# Working with the BigQuery DataFrames Python API

The BigQuery DataFrames Python API enables us to use Python to analyze and manipulate data in BigQuery, and perform various machine learning tasks. It’s a relatively new, open-source option launched and maintained by Google Cloud for using dataframes to interact with BigQuery, and we can access it by using the bigframes Python library, which consists of two main parts:
•	bigframes.pandas, which implements a pandas-like API on top of BigQuery.
•	bigframes.ml, which implements a scikit-learn-like API on top of BigQuery ML.

This notebook focuses on using **bigframes.pandas**.


## Prerequisites
**Note:** This notebook and repository are supporting artifacts for the "Google Machine Learning and Generative AI for Solutions Architects" book. The book describes the concepts associated with this notebook, and for some of the activities, the book contains instructions that should be performed before running the steps in the notebooks. Each top-level folder in this repo is associated with a chapter in the book. Please ensure that you have read the relevant chapter sections before performing the activities in this notebook.

**There are also important generic prerequisite steps outlined [here](https://github.com/PacktPublishing/Google-Machine-Learning-for-Solutions-Architects/blob/main/Prerequisite-steps/Prerequisites.ipynb).**

**Let's begin by importing bigframes.pandas into our notebook (note: this assumes that you are using the bigframes custom Jupyter kernel created during the prerequisite steps in Chapter 14)**

**Attention:** The code in this notebook creates Google Cloud resources that can incur costs.

Refer to the Google Cloud pricing documentation for details.

For example:

* [Vertex AI Pricing](https://cloud.google.com/vertex-ai/pricing)
* [BigQuery Pricing](https://cloud.google.com/bigquery/pricing)


In [None]:
import bigframes.pandas as bpd
import numpy as np

## Define constants

Next, we define the constants to contain our project ID and the dataset ID at which we will save our data in BigQuery later.

We will use the `gcloud` command to get the Project ID details from the local Google Cloud project, and assign the results to the PROJECT_ID variable. If, for any reason, PROJECT_ID is not set, you can set it manually or change it, if preferred.


In [None]:
PROJECT_ID_DETAILS = !gcloud config get-value project
PROJECT_ID = PROJECT_ID_DETAILS[0]  # The project ID is item 0 in the list returned by the gcloud command
UPDATED_DATASET_ID = "new_york_taxi_trips"
TABLE = "transformed_taxi_data_bigframes"

## Read data in from BigQuery

The code in the next cell will read data from the `New York Taxi Trips` BigQuery Public Dataset into a dataframe that we can then use in the remaining steps in this notebook.

In [None]:
df = bpd.read_gbq("SELECT * FROM `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2020`")

## Data exploration

Now that we've read the data into a dataframe, we can begin to explore our dataset.

### Preview the data

Let's take a look at some of the values in our dataset:

In [None]:
df.head()

### Explore the data types 

We can use the dtypes property to explore the data types in the fields of our dataset:

In [None]:
df.dtypes

### Summary statistics

We can use the describe() function to display some summary statistics about the fields in our dataset. This can help us to understand the scale of features in our dataset, by displaying statistics such as `count`, `min`, `max`, `mean`, and the standard deviation (`std`):

In [None]:
df.describe()

### Explore missing values

Missing values can cause problems for many machine learning algorithms, so it's often important for data scientists to be aware of any missing values that exist in the dataset, and to address them accordingly. The code in the next cell will tell us how many missing values exist for each feature in the dataset:

In [None]:
df.isnull().sum()

### Value counts

It's also often important to understand how many unique values each feature contains. This is referred to as the `cardinality` of a feature, where low cardinality features have a small number of unique values (e.g., binary features that are either `yes` or `no`), and high cardinality have a large number of unique values (e.g., product IDs).

Feature cardinality can be important to understand for tasks such as feature encoding and feature selection.

In [None]:
df['passenger_count'].value_counts()

## Feature engineering

After exploring our data, we can perform any feature engineering that we believe could be important for helping our models to learn specific patterns in our dataset.

For example, we can engineer a new feature named `fare_per_mile` by diving the `fare_amount` feature by the `trip_distance` feature, and this new feature may be more useful if we want to build a model that estimates the fare for a given trip distance.

To avoid errors such as type mismatches during our division operation, we will convert all types to Float64.
To avoid division by zero, we replace all instances of zero in `trip_distance` with `numpy.finfo.eps` (epsilon), which is a tiny positive number. 

In [None]:
df['fare_amount'] = df['fare_amount'].astype('Float64')
df['trip_distance'] = df['trip_distance'].astype('Float64')
df['fare_per_mile'] = df['fare_amount'] / df['trip_distance'].replace(0, np.finfo(float).eps)

We already covered feature engineering extensively in Chapter 7 of the book, and you can refer to the [feature-eng-titanic.ipynb](https://github.com/PacktPublishing/Google-Machine-Learning-for-Solutions-Architects/blob/main/Chapter-07/feature-eng-titanic.ipynb) Jupyter Notebook file for additional examples.

## Writing data to BigQuery

After performing our feature engineering steps, we can write our updated data back to BigQuery for long term storage, reference, and analytics:

In [None]:
df.to_gbq(f"{PROJECT_ID}.{UPDATED_DATASET_ID}.{TABLE}") 

**After that operation completes, you can view the dataset in the [BigQuery console](https://console.cloud.google.com/bigquery)**

# That's it! Well Done!

# Clean up

When you no longer need the resources created by this notebook. You can delete them as follows.

**Note: if you do not delete the resources, you will continue to pay for them.**

In [None]:
clean_up = False  # Set to True if you want to delete the resources

## Delete BigQuery resources

In [None]:
if clean_up:  
    try:
        ! bq rm -r -f -d $PROJECT_ID:$UPDATED_DATASET_ID
        print(f"Deleted dataset {UPDATED_DATASET_ID}")
    except Exception as e:
        print(f"Error deleting dataset: {e}")
else:
    print("clean_up parameter is set to False.")

**You can also verify or delete the dataset in the [BigQuery console](https://console.cloud.google.com/bigquery)**