# BigQuery ML (BQML)

In this notebook, we will use BQML to train and evaluate a linear regression model to predict the fare amount for a given taxi trip, based on the data contained in the New York City Taxi Trips public dataset in BigQuery.

**This notebook uses the BigQuery feature dataset and view that we created in the Feature Store activities in Chapter 7. If you deleted those resources, you can create them again by performing the Feature Store activities (in the `feature_store.ipynb` notebook file) in Chapter 7 again.** 

## Prerequisites
**Note:** This notebook and repository are supporting artifacts for the "Google Machine Learning and Generative AI for Solutions Architects" book. The book describes the concepts associated with this notebook, and for some of the activities, the book contains instructions that should be performed before running the steps in the notebooks. Each top-level folder in this repo is associated with a chapter in the book. Please ensure that you have read the relevant chapter sections before performing the activities in this notebook.

**There are also important generic prerequisite steps outlined [here](https://github.com/PacktPublishing/Google-Machine-Learning-for-Solutions-Architects/blob/main/Prerequisite-steps/Prerequisites.ipynb).**


**Attention:** The code in this notebook creates Google Cloud resources that can incur costs.

Refer to the Google Cloud pricing documentation for details.

For example:

* [Vertex AI Pricing](https://cloud.google.com/vertex-ai/pricing)
* [BigQuery Pricing](https://cloud.google.com/bigquery/pricing)


## Install and import libraries

Let's begin by installing and importing the BigQuery Python client library.

In [None]:
! pip install --upgrade --quiet bigquery

*The pip installation commands sometimes report various errors. Those errors usually do not affect the activities in this notebook, and you can ignore them.*


## Restart the kernel

The code in the next cell will retart the kernel, which is sometimes required after installing/upgrading packages.

**When prompted, click OK to restart the kernel.**

The sleep command simply prevents further cells from executing before the kernel restarts.

In [None]:
import IPython
import time

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

time.sleep(10)

# (Wait for kernel to restart before proceeding...)

## Import required libraries

In [None]:
from google.cloud import bigquery

## Set Google Cloud resource variables

The following code will set variables specific to your Google Cloud resources that will be used in this notebook, such as the Project ID, Region, and GCS Bucket.

**Note: This notebook is intended to execute in a Vertex AI Workbench Notebook, in which case the API calls issued in this notebook are authenticated according to the permissions (e.g., service account) assigned to the Vertex AI Workbench Notebook.**

We will use the `gcloud` command to get the Project ID details from the local Google Cloud project, and assign the results to the PROJECT_ID variable. If, for any reason, PROJECT_ID is not set, you can set it manually or change it, if preferred.

We also use a default bucket name for most of the examples and activities in this book, which has the format: `{PROJECT_ID}-aiml-sa-bucket`. You can change the bucket name if preferred.

Also, we're defaulting to the **us-central1** region, but you can optionally replace this with your [preferred region](https://cloud.google.com/about/locations).

In [None]:
PROJECT_ID_DETAILS = !gcloud config get-value project
PROJECT_ID = PROJECT_ID_DETAILS[0]  # The project ID is item 0 in the list returned by the gcloud command
REGION="us-central1" # Optional: replace with your preferred region (See: https://cloud.google.com/about/locations) 
print(f"Project ID: {PROJECT_ID}")


## Begin implementation

Now that we have performed the prerequisite steps for this activity, it's time to implement the activity.

## Define constants and variables

The code in the following cell creates constants and variables that will be used throughout the notebook.


In [None]:
API_ENDPOINT = f"{REGION}-aiplatform.googleapis.com"

# Specify the public dataset and table
public_project_name = 'bigquery-public-data'
public_dataset_name = 'new_york_taxi_trips'
public_table_name = 'tlc_yellow_trips_2020'
public_table_id = f"{public_project_name}.{public_dataset_name}.{public_table_name}"

# Define our dataset and view names
# Feature store view we created in Chapter 7: 
feature_dataset_name = 'feature_store_for_nyc_taxi_data'
feature_view_name = 'nyc_taxi_data_view'
feature_view_id = f"{PROJECT_ID}.{feature_dataset_name}.{feature_view_name}"
# Extended view to combine engineered features with original dataset
extended_view_name = 'extended_nyc_taxi_data_view'
extended_view_id = f"{PROJECT_ID}.{feature_dataset_name}.{extended_view_name}"

# Define our model details
model_name = 'taxi_fare_model'
model_id = f'{feature_dataset_name}.{model_name}'

## Get the base table from the New York Taxi Trips public dataset in BigQuery

In [None]:
# Create a BigQuery client
client = bigquery.Client()

# Get the table
table = client.get_table(public_table_id)

# List the field names
field_names = [field.name for field in table.schema]

# Print the field names
print("Field Names in the Table:")
print(field_names)

## Define and create our extended view

We will create a new view to train our model. The view will combine the base dataset from the New York Taxi Trips public dataset with the features we engineered in Chapter 7 ([see here for reference](https://github.com/PacktPublishing/Google-Machine-Learning-for-Solutions-Architects/blob/main/Chapter-07/feature-store.ipynb)).

First, we define the query that will create the view.

In [None]:
extended_view_query = f"""
SELECT
  e.pickup_datetime,
  e.dropoff_datetime,
  e.passenger_count,
  e.trip_distance,
  e.fare_amount,
  e.fare_per_mile,
  e.pickup_hour,
  e.pickup_day_of_week,
  e.dropoff_hour,
  e.dropoff_day_of_week,
  e.entity_id,
  e.feature_timestamp,
  o.vendor_id,
  o.rate_code,
  o.store_and_fwd_flag,
  o.payment_type,
  o.extra,
  o.mta_tax,
  o.tip_amount,
  o.tolls_amount,
  o.imp_surcharge,
  o.airport_fee,
  o.total_amount,
  o.data_file_year,
  o.data_file_month,
FROM
  `{feature_view_id}` AS e
JOIN
  `{public_table_id}` AS o
ON
  e.entity_id = CONCAT(o.pickup_datetime, '-', CAST(o.pickup_location_id AS STRING))
"""

Next, execute the query that will create the view:

In [None]:
from google.api_core.exceptions import GoogleAPIError

extended_view = bigquery.Table(extended_view_id)
extended_view.view_query = extended_view_query

try:
    # Use `exists_ok=True` to avoid 'already exists' error if the view already exists
    client.create_table(extended_view, exists_ok=True)
    print(f"View {extended_view_id} created successfully.")
except GoogleAPIError as e:
    print(f"An error occurred: {e}")

## Verify contents of the extended view

The code in the next cell will get our newly created view details and list the fields (or features) in the view.

In [None]:
# Get the table
extended_table = client.get_table(extended_view_id)

# List the field names
extended_field_names = [field.name for field in extended_table.schema]

# Print the field names
print("Field Names in the Table:")
print(extended_field_names)

## Define the query that will be used to create our linear regression model

In the next cell, we define a query that will be used to create our linear regression model. 
The query selects the features from our extended view (that we created in the previous steps in this notebook) to be used in training our model.
It also specifies that the model type is linear regression, and that the target column is `fare_amount`.

In [None]:
model_query = f"""
CREATE OR REPLACE MODEL `{model_id}`
OPTIONS(model_type='linear_reg', input_label_cols=['fare_amount']) AS
SELECT
  pickup_datetime,
  dropoff_datetime,
  passenger_count,
  trip_distance,
  fare_amount,
  fare_per_mile,
  pickup_hour,
  pickup_day_of_week,
  dropoff_hour,
  dropoff_day_of_week,
  vendor_id,
  rate_code,
  store_and_fwd_flag,
  payment_type,
  extra,
  mta_tax,
  tip_amount,
  tolls_amount,
  imp_surcharge,
  airport_fee,
  total_amount,
  data_file_year,
  data_file_month
FROM
  `{extended_view_id}`;
"""

## Execute the query to create our linear regression model

Next, we execute the query that we defined to create our linear regression model. 

In [None]:
from google.api_core.exceptions import NotFound
import time

# Run the query to create the model and wait for it to start
print("Starting model training and waiting for completion...")
model_query_job = client.query(model_query)
model_query_job.result()  # Wait for the query to complete

## Evaluate our linear regression model

Next, we evaluate our linear regression model.

The following query will return evaluation metric values for our model.

In [None]:
# Define the SQL query for model evaluation
evaluation_query = f"""
SELECT
  *
FROM
  ML.EVALUATE(MODEL `{model_id}`);
"""

# Run the query and store the results in a dataframe 
evaluation_result = client.query(evaluation_query).to_dataframe()

# Display the evaluation metrics
print(evaluation_result)

## Get prediction from our model

Next, we will get a prediction from our model. In a real world scenario, we would have new data to send to our model. However, to make things easy for demonstration purposes, rather than fabricating new data, we can use existing data to create a new data record that contains all fields except the field that we wish to predict (i.e., the `fare_amount` field).

To help understand the code in the next cell, I'll break it down as follows:

We will run a SELECT query within a SELECT query. The first SELECT query that executes is:

```
SELECT
      * EXCEPT(fare_amount)
    FROM
      `{extended_view_id}`
    LIMIT 1))
```

This returns a single record (due to the `LIMIT 1` clause) from our training dataset that contains all fields except the `fare_amount` field. Bear in mind that this is just a convenient way to create a record that we can use in our prediction request. As mentioned above, in a real world scenario, we would have new data to send to our model, and we would not use this trick.

The results of that query are then used in our prediction query:

```
SELECT
  predicted_fare_amount
FROM
  ML.PREDICT(MODEL `{model_id}`, 
```

That query is what requests a prediction from our trained model, and it uses the record from the previous query above as input.

Then, we assign the results of our prediction request to a dataframe and we print the predicted value.

In [None]:
# Define and run the SQL query for making a prediction
prediction_query = f"""
SELECT
  predicted_fare_amount
FROM
  ML.PREDICT(MODEL `{model_id}`, 
    (SELECT
      * EXCEPT(fare_amount)
    FROM
      `{extended_view_id}`
    LIMIT 1))
"""

# Run the prediction query and convert to DataFrame
prediction_result = client.query(prediction_query).to_dataframe()

# Display the predicted fare amount
print("Predicted Fare Amount:", prediction_result['predicted_fare_amount'].iloc[0])

# That's it! Well Done!

# Clean up

When you no longer need the resources created by this notebook. You can delete them as follows.

**Note: if you do not delete the resources, you will continue to pay for them.**

In [None]:
clean_up = False  # Set to True if you want to delete the resources

## Delete BigQuery dataset, view, and model

In [None]:
if clean_up:  
    try:
        client.delete_table(feature_view_id, not_found_ok=True)
        print(f"Deleted view: {feature_view_id}")
    except Exception as e:
        print(f"Error deleting view: {e}")

    try:
        client.delete_dataset(feature_dataset_name, delete_contents=True, not_found_ok=True)
        print(f"Deleted dataset: {feature_dataset_name}")
    except Exception as e:
        print(f"Error deleting dataset: {e}")
        
    # Delete the model
    try:
        model_ref = client.get_model(model_id)
        client.delete_model(model_ref)
        print(f"Deleted model: {model_id}")
    except Exception as e:
        print(f"Error deleting model: {e}")
        
else:
    print("clean_up parameter is set to False.")