In [1]:
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

## Train a linear regression model with BigQuery DataFrames ML


<table align="left">

  <td>
    <a href="https://colab.research.google.com/github/googleapis/python-bigquery-dataframes/blob/main/notebooks/ml/bq_dataframes_ml_linear_regression.ipynb">
      <img src="https://raw.githubusercontent.com/googleapis/python-bigquery-dataframes/refs/heads/main/third_party/logo/colab-logo.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/googleapis/python-bigquery-dataframes/blob/main/notebooks/ml/bq_dataframes_ml_linear_regression.ipynb">
      <img src="https://raw.githubusercontent.com/googleapis/python-bigquery-dataframes/refs/heads/main/third_party/logo/github-logo.png" width="32" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/googleapis/python-bigquery-dataframes/blob/main/notebooks/ml/bq_dataframes_ml_linear_regression.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/bigquery/import?url=https://github.com/googleapis/python-bigquery-dataframes/blob/main/notebooks/ml/bq_dataframes_ml_linear_regression.ipynb">
      <img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTW1gvOovVlbZAIZylUtf5Iu8-693qS1w5NJw&s" alt="BQ logo" width="35">
      Open in BQ Studio
    </a>
  </td>                                                                                               
</table>

**_NOTE_**: This notebook has been tested in the following environment:

* Python version = 3.10

## Overview

Use this notebook to learn how to train a linear regression model using BigQuery ML and the `bigframes.bigquery` module.

This example is adapted from the [BQML linear regression tutorial](https://cloud.google.com/bigquery-ml/docs/linear-regression-tutorial).

Learn more about [BigQuery DataFrames](https://dataframes.bigquery.dev/).

### Objective

In this tutorial, you use BigQuery DataFrames to create a linear regression model that predicts the weight of an Adelie penguin based on the penguin's island of residence, culmen length and depth, flipper length, and sex.

The steps include:

- Creating a DataFrame from a BigQuery table.
- Cleaning and preparing data using pandas.
- Creating a linear regression model using `bigframes.ml`.
- Saving the ML model to BigQuery for future use.

### Dataset

This tutorial uses the [```penguins``` table](https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=ml_datasets&t=penguins) (a BigQuery Public Dataset) which includes data on a set of penguins including species, island of residence, weight, culmen length and depth, flipper length, and sex.

### Costs

This tutorial uses billable components of Google Cloud:

* BigQuery (compute)
* BigQuery ML

Learn about [BigQuery compute pricing](https://cloud.google.com/bigquery/pricing#analysis_pricing_models)
and [BigQuery ML pricing](https://cloud.google.com/bigquery/pricing#bqml),
and use the [Pricing Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

## Installation

If you don't have [bigframes](https://pypi.org/project/bigframes/) package already installed, uncomment and execute the following cells to

1. Install the package
1. Restart the notebook kernel (Jupyter or Colab) to work with the package

In [2]:
# !pip install bigframes

In [3]:
# Automatically restart kernel after installs so that your environment can access the new packages
# import IPython

# app = IPython.Application.instance()
# app.kernel.do_shutdown(True)

## Before you begin

Complete the tasks in this section to set up your environment.

### Set up your Google Cloud project

**The following steps are required, regardless of your notebook environment.**

1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 credit towards your compute/storage costs.

2. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).

3. [Enable the BigQuery API](https://console.cloud.google.com/flows/enableapi?apiid=bigquery.googleapis.com).

4. If you are running this notebook locally, install the [Cloud SDK](https://cloud.google.com/sdk).

#### Set your project ID

If you don't know your project ID, try the following:
* Run `gcloud config list`.
* Run `gcloud projects list`.
* See the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113).

In [None]:
PROJECT_ID = ""  # @param {type:"string"}

# Set the project id
! gcloud config set project {PROJECT_ID}

Updated property [core/project].


#### Set the region

You can also change the `REGION` variable used by BigQuery. Learn more about [BigQuery regions](https://cloud.google.com/bigquery/docs/locations#supported_locations).

In [5]:
REGION = "US"  # @param {type: "string"}

### Authenticate your Google Cloud account

Depending on your Jupyter environment, you might have to manually authenticate. Follow the relevant instructions below.

**Vertex AI Workbench**

Do nothing, you are already authenticated.

**Local JupyterLab instance**

Uncomment and run the following cell:

In [6]:
# ! gcloud auth login

**Colab**

Uncomment and run the following cell:

In [7]:
# from google.colab import auth
# auth.authenticate_user()

### Import libraries

In [8]:
import bigframes.pandas as bpd

### Set BigQuery DataFrames options

In [9]:
# Note: The project option is not required in all environments.
# On BigQuery Studio, the project ID is automatically detected.
bpd.options.bigquery.project = PROJECT_ID

# Note: The location option is not required.
# It defaults to the location of the first table or query
# passed to read_gbq(). For APIs where a location can't be
# auto-detected, the location defaults to the "US" location.
bpd.options.bigquery.location = REGION

# Recommended for performance. Disables pandas default ordering of all rows.
bpd.options.bigquery.ordering_mode = "partial"

If you want to reset the location of the created DataFrame or Series objects, reset the session by executing `bpd.close_session()`. After that, you can reuse `bpd.options.bigquery.location` to specify another location.

## Read a BigQuery table into a BigQuery DataFrames DataFrame

Read the [```penguins``` table](https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=ml_datasets&t=penguins) into a BigQuery DataFrames DataFrame:

In [10]:
df = bpd.read_gbq("bigquery-public-data.ml_datasets.penguins")

Take a look at the DataFrame:

In [11]:
df.peek()

Unnamed: 0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie Penguin (Pygoscelis adeliae),Dream,36.6,18.4,184.0,3475.0,FEMALE
1,Adelie Penguin (Pygoscelis adeliae),Dream,39.8,19.1,184.0,4650.0,MALE
2,Adelie Penguin (Pygoscelis adeliae),Dream,40.9,18.9,184.0,3900.0,MALE
3,Chinstrap penguin (Pygoscelis antarctica),Dream,46.5,17.9,192.0,3500.0,FEMALE
4,Adelie Penguin (Pygoscelis adeliae),Dream,37.3,16.8,192.0,3000.0,FEMALE


## Clean and prepare data

You can use pandas as you normally would on the BigQuery DataFrames DataFrame, but calculations happen in the BigQuery query engine instead of your local environment.

Because this model will focus on the Adelie Penguin species, you need to filter the data for only those rows representing Adelie penguins. Then you drop the `species` column because it is no longer needed.

As these functions are applied, only the new DataFrame object `adelie_data` is modified. The source table and the original DataFrame object `df` don't change.

In [12]:
# Filter down to the data to the Adelie Penguin species
adelie_data = df[df.species == "Adelie Penguin (Pygoscelis adeliae)"]

# Drop the species column
adelie_data = adelie_data.drop(columns=["species"])

# Take a look at the filtered DataFrame
adelie_data

Unnamed: 0,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
0,Dream,36.6,18.4,184.0,3475.0,FEMALE
1,Dream,39.8,19.1,184.0,4650.0,MALE
2,Dream,40.9,18.9,184.0,3900.0,MALE
3,Dream,37.3,16.8,192.0,3000.0,FEMALE
4,Dream,43.2,18.5,192.0,4100.0,MALE
5,Dream,40.2,20.1,200.0,3975.0,MALE
6,Dream,40.8,18.9,208.0,4300.0,MALE
7,Dream,39.0,18.7,185.0,3650.0,MALE
8,Dream,37.0,16.9,185.0,3000.0,FEMALE
9,Dream,34.0,17.1,185.0,3400.0,FEMALE


Drop rows with `NULL` values in order to create a BigQuery DataFrames DataFrame for the training data:

In [13]:
# Drop rows with nulls to get training data
training_data = adelie_data.dropna()

# Take a peek at the training data
training_data

Unnamed: 0,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
0,Dream,36.6,18.4,184.0,3475.0,FEMALE
1,Dream,39.8,19.1,184.0,4650.0,MALE
2,Dream,40.9,18.9,184.0,3900.0,MALE
3,Dream,37.3,16.8,192.0,3000.0,FEMALE
4,Dream,43.2,18.5,192.0,4100.0,MALE
5,Dream,40.2,20.1,200.0,3975.0,MALE
6,Dream,40.8,18.9,208.0,4300.0,MALE
7,Dream,39.0,18.7,185.0,3650.0,MALE
8,Dream,37.0,16.9,185.0,3000.0,FEMALE
9,Dream,34.0,17.1,185.0,3400.0,FEMALE


## Create the linear regression model

In this notebook, you create a linear regression model, a type of regression model that generates a continuous value from a linear combination of input features.

Create a BigQuery dataset to house the model, adding a name for your dataset as the `DATASET_ID` variable:

In [14]:
DATASET_ID = "bqml_tutorial"  # @param {type:"string"}

from google.cloud import bigquery
client = bigquery.Client(project=PROJECT_ID)
dataset = bigquery.Dataset(PROJECT_ID + "." + DATASET_ID)
dataset.location = REGION
dataset = client.create_dataset(dataset, exists_ok=True)
print(f"Dataset {dataset.dataset_id} created.")

Dataset bqml_tutorial created.


### Create the model using `bigframes.ml`

When you pass the feature columns without transforms, BigQuery ML uses
[automatic preprocessing](https://cloud.google.com/bigquery/docs/auto-preprocessing) to encode string values and scale numeric values.

BigQuery ML also [automatically splits the data for training and evaluation](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create-glm#data_split_method), although for datasets with less than 500 rows (such as this one), all rows are used for training.

In [15]:
import bigframes.bigquery as bbq

model_name = f"{PROJECT_ID}.{DATASET_ID}.penguin_weight"
model_metadata = bbq.ml.create_model(
    model_name,
    replace=True,
    options={
        "model_type": "LINEAR_REG",
    },
    training_data=training_data.rename(columns={"body_mass_g": "label"})
)
model_metadata

etag                                         hxX7+/9HAtOBZmonQ0Kvqg==
modelReference      {'projectId': 'bigframes-dev', 'datasetId': 'b...
creationTime                                            1764777925449
lastModifiedTime                                        1764777925525
modelType                                           LINEAR_REGRESSION
trainingRuns        [{'trainingOptions': {'lossType': 'MEAN_SQUARE...
featureColumns      [{'name': 'island', 'type': {'typeKind': 'STRI...
labelColumns        [{'name': 'predicted_label', 'type': {'typeKin...
location                                                           US
dtype: object

### Evaluate the model

Check how the model performed by using the `evalutate` function. More information on model evaluation can be found [here](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-evaluate#mlevaluate_output).

In [16]:
bbq.ml.evaluate(model_name)

Unnamed: 0,mean_absolute_error,mean_squared_error,mean_squared_log_error,median_absolute_error,r2_score,explained_variance
0,223.878763,78553.601634,0.005614,181.330911,0.623951,0.623951


### Use the model to predict outcomes

Now that you have evaluated your model, the next step is to use it to predict an
outcome. You can run `bigframes.bigquery.ml.predict` function on the model to
predict the body mass in grams of all penguins that reside on the Biscoe
Islands.

In [17]:
df = bpd.read_gbq("bigquery-public-data.ml_datasets.penguins")
biscoe = df[df["island"].str.contains("Biscoe")]
bbq.ml.predict(model_name, biscoe)

incompatibilies with previous reads of this table. To read the latest
version, set `use_cache=False` or close the current session with
Session.close() or bigframes.pandas.close_session().
  return method(*args, **kwargs)


Unnamed: 0,predicted_label,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
0,3945.010052,Gentoo penguin (Pygoscelis papua),Biscoe,,,,,
1,3914.916297,Adelie Penguin (Pygoscelis adeliae),Biscoe,39.7,18.9,184.0,3550.0,MALE
2,3278.611224,Adelie Penguin (Pygoscelis adeliae),Biscoe,36.4,17.1,184.0,2850.0,FEMALE
3,4006.367355,Adelie Penguin (Pygoscelis adeliae),Biscoe,41.6,18.0,192.0,3950.0,MALE
4,3417.610478,Adelie Penguin (Pygoscelis adeliae),Biscoe,35.0,17.9,192.0,3725.0,FEMALE
5,4009.612421,Adelie Penguin (Pygoscelis adeliae),Biscoe,41.1,18.2,192.0,4050.0,MALE
6,4231.330911,Adelie Penguin (Pygoscelis adeliae),Biscoe,42.0,19.5,200.0,4050.0,MALE
7,3554.308906,Gentoo penguin (Pygoscelis papua),Biscoe,43.8,13.9,208.0,4300.0,FEMALE
8,3550.677455,Gentoo penguin (Pygoscelis papua),Biscoe,43.3,14.0,208.0,4575.0,FEMALE
9,3537.882543,Gentoo penguin (Pygoscelis papua),Biscoe,44.0,13.6,208.0,4350.0,FEMALE


### Explain the prediction results

To understand why the model is generating these prediction results, you can use the `explain_predict` function.

In [18]:
bbq.ml.explain_predict(model_name, biscoe, top_k_features=3)

Unnamed: 0,predicted_label,top_feature_attributions,baseline_prediction_value,prediction_value,approximation_error,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
0,3945.010052,"[{'feature': 'island', 'attribution': -560.0} ...",4505.010052,3945.010052,0.0,Gentoo penguin (Pygoscelis papua),Biscoe,,,,,
1,3914.916297,"[{'feature': 'island', 'attribution': -560.0} ...",4505.010052,3914.916297,0.0,Adelie Penguin (Pygoscelis adeliae),Biscoe,39.7,18.9,184.0,3550.0,MALE
2,3278.611224,"[{'feature': 'island', 'attribution': -560.0} ...",4505.010052,3278.611224,0.0,Adelie Penguin (Pygoscelis adeliae),Biscoe,36.4,17.1,184.0,2850.0,FEMALE
3,4006.367355,"[{'feature': 'island', 'attribution': -560.0} ...",4505.010052,4006.367355,0.0,Adelie Penguin (Pygoscelis adeliae),Biscoe,41.6,18.0,192.0,3950.0,MALE
4,3417.610478,"[{'feature': 'island', 'attribution': -560.0} ...",4505.010052,3417.610478,0.0,Adelie Penguin (Pygoscelis adeliae),Biscoe,35.0,17.9,192.0,3725.0,FEMALE
5,4009.612421,"[{'feature': 'island', 'attribution': -560.0} ...",4505.010052,4009.612421,0.0,Adelie Penguin (Pygoscelis adeliae),Biscoe,41.1,18.2,192.0,4050.0,MALE
6,4231.330911,"[{'feature': 'island', 'attribution': -560.0} ...",4505.010052,4231.330911,0.0,Adelie Penguin (Pygoscelis adeliae),Biscoe,42.0,19.5,200.0,4050.0,MALE
7,3554.308906,"[{'feature': 'island', 'attribution': -560.0} ...",4505.010052,3554.308906,0.0,Gentoo penguin (Pygoscelis papua),Biscoe,43.8,13.9,208.0,4300.0,FEMALE
8,3550.677455,"[{'feature': 'island', 'attribution': -560.0} ...",4505.010052,3550.677455,0.0,Gentoo penguin (Pygoscelis papua),Biscoe,43.3,14.0,208.0,4575.0,FEMALE
9,3537.882543,"[{'feature': 'island', 'attribution': -560.0} ...",4505.010052,3537.882543,0.0,Gentoo penguin (Pygoscelis papua),Biscoe,44.0,13.6,208.0,4350.0,FEMALE


### Globally explain the model

To know which features are generally the most important to determine penguin
weight, you can use the `global_explain` function. In order to use
`global_explain`, you must retrain the model with the `enable_global_explain`
option set to `True`.

In [19]:
model_name = f"{PROJECT_ID}.{DATASET_ID}.penguin_weight_with_global_explain"
model_metadata = bbq.ml.create_model(
    model_name,
    replace=True,
    options={
        "model_type": "LINEAR_REG",
        "input_label_cols": ["body_mass_g"],
        "enable_global_explain": True,
    },
    training_data=training_data,
)

In [20]:
bbq.ml.global_explain(model_name)

Unnamed: 0,feature,attribution
0,sex,221.587592
1,flipper_length_mm,71.311846
2,culmen_depth_mm,66.17986
3,culmen_length_mm,45.443363
4,island,17.258076


# Compatibility with `bigframes.ml`

The models created with `bigframes.bigquery.ml` can be used with the scikit-learn-like `bigframes.ml` modules by using the `read_gbq_model` method.


In [21]:
model = bpd.read_gbq_model(model_name)
model

LinearRegression(enable_global_explain=True,
                 optimize_strategy='NORMAL_EQUATION')

In [24]:
X = training_data[["sex", "flipper_length_mm", "culmen_depth_mm", "culmen_length_mm", "island"]]
y = training_data[["body_mass_g"]]
model.score(X, y)

Unnamed: 0,mean_absolute_error,mean_squared_error,mean_squared_log_error,median_absolute_error,r2_score,explained_variance
0,223.878763,78553.601634,0.005614,181.330911,0.623951,0.623951


# Summary and next steps

You've created a linear regression model using `bigframes.bigquery.ml`.

Learn more about BigQuery DataFrames in the [documentation](https://dataframes.bigquery.dev/) and find more sample notebooks in the [GitHub repo](https://github.com/googleapis/python-bigquery-dataframes/tree/main/notebooks).

## Cleaning up

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud
project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Otherwise, you can uncomment the remaining cells and run them to delete the individual resources you created in this tutorial:

In [None]:
# # Delete the BigQuery dataset and associated ML model
# from google.cloud import bigquery
# client = bigquery.Client(project=PROJECT_ID)
# client.delete_dataset(
#  DATASET_ID, delete_contents=True, not_found_ok=True
# )
# print("Deleted dataset '{}'.".format(DATASET_ID))