In [None]:
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

## Train a linear regression model with BigQuery DataFrames ML


<table align="left">

  <td>
    <a href="https://colab.research.google.com/github/googleapis/python-bigquery-dataframes/blob/main/notebooks/ml/bq_dataframes_ml_linear_regression_big.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/googleapis/python-bigquery-dataframes/blob/main/notebooks/ml/bq_dataframes_ml_linear_regression_big.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/googleapis/python-bigquery-dataframes/blob/main/notebooks/ml/bq_dataframes_ml_linear_regression_big.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/bigquery/import?url=https://github.com/googleapis/python-bigquery-dataframes/blob/main/notebooks/ml/bq_dataframes_ml_linear_regression_big.ipynb">
      <img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTW1gvOovVlbZAIZylUtf5Iu8-693qS1w5NJw&s" alt="BQ logo" width="35">
      Open in BQ Studio
    </a>
  </td>                                                                                               
</table>

**_NOTE_**: This notebook has been tested in the following environment:

* Python version = 3.11

## Overview

This notebook demonstrates training a linear regression model on Big Data using BigQuery DataFrames ML. BigQuery DataFrames ML provides a provides a scikit-learn-like API for ML powered by the BigQuery engine.

Learn more about [BigQuery DataFrames](https://cloud.google.com/python/docs/reference/bigframes/latest).

### Objective

In this tutorial, you use BigQuery DataFrames to create a linear regression model that predicts the weight of newborns.

The steps include:

- Creating a DataFrame from the BigQuery table.
- Cleaning and preparing data using `bigframes.pandas` module.
- Creating a linear regression model using `bigframes.ml` module.
- Saving the ML model to BigQuery for future use.

### Dataset

This tutorial uses the table `bigquery-public-data.samples.natality` (a BigQuery Public Dataset) which includes data on a child births. This table contains more than 20 GB of data in more than 135 million rows.

### Costs

This tutorial uses billable components of Google Cloud:

* BigQuery (compute)
* BigQuery ML

Learn about [BigQuery compute pricing](https://cloud.google.com/bigquery/pricing#analysis_pricing_models)
and [BigQuery ML pricing](https://cloud.google.com/bigquery/pricing#bqml),
and use the [Pricing Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

## Installation

If you don't have [bigframes](https://pypi.org/project/bigframes/) package already installed, uncomment and execute the following cells to

1. Install the package
1. Restart the notebook kernel (Jupyter or Colab) to work with the package

In [11]:
# !pip install bigframes

In [12]:
# Automatically restart kernel after installs so that your environment can access the new packages

# import IPython
#
# app = IPython.Application.instance()
# app.kernel.do_shutdown(True)

## Before you begin

Complete the tasks in this section to set up your environment.

### Set up your Google Cloud project

**The following steps are required, regardless of your notebook environment.**

1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 credit towards your compute/storage costs.

2. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).

3. [Enable the BigQuery API](https://console.cloud.google.com/flows/enableapi?apiid=bigquery.googleapis.com).

4. If you are running this notebook locally, install the [Cloud SDK](https://cloud.google.com/sdk).

#### Set your project ID

If you don't know your project ID, try the following:
* Run `gcloud config list`.
* Run `gcloud projects list`.
* See the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113).

In [None]:
PROJECT_ID = ""  # @param {type:"string"}

#### Set the region

You can also change the `REGION` variable used by BigQuery. Learn more about [BigQuery regions](https://cloud.google.com/bigquery/docs/locations#supported_locations).

In [None]:
REGION = "US"  # @param {type: "string"}

### Authenticate your Google Cloud account

Depending on your Jupyter environment, you might have to manually authenticate. Follow the relevant instructions below.

**Vertex AI Workbench**

Do nothing, you are already authenticated.

**Local JupyterLab instance**

Uncomment and run the following cell:

In [None]:
# ! gcloud auth login

**Colab**

Uncomment and run the following cell:

In [None]:
# from google.colab import auth
# auth.authenticate_user()

### Import libraries

In [None]:
import bigframes.pandas as bpd

### Set BigQuery DataFrames options

In [None]:
# NOTE: The project option is not required in all environments.
# On BigQuery Studio, the project ID is automatically detected.
bpd.options.bigquery.project = PROJECT_ID

# NOTE: The location option is not required.
# It defaults to the location of the first table or query
# passed to read_gbq(). For APIs where a location can't be
# auto-detected, the location defaults to the "US" location.
bpd.options.bigquery.location = REGION

# NOTE: The ordering mode by default is full ordered to be compatible with
# pandas. In model training the order of the data is not important, so we relax
# the ordering mode to gain efficiency.
bpd.options.bigquery.ordering_mode = "partial"

If you want to reset the location of the created DataFrame or Series objects, reset the session by executing `bpd.close_session()`. After that, you can reuse `bpd.options.bigquery.location` to specify another location.

## Read a BigQuery table into a BigQuery DataFrames DataFrame

Read the `bigquery-public-data.samples.natality` table into a BigQuery DataFrames DataFrame:

In [None]:
df = bpd.read_gbq("bigquery-public-data.samples.natality")

See the preview of the DataFrame:

In [None]:
df.peek()

Observe that the column `weight_pounds` has missing values for several rows. In this tutorial we would predict those missing values.

In [None]:
df["weight_pounds"].isna().value_counts()

## Clean and prepare data

You can use pandas APIs as you normally would on the BigQuery DataFrames DataFrame, but calculations happen in the BigQuery query engine instead of your local environment.

Check the column names and types.

In [None]:
df.dtypes

We think the following columns may have meaningful information to predict the weight.

In [None]:
target_column = "weight_pounds"
feature_columns = [
    "year", "state", "is_male", "apgar_1min", "apgar_5min",
    "mother_age", "mother_race", "gestation_weeks", "mother_married",
    "cigarette_use", "cigarettes_per_day", "alcohol_use", "drinks_per_week",
    "weight_gain_pounds", "born_alive_alive", "born_alive_dead", "born_dead",
    "ever_born", "father_race", "father_age", 
]

Keep only the columns of interest.

In [None]:
df = df[feature_columns + [target_column]]
df.peek()

Seperate the data with known and unnknown `weight_pounds`. We will train a linear regression model on the known data to predict the unknown weights.

In [None]:
# Define a filter to identify the rows with missing weights
is_weight_unknown = df.weight_pounds.isna()

# Seperate the rows with missing `weight_pounds` values
df_unknown = df[is_weight_unknown]

# Drop the species column
df_known = df[~is_weight_unknown]

# Print the size of each data
len(df_known), len(df_unknown), len(df)

Prepare your feature (or input) columns and the target (or output) column from the known data:

In [None]:
X = df_known[feature_columns]
y = df_known[target_column]

Split the known data into 80% training data and 20% test data.

In [None]:
from bigframes.ml.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Show the number of rows in the data splits
len(X_train), len(X_test), len(y_train), len(y_test)

Prepare the unknown data for prediction.

In [None]:
X_unknown = df_unknown[feature_columns]

## Create the linear regression model

BigQuery DataFrames ML lets you move from exploring data to creating machine learning models through its scikit-learn-like API, `bigframes.ml`. BigQuery DataFrames ML supports several types of [ML models](https://cloud.google.com/python/docs/reference/bigframes/latest#ml-capabilities).

In this notebook, you create a linear regression model, a type of regression model that generates a continuous value from a linear combination of input features.

When you create a model with BigQuery DataFrames ML, it is saved in an internal location and limited to the BigQuery DataFrames session. However, as you'll see in the next section, you can use `to_gbq` to save the model permanently to your BigQuery project.

### Create the model using `bigframes.ml`

When you pass the feature columns without transforms, BigQuery ML uses
[automatic preprocessing](https://cloud.google.com/bigquery/docs/auto-preprocessing) to encode string values and scale numeric values.

BigQuery ML also [automatically splits the data for training and evaluation](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create-glm#data_split_method), although for datasets with less than 500 rows (such as this one), all rows are used for training.

In [None]:
from bigframes.ml.linear_model import LinearRegression

model = LinearRegression()

model.fit(X_train, y_train)

### Score the model

Check how the model performed by using the `score` method. More information on model scoring can be found [here](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-evaluate#mlevaluate_output).

In [None]:
# On the training data
model.score(X_train, y_train)

In [None]:
# On the test data
model.score(X_test, y_test)

### Predict using the model

Use the model to predict the weights of the newborns for whom the weights are unknown. The predicted weights is returned in the column `predicted_weight_pounds`.

In [None]:
df_pred = model.predict(X_unknown)
df_pred.peek()

In [None]:
df["weight_pounds"].fillna(df_pred["predicted_weight_pounds"]).isna().value_counts()

## Save the model in BigQuery

The model is saved locally within this session. You can save the model permanently to BigQuery for use in future sessions, and to make the model sharable with others.

Create a BigQuery dataset to house the model, adding a name for your dataset as the `DATASET_ID` variable:

In [None]:
DATASET_ID = ""  # @param {type:"string"}

if not DATASET_ID:
    raise ValueError("Please define the DATASET_ID")

client = bpd.get_global_session().bqclient
dataset = client.create_dataset(DATASET_ID, exists_ok=True)
print(f"Dataset {dataset.dataset_id} created.")

Save the model using the `to_gbq` method:

In [None]:
model.to_gbq(DATASET_ID + ".natality_weight" , replace=True)

You can view the saved model in the BigQuery console under the dataset you created in the first step. Run the following cell and follow the link to view your BigQuery console:

In [None]:
print(f'https://console.cloud.google.com/bigquery?project={PROJECT_ID}')

# Summary and next steps

You've created a linear regression model using `bigframes.ml`.

Learn more about BigQuery DataFrames in the [documentation](https://cloud.google.com/python/docs/reference/bigframes/latest) and find more sample notebooks in the [GitHub repo](https://github.com/googleapis/python-bigquery-dataframes/tree/main/notebooks).

## Cleaning up

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud
project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Otherwise, you can uncomment the remaining cells and run them to delete the individual resources you created in this tutorial:

In [None]:
# # Delete the BigQuery dataset and associated ML model
# client.delete_dataset(DATASET_ID, delete_contents=True, not_found_ok=True)