# Using ML - Easy linear regression

This demo shows BigQuery DataFrames ML providing an SKLearn-like experience for
training a linear regression model.

In this "easy" version of linear regression, we use a couple of BQML features to simplify our code:

- We rely on automatic preprocessing to encode string values and scale numeric values
- We rely on automatic data split & evaluation to test the model

This example is adapted from the [BQML linear regression tutorial](https://cloud.google.com/bigquery-ml/docs/linear-regression-tutorial).

## 1. Init & load data

Import `bigframes.pandas` module and get the default session

In [22]:
import bigframes.pandas
session = bigframes.pandas.get_global_session()

Define a dataset for storing BQML model, and create it if it does not exist.

In [None]:
dataset = f"{session.bqclient.project}.bqml_tutorial"
session.bqclient.create_dataset(dataset, exists_ok=True)

Define a model path

In [24]:
penguins_model = f"{dataset}.penguins_model"

Read the penguins data.

In [None]:
# read a BigQuery table to a BigQuery DataFrame
df = bigframes.pandas.read_gbq(f"bigquery-public-data.ml_datasets.penguins")

# take a peek at the dataframe
df

## 2. Data cleaning / prep

In [None]:
# filter down to the data we want to analyze
adelie_data = df[df.species == "Adelie Penguin (Pygoscelis adeliae)"]

# drop the columns we don't care about
adelie_data = adelie_data.drop(columns=["species"])

# drop rows with nulls to get our training data
training_data = adelie_data.dropna()

# take a peek at the training data
training_data

In [27]:
# pick feature columns and label column
feature_columns = training_data[['island', 'culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'sex']]
label_columns = training_data[['body_mass_g']]                               

# also get the rows that we want to make predictions for (i.e. where the feature column is null)
missing_body_mass = adelie_data[adelie_data.body_mass_g.isnull()]

## 3. Create, score, fit, predict

In [None]:
from bigframes.ml.linear_model import LinearRegression

model = LinearRegression()

# Here we pass the feature columns without transforms - BQML will then use
# automatic preprocessing to encode these columns
model.fit(feature_columns, label_columns)

In [None]:
# check how the model performed
model.score(feature_columns, label_columns)

In [None]:
# use the model to predict the missing labels
model.predict(missing_body_mass)

## 4. Save in BigQuery

In [None]:
# save the model to a permanent location in BigQuery, so we can use it in future sessions (and elsewhere in BQ)
model.to_gbq(penguins_model, replace=True)

## 5. Reload from BigQuery

In [None]:
# WARNING - until b/281709360 is fixed & pipeline is updated, pipelines will load as models,
# and details of their transform steps will be lost (the loaded model will behave the same)
bigframes.pandas.read_gbq_model(penguins_model)