**Note: this notebook requires changes not yet checked in**

# Introduction

This is a prototype for how a minimal SKLearn-like wrapper for BQML might work in BigFrames.

Disclaimer - this is not a polished design or a robust implementation, this is a quick prototype to workshop some ideas. Design will be next.

What is BigFrames?
- Pandas API for BigQuery
- Lets data scientists quickly iterate and prepare their data as they do in Pandas, but executed by BigQuery

What is meant by SKLearn-like?
- Follow the API design practices from the SKLearn project
    - [API design for machine learning software: experiences from the scikit-learn project](https://arxiv.org/pdf/1309.0238.pdf)
- Not a copy of, or compatible with, SKLearn

Briefly, patterns taken from SKLearn are:
- Models and transforms are 'Estimators'
    - A bundle of parameters with a consistent way to initialize/get/set
    - And a .fit(..) method to fit to training data
- Models additionally have a .predict(..)
- By default, these objects are transient, making them easy to play around with. No need to give them names or decide how to persist them.


Design goals:
- Zero friction ML capabilities for BigFrames users (no extra auth, configuration, etc)
- Offers first class integration with the Pandas-like BigFrames API
- Uses SKLearn-like design patterns that feel familiar to data scientists
- Also a first class BigQuery experience
    - Offers BigQuery's scalability and storage / compute management
    - Works naturally with BigQuery's other interfaces, e.g. GUI and SQL
    - BQML features

# Linear regression tutorial

Adapted from the "Penguin weight" Linear Regression tutorial for BQML: https://cloud.google.com/bigquery-ml/docs/linear-regression-tutorial


## Setting the scene

Our conservationists have sent us some measurements of penguins found in the Antarctic islands. They say that some of the body mass measurements for the Adelie penguins are missing, and ask if we can use some data science magic to estimate them. Sounds like a job for a linear regression!

Lets take a look at the data...

In [1]:
import bigframes

session = bigframes.connect()
df = session.read_gbq("bigframes-dev.bqml_tutorial.penguins")
df

Unnamed: 0,name,tag_number,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
0,Genesis,1245,Adelie Penguin (Pygoscelis adeliae),Dream,37.5,18.9,179.0,2975.0,
1,Bella,1162,Adelie Penguin (Pygoscelis adeliae),Dream,40.7,17.0,190.0,3725.0,MALE
2,Diego,1175,Adelie Penguin (Pygoscelis adeliae),Dream,41.1,19.0,182.0,3425.0,MALE
3,Oliver,1178,Adelie Penguin (Pygoscelis adeliae),Dream,41.6,20.0,204.0,,MALE
4,Cole,1180,Adelie Penguin (Pygoscelis adeliae),Dream,38.8,20.0,190.0,3950.0,MALE
...,...,...,...,...,...,...,...,...,...
342,Annabelle,1299,Adelie Penguin (Pygoscelis adeliae),Torgersen,35.2,15.9,186.0,3050.0,FEMALE
343,Sadie,1301,Adelie Penguin (Pygoscelis adeliae),Torgersen,40.9,16.8,191.0,3700.0,FEMALE
344,Leonardo,1303,Adelie Penguin (Pygoscelis adeliae),Torgersen,36.6,17.8,185.0,3700.0,FEMALE
345,Violet,1306,Adelie Penguin (Pygoscelis adeliae),Torgersen,38.9,17.8,181.0,3625.0,FEMALE


First we note that while we have a default numbered index generated by BigQuery, actually the penguins are uniquely identified by their names and tags. Giving birds unique memorable names, in addition to their tags, is a common practice among conservationists!

Lets make the data a bit friendlier to work with by setting the name column as the index.

In [2]:
df = df.set_index("name")
df

Unnamed: 0_level_0,tag_number,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Aaliyah,1177,Adelie Penguin (Pygoscelis adeliae),Biscoe,41.0,20.0,203.0,4725.0,MALE
Aaron,1110,Gentoo penguin (Pygoscelis papua),Biscoe,49.0,16.1,216.0,5550.0,MALE
Abigail,1054,Gentoo penguin (Pygoscelis papua),Biscoe,46.4,15.0,216.0,4700.0,FEMALE
Adam,1164,Adelie Penguin (Pygoscelis adeliae),Torgersen,38.6,17.0,188.0,2900.0,FEMALE
Addison,1077,Gentoo penguin (Pygoscelis papua),Biscoe,46.4,15.6,221.0,5000.0,MALE
...,...,...,...,...,...,...,...,...
Wyatt,1112,Gentoo penguin (Pygoscelis papua),Biscoe,52.2,17.1,228.0,5400.0,MALE
Xavier,1145,Gentoo penguin (Pygoscelis papua),Biscoe,46.5,14.4,217.0,4900.0,FEMALE
Zachary,1114,Gentoo penguin (Pygoscelis papua),Biscoe,45.3,13.8,208.0,4200.0,FEMALE
Zoe,1135,Gentoo penguin (Pygoscelis papua),Biscoe,51.5,16.3,230.0,5500.0,MALE


We saw in the first view that there were some missing values. We're especially interested in observations that are missing just the body_mass_g, so lets look at those:

In [3]:
df[df.body_mass_g.isnull()]

Unnamed: 0_level_0,tag_number,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Ayden,1157,Adelie Penguin (Pygoscelis adeliae),Torgersen,,,,,
Bentley,1197,Adelie Penguin (Pygoscelis adeliae),Dream,36.3,18.5,194.0,,MALE
Isabella,1033,Gentoo penguin (Pygoscelis papua),Biscoe,,,,,
Jesus,1186,Adelie Penguin (Pygoscelis adeliae),Dream,38.0,17.5,194.0,,FEMALE
Oliver,1178,Adelie Penguin (Pygoscelis adeliae),Dream,41.6,20.0,204.0,,MALE


Here we see three Adelie penguins "Bentley", "Jesus", and "Oliver" that are missing their weight but have the other measurements. These are the ones we need to guess. We can do this by training a statistical model on the measurements that we do have, and then using it to predict the missing values.

Our conservationists warned us that trying to generalize across species is a bad idea, so for now lets just try building a model for Adelie penguins. We can revisit it later and see if including the other observations improves the model performance.

In [4]:
# get all the rows with adelie penguins
adelie_data = df[df.species.str.startswith("Adelie")]

# separate out the rows that have a body mass measurement
training_data = adelie_data[adelie_data.body_mass_g.notnull()]

# we noticed there were also some rows that were missing other values,
# lets remove these so they don't affect our results
training_data = training_data.dropna()

# lets take a quick peek and make sure things look right:
training_data

Unnamed: 0_level_0,tag_number,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Aaliyah,1177,Adelie Penguin (Pygoscelis adeliae),Biscoe,41.0,20.0,203.0,4725.0,MALE
Adam,1164,Adelie Penguin (Pygoscelis adeliae),Torgersen,38.6,17.0,188.0,2900.0,FEMALE
Aidan,1189,Adelie Penguin (Pygoscelis adeliae),Dream,40.3,18.5,196.0,4350.0,MALE
Alejandro,1213,Adelie Penguin (Pygoscelis adeliae),Biscoe,35.9,19.2,189.0,3800.0,FEMALE
Alex,1183,Adelie Penguin (Pygoscelis adeliae),Dream,37.0,16.5,185.0,3400.0,FEMALE
...,...,...,...,...,...,...,...,...
Vanessa,1279,Adelie Penguin (Pygoscelis adeliae),Torgersen,42.9,17.6,196.0,4700.0,MALE
Victor,1253,Adelie Penguin (Pygoscelis adeliae),Dream,40.2,17.1,193.0,3400.0,FEMALE
Vincent,1221,Adelie Penguin (Pygoscelis adeliae),Biscoe,37.7,18.7,180.0,3600.0,MALE
Violet,1306,Adelie Penguin (Pygoscelis adeliae),Torgersen,38.9,17.8,181.0,3625.0,FEMALE


In [5]:
# we'll look at the schema too:
training_data.dtypes

tag_number             Int64
species               string
island                string
culmen_length_mm     Float64
culmen_depth_mm      Float64
flipper_length_mm    Float64
body_mass_g          Float64
sex                   string
dtype: object

Great! Now lets configure a linear regression model to predict body mass from the other columns

In [6]:
import bigframes.ml as ml

model = ml.LinearRegression()
model

BQML linear regression model

Not yet fitted

As in SKLearn, an unfitted model object is just a bundle of parameters.

In [7]:
# lets view the parameters
model.get_params()

{'auto_class_weights': None,
 'calculate_p_values': None,
 'category_encoding_method': None,
 'class_weights': None,
 'data_split_col': None,
 'data_split_eval_fraction': None,
 'data_split_method': None,
 'early_stop': None,
 'enable_global_explain': None,
 'fit_intercept': None,
 'l1_reg': None,
 'l2_reg': None,
 'learn_rate': None,
 'learn_rate_strategy': None,
 'ls_init_learn_rate': None,
 'max_iterations': None,
 'min_rel_progress': None,
 'optimize_strategy': None,
 'warm_start': None}

For this task, really all the default options are fine. But just so we can see how configuration works, lets specify that we want to use gradient descent to find the solution:

In [8]:
model.optimize_strategy = "BATCH_GRADIENT_DESCENT"
model

BQML linear regression model

optimize_strategy: BATCH_GRADIENT_DESCENT

Not yet fitted

BigQuery models provide a couple of extra conveniences:

1. By default, they will automatically perform feature engineering on the inputs - encoding our string columns and scaling our numeric columns.
2. By default, they will also automatically manage the test/training data split for us.

So all we need to do is hook our chosen feature and label columns into the model and call .fit()!

In [9]:
train_x = training_data[['island', 'culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'sex']]
train_y = training_data[['body_mass_g']]
model.fit(train_x, train_y)
model

BQML linear regression model

optimize_strategy: BATCH_GRADIENT_DESCENT

Fitted: [_5d43055eacdc4edbad25851a3a083d71._a43deb1c-40db-4965-9d79-85a5e53e7a01](https://console.cloud.google.com/bigquery?project=bigframes-dev)

...and there, we've successfully trained a linear regressor model. Lets see how it performs, using the automatic data split:

In [10]:
model.evaluate()

mean_absolute_error         222.885846
mean_squared_error        79645.032886
mean_squared_log_error        0.005656
median_absolute_error       169.966814
r2_score                      0.618726
explained_variance            0.620046
dtype: float64

Great! The model seems useful, predicting 62% of the variance.

We realize we made a mistake though - we're trying to predict mass using a linear model, mass will increase with the cube of the penguin's size, whereas our inputs are linear with size. Can we improve our model by cubing them?

In [None]:
# SKIP THIS STEP (not yet work working in BigFrames)

# lets define a preprocessing step that adjust the linear measurements to use the cube
'''
def cubify(penguin_df):
    penguin_df.culmen_length_mm = train_x.culmen_length_mm.pow(3)
    penguin_df.culmen_depth_mm = train_x.culmen_depth_mm.pow(3)
    penguin_df.flipper_length_mm = train_x.flipper_length_mm.pow(3)

cubify(train_x)
train_x
'''

In [None]:
# AS ABOVE, SKIP FOR NOW
'''
model.fit(train_x, train_y)
model.evaluate()
'''

Now that we're satisfied with our model, lets see what it predicts for those Adelie penguins with no body mass measurement:

In [11]:
# Lets predict the missing observations
missing_body_mass = adelie_data[adelie_data.body_mass_g.isnull()]

model.predict(missing_body_mass)

Unnamed: 0_level_0,predicted_body_mass_g
name,Unnamed: 1_level_1
Oliver,4305.620607
Bentley,3877.167499
Jesus,3477.661804
Ayden,2430.849117


Because we created it without a name, it was just a temporary model that will disappear after 24 hours. 

We decide that this approach is promising, so lets tell BigQuery to save it.

In [12]:
model.to_gbq("bqml_tutorial.penguins_model", replace=True)
model

BQML linear regression model

optimize_strategy: BATCH_GRADIENT_DESCENT

Fitted: [bqml_tutorial.penguins_model](https://console.cloud.google.com/bigquery?project=bigframes-dev)

We can now use this model anywhere in BigQuery with this name. We can also load it again in our BigFrames session and evaluate or inference it without needing to retrain it:

In [13]:
model = session.read_gbq("bqml_tutorial.penguins_model")
model

BQML linear regression model

optimize_strategy: BATCH_GRADIENT_DESCENT

Fitted: [bigframes-dev.bqml_tutorial.penguins_model](https://console.cloud.google.com/bigquery?project=bigframes-dev)

And of course we can retrain it if we like. Lets make another version that is based on all the penguins, so we can test that assumption we made at the beginning that it would be best to separate them:

In [14]:
# This time we'll take all the training data, for all species
training_data = df[df.body_mass_g.notnull()]
training_data = training_data.dropna()

# And we'll include species in our features
train_x = training_data[['species', 'island', 'culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'sex']]
train_y = training_data[['body_mass_g']]
model.fit(train_x, train_y)

# And we'll evaluate it on the Adelie penguins only
adelie_data = training_data[training_data.species.str.startswith("Adelie")]
test_x = adelie_data[['species', 'island', 'culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'sex']]
test_y = adelie_data[['body_mass_g']]
model.evaluate(test_x, test_y)

mean_absolute_error         244.218173
mean_squared_error        92113.963273
mean_squared_log_error        0.006737
median_absolute_error       206.081168
r2_score                      0.559035
explained_variance            0.559944
dtype: float64

It looks like the conservationists were right! Including other species, even though it gave us more training data, worsened prediction on the Adelie penguins.

===============================================

**Everything below this line not yet implemented**

We want to productionalize this model, so lets start publishing it to the vertex model registry ([prerequisites](https://cloud.google.com/bigquery-ml/docs/managing-models-vertex#prerequisites))

In [None]:
model.publish(
    registry="vertex_ai",
    vertex_ai_model_version_aliases=["experimental"])

Now when we fit the model, we can see it published here: https://console.cloud.google.com/vertex-ai/models

# Custom feature engineering

So far, we've relied on BigQuery to do our feature engineering for us. What if we want to do it manually?

BigFrames provides a way to do this using Pipelines.

In [None]:
from bigframes.ml.pipeline import Pipeline
from bigframes.ml.preprocessing import StandardScaler

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('linreg', LinearRegression())
])

pipe.fit(train_x, train_y)
pipe.evaluate()

We then can then save the entire pipeline to BigQuery, BigQuery will save this as a single model, with the pre-processing steps embedded in the TRANSFORM property:

In [None]:
pipe.to_gbq("bqml_tutorial.penguins_pipeline")

# Custom data split

BigQuery has also managed splitting out our training data. What if we want to do this manually?

*TODO: Write this section*