**Note: this notebook requires changes not yet checked in**

# Introduction

This is a prototype for how a minimal SKLearn-like wrapper for BQML might work in BigQuery DataFrames.

Disclaimer - this is not a polished design or a robust implementation, this is a quick prototype to workshop some ideas. Design will be next.

What is BigQuery DataFrame?
- Pandas API for BigQuery
- Lets data scientists quickly iterate and prepare their data as they do in Pandas, but executed by BigQuery

What is meant by SKLearn-like?
- Follow the API design practices from the SKLearn project
    - [API design for machine learning software: experiences from the scikit-learn project](https://arxiv.org/pdf/1309.0238.pdf)
- Not a copy of, or compatible with, SKLearn

Briefly, patterns taken from SKLearn are:
- Models and transforms are 'Estimators'
    - A bundle of parameters with a consistent way to initialize/get/set
    - And a .fit(..) method to fit to training data
- Models additionally have a .predict(..)
- By default, these objects are transient, making them easy to play around with. No need to give them names or decide how to persist them.


Design goals:
- Zero friction ML capabilities for BigQuery DataFrames users (no extra auth, configuration, etc)
- Offers first class integration with the Pandas-like BigQuery DataFrames API
- Uses SKLearn-like design patterns that feel familiar to data scientists
- Also a first class BigQuery experience
    - Offers BigQuery's scalability and storage / compute management
    - Works naturally with BigQuery's other interfaces, e.g. GUI and SQL
    - BQML features

# Linear regression tutorial

Adapted from the "Penguin weight" Linear Regression tutorial for BQML: https://cloud.google.com/bigquery-ml/docs/linear-regression-tutorial


## Setting the scene

Our conservationists have sent us some measurements of penguins found in the Antarctic islands. They say that some of the body mass measurements for the Adelie penguins are missing, and ask if we can use some data science magic to estimate them. Sounds like a job for a linear regression!

Lets take a look at the data...

In [1]:
import bigframes.pandas

df = bigframes.pandas.read_gbq("bigframes-dev.bqml_tutorial.penguins")
df

Unnamed: 0,tag_number,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
0,1225,Gentoo penguin (Pygoscelis papua),Biscoe,,,,,
1,1278,Gentoo penguin (Pygoscelis papua),Biscoe,42.0,13.5,210.0,4150.0,FEMALE
2,1275,Gentoo penguin (Pygoscelis papua),Biscoe,46.5,13.5,210.0,4550.0,FEMALE
3,1233,Gentoo penguin (Pygoscelis papua),Biscoe,43.3,14.0,208.0,4575.0,FEMALE
4,1311,Gentoo penguin (Pygoscelis papua),Biscoe,47.5,14.0,212.0,4875.0,FEMALE
5,1316,Gentoo penguin (Pygoscelis papua),Biscoe,49.1,14.5,212.0,4625.0,FEMALE
6,1313,Gentoo penguin (Pygoscelis papua),Biscoe,45.5,14.5,212.0,4750.0,FEMALE
7,1381,Gentoo penguin (Pygoscelis papua),Biscoe,47.6,14.5,215.0,5400.0,MALE
8,1377,Gentoo penguin (Pygoscelis papua),Biscoe,45.1,14.5,207.0,5050.0,FEMALE
9,1380,Gentoo penguin (Pygoscelis papua),Biscoe,45.1,14.5,215.0,5000.0,FEMALE


First we note that while we have a default numbered index generated by BigQuery, actually the penguins are uniquely identified by their tags.

Lets make the data a bit friendlier to work with by setting the tag number column as the index.

In [2]:
df = df.set_index("tag_number")
df

Unnamed: 0_level_0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
tag_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1225,Gentoo penguin (Pygoscelis papua),Biscoe,,,,,
1278,Gentoo penguin (Pygoscelis papua),Biscoe,42.0,13.5,210.0,4150.0,FEMALE
1275,Gentoo penguin (Pygoscelis papua),Biscoe,46.5,13.5,210.0,4550.0,FEMALE
1233,Gentoo penguin (Pygoscelis papua),Biscoe,43.3,14.0,208.0,4575.0,FEMALE
1311,Gentoo penguin (Pygoscelis papua),Biscoe,47.5,14.0,212.0,4875.0,FEMALE
1316,Gentoo penguin (Pygoscelis papua),Biscoe,49.1,14.5,212.0,4625.0,FEMALE
1313,Gentoo penguin (Pygoscelis papua),Biscoe,45.5,14.5,212.0,4750.0,FEMALE
1381,Gentoo penguin (Pygoscelis papua),Biscoe,47.6,14.5,215.0,5400.0,MALE
1377,Gentoo penguin (Pygoscelis papua),Biscoe,45.1,14.5,207.0,5050.0,FEMALE
1380,Gentoo penguin (Pygoscelis papua),Biscoe,45.1,14.5,215.0,5000.0,FEMALE


We saw in the first view that there were some missing values. We're especially interested in observations that are missing just the body_mass_g, so lets look at those:

In [3]:
df[df.body_mass_g.isnull()]

Unnamed: 0_level_0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
tag_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1225,Gentoo penguin (Pygoscelis papua),Biscoe,,,,,
1393,Adelie Penguin (Pygoscelis adeliae),Torgersen,,,,,
1524,Adelie Penguin (Pygoscelis adeliae),Dream,41.6,20.0,204.0,,MALE
1523,Adelie Penguin (Pygoscelis adeliae),Dream,38.0,17.5,194.0,,FEMALE
1525,Adelie Penguin (Pygoscelis adeliae),Dream,36.3,18.5,194.0,,MALE


Here we see three Adelie penguins with tag numbers 1523, 1524, 1525 are missing their body_mass_g but have the other measurements. These are the ones we need to guess. We can do this by training a statistical model on the measurements that we do have, and then using it to predict the missing values.

Our conservationists warned us that trying to generalize across species is a bad idea, so for now lets just try building a model for Adelie penguins. We can revisit it later and see if including the other observations improves the model performance.

In [4]:
# get all the rows with adelie penguins
adelie_data = df[df.species == "Adelie Penguin (Pygoscelis adeliae)"]

# separate out the rows that have a body mass measurement
training_data = adelie_data[adelie_data.body_mass_g.notnull()]

# we noticed there were also some rows that were missing other values,
# lets remove these so they don't affect our results
training_data = training_data.dropna()

# lets take a quick peek and make sure things look right:
training_data

Unnamed: 0_level_0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
tag_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1172,Adelie Penguin (Pygoscelis adeliae),Dream,32.1,15.5,188.0,3050.0,FEMALE
1371,Adelie Penguin (Pygoscelis adeliae),Biscoe,37.7,16.0,183.0,3075.0,FEMALE
1417,Adelie Penguin (Pygoscelis adeliae),Torgersen,38.6,17.0,188.0,2900.0,FEMALE
1204,Adelie Penguin (Pygoscelis adeliae),Dream,40.7,17.0,190.0,3725.0,MALE
1251,Adelie Penguin (Pygoscelis adeliae),Biscoe,37.6,17.0,185.0,3600.0,FEMALE
1422,Adelie Penguin (Pygoscelis adeliae),Torgersen,35.7,17.0,189.0,3350.0,FEMALE
1394,Adelie Penguin (Pygoscelis adeliae),Torgersen,40.2,17.0,176.0,3450.0,FEMALE
1163,Adelie Penguin (Pygoscelis adeliae),Dream,36.4,17.0,195.0,3325.0,FEMALE
1329,Adelie Penguin (Pygoscelis adeliae),Biscoe,38.1,17.0,181.0,3175.0,FEMALE
1406,Adelie Penguin (Pygoscelis adeliae),Torgersen,44.1,18.0,210.0,4000.0,MALE


In [5]:
# we'll look at the schema too:
training_data.dtypes

species              string[pyarrow]
island               string[pyarrow]
culmen_length_mm             Float64
culmen_depth_mm              Float64
flipper_length_mm            Float64
body_mass_g                  Float64
sex                  string[pyarrow]
dtype: object

Great! Now lets configure a linear regression model to predict body mass from the other columns

In [6]:
import bigframes.ml.linear_model as ml

model = ml.LinearRegression()
model

LinearRegression()

As in SKLearn, an unfitted model object is just a bundle of parameters.

In [7]:
# lets view the parameters
model.get_params()

{'fit_intercept': True}

For this task, really all the default options are fine. But just so we can see how configuration works, lets specify that we want to use gradient descent to find the solution:

In [8]:
model.optimize_strategy = "BATCH_GRADIENT_DESCENT"
model

LinearRegression()

BigQuery models provide a couple of extra conveniences:

1. By default, they will automatically perform feature engineering on the inputs - encoding our string columns and scaling our numeric columns.
2. By default, they will also automatically manage the test/training data split for us.

So all we need to do is hook our chosen feature and label columns into the model and call .fit()!

In [9]:
train_x = training_data[['island', 'culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'sex']]
train_y = training_data[['body_mass_g']]
model.fit(train_x, train_y)
model

LinearRegression()

...and there, we've successfully trained a linear regressor model. Lets see how it performs, using the automatic data split:

In [10]:
model.score(train_x, train_y)

Unnamed: 0,mean_absolute_error,mean_squared_error,mean_squared_log_error,median_absolute_error,r2_score,explained_variance
0,223.878763,78553.601634,0.005614,181.330911,0.623951,0.623951


Great! The model seems useful, predicting 62% of the variance.

We realize we made a mistake though - we're trying to predict mass using a linear model, mass will increase with the cube of the penguin's size, whereas our inputs are linear with size. Can we improve our model by cubing them?

In [11]:
# SKIP THIS STEP (not yet work working in BigQuery DataFrame)

# lets define a preprocessing step that adjust the linear measurements to use the cube
'''
def cubify(penguin_df):
    penguin_df.culmen_length_mm = train_x.culmen_length_mm.pow(3)
    penguin_df.culmen_depth_mm = train_x.culmen_depth_mm.pow(3)
    penguin_df.flipper_length_mm = train_x.flipper_length_mm.pow(3)

cubify(train_x)
train_x
'''

'\ndef cubify(penguin_df):\n    penguin_df.culmen_length_mm = train_x.culmen_length_mm.pow(3)\n    penguin_df.culmen_depth_mm = train_x.culmen_depth_mm.pow(3)\n    penguin_df.flipper_length_mm = train_x.flipper_length_mm.pow(3)\n\ncubify(train_x)\ntrain_x\n'

In [12]:
# AS ABOVE, SKIP FOR NOW
'''
model.fit(train_x, train_y)
model.evaluate()
'''

'\nmodel.fit(train_x, train_y)\nmodel.evaluate()\n'

Now that we're satisfied with our model, lets see what it predicts for those Adelie penguins with no body mass measurement:

In [13]:
# Lets predict the missing observations
missing_body_mass = adelie_data[adelie_data.body_mass_g.isnull()]

model.predict(missing_body_mass)

Unnamed: 0_level_0,predicted_body_mass_g
tag_number,Unnamed: 1_level_1
1393,3459.735118
1524,4304.175638
1523,3471.668379
1525,3947.881639


Because we created it without a name, it was just a temporary model that will disappear after 24 hours. 

We decide that this approach is promising, so lets tell BigQuery to save it.

In [14]:
model.to_gbq("bqml_tutorial.penguins_model", replace=True)
model

LinearRegression()

We can now use this model anywhere in BigQuery with this name. We can also load
it again in our BigQuery DataFrames session and evaluate or inference it without
needing to retrain it:

In [15]:
model = bigframes.pandas.read_gbq_model("bqml_tutorial.penguins_model")
model

LinearRegression()

And of course we can retrain it if we like. Lets make another version that is based on all the penguins, so we can test that assumption we made at the beginning that it would be best to separate them:

In [16]:
# This time we'll take all the training data, for all species
training_data = df[df.body_mass_g.notnull()]
training_data = training_data.dropna()

# And we'll include species in our features
train_x = training_data[['species', 'island', 'culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'sex']]
train_y = training_data[['body_mass_g']]
model.fit(train_x, train_y)

# And we'll evaluate it on the Adelie penguins only
adelie_data = training_data[training_data.species == "Adelie Penguin (Pygoscelis adeliae)"]
test_x = adelie_data[['species', 'island', 'culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'sex']]
test_y = adelie_data[['body_mass_g']]
model.score(test_x, test_y)

Unnamed: 0,mean_absolute_error,mean_squared_error,mean_squared_log_error,median_absolute_error,r2_score,explained_variance
0,224.717433,79527.879623,0.005693,169.235869,0.619287,0.619287


It looks like the conservationists were right! Including other species, even though it gave us more training data, worsened prediction on the Adelie penguins.

===============================================

**Everything below this line not yet implemented**

We want to productionalize this model, so lets start publishing it to the vertex model registry ([prerequisites](https://cloud.google.com/bigquery-ml/docs/managing-models-vertex#prerequisites))

In [None]:
model.publish(
    registry="vertex_ai",
    vertex_ai_model_version_aliases=["experimental"])

Now when we fit the model, we can see it published here: https://console.cloud.google.com/vertex-ai/models

# Custom feature engineering

So far, we've relied on BigQuery to do our feature engineering for us. What if we want to do it manually?

BigQuery DataFrames provides a way to do this using Pipelines.

In [None]:
from bigframes.ml.pipeline import Pipeline
from bigframes.ml.preprocessing import StandardScaler

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('linreg', LinearRegression())
])

pipe.fit(train_x, train_y)
pipe.evaluate()

We then can then save the entire pipeline to BigQuery, BigQuery will save this as a single model, with the pre-processing steps embedded in the TRANSFORM property:

In [None]:
pipe.to_gbq("bqml_tutorial.penguins_pipeline")

# Custom data split

BigQuery has also managed splitting out our training data. What if we want to do this manually?

*TODO: Write this section*