# Linear regression tutorial

WIP: adapt the BQML penguin weight tutorial to BigFrames https://cloud.google.com/bigquery-ml/docs/linear-regression-tutorial

This is an exploration for how a minimal combination of BQML and SKLearn style might work.

In [1]:
import bigframes

session = bigframes.connect()

Lets load the table containing our source data

In [7]:
df = session.read_gbq("bigquery-public-data.ml_datasets.penguins")
df.head()


                                       species     island  ...  body_mass_g     sex
0          Adelie Penguin (Pygoscelis adeliae)      Dream  ...       3475.0  FEMALE
1          Adelie Penguin (Pygoscelis adeliae)      Dream  ...       4650.0    MALE
2          Adelie Penguin (Pygoscelis adeliae)      Dream  ...       3900.0    MALE
3    Chinstrap penguin (Pygoscelis antarctica)      Dream  ...       3500.0  FEMALE
4          Adelie Penguin (Pygoscelis adeliae)      Dream  ...       3000.0  FEMALE
..                                         ...        ...  ...          ...     ...
339        Adelie Penguin (Pygoscelis adeliae)  Torgersen  ...       3275.0  FEMALE
340        Adelie Penguin (Pygoscelis adeliae)  Torgersen  ...       3700.0  FEMALE
341        Adelie Penguin (Pygoscelis adeliae)  Torgersen  ...       3050.0  FEMALE
342        Adelie Penguin (Pygoscelis adeliae)  Torgersen  ...       4000.0    MALE
343        Adelie Penguin (Pygoscelis adeliae)  Torgersen  ...       3775.0

We want to predict body_mass_g, but only of the female penguins. Lets remove the males from the data

In [6]:
df = df[df['sex'] == 'FEMALE']
df.head()


                                       species     island  ...  body_mass_g     sex
0          Adelie Penguin (Pygoscelis adeliae)      Dream  ...       3475.0  FEMALE
3    Chinstrap penguin (Pygoscelis antarctica)      Dream  ...       3500.0  FEMALE
4          Adelie Penguin (Pygoscelis adeliae)      Dream  ...       3000.0  FEMALE
6    Chinstrap penguin (Pygoscelis antarctica)      Dream  ...       2700.0  FEMALE
7    Chinstrap penguin (Pygoscelis antarctica)      Dream  ...       3400.0  FEMALE
..                                         ...        ...  ...          ...     ...
331        Adelie Penguin (Pygoscelis adeliae)  Torgersen  ...       3700.0  FEMALE
334        Adelie Penguin (Pygoscelis adeliae)  Torgersen  ...       3325.0  FEMALE
339        Adelie Penguin (Pygoscelis adeliae)  Torgersen  ...       3275.0  FEMALE
340        Adelie Penguin (Pygoscelis adeliae)  Torgersen  ...       3700.0  FEMALE
341        Adelie Penguin (Pygoscelis adeliae)  Torgersen  ...       3050.0

Great! Now lets configure a linear regression model to predict body mass from the other columns

In [8]:
import bigframes.ml as ml

model = ml.LinearRegression()
model.get_params()

{'auto_class_weights': None,
 'calculate_p_values': None,
 'category_encoding_method': None,
 'class_weights': None,
 'data_split_col': None,
 'data_split_eval_fraction': None,
 'data_split_method': None,
 'early_stop': None,
 'enable_global_explain': None,
 'fit_intercept': None,
 'l1_reg': None,
 'l2_reg': None,
 'learn_rate': None,
 'learn_rate_strategy': None,
 'ls_init_learn_rate': None,
 'max_iterations': None,
 'min_rel_progress': None,
 'optimize_strategy': None,
 'warm_start': None}

The model is just an empty configuration at the moment, it won't create anything in BigQuery until we fit it to some training data

In [10]:
train_x = df[['species', 'island', 'culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'sex']]
train_y = df['body_mass_g']
model.fit(train_x, train_y)

LinearRegression()


BigQuery automatically managed our training data split and model evaluation for us - lets see how well the model performed

In [9]:
model.evaluate()


Mean absolute error:      227.0122
Mean squared error:     81838.1599
Mean squared log error:     0.0051
Mean absolute error:      173.0808
R squared:                  0.8724


Great! The model works well. Because we created it without a name, it was just a temporary model that will disappear after 24 hours. 

We decide that this approach is promising, so lets fit it again, but this time we'll specify a name so that the fitted model is saved.

In [11]:
model.set_persistence(name="bqml_tutorial.penguins_model")
model.fit(train_x, train_y)


LinearRegression()
Persistence:
   BigQuery model: bqml_tutorial.penguins_model


We can now use this model anywhere in BigQuery with this name, we can view it in Pantheon [here](https://pantheon.corp.google.com/bigquery?project=bigframes-dev&cloudshell=true&mods=pan_ng2&ws=!1m10!1m4!1m3!1sbigframes-dev!2sbquxjob_294496a9_18657567655!3sUS!1m4!5m3!1sbigframes-dev!2sbqml_tutorial!3spenguins_model). We can also load it again in our BigFrames session and evaluate or inference it without needing to retrain it:

In [12]:
model = session.read_gbq("bqml_tutorial.penguins_model")

model


LinearRegression()
Persistence:
   BigQuery model: bqml_tutorial.penguins_model


And of course we can retrain it:

In [13]:
model.fit(train_x, train_y)


LinearRegression()
Persistence:
   BigQuery model: bqml_tutorial.penguins_model


We want to productionalize this model, so lets start publishing it to the vertex model registry ([prerequisites](https://cloud.google.com/bigquery-ml/docs/managing-models-vertex#prerequisites))

Note that while we can load models from BigQuery and change the parameters, things are only ever persisted when we run .fit()

In [14]:
model.set_persistence(
    registry="vertex_ai",
    vertex_ai_model_version_aliases=["experimental"])
model.fit(train_x, train_y)


LinearRegression()
Persistence:
   BigQuery model: bqml_tutorial.penguins_model
   Model registry: Vertex AI
   Vertex AI model version aliases: [ experimental ]


Now when we fit the model, we can see it published here: https://pantheon.corp.google.com/vertex-ai/models