# Using ML - Easy linear regression

This demo shows BigFrame's ML API providing an SKLearn-like experience for training a linear regression model.

In this "easy" version of linear regression, we use a couple of BQML features to simplify our code:

- We rely on automatic preprocessing to encode string values and scale numeric values
- We rely on automatic data split & evaluation to test the model

This example is adapted from the [BQML linear regression tutorial](https://cloud.google.com/bigquery-ml/docs/linear-regression-tutorial).

## 1. Init & load data

In [1]:
# initialize BigFrames
import bigframes
session = bigframes.connect()

# read a BigQuery table to a BigFrames dataframe
df = session.read_gbq("bigframes-dev.bqml_tutorial.penguins")

# take a peek at the dataframe
df

     tag_number                                    species     island  \
136        1228        Adelie Penguin (Pygoscelis adeliae)     Biscoe   
144        1295        Adelie Penguin (Pygoscelis adeliae)     Biscoe   
167        1200        Adelie Penguin (Pygoscelis adeliae)      Dream   
199        1398        Adelie Penguin (Pygoscelis adeliae)  Torgersen   
227        1137        Adelie Penguin (Pygoscelis adeliae)      Dream   
251        1304        Adelie Penguin (Pygoscelis adeliae)     Biscoe   
264        1374        Adelie Penguin (Pygoscelis adeliae)     Biscoe   
316        1167  Chinstrap penguin (Pygoscelis antarctica)      Dream   
42         1360          Gentoo penguin (Pygoscelis papua)     Biscoe   
78         1339          Gentoo penguin (Pygoscelis papua)     Biscoe   

     culmen_length_mm  culmen_depth_mm  flipper_length_mm  body_mass_g     sex  
136              41.6             18.0              192.0       3950.0    MALE  
144              41.0             

## 2. Data cleaning / prep

In [2]:
# set a friendlier index to uniquely identify the rows
df = df.set_index("tag_number")

# filter down to the data we want to analyze
adelie_data = df[df.species == "Adelie Penguin (Pygoscelis adeliae)"]

# drop the columns we don't care about
adelie_data = adelie_data.drop(columns=["species"])

# drop rows with nulls to get our training data
training_data = adelie_data.dropna()

# take a peek at the training data
training_data

           island  culmen_length_mm  culmen_depth_mm  flipper_length_mm  \
tag_number                                                                
1101        Dream              36.6             18.4              184.0   
1102        Dream              39.8             19.1              184.0   
1103        Dream              40.9             18.9              184.0   
1105        Dream              37.3             16.8              192.0   
1106        Dream              43.2             18.5              192.0   
1110        Dream              40.2             20.1              200.0   
1111        Dream              40.8             18.9              208.0   
1112        Dream              39.0             18.7              185.0   
1113        Dream              37.0             16.9              185.0   
1115        Dream              34.0             17.1              185.0   

            body_mass_g     sex  
tag_number                       
1101             3475.0  FEMALE

In [3]:
# pick feature columns and label column
feature_columns = training_data[['island', 'culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'sex']]
label_columns = training_data[['body_mass_g']]                               

# also get the rows that we want to make predictions for (i.e. where the feature column is null)
missing_body_mass = adelie_data[adelie_data.body_mass_g.isnull()]

## 3. Create, score, fit, predict

In [4]:
from bigframes.ml.linear_model import LinearRegression

# as in scikit-learn, a newly created model is just a bundle of parameters
# default parameters are fine here
model = LinearRegression(data_split_method="AUTO_SPLIT")

# this will train a temporary model in BQML
model.fit(feature_columns, label_columns)

In [5]:
# check how the model performed, using the automatic test/training data split chosen by BQML
model.score()

   mean_absolute_error  mean_squared_error  mean_squared_log_error  \
0           223.878763        78553.601634                0.005614   

   median_absolute_error  r2_score  explained_variance  
0             181.330911  0.623951            0.623951  

[1 rows x 6 columns]

In [6]:
# use the model to predict the missing labels
model.predict(missing_body_mass)

                predicted_body_mass_g
tag_number_z_z                       
1393                      3459.735118
1524                      4304.175638
1523                      3471.668379
1525                      3947.881639

[4 rows x 1 columns]

## 4. Save in BigQuery

In [7]:
# save the model to a permanent location in BigQuery, so we can use it in future sessions (and elsewhere in BQ)
model.to_gbq("bqml_tutorial.penguins_model", replace=True)

I can't beleive it's not SKLearn!