# Linear Regression - Penguin weight

![weighing a penguin - Parti](penguinweigh.png)

This demo shows BigFrame's ML API providing an SKLearn-like experience for training a linear regression model.

This example is adapted from the [BQML linear regression tutorial](https://cloud.google.com/bigquery-ml/docs/linear-regression-tutorial).

## 1. Init & load data

In [1]:
# initialize BigFrames
import bigframes
session = bigframes.connect()

# read a BigQuery table to a BigFrames dataframe
df = session.read_gbq("bigframes-dev.bqml_tutorial.penguins")

# take a peek at the dataframe
df

   tag_number                            species  island  culmen_length_mm  \
0        1225  Gentoo penguin (Pygoscelis papua)  Biscoe               NaN   
1        1278  Gentoo penguin (Pygoscelis papua)  Biscoe              42.0   
2        1275  Gentoo penguin (Pygoscelis papua)  Biscoe              46.5   
3        1233  Gentoo penguin (Pygoscelis papua)  Biscoe              43.3   
4        1311  Gentoo penguin (Pygoscelis papua)  Biscoe              47.5   
5        1316  Gentoo penguin (Pygoscelis papua)  Biscoe              49.1   
6        1313  Gentoo penguin (Pygoscelis papua)  Biscoe              45.5   
7        1381  Gentoo penguin (Pygoscelis papua)  Biscoe              47.6   
8        1377  Gentoo penguin (Pygoscelis papua)  Biscoe              45.1   
9        1380  Gentoo penguin (Pygoscelis papua)  Biscoe              45.1   

   culmen_depth_mm  flipper_length_mm  body_mass_g     sex  
0              NaN                NaN          NaN    None  
1             13.5 

## 2. Data cleaning / prep

In [2]:
# set a friendlier index to uniquely identify the rows
df = df.set_index("tag_number")

# filter down to the data we want to analyze
adelie_data = df[df.species == "Adelie Penguin (Pygoscelis adeliae)"]

# drop the columns we don't care about
adelie_data = adelie_data.drop(["name", "species"])

# drop rows with nulls to get our training data
training_data = adelie_data.dropna()

# take a peek at the training data
training_data

   tag_number island  culmen_length_mm  culmen_depth_mm  flipper_length_mm  \
0        1101  Dream              36.6             18.4              184.0   
1        1102  Dream              39.8             19.1              184.0   
2        1103  Dream              40.9             18.9              184.0   
3        1105  Dream              37.3             16.8              192.0   
4        1106  Dream              43.2             18.5              192.0   
5        1110  Dream              40.2             20.1              200.0   
6        1111  Dream              40.8             18.9              208.0   
7        1112  Dream              39.0             18.7              185.0   
8        1113  Dream              37.0             16.9              185.0   
9        1115  Dream              34.0             17.1              185.0   

   body_mass_g     sex  
0       3475.0  FEMALE  
1       4650.0    MALE  
2       3900.0    MALE  
3       3000.0  FEMALE  
4       4100.0  

In [3]:
# pick feature columns and label column
feature_columns = training_data[['island', 'culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'sex']]
label_columns = training_data[['body_mass_g']]                               

# also get the rows that we want to make predictions for (i.e. where the feature column is null)
missing_body_mass = adelie_data[adelie_data.body_mass_g.isnull()]

## 3. Create, score, fit, predict

In [5]:
from bigframes.ml.linear_model import LinearRegression

# as in scikit-learn, a newly created model is just a bundle of parameters
# default parameters are fine here
model = LinearRegression()

# this will train a temporary model in BQML
model.fit(feature_columns, label_columns)

In [6]:
# check how the model performed, using the automatic test/training data split chosen by BQML
model.score()

   mean_absolute_error  mean_squared_error  mean_squared_log_error  \
0           223.867573        78092.919595                0.005589   

   median_absolute_error  r2_score  explained_variance  
0              182.67685  0.626156            0.626156  

[1 rows x 6 columns]

In [7]:
# use the model to predict the missing labels
model.predict(missing_body_mass)

   tag_number  predicted_body_mass_g
0        1393            3441.163571
1        1524            4522.356778
2        1523            3695.111770
3        1525            4175.124071

[4 rows x 2 columns]

## 4. Save in BigQuery

In [8]:
# save the model to a permanent location in BigQuery, so we can use it in future sessions (and elsewhere in BQ)
model.to_gbq("bqml_tutorial.penguins_model", replace=True)

I can't beleive it's not SKLearn!