# Using ML - Easy linear regression

This demo shows BigQuery DataFrames ML providing an SKLearn-like experience for
training a linear regression model.

In this "easy" version of linear regression, we use a couple of BQML features to simplify our code:

- We rely on automatic preprocessing to encode string values and scale numeric values
- We rely on automatic data split & evaluation to test the model

This example is adapted from the [BQML linear regression tutorial](https://cloud.google.com/bigquery-ml/docs/linear-regression-tutorial).

## 1. Init & load data

In [1]:
import bigframes.pandas

# read a BigQuery table to a BigQuery DataFrame
df = bigframes.pandas.read_gbq("bigframes-dev.bqml_tutorial.penguins")

# take a peek at the dataframe
df

Unnamed: 0,tag_number,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
0,1225,Gentoo penguin (Pygoscelis papua),Biscoe,,,,,
1,1278,Gentoo penguin (Pygoscelis papua),Biscoe,42.0,13.5,210.0,4150.0,FEMALE
2,1275,Gentoo penguin (Pygoscelis papua),Biscoe,46.5,13.5,210.0,4550.0,FEMALE
3,1233,Gentoo penguin (Pygoscelis papua),Biscoe,43.3,14.0,208.0,4575.0,FEMALE
4,1311,Gentoo penguin (Pygoscelis papua),Biscoe,47.5,14.0,212.0,4875.0,FEMALE
5,1316,Gentoo penguin (Pygoscelis papua),Biscoe,49.1,14.5,212.0,4625.0,FEMALE
6,1313,Gentoo penguin (Pygoscelis papua),Biscoe,45.5,14.5,212.0,4750.0,FEMALE
7,1381,Gentoo penguin (Pygoscelis papua),Biscoe,47.6,14.5,215.0,5400.0,MALE
8,1377,Gentoo penguin (Pygoscelis papua),Biscoe,45.1,14.5,207.0,5050.0,FEMALE
9,1380,Gentoo penguin (Pygoscelis papua),Biscoe,45.1,14.5,215.0,5000.0,FEMALE


## 2. Data cleaning / prep

In [2]:
# set a friendlier index to uniquely identify the rows
df = df.set_index("tag_number")

# filter down to the data we want to analyze
adelie_data = df[df.species == "Adelie Penguin (Pygoscelis adeliae)"]

# drop the columns we don't care about
adelie_data = adelie_data.drop(columns=["species"])

# drop rows with nulls to get our training data
training_data = adelie_data.dropna()

# take a peek at the training data
training_data

Unnamed: 0_level_0,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
tag_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1172,Dream,32.1,15.5,188.0,3050.0,FEMALE
1371,Biscoe,37.7,16.0,183.0,3075.0,FEMALE
1417,Torgersen,38.6,17.0,188.0,2900.0,FEMALE
1204,Dream,40.7,17.0,190.0,3725.0,MALE
1251,Biscoe,37.6,17.0,185.0,3600.0,FEMALE
1422,Torgersen,35.7,17.0,189.0,3350.0,FEMALE
1394,Torgersen,40.2,17.0,176.0,3450.0,FEMALE
1163,Dream,36.4,17.0,195.0,3325.0,FEMALE
1329,Biscoe,38.1,17.0,181.0,3175.0,FEMALE
1406,Torgersen,44.1,18.0,210.0,4000.0,MALE


In [3]:
# pick feature columns and label column
feature_columns = training_data[['island', 'culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'sex']]
label_columns = training_data[['body_mass_g']]                               

# also get the rows that we want to make predictions for (i.e. where the feature column is null)
missing_body_mass = adelie_data[adelie_data.body_mass_g.isnull()]

## 3. Create, score, fit, predict

In [4]:
from bigframes.ml.linear_model import LinearRegression

model = LinearRegression()

# Here we pass the feature columns without transforms - BQML will then use
# automatic preprocessing to encode these columns
model.fit(feature_columns, label_columns)

In [5]:
# check how the model performed
model.score(feature_columns, label_columns)

Unnamed: 0,mean_absolute_error,mean_squared_error,mean_squared_log_error,median_absolute_error,r2_score,explained_variance
0,223.878763,78553.601634,0.005614,181.330911,0.623951,0.623951


In [6]:
# use the model to predict the missing labels
model.predict(missing_body_mass)

Unnamed: 0_level_0,predicted_body_mass_g
tag_number,Unnamed: 1_level_1
1393,3459.735118
1525,3947.881639
1524,4304.175638
1523,3471.668379


## 4. Save in BigQuery

In [7]:
# save the model to a permanent location in BigQuery, so we can use it in future sessions (and elsewhere in BQ)
model.to_gbq("bqml_tutorial.penguins_model", replace=True)

LinearRegression()