# Using ML - Easy linear regression

This demo shows BigQuery DataFrames ML providing an SKLearn-like experience for
training a linear regression model.

In this "easy" version of linear regression, we use a couple of BQML features to simplify our code:

- We rely on automatic preprocessing to encode string values and scale numeric values
- We rely on automatic data split & evaluation to test the model

This example is adapted from the [BQML linear regression tutorial](https://cloud.google.com/bigquery-ml/docs/linear-regression-tutorial).

## 1. Init & load data

In [3]:
import bigframes.pandas

# read a BigQuery table to a BigQuery DataFrame
df = bigframes.pandas.read_gbq("bigframes-dev.bqml_tutorial.penguins")

# take a peek at the dataframe
df

HTML(value='Query job aa5413d7-83ef-4e29-bd72-6ffb3d1418df is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job a4a343b7-28b9-4fdd-a823-e21770a2f7bd is DONE. 0 Bytes processed. <a target="_blank" href…

HTML(value='Query job addc9245-ca49-4acb-9b5b-d859f4061431 is DONE. 31.7 kB processed. <a target="_blank" href…

HTML(value='Query job 7fe3ffbd-7969-4faa-ad1a-00377d0cfbb3 is DONE. 0 Bytes processed. <a target="_blank" href…

Unnamed: 0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie Penguin (Pygoscelis adeliae),Dream,36.6,18.4,184.0,3475.0,FEMALE
1,Adelie Penguin (Pygoscelis adeliae),Dream,39.8,19.1,184.0,4650.0,MALE
2,Adelie Penguin (Pygoscelis adeliae),Dream,40.9,18.9,184.0,3900.0,MALE
3,Chinstrap penguin (Pygoscelis antarctica),Dream,46.5,17.9,192.0,3500.0,FEMALE
4,Adelie Penguin (Pygoscelis adeliae),Dream,37.3,16.8,192.0,3000.0,FEMALE
5,Adelie Penguin (Pygoscelis adeliae),Dream,43.2,18.5,192.0,4100.0,MALE
6,Chinstrap penguin (Pygoscelis antarctica),Dream,46.9,16.6,192.0,2700.0,FEMALE
7,Chinstrap penguin (Pygoscelis antarctica),Dream,50.5,18.4,200.0,3400.0,FEMALE
8,Chinstrap penguin (Pygoscelis antarctica),Dream,49.5,19.0,200.0,3800.0,MALE
9,Adelie Penguin (Pygoscelis adeliae),Dream,40.2,20.1,200.0,3975.0,MALE


## 2. Data cleaning / prep

In [4]:
# filter down to the data we want to analyze
adelie_data = df[df.species == "Adelie Penguin (Pygoscelis adeliae)"]

# drop the columns we don't care about
adelie_data = adelie_data.drop(columns=["species"])

# drop rows with nulls to get our training data
training_data = adelie_data.dropna()

# take a peek at the training data
training_data

HTML(value='Query job 6f0e1877-369d-4e9f-81a8-9c00ab1b57b3 is DONE. 28.9 kB processed. <a target="_blank" href…

HTML(value='Query job c0e9c31c-2915-4d59-9661-0a4dcdfe240a is DONE. 31.7 kB processed. <a target="_blank" href…

HTML(value='Query job 21bdae7c-763e-4f1a-b3cd-11b8d3e128e2 is DONE. 0 Bytes processed. <a target="_blank" href…

Unnamed: 0,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
0,Dream,36.6,18.4,184.0,3475.0,FEMALE
1,Dream,39.8,19.1,184.0,4650.0,MALE
2,Dream,40.9,18.9,184.0,3900.0,MALE
4,Dream,37.3,16.8,192.0,3000.0,FEMALE
5,Dream,43.2,18.5,192.0,4100.0,MALE
9,Dream,40.2,20.1,200.0,3975.0,MALE
10,Dream,40.8,18.9,208.0,4300.0,MALE
11,Dream,39.0,18.7,185.0,3650.0,MALE
12,Dream,37.0,16.9,185.0,3000.0,FEMALE
14,Dream,34.0,17.1,185.0,3400.0,FEMALE


In [5]:
# pick feature columns and label column
feature_columns = training_data[['island', 'culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'sex']]
label_columns = training_data[['body_mass_g']]                               

# also get the rows that we want to make predictions for (i.e. where the feature column is null)
missing_body_mass = adelie_data[adelie_data.body_mass_g.isnull()]

## 3. Create, score, fit, predict

In [6]:
from bigframes.ml.linear_model import LinearRegression

model = LinearRegression()

# Here we pass the feature columns without transforms - BQML will then use
# automatic preprocessing to encode these columns
model.fit(feature_columns, label_columns)

In [7]:
# check how the model performed
model.score(feature_columns, label_columns)

HTML(value='Query job 56778fb7-779c-4e44-b2a3-04d2e174c562 is DONE. 31.9 kB processed. <a target="_blank" href…

HTML(value='Query job 249d42d1-3557-478d-95bd-64fd5de2d16e is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job f6524668-873b-41df-9a67-68dedf1fa734 is DONE. 0 Bytes processed. <a target="_blank" href…

HTML(value='Query job 6fe78a44-98e1-4518-b877-f294979f3eed is DONE. 56 Bytes processed. <a target="_blank" hre…

HTML(value='Query job 7f62d9c6-dd5e-4f1e-aab3-ddf88e9193b0 is DONE. 0 Bytes processed. <a target="_blank" href…

Unnamed: 0,mean_absolute_error,mean_squared_error,mean_squared_log_error,median_absolute_error,r2_score,explained_variance
0,223.878763,78553.601634,0.005614,181.330911,0.623951,0.623951


In [8]:
# use the model to predict the missing labels
model.predict(missing_body_mass)

HTML(value='Query job 1eabd729-6c75-4087-9c87-0d95327b615c is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job c8a941c4-f200-49f9-a00b-f942acc35355 is DONE. 8 Bytes processed. <a target="_blank" href…

HTML(value='Query job 03b89481-fc04-4bd7-a88a-bf500a03bf12 is DONE. 0 Bytes processed. <a target="_blank" href…

HTML(value='Query job dc2e7ed5-25f6-45de-bb71-203772957ee5 is DONE. 16 Bytes processed. <a target="_blank" hre…

HTML(value='Query job 41e6f6de-7a25-4351-9044-e881611cbabd is DONE. 0 Bytes processed. <a target="_blank" href…

Unnamed: 0,predicted_body_mass_g
292,3459.735118


## 4. Save in BigQuery

In [9]:
# save the model to a permanent location in BigQuery, so we can use it in future sessions (and elsewhere in BQ)
model.to_gbq("bqml_tutorial.penguins_model", replace=True)

LinearRegression()