# Using ML - Easy linear regression

This demo shows BigQuery DataFrames ML providing an SKLearn-like experience for
training a linear regression model.

In this "easy" version of linear regression, we use a couple of BQML features to simplify our code:

- We rely on automatic preprocessing to encode string values and scale numeric values
- We rely on automatic data split & evaluation to test the model

This example is adapted from the [BQML linear regression tutorial](https://cloud.google.com/bigquery-ml/docs/linear-regression-tutorial).

## 1. Init & load data

In [20]:
import bigframes.pandas

# read a BigQuery table to a BigQuery DataFrame
df = bigframes.pandas.read_gbq("bigframes-dev.bqml_tutorial.penguins")

# take a peek at the dataframe
df

HTML(value='Query job d1e085ba-66d8-4631-bb51-50a17d0a6e51 is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job 4468d93d-c22c-43f4-a09b-262b5b830c0e is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job 8fe1dc50-9d32-4466-9c2b-76d32cbde7c5 is DONE. 0 Bytes processed. <a target="_blank" href…

HTML(value='Query job e40d99ae-1b3a-4a12-b4be-e264af8b22e5 is RUNNING. <a target="_blank" href="https://consol…

Unnamed: 0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie Penguin (Pygoscelis adeliae),Dream,36.6,18.4,184.0,3475.0,FEMALE
1,Adelie Penguin (Pygoscelis adeliae),Dream,39.8,19.1,184.0,4650.0,MALE
2,Adelie Penguin (Pygoscelis adeliae),Dream,40.9,18.9,184.0,3900.0,MALE
3,Chinstrap penguin (Pygoscelis antarctica),Dream,46.5,17.9,192.0,3500.0,FEMALE
4,Adelie Penguin (Pygoscelis adeliae),Dream,37.3,16.8,192.0,3000.0,FEMALE
5,Adelie Penguin (Pygoscelis adeliae),Dream,43.2,18.5,192.0,4100.0,MALE
6,Chinstrap penguin (Pygoscelis antarctica),Dream,46.9,16.6,192.0,2700.0,FEMALE
7,Chinstrap penguin (Pygoscelis antarctica),Dream,50.5,18.4,200.0,3400.0,FEMALE
8,Chinstrap penguin (Pygoscelis antarctica),Dream,49.5,19.0,200.0,3800.0,MALE
9,Adelie Penguin (Pygoscelis adeliae),Dream,40.2,20.1,200.0,3975.0,MALE


## 2. Data cleaning / prep

In [21]:
# filter down to the data we want to analyze
adelie_data = df[df.species == "Adelie Penguin (Pygoscelis adeliae)"]

# drop the columns we don't care about
adelie_data = adelie_data.drop(columns=["species"])

# drop rows with nulls to get our training data
training_data = adelie_data.dropna()

# take a peek at the training data
training_data

HTML(value='Query job 7d289291-5c60-4d8f-b476-e46cb2ab06a7 is DONE. 28.9 kB processed. <a target="_blank" href…

HTML(value='Query job 8411db98-9ec3-4655-a40f-f9bf272e2403 is RUNNING. <a target="_blank" href="https://consol…

Unnamed: 0,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
0,Dream,36.6,18.4,184.0,3475.0,FEMALE
1,Dream,39.8,19.1,184.0,4650.0,MALE
2,Dream,40.9,18.9,184.0,3900.0,MALE
4,Dream,37.3,16.8,192.0,3000.0,FEMALE
5,Dream,43.2,18.5,192.0,4100.0,MALE
9,Dream,40.2,20.1,200.0,3975.0,MALE
10,Dream,40.8,18.9,208.0,4300.0,MALE
11,Dream,39.0,18.7,185.0,3650.0,MALE
12,Dream,37.0,16.9,185.0,3000.0,FEMALE
14,Dream,34.0,17.1,185.0,3400.0,FEMALE


In [22]:
# pick feature columns and label column
feature_columns = training_data[['island', 'culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'sex']]
label_columns = training_data[['body_mass_g']]                               

# also get the rows that we want to make predictions for (i.e. where the feature column is null)
missing_body_mass = adelie_data[adelie_data.body_mass_g.isnull()]

## 3. Create, score, fit, predict

In [23]:
from bigframes.ml.linear_model import LinearRegression

model = LinearRegression()

# Here we pass the feature columns without transforms - BQML will then use
# automatic preprocessing to encode these columns
model.fit(feature_columns, label_columns)

HTML(value='Query job dcef36e5-4bd6-40f8-88c6-72e84360533f is RUNNING. <a target="_blank" href="https://consol…

LinearRegression()

In [24]:
# check how the model performed
model.score(feature_columns, label_columns)

HTML(value='Query job 87895ee3-81d0-4267-8a50-ab00e04664a7 is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job f00512e0-983a-4e93-b209-58205ebad99f is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job 51ba4529-5be9-4e9f-aae6-3c80dd4b36b8 is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job 2e3a6603-9f0e-44ff-9086-2e14ad50bd25 is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job 357878f9-b705-4a03-aeeb-818a51873724 is DONE. 0 Bytes processed. <a target="_blank" href…

HTML(value='Query job 7d6c2e32-56e7-43ef-9b21-ccd2a25930ea is RUNNING. <a target="_blank" href="https://consol…

Unnamed: 0,mean_absolute_error,mean_squared_error,mean_squared_log_error,median_absolute_error,r2_score,explained_variance
0,223.878763,78553.601634,0.005614,181.330911,0.623951,0.623951


In [25]:
# use the model to predict the missing labels
model.predict(missing_body_mass)

HTML(value='Query job a25c445d-9b60-4a8d-a325-1bfacd32bc8d is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job b881b602-abfa-4c19-a385-2480b3e8b2bd is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job 03249520-93d3-4b2e-8976-f49cc4efe520 is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job 31094013-70ea-415f-8b96-85c1af7ee9c8 is DONE. 0 Bytes processed. <a target="_blank" href…

HTML(value='Query job 0e456f33-4cb7-45a0-88e6-29324175b5a6 is RUNNING. <a target="_blank" href="https://consol…

Unnamed: 0,predicted_body_mass_g
292,3459.735118


## 4. Save in BigQuery

In [26]:
# save the model to a permanent location in BigQuery, so we can use it in future sessions (and elsewhere in BQ)
model.to_gbq("bigframes-dev.bqml_tutorial.penguins_model", replace=True)

HTML(value='Copy job 1a273ccd-212a-4750-a3c1-615256af6d48 is RUNNING. <a target="_blank" href="https://console…

LinearRegression(optimize_strategy='NORMAL_EQUATION')

## 5. Reload from BigQuery

In [27]:
# WARNING - until b/281709360 is fixed & pipeline is updated, pipelines will load as models,
# and details of their transform steps will be lost (the loaded model will behave the same)
bigframes.pandas.read_gbq_model("bigframes-dev.bqml_tutorial.penguins_model")

LinearRegression(optimize_strategy='NORMAL_EQUATION')