# Using ML - Easy linear regression

This demo shows BigQuery DataFrames ML providing an SKLearn-like experience for
training a linear regression model.

In this "easy" version of linear regression, we use a couple of BQML features to simplify our code:

- We rely on automatic preprocessing to encode string values and scale numeric values
- We rely on automatic data split & evaluation to test the model

This example is adapted from the [BQML linear regression tutorial](https://cloud.google.com/bigquery-ml/docs/linear-regression-tutorial).

## 1. Init & load data

Import `bigframes.pandas` module and get the default session

In [22]:
import bigframes.pandas
session = bigframes.pandas.get_global_session()

Define a dataset for storing BQML model, and create it if it does not exist.

In [23]:
dataset = f"{session.bqclient.project}.bqml_tutorial"
session.bqclient.create_dataset(dataset, exists_ok=True)

Dataset(DatasetReference('shobs-test', 'bqml_tutorial'))

Define a model path

In [24]:
penguins_model = f"{dataset}.penguins_model"

Read the penguins data.

In [25]:
# read a BigQuery table to a BigQuery DataFrame
df = bigframes.pandas.read_gbq(f"bigquery-public-data.ml_datasets.penguins")

# take a peek at the dataframe
df

Unnamed: 0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie Penguin (Pygoscelis adeliae),Biscoe,40.1,18.9,188.0,4300.0,MALE
1,Adelie Penguin (Pygoscelis adeliae),Torgersen,39.1,18.7,181.0,3750.0,MALE
2,Gentoo penguin (Pygoscelis papua),Biscoe,47.4,14.6,212.0,4725.0,FEMALE
3,Chinstrap penguin (Pygoscelis antarctica),Dream,42.5,16.7,187.0,3350.0,FEMALE
4,Adelie Penguin (Pygoscelis adeliae),Biscoe,43.2,19.0,197.0,4775.0,MALE
5,Gentoo penguin (Pygoscelis papua),Biscoe,46.7,15.3,219.0,5200.0,MALE
6,Adelie Penguin (Pygoscelis adeliae),Biscoe,41.3,21.1,195.0,4400.0,MALE
7,Gentoo penguin (Pygoscelis papua),Biscoe,45.2,13.8,215.0,4750.0,FEMALE
8,Gentoo penguin (Pygoscelis papua),Biscoe,46.5,13.5,210.0,4550.0,FEMALE
9,Gentoo penguin (Pygoscelis papua),Biscoe,50.5,15.2,216.0,5000.0,FEMALE


## 2. Data cleaning / prep

In [26]:
# filter down to the data we want to analyze
adelie_data = df[df.species == "Adelie Penguin (Pygoscelis adeliae)"]

# drop the columns we don't care about
adelie_data = adelie_data.drop(columns=["species"])

# drop rows with nulls to get our training data
training_data = adelie_data.dropna()

# take a peek at the training data
training_data

Unnamed: 0,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
0,Biscoe,40.1,18.9,188.0,4300.0,MALE
1,Torgersen,39.1,18.7,181.0,3750.0,MALE
4,Biscoe,43.2,19.0,197.0,4775.0,MALE
6,Biscoe,41.3,21.1,195.0,4400.0,MALE
11,Dream,38.1,18.6,190.0,3700.0,FEMALE
13,Biscoe,37.8,20.0,190.0,4250.0,MALE
14,Biscoe,35.0,17.9,190.0,3450.0,FEMALE
16,Torgersen,34.6,21.1,198.0,4400.0,MALE
19,Dream,37.2,18.1,178.0,3900.0,MALE
21,Biscoe,40.5,17.9,187.0,3200.0,FEMALE


In [27]:
# pick feature columns and label column
feature_columns = training_data[['island', 'culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'sex']]
label_columns = training_data[['body_mass_g']]                               

# also get the rows that we want to make predictions for (i.e. where the feature column is null)
missing_body_mass = adelie_data[adelie_data.body_mass_g.isnull()]

## 3. Create, score, fit, predict

In [28]:
from bigframes.ml.linear_model import LinearRegression

model = LinearRegression()

# Here we pass the feature columns without transforms - BQML will then use
# automatic preprocessing to encode these columns
model.fit(feature_columns, label_columns)

LinearRegression()

In [29]:
# check how the model performed
model.score(feature_columns, label_columns)

Unnamed: 0,mean_absolute_error,mean_squared_error,mean_squared_log_error,median_absolute_error,r2_score,explained_variance
0,223.878763,78553.601634,0.005614,181.330911,0.623951,0.623951


In [30]:
# use the model to predict the missing labels
model.predict(missing_body_mass)

Unnamed: 0,predicted_body_mass_g
334,5891.735118


## 4. Save in BigQuery

In [31]:
# save the model to a permanent location in BigQuery, so we can use it in future sessions (and elsewhere in BQ)
model.to_gbq(penguins_model, replace=True)

LinearRegression(optimize_strategy='NORMAL_EQUATION')

## 5. Reload from BigQuery

In [32]:
# WARNING - until b/281709360 is fixed & pipeline is updated, pipelines will load as models,
# and details of their transform steps will be lost (the loaded model will behave the same)
bigframes.pandas.read_gbq_model(penguins_model)

LinearRegression(optimize_strategy='NORMAL_EQUATION')