**Note: this notebook requires changes not yet checked in**

# BigFrames ML demo - Penguin weight

![weighing a penguin - Parti](penguinweigh.png)

This demo shows BigFrame's ML API providing an SKLearn-like experience for training a linear regression model.

This example is adapted from the [BQML linear regression tutorial](https://cloud.google.com/bigquery-ml/docs/linear-regression-tutorial).

## 1. Init & load data

In [1]:
# initialize BigFrames
import bigframes
session = bigframes.connect()

# read a BigQuery table to a BigFrames dataframe
df = session.read_gbq("bigframes-dev.bqml_tutorial.penguins")

# take a peek at the dataframe
df

Unnamed: 0,name,tag_number,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
0,Genesis,1245,Adelie Penguin (Pygoscelis adeliae),Dream,37.5,18.9,179.0,2975.0,
1,Bella,1162,Adelie Penguin (Pygoscelis adeliae),Dream,40.7,17.0,190.0,3725.0,MALE
2,Diego,1175,Adelie Penguin (Pygoscelis adeliae),Dream,41.1,19.0,182.0,3425.0,MALE
3,Oliver,1178,Adelie Penguin (Pygoscelis adeliae),Dream,41.6,20.0,204.0,,MALE
4,Cole,1180,Adelie Penguin (Pygoscelis adeliae),Dream,38.8,20.0,190.0,3950.0,MALE
...,...,...,...,...,...,...,...,...,...
342,Annabelle,1299,Adelie Penguin (Pygoscelis adeliae),Torgersen,35.2,15.9,186.0,3050.0,FEMALE
343,Sadie,1301,Adelie Penguin (Pygoscelis adeliae),Torgersen,40.9,16.8,191.0,3700.0,FEMALE
344,Leonardo,1303,Adelie Penguin (Pygoscelis adeliae),Torgersen,36.6,17.8,185.0,3700.0,FEMALE
345,Violet,1306,Adelie Penguin (Pygoscelis adeliae),Torgersen,38.9,17.8,181.0,3625.0,FEMALE


## 2. Data cleaning / prep

In [2]:
# set a friendlier index to uniquely identify the rows
df = df.set_index("tag_number")

# filter down to the data we want to analyze
adelie_data = df[df.species.str.startswith("Adelie")]

# drop the columns we don't care about
adelie_data = adelie_data.drop(["name", "species"])

# drop rows with nulls to get our training data
training_data = adelie_data.dropna()

# take a peek at the training data
training_data

Unnamed: 0_level_0,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
tag_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1158,Dream,32.1,15.5,188.0,3050.0,FEMALE
1159,Biscoe,37.7,16.0,183.0,3075.0,FEMALE
1160,Torgersen,40.2,17.0,176.0,3450.0,FEMALE
1161,Biscoe,37.6,17.0,185.0,3600.0,FEMALE
1162,Dream,40.7,17.0,190.0,3725.0,MALE
...,...,...,...,...,...,...
1307,Dream,38.9,18.8,190.0,3600.0,FEMALE
1308,Biscoe,40.6,18.8,193.0,3800.0,MALE
1309,Dream,39.6,18.8,190.0,4600.0,MALE
1310,Torgersen,36.7,18.8,187.0,3800.0,FEMALE


In [3]:
# pick feature columns and label column
feature_columns = training_data[['island', 'culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'sex']]
label_columns = training_data[['body_mass_g']]                               

# also get the rows that we want to make predictions for (i.e. where the feature column is null)
missing_body_mass = adelie_data[adelie_data.body_mass_g.isnull()]

## 3. Create, score, fit, predict

In [4]:
import bigframes.ml as ml

# as in scikit-learn, a newly created model is just a bundle of parameters
# default parameters are fine here
model = ml.LinearRegression()

# this will train a temporary model in BQML
model.fit(feature_columns, label_columns)

In [5]:
# check how the model performed, using the automatic test/training data split chosen by BQML
model.score()

mean_absolute_error         223.878763
mean_squared_error        78553.601634
mean_squared_log_error        0.005614
median_absolute_error       181.330911
r2_score                      0.623951
explained_variance            0.623951
dtype: float64

In [6]:
# use the model to predict the missing labels
model.predict(missing_body_mass)

Unnamed: 0_level_0,predicted_body_mass_g
tag_number,Unnamed: 1_level_1
1178,4304.175638
1197,3947.881639
1186,3471.668379
1157,3459.735118


## 4. Save in BigQuery

In [7]:
# save the model to a permanent location in BigQuery, so we can use it in future sessions (and elsewhere in BQ)
model.to_gbq("bqml_tutorial.penguins_model", replace=True)

# Addenum: comparing step 3. with scikit-learn

BQML provides extra conveniences:
- Automatic preprocessing scales the numeric feature columns and encodes the string feature columns
- Automatic training/test data split

Doing the same example in scikit-learn is slightly more complex.

In [8]:
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split

# Define preprocessing steps for continuous and categorical features
preprocessor = ColumnTransformer(
    transformers=[
        ('island', OneHotEncoder(), ['Categorical']),
        ('culmen_length_mm', StandardScaler(), ['Continuous']),
        ('culmen_depth_mm', StandardScaler(), ['Continuous']),
        ('flipper_length_mm', StandardScaler(), ['Continuous']),
        ('sex', OneHotEncoder(), ['Categorical']),
    ])

# Apply preprocessing to the dataset
X_preprocessed = preprocessor.fit_transform(feature_columns)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_preprocessed, label_columns, test_size=0.2, random_state=42)

# Create a linear regression model
model = LinearRegression()

# Train the model using the preprocessed training data
model.fit(X_train, y_train)

# Evaluate the model with the test set
model.score(X_test, y_test)

# Preprocess and make predictions for the missing labels
X_missing = preprocessor.transform(missing_body_mass)
y_pred = model.predict(X_missing)

ModuleNotFoundError: No module named 'sklearn'