# Using ML - ML fundamentals

The `bigframes.ml` module implements Scikit-Learn's machine learning API in BigFrames. It exposes BigQuery's ML capabilities in a simple, popular API that works seamlessly with both BigFrames and BigQuery.

In [1]:
# Lets load some test data to use in this tutorial
import bigframes
session = bigframes.connect()

df = session.read_gbq("bigquery-public-data.ml_datasets.penguins")
df = df.dropna()

# Temporary workaround: lets name our index so it isn't lost
# BigFrames currently drops unnamed indexes when round-tripping through
# pandas, which some ML APIs do to route around missing functionality
df.index.name = "penguin_id"

df

Unnamed: 0_level_0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
penguin_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,Adelie Penguin (Pygoscelis adeliae),Dream,36.6,18.4,184.0,3475.0,FEMALE
1,Adelie Penguin (Pygoscelis adeliae),Dream,39.8,19.1,184.0,4650.0,MALE
2,Adelie Penguin (Pygoscelis adeliae),Dream,40.9,18.9,184.0,3900.0,MALE
3,Chinstrap penguin (Pygoscelis antarctica),Dream,46.5,17.9,192.0,3500.0,FEMALE
4,Adelie Penguin (Pygoscelis adeliae),Dream,37.3,16.8,192.0,3000.0,FEMALE
5,Adelie Penguin (Pygoscelis adeliae),Dream,43.2,18.5,192.0,4100.0,MALE
6,Chinstrap penguin (Pygoscelis antarctica),Dream,46.9,16.6,192.0,2700.0,FEMALE
7,Chinstrap penguin (Pygoscelis antarctica),Dream,50.5,18.4,200.0,3400.0,FEMALE
8,Chinstrap penguin (Pygoscelis antarctica),Dream,49.5,19.0,200.0,3800.0,MALE
9,Adelie Penguin (Pygoscelis adeliae),Dream,40.2,20.1,200.0,3975.0,MALE


## Data split

Part of preparing data for a machine learning task is splitting it into subsets for training and testing, to ensure that the solution is not overfitting. Most commonly this is done with `bigframes.ml.model_selection.train_test_split` like so:

In [2]:
# In this example, we're doing supervised learning, where we will learn to predict
# output variable `y` from input features `X`
X = df[['island', 'culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'sex', 'species']]
y = df[['body_mass_g']] 

from bigframes.ml.model_selection import train_test_split

# This will split X and y into test and training sets, with 20% of the rows in the test set,
# and the rest in the training set
train_X, test_X, train_y, test_y = train_test_split(
  X, y, test_size=0.2)

# Show the shape of the data after the split
print(f"""train_X shape: {train_X.shape}
test_X shape: {test_X.shape}
train_y shape: {train_y.shape}
test_y shape: {test_y.shape}""")

train_X shape: (267, 6)
test_X shape: (67, 6)
train_y shape: (267, 1)
test_y shape: (67, 1)


In [3]:
# If we look at the data, we can see that random rows were selected for
# each side of the split
test_X.head(5)

Unnamed: 0,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,sex,species
0,Biscoe,39.0,17.5,186.0,FEMALE,Adelie Penguin (Pygoscelis adeliae)
1,Biscoe,39.6,20.7,191.0,FEMALE,Adelie Penguin (Pygoscelis adeliae)
2,Dream,37.3,17.8,191.0,FEMALE,Adelie Penguin (Pygoscelis adeliae)
3,Biscoe,40.5,18.9,180.0,MALE,Adelie Penguin (Pygoscelis adeliae)
4,Biscoe,51.5,16.3,230.0,MALE,Gentoo penguin (Pygoscelis papua)


In [4]:
# Note that this matches the rows in test_X
test_y.head(5)

Unnamed: 0,body_mass_g
0,3550.0
1,3900.0
2,3350.0
3,3950.0
4,5500.0


## Estimators

Following Scikit-Learn, all learning components are "estimators"; objects that can learn from training data and then apply themselves to new data. Estimators share the following patterns:

- a constructor that takes a list of parameters
- a standard string representation that shows the class name and all non-default parameters, e.g. `LinearRegression(fit_intercept=False)`
- a `.fit(..)` method to fit the estimator to training data

There estimators can be further broken down into two main subtypes:

### Transformers

Transformers are estimators that are used to prepare data for consumption by other estimators ('preprocessing'). In addition to `.fit(...)`, the transformer implements a `.transform(...)` method, which will apply a transformation based on what was computed during `.fit(..)`. With this pattern dynamic preprocessing steps can be applied to both training and test/production data consistently.

An example of a transformer is `bigframes.ml.preprocessing.StandardScaler`, which rescales a dataset to have a mean of zero and a standard deviation of one:

In [5]:
from bigframes.ml.preprocessing import StandardScaler

# StandardScaler will only work on numeric columns
numeric_columns = ["culmen_length_mm", "culmen_depth_mm", "flipper_length_mm"]

scaler = StandardScaler()
scaler.fit(train_X[numeric_columns])

# Now, standardscaler should transform the numbers to have mean of zero
# and standard deviation of one:
scaler.transform(train_X[numeric_columns])

Unnamed: 0,scaled_culmen_length_mm,scaled_culmen_depth_mm,scaled_flipper_length_mm
127,0.085923,-0.717686,1.124315
108,0.854893,-0.717686,0.485016
110,0.983055,-0.667098,1.053282
111,1.019673,-1.071799,1.053282
112,0.90982,-0.515336,1.053282
114,1.001364,-0.515336,1.621549
115,1.092908,-0.616511,1.621549
121,0.745041,-1.021211,0.556049
129,1.111217,-1.071799,1.692582
130,1.294305,-0.312985,1.692582


In [6]:
# We can then repeat this transformation on new data
scaler.transform(test_X[numeric_columns])

Unnamed: 0,scaled_culmen_length_mm,scaled_culmen_depth_mm,scaled_flipper_length_mm
4,1.36754,-0.41416,2.047749
6,1.715408,-0.667098,1.266382
8,0.378864,-1.021211,0.982249
19,2.026658,-0.565923,2.047749
20,1.239379,0.091715,1.905682
23,0.873202,-0.464748,1.479482
36,0.177467,-0.869449,1.337415
41,0.507026,-0.515336,0.982249
53,0.964746,-0.717686,1.124315
60,1.477393,-0.060047,2.047749


#### Composing transformers

To process data where different columns need different preprocessors, `bigframes.composition.ColumnTransformer` can be employed:

In [7]:
from bigframes.ml.compose import ColumnTransformer
from bigframes.ml.preprocessing import OneHotEncoder

# Create an aggregate transform that applies StandardScaler to the numeric columns,
# and OneHotEncoder to the string columns
preproc = ColumnTransformer([
    ("scale", StandardScaler(), ["culmen_length_mm", "culmen_depth_mm", "flipper_length_mm"]),
    ("encode", OneHotEncoder(), ["species", "sex", "island"])])

# Now we can fit all columns of the training data
preproc.fit(train_X)

processed_train_X = preproc.transform(train_X)
processed_test_X = preproc.transform(test_X)

processed_train_X

Unnamed: 0,onehotencoded_island,scaled_culmen_length_mm,scaled_culmen_depth_mm,scaled_flipper_length_mm,onehotencoded_sex,onehotencoded_species
127,"[{'index': 1, 'value': 1.0}]",0.085923,-0.717686,1.124315,"[{'index': 0, 'value': 1.0}]","[{'index': 3, 'value': 1.0}]"
108,"[{'index': 1, 'value': 1.0}]",0.854893,-0.717686,0.485016,"[{'index': 2, 'value': 1.0}]","[{'index': 3, 'value': 1.0}]"
110,"[{'index': 1, 'value': 1.0}]",0.983055,-0.667098,1.053282,"[{'index': 2, 'value': 1.0}]","[{'index': 3, 'value': 1.0}]"
111,"[{'index': 1, 'value': 1.0}]",1.019673,-1.071799,1.053282,"[{'index': 2, 'value': 1.0}]","[{'index': 3, 'value': 1.0}]"
112,"[{'index': 1, 'value': 1.0}]",0.90982,-0.515336,1.053282,"[{'index': 2, 'value': 1.0}]","[{'index': 3, 'value': 1.0}]"
114,"[{'index': 1, 'value': 1.0}]",1.001364,-0.515336,1.621549,"[{'index': 2, 'value': 1.0}]","[{'index': 3, 'value': 1.0}]"
115,"[{'index': 1, 'value': 1.0}]",1.092908,-0.616511,1.621549,"[{'index': 2, 'value': 1.0}]","[{'index': 3, 'value': 1.0}]"
121,"[{'index': 1, 'value': 1.0}]",0.745041,-1.021211,0.556049,"[{'index': 2, 'value': 1.0}]","[{'index': 3, 'value': 1.0}]"
129,"[{'index': 1, 'value': 1.0}]",1.111217,-1.071799,1.692582,"[{'index': 2, 'value': 1.0}]","[{'index': 3, 'value': 1.0}]"
130,"[{'index': 1, 'value': 1.0}]",1.294305,-0.312985,1.692582,"[{'index': 2, 'value': 1.0}]","[{'index': 3, 'value': 1.0}]"


### Predictors

Predictors are estimators that learn and make predictions. In addition to `.fit(...)`, the predictor implements a `.predict(...)` method, which will use what was learned during `.fit(...)` to predict some output.

Predictors can be further broken down into two categories:

#### Supervised predictors

Supervised learning is when we train a model on input-output pairs, and then ask it to predict the output for new inputs. An example of such a predictor is `bigframes.ml.linear_models.LinearRegression`.

In [8]:
from bigframes.ml.linear_model import LinearRegression

linreg = LinearRegression()

# Learn from the training data how to predict output y
linreg.fit(processed_train_X, train_y)

# Predict y for the test data
predicted_test_y = linreg.predict(processed_test_X)

predicted_test_y

Unnamed: 0,predicted_body_mass_g
26,4637.784248
54,3247.336085
22,4219.706307
62,5361.446067
57,4085.755705
8,5201.034041
36,5286.578579
0,3416.315476
63,3440.931066
14,3587.093257


#### Unsupervised predictors

In unsupervised learning, there are no known outputs in the training data, instead the model learns on input data alone and predicts something else. An example of an unsupervised predictor is `bigframes.ml.cluster.KMeans`, which learns how to fit input data to a target number of clusters.

In [16]:
from bigframes.ml.cluster import KMeans

kmeans = KMeans(n_clusters=4)

kmeans.fit(processed_train_X)

kmeans.predict(processed_test_X)

Unnamed: 0,CENTROID_ID
46,2
25,4
65,2
20,2
0,1
11,2
21,3
15,1
37,1
4,2


## Pipelines

Transfomers and predictors can be chained into a single estimator component using `bigframes.ml.pipeline.Pipeline`:

In [10]:
from bigframes.ml.pipeline import Pipeline

pipeline = Pipeline([
  ('preproc', preproc),
  ('linreg', linreg)
])

# Print our pipeline
pipeline

Pipeline(steps=[('preproc',
                 ColumnTransformer(transformers=[('scale', StandardScaler(),
                                                  ['culmen_length_mm',
                                                   'culmen_depth_mm',
                                                   'flipper_length_mm']),
                                                 ('encode', OneHotEncoder(),
                                                  ['species', 'sex',
                                                   'island'])])),
                ('linreg', LinearRegression())])

The pipeline simplifies the workflow by applying each of its component steps automatically:

In [11]:
pipeline.fit(train_X, train_y)

predicted_test_y = pipeline.predict(test_X)
predicted_test_y

Unnamed: 0,predicted_body_mass_g
4,5623.418778
6,5436.723898
8,5201.041354
19,5664.803711
20,5641.825206
23,5435.810218
36,5286.586113
41,5276.955531
53,5327.33637
60,5678.411067


In the backend, a pipeline will actually be compiled into a single model with an embedded TRANSFORM step.

## Evaluating results

Some models include a convenient `.score()` method for evaulation with a preset accuracy metric:

In [12]:
# In the case of a pipeline, this will be equivalent to calling .score on the contained LinearRegression
pipeline.score(test_X, test_y)

Unnamed: 0,mean_absolute_error,mean_squared_error,mean_squared_log_error,median_absolute_error,r2_score,explained_variance
0,269.730614,117209.606054,0.00734,185.196289,0.852767,0.852911


For a more general approach, the library `bigframes.ml.metrics` is provided:

In [13]:
from bigframes.ml.metrics import r2_score

r2_score(test_y, predicted_test_y)

0.8527672439780059

## Save/Load to BigQuery

Estimators can be saved to BigQuery as BQML models, and loaded again in future

In [14]:
# Replace with a path where you have permission to save a model
model_name = "bigframes-dev.bqml_tutorial.penguins_model"

linreg.to_gbq(model_name, replace=True)

LinearRegression()

In [15]:
# WARNING - until b/281709360 is fixed & pipeline is updated, pipelines will load as models,
# and details of their transform steps will be lost (the loaded model will behave the same)
session.read_gbq_model(model_name)

LinearRegression()