# Using ML - ML fundamentals

The `bigframes.ml` module implements Scikit-Learn's machine learning API in
BigQuery DataFrames. It exposes BigQuery's ML capabilities in a simple, popular
API that works seamlessly with the rest of the BigQuery DataFrames API.

In [1]:
# Lets load some test data to use in this tutorial
import bigframes.pandas

df = bigframes.pandas.read_gbq("bigquery-public-data.ml_datasets.penguins")
df = df.dropna()

# Temporary workaround: lets name our index so it isn't lost BigQuery DataFrame
# currently drops unnamed indexes when round-tripping through pandas, which
# some ML APIs do to route around missing functionality
df.index.name = "penguin_id"

df

HTML(value='Query job 802292ea-3f16-400e-97e2-34009774a1f0 is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job 884f55c2-0fca-4f07-b52c-3ff19d7327ea is DONE. 28.9 kB processed. <a target="_blank" href…

HTML(value='Query job d08fdb42-a6e3-4df2-be75-a8ed41a03ccc is DONE. 31.7 kB processed. <a target="_blank" href…

HTML(value='Query job 8cf61cc4-94b1-48f3-8230-c769c1b42185 is DONE. 0 Bytes processed. <a target="_blank" href…

Unnamed: 0_level_0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
penguin_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,Adelie Penguin (Pygoscelis adeliae),Dream,36.6,18.4,184.0,3475.0,FEMALE
1,Adelie Penguin (Pygoscelis adeliae),Dream,39.8,19.1,184.0,4650.0,MALE
2,Adelie Penguin (Pygoscelis adeliae),Dream,40.9,18.9,184.0,3900.0,MALE
3,Chinstrap penguin (Pygoscelis antarctica),Dream,46.5,17.9,192.0,3500.0,FEMALE
4,Adelie Penguin (Pygoscelis adeliae),Dream,37.3,16.8,192.0,3000.0,FEMALE
5,Adelie Penguin (Pygoscelis adeliae),Dream,43.2,18.5,192.0,4100.0,MALE
6,Chinstrap penguin (Pygoscelis antarctica),Dream,46.9,16.6,192.0,2700.0,FEMALE
7,Chinstrap penguin (Pygoscelis antarctica),Dream,50.5,18.4,200.0,3400.0,FEMALE
8,Chinstrap penguin (Pygoscelis antarctica),Dream,49.5,19.0,200.0,3800.0,MALE
9,Adelie Penguin (Pygoscelis adeliae),Dream,40.2,20.1,200.0,3975.0,MALE


## Data split

Part of preparing data for a machine learning task is splitting it into subsets for training and testing, to ensure that the solution is not overfitting. Most commonly this is done with `bigframes.ml.model_selection.train_test_split` like so:

In [2]:
# In this example, we're doing supervised learning, where we will learn to predict
# output variable `y` from input features `X`
X = df[['island', 'culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'sex', 'species']]
y = df[['body_mass_g']] 

from bigframes.ml.model_selection import train_test_split

# This will split X and y into test and training sets, with 20% of the rows in the test set,
# and the rest in the training set
train_X, test_X, train_y, test_y = train_test_split(
  X, y, test_size=0.2)

# Show the shape of the data after the split
print(f"""train_X shape: {train_X.shape}
test_X shape: {test_X.shape}
train_y shape: {train_y.shape}
test_y shape: {test_y.shape}""")

HTML(value='Query job 7ebf9d82-751c-4e02-bf5d-e8fa207539eb is DONE. 28.9 kB processed. <a target="_blank" href…

HTML(value='Query job 6cbaa7c1-3649-4c4d-9b57-020a88ef579a is DONE. 31.7 kB processed. <a target="_blank" href…

HTML(value='Query job f8376784-8d9b-4423-9b38-360fc9b053ee is DONE. 31.7 kB processed. <a target="_blank" href…

HTML(value='Query job 1f190e5d-7e2c-4aad-8ffe-5d6931a4f069 is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job d0f050b1-1e2c-4574-9530-090650afac89 is DONE. 31.7 kB processed. <a target="_blank" href…

train_X shape: (267, 6)
test_X shape: (67, 6)
train_y shape: (267, 1)
test_y shape: (67, 1)


In [3]:
# If we look at the data, we can see that random rows were selected for
# each side of the split
test_X.head(5)

HTML(value='Query job d183d82d-e05a-41c1-be4b-8487875e52ea is DONE. 31.7 kB processed. <a target="_blank" href…

HTML(value='Query job b6421901-8342-4336-87f0-6deb9fc9b397 is DONE. 31.7 kB processed. <a target="_blank" href…

HTML(value='Query job 15ee740d-09c8-45e2-a3da-ccae2d4d21eb is DONE. 0 Bytes processed. <a target="_blank" href…

Unnamed: 0_level_0,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,sex,species
penguin_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
241,Biscoe,46.2,14.9,221.0,MALE,Gentoo penguin (Pygoscelis papua)
121,Dream,48.1,16.4,199.0,FEMALE,Chinstrap penguin (Pygoscelis antarctica)
209,Biscoe,42.7,18.3,196.0,MALE,Adelie Penguin (Pygoscelis adeliae)
270,Biscoe,37.7,16.0,183.0,FEMALE,Adelie Penguin (Pygoscelis adeliae)
187,Biscoe,43.4,14.4,218.0,FEMALE,Gentoo penguin (Pygoscelis papua)


In [4]:
# Note that this matches the rows in test_X
test_y.head(5)

HTML(value='Query job 9f0cf977-a895-41fa-9f33-8b5346c62786 is DONE. 31.7 kB processed. <a target="_blank" href…

HTML(value='Query job de51d129-79ac-4189-9214-a60148b6824a is DONE. 31.7 kB processed. <a target="_blank" href…

HTML(value='Query job c7afac6c-6925-4e53-8ef6-e1a7a3fabd48 is DONE. 0 Bytes processed. <a target="_blank" href…

Unnamed: 0_level_0,body_mass_g
penguin_id,Unnamed: 1_level_1
241,5300.0
121,3325.0
209,4075.0
270,3075.0
187,4600.0


## Estimators

Following Scikit-Learn, all learning components are "estimators"; objects that can learn from training data and then apply themselves to new data. Estimators share the following patterns:

- a constructor that takes a list of parameters
- a standard string representation that shows the class name and all non-default parameters, e.g. `LinearRegression(fit_intercept=False)`
- a `.fit(..)` method to fit the estimator to training data

There estimators can be further broken down into two main subtypes:

### Transformers

Transformers are estimators that are used to prepare data for consumption by other estimators ('preprocessing'). In addition to `.fit(...)`, the transformer implements a `.transform(...)` method, which will apply a transformation based on what was computed during `.fit(..)`. With this pattern dynamic preprocessing steps can be applied to both training and test/production data consistently.

An example of a transformer is `bigframes.ml.preprocessing.StandardScaler`, which rescales a dataset to have a mean of zero and a standard deviation of one:

In [5]:
from bigframes.ml.preprocessing import StandardScaler

# StandardScaler will only work on numeric columns
numeric_columns = ["culmen_length_mm", "culmen_depth_mm", "flipper_length_mm"]

scaler = StandardScaler()
scaler.fit(train_X[numeric_columns])

# Now, standardscaler should transform the numbers to have mean of zero
# and standard deviation of one:
scaler.transform(train_X[numeric_columns])

HTML(value='Query job f54865db-fdb4-4022-af30-7f282a6b81c0 is DONE. 31.7 kB processed. <a target="_blank" href…

HTML(value='Query job 2d1150ee-11bd-4711-9466-e504d5ee36d2 is DONE. 0 Bytes processed. <a target="_blank" href…

HTML(value='Query job bf02595f-3cac-4995-998c-d0d0752edfb2 is DONE. 0 Bytes processed. <a target="_blank" href…

HTML(value='Query job 115985df-2008-4210-b6eb-0dd9a121155c is DONE. 0 Bytes processed. <a target="_blank" href…

HTML(value='Query job 527dd11b-c164-43df-a430-6cdf326048e4 is DONE. 0 Bytes processed. <a target="_blank" href…

Unnamed: 0_level_0,scaled_culmen_length_mm,scaled_culmen_depth_mm,scaled_flipper_length_mm
penguin_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,-1.364965,0.629892,-1.226537
1,-0.771824,0.984275,-1.226537
2,-0.567932,0.883023,-1.226537
3,0.470064,0.376761,-0.652517
4,-1.235216,-0.180128,-0.652517
5,-0.141612,0.680518,-0.652517
6,0.544207,-0.281381,-0.652517
7,1.21149,0.629892,-0.078497
8,1.026133,0.933649,-0.078497
10,-0.586468,0.883023,0.495523


In [6]:
# We can then repeat this transformation on new data
scaler.transform(test_X[numeric_columns])

HTML(value='Query job a8029b51-2ef1-4acd-9759-d808db954298 is DONE. 31.7 kB processed. <a target="_blank" href…

HTML(value='Query job 6d7022b5-1b0e-4dbe-bc0a-555c613e2153 is DONE. 0 Bytes processed. <a target="_blank" href…

HTML(value='Query job eb0e0503-2a58-4394-b2dc-6d7e6aeaa57c is DONE. 0 Bytes processed. <a target="_blank" href…

HTML(value='Query job 9f0cafe4-4a59-4984-b66e-614355b36203 is DONE. 0 Bytes processed. <a target="_blank" href…

HTML(value='Query job ec0960fb-23e3-4814-8a80-9c23c42e185e is DONE. 0 Bytes processed. <a target="_blank" href…

Unnamed: 0_level_0,scaled_culmen_length_mm,scaled_culmen_depth_mm,scaled_flipper_length_mm
penguin_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
9,-0.697682,1.490538,-0.078497
12,-1.290822,-0.129502,-1.154784
13,0.562742,0.073003,-1.154784
19,-1.142537,0.478013,-0.580765
22,-0.697682,-0.028249,-0.580765
23,-1.327894,0.680518,-0.580765
25,1.489525,0.478013,-0.006745
28,1.897309,1.844922,-0.006745
30,1.267097,0.680518,-0.006745
38,-0.49379,1.59179,-0.509012


#### Composing transformers

To process data where different columns need different preprocessors, `bigframes.composition.ColumnTransformer` can be employed:

In [7]:
from bigframes.ml.compose import ColumnTransformer
from bigframes.ml.preprocessing import OneHotEncoder

# Create an aggregate transform that applies StandardScaler to the numeric columns,
# and OneHotEncoder to the string columns
preproc = ColumnTransformer([
    ("scale", StandardScaler(), ["culmen_length_mm", "culmen_depth_mm", "flipper_length_mm"]),
    ("encode", OneHotEncoder(), ["species", "sex", "island"])])

# Now we can fit all columns of the training data
preproc.fit(train_X)

processed_train_X = preproc.transform(train_X)
processed_test_X = preproc.transform(test_X)

processed_train_X

HTML(value='Query job 792a19f4-e3c2-4344-8555-7f5a781c3c7d is DONE. 32.0 kB processed. <a target="_blank" href…

HTML(value='Query job 61fea509-858e-44c6-974d-f3a80169f278 is DONE. 0 Bytes processed. <a target="_blank" href…

HTML(value='Query job 3d7305d8-d0ef-4e7b-823a-0ec31614272d is DONE. 32.0 kB processed. <a target="_blank" href…

HTML(value='Query job b5ceae33-e4ef-4a97-a84c-84419f55dfa7 is DONE. 0 Bytes processed. <a target="_blank" href…

HTML(value='Query job a6ca42d3-db26-4201-a581-e7f4bba50df1 is DONE. 0 Bytes processed. <a target="_blank" href…

HTML(value='Query job 08b3de93-ff67-48fc-b4c8-a8a69650ad44 is DONE. 0 Bytes processed. <a target="_blank" href…

HTML(value='Query job 29fe217e-d3a5-4f56-ba25-2f98d3162f80 is DONE. 0 Bytes processed. <a target="_blank" href…

Unnamed: 0_level_0,onehotencoded_island,scaled_culmen_length_mm,scaled_culmen_depth_mm,scaled_flipper_length_mm,onehotencoded_sex,onehotencoded_species
penguin_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,"[{'index': 2, 'value': 1.0}]",-1.364965,0.629892,-1.226537,"[{'index': 2, 'value': 1.0}]","[{'index': 1, 'value': 1.0}]"
1,"[{'index': 2, 'value': 1.0}]",-0.771824,0.984275,-1.226537,"[{'index': 3, 'value': 1.0}]","[{'index': 1, 'value': 1.0}]"
2,"[{'index': 2, 'value': 1.0}]",-0.567932,0.883023,-1.226537,"[{'index': 3, 'value': 1.0}]","[{'index': 1, 'value': 1.0}]"
3,"[{'index': 2, 'value': 1.0}]",0.470064,0.376761,-0.652517,"[{'index': 2, 'value': 1.0}]","[{'index': 2, 'value': 1.0}]"
4,"[{'index': 2, 'value': 1.0}]",-1.235216,-0.180128,-0.652517,"[{'index': 2, 'value': 1.0}]","[{'index': 1, 'value': 1.0}]"
5,"[{'index': 2, 'value': 1.0}]",-0.141612,0.680518,-0.652517,"[{'index': 3, 'value': 1.0}]","[{'index': 1, 'value': 1.0}]"
6,"[{'index': 2, 'value': 1.0}]",0.544207,-0.281381,-0.652517,"[{'index': 2, 'value': 1.0}]","[{'index': 2, 'value': 1.0}]"
7,"[{'index': 2, 'value': 1.0}]",1.21149,0.629892,-0.078497,"[{'index': 2, 'value': 1.0}]","[{'index': 2, 'value': 1.0}]"
8,"[{'index': 2, 'value': 1.0}]",1.026133,0.933649,-0.078497,"[{'index': 3, 'value': 1.0}]","[{'index': 2, 'value': 1.0}]"
10,"[{'index': 2, 'value': 1.0}]",-0.586468,0.883023,0.495523,"[{'index': 3, 'value': 1.0}]","[{'index': 1, 'value': 1.0}]"


### Predictors

Predictors are estimators that learn and make predictions. In addition to `.fit(...)`, the predictor implements a `.predict(...)` method, which will use what was learned during `.fit(...)` to predict some output.

Predictors can be further broken down into two categories:

#### Supervised predictors

Supervised learning is when we train a model on input-output pairs, and then ask it to predict the output for new inputs. An example of such a predictor is `bigframes.ml.linear_models.LinearRegression`.

In [8]:
from bigframes.ml.linear_model import LinearRegression

linreg = LinearRegression()

# Learn from the training data how to predict output y
linreg.fit(processed_train_X, train_y)

# Predict y for the test data
predicted_test_y = linreg.predict(processed_test_X)

predicted_test_y

HTML(value='Query job b8c690b1-fa1b-4c4c-b802-c8ab37c930c9 is DONE. 359 Bytes processed. <a target="_blank" hr…

HTML(value='Query job 22839d53-eb3c-4ade-8b10-9784a1926af2 is DONE. 0 Bytes processed. <a target="_blank" href…

HTML(value='Query job 3d396971-8255-4614-b11d-478cb0608916 is DONE. 0 Bytes processed. <a target="_blank" href…

HTML(value='Query job f01e17cd-7001-4e30-bd6f-c2ae5ca1d350 is DONE. 0 Bytes processed. <a target="_blank" href…

HTML(value='Query job feac2ca6-25fd-4535-95c8-6363961308da is DONE. 0 Bytes processed. <a target="_blank" href…

Unnamed: 0_level_0,predicted_body_mass_g
penguin_id,Unnamed: 1_level_1
9,4295.335461
12,3338.44131
13,3201.820204
19,3982.814079
22,3538.610664
23,3613.50305
25,4009.759444
28,4240.515635
30,4028.904195
38,4206.810346


#### Unsupervised predictors

In unsupervised learning, there are no known outputs in the training data, instead the model learns on input data alone and predicts something else. An example of an unsupervised predictor is `bigframes.ml.cluster.KMeans`, which learns how to fit input data to a target number of clusters.

In [9]:
from bigframes.ml.cluster import KMeans

kmeans = KMeans(n_clusters=4)

kmeans.fit(processed_train_X)

kmeans.predict(processed_test_X)

HTML(value='Query job 8e88f903-5b13-4305-b259-d17194be16da is DONE. 809 Bytes processed. <a target="_blank" hr…

HTML(value='Query job a47e5689-f8d0-45f7-84b8-a0f1668eec74 is DONE. 0 Bytes processed. <a target="_blank" href…

HTML(value='Query job 5a1b8a7e-146e-4760-9c5d-82717835d3ca is DONE. 0 Bytes processed. <a target="_blank" href…

HTML(value='Query job 5f48318a-64ad-404f-8eea-864e0e26e3d2 is DONE. 0 Bytes processed. <a target="_blank" href…

HTML(value='Query job 345b19df-f431-4f41-938b-0306937c2882 is DONE. 0 Bytes processed. <a target="_blank" href…

Unnamed: 0_level_0,CENTROID_ID
penguin_id,Unnamed: 1_level_1
9,4
12,4
13,2
19,4
22,4
23,4
25,2
28,2
30,2
38,4


## Pipelines

Transfomers and predictors can be chained into a single estimator component using `bigframes.ml.pipeline.Pipeline`:

In [10]:
from bigframes.ml.pipeline import Pipeline

pipeline = Pipeline([
  ('preproc', preproc),
  ('linreg', linreg)
])

# Print our pipeline
pipeline

Pipeline(steps=[('preproc',
                 ColumnTransformer(transformers=[('scale', StandardScaler(),
                                                  ['culmen_length_mm',
                                                   'culmen_depth_mm',
                                                   'flipper_length_mm']),
                                                 ('encode', OneHotEncoder(),
                                                  ['species', 'sex',
                                                   'island'])])),
                ('linreg', LinearRegression())])

The pipeline simplifies the workflow by applying each of its component steps automatically:

In [11]:
pipeline.fit(train_X, train_y)

predicted_test_y = pipeline.predict(test_X)
predicted_test_y

HTML(value='Query job c7d094cb-cc51-4f11-8887-b169c23aceb2 is DONE. 32.3 kB processed. <a target="_blank" href…

HTML(value='Query job dc1e1972-cd00-4cfd-9654-fcf34107077b is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job 8278900a-3685-4058-8ac8-fda43a71ec61 is DONE. 0 Bytes processed. <a target="_blank" href…

HTML(value='Query job d4ac1341-de7e-432b-ae33-0871f4936412 is DONE. 0 Bytes processed. <a target="_blank" href…

HTML(value='Query job bddce904-5880-4f05-a61f-91db6aa01af2 is DONE. 0 Bytes processed. <a target="_blank" href…

Unnamed: 0_level_0,predicted_body_mass_g
penguin_id,Unnamed: 1_level_1
9,4295.328991
12,3338.434943
13,3201.813783
19,3982.807707
22,3538.604385
23,3613.496641
25,4009.753161
28,4240.509087
30,4028.897875
38,4206.80377


In the backend, a pipeline will actually be compiled into a single model with an embedded TRANSFORM step.

## Evaluating results

Some models include a convenient `.score(X, y)` method for evaulation with a preset accuracy metric:

In [12]:
# In the case of a pipeline, this will be equivalent to calling .score on the contained LinearRegression
pipeline.score(test_X, test_y)

HTML(value='Query job a427bad2-9875-453f-ad2a-1eefaf085657 is DONE. 32.3 kB processed. <a target="_blank" href…

HTML(value='Query job c43f2287-9497-48f8-a8b0-240f5e5cfc20 is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job 1e478912-cc64-43bf-811a-60519787c6d3 is DONE. 0 Bytes processed. <a target="_blank" href…

HTML(value='Query job 825daf98-11ff-467c-b8a0-fc7fc19f520b is DONE. 56 Bytes processed. <a target="_blank" hre…

HTML(value='Query job 5f5bbf93-1d3e-49cf-aeda-4cf73499e3ac is DONE. 0 Bytes processed. <a target="_blank" href…

Unnamed: 0,mean_absolute_error,mean_squared_error,mean_squared_log_error,median_absolute_error,r2_score,explained_variance
0,241.640738,90117.84266,0.005652,200.718678,0.8727,0.878359


For a more general approach, the library `bigframes.ml.metrics` is provided:

In [13]:
from bigframes.ml.metrics import r2_score

r2_score(test_y, predicted_test_y)

HTML(value='Query job 929c826c-1051-47fc-9256-546f4ef11c32 is DONE. 31.7 kB processed. <a target="_blank" href…

HTML(value='Query job 1c7d8f62-2c4f-4fcb-a169-cb9df93dd46c is DONE. 31.7 kB processed. <a target="_blank" href…

HTML(value='Query job 3129ca19-df93-4a1e-9340-280e782c09ec is DONE. 31.7 kB processed. <a target="_blank" href…

0.8726996609087831

## Save/Load to BigQuery

Estimators can be saved to BigQuery as BQML models, and loaded again in future

In [14]:
# Replace with a path where you have permission to save a model
model_name = "bigframes-dev.bqml_tutorial.penguins_model"

linreg.to_gbq(model_name, replace=True)

Pipeline(steps=[('transform',
                 ColumnTransformer(transformers=[('ont_hot_encoder',
                                                  OneHotEncoder(max_categories=1000001,
                                                                min_frequency=0),
                                                  'island'),
                                                 ('standard_scaler',
                                                  StandardScaler(),
                                                  'culmen_length_mm'),
                                                 ('standard_scaler',
                                                  StandardScaler(),
                                                  'culmen_depth_mm'),
                                                 ('standard_scaler',
                                                  StandardScaler(),
                                                  'flipper_length_mm'),
                                              

In [15]:
# WARNING - until b/281709360 is fixed & pipeline is updated, pipelines will load as models,
# and details of their transform steps will be lost (the loaded model will behave the same)
bigframes.pandas.read_gbq_model(model_name)

Pipeline(steps=[('transform',
                 ColumnTransformer(transformers=[('ont_hot_encoder',
                                                  OneHotEncoder(max_categories=1000001,
                                                                min_frequency=0),
                                                  'island'),
                                                 ('standard_scaler',
                                                  StandardScaler(),
                                                  'culmen_length_mm'),
                                                 ('standard_scaler',
                                                  StandardScaler(),
                                                  'culmen_depth_mm'),
                                                 ('standard_scaler',
                                                  StandardScaler(),
                                                  'flipper_length_mm'),
                                              