# Using ML - ML fundamentals

The `bigframes.ml` module implements Scikit-Learn's machine learning API in
BigQuery DataFrames. It exposes BigQuery's ML capabilities in a simple, popular
API that works seamlessly with the rest of the BigQuery DataFrames API.

In [18]:
# Lets load some test data to use in this tutorial
import bigframes.pandas

df = bigframes.pandas.read_gbq("bigquery-public-data.ml_datasets.penguins")
df = df.dropna()

# Temporary workaround: lets name our index so it isn't lost BigQuery DataFrame
# currently drops unnamed indexes when round-tripping through pandas, which
# some ML APIs do to route around missing functionality
df.index.name = "penguin_id"

df

HTML(value='Query job 28e903c6-e874-4b99-8f53-0755e0b0c188 is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job 32d772f8-3d61-43bf-a152-d930e3ecbf29 is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job 5a435e8f-9960-4fec-a8d8-1230e7b229a3 is DONE. 28.9 kB processed. <a target="_blank" href…

HTML(value='Query job 7950d6a7-3747-4454-bba2-9660e830647f is DONE. 31.7 kB processed. <a target="_blank" href…

Unnamed: 0_level_0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
penguin_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,Adelie Penguin (Pygoscelis adeliae),Dream,36.6,18.4,184.0,3475.0,FEMALE
1,Adelie Penguin (Pygoscelis adeliae),Dream,39.8,19.1,184.0,4650.0,MALE
2,Adelie Penguin (Pygoscelis adeliae),Dream,40.9,18.9,184.0,3900.0,MALE
3,Chinstrap penguin (Pygoscelis antarctica),Dream,46.5,17.9,192.0,3500.0,FEMALE
4,Adelie Penguin (Pygoscelis adeliae),Dream,37.3,16.8,192.0,3000.0,FEMALE
5,Adelie Penguin (Pygoscelis adeliae),Dream,43.2,18.5,192.0,4100.0,MALE
6,Chinstrap penguin (Pygoscelis antarctica),Dream,46.9,16.6,192.0,2700.0,FEMALE
7,Chinstrap penguin (Pygoscelis antarctica),Dream,50.5,18.4,200.0,3400.0,FEMALE
8,Chinstrap penguin (Pygoscelis antarctica),Dream,49.5,19.0,200.0,3800.0,MALE
9,Adelie Penguin (Pygoscelis adeliae),Dream,40.2,20.1,200.0,3975.0,MALE


## Data split

Part of preparing data for a machine learning task is splitting it into subsets for training and testing, to ensure that the solution is not overfitting. Most commonly this is done with `bigframes.ml.model_selection.train_test_split` like so:

In [19]:
# In this example, we're doing supervised learning, where we will learn to predict
# output variable `y` from input features `X`
X = df[['island', 'culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'sex', 'species']]
y = df[['body_mass_g']] 

from bigframes.ml.model_selection import train_test_split

# This will split X and y into test and training sets, with 20% of the rows in the test set,
# and the rest in the training set
X_train, X_test, y_train, y_test = train_test_split(
  X, y, test_size=0.2)

# Show the shape of the data after the split
print(f"""X_train shape: {X_train.shape}
X_test shape: {X_test.shape}
y_train shape: {y_train.shape}
y_test shape: {y_test.shape}""")

HTML(value='Query job 1408053d-cb80-4870-af28-e94b90a20a6d is DONE. 28.9 kB processed. <a target="_blank" href…

HTML(value='Query job 262885fe-973c-4338-a853-227f9db4835a is DONE. 31.7 kB processed. <a target="_blank" href…

HTML(value='Query job fb1dc831-7f6f-42ce-96da-1292d73919b4 is DONE. 31.7 kB processed. <a target="_blank" href…

HTML(value='Query job e79add79-f1e4-4cf0-bb97-04d153222f19 is DONE. 31.7 kB processed. <a target="_blank" href…

HTML(value='Query job cb5ee343-f86e-4795-b0ce-d58854e72e5c is RUNNING. <a target="_blank" href="https://consol…

X_train shape: (267, 6)
X_test shape: (67, 6)
y_train shape: (267, 1)
y_test shape: (67, 1)


In [20]:
# If we look at the data, we can see that random rows were selected for
# each side of the split
X_test.head(5)

HTML(value='Query job e65af31c-feda-468d-89c9-dec033574640 is DONE. 31.7 kB processed. <a target="_blank" href…

HTML(value='Query job 0455f252-2b94-457e-bad5-672b91d9b51f is RUNNING. <a target="_blank" href="https://consol…

Unnamed: 0_level_0,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,sex,species
penguin_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
156,Biscoe,46.2,14.5,209.0,FEMALE,Gentoo penguin (Pygoscelis papua)
189,Biscoe,35.3,18.9,187.0,FEMALE,Adelie Penguin (Pygoscelis adeliae)
279,Biscoe,45.1,14.5,215.0,FEMALE,Gentoo penguin (Pygoscelis papua)
245,Biscoe,49.5,16.2,229.0,MALE,Gentoo penguin (Pygoscelis papua)
343,Torgersen,37.3,20.5,199.0,MALE,Adelie Penguin (Pygoscelis adeliae)


In [21]:
# Note that this matches the rows in X_test
y_test.head(5)

HTML(value='Query job d5a173bd-a7dc-42fa-8468-b088d47ccfe0 is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job c6b6518b-2689-4dc1-a5b0-2a9ab75301eb is RUNNING. <a target="_blank" href="https://consol…

Unnamed: 0_level_0,body_mass_g
penguin_id,Unnamed: 1_level_1
156,4800.0
189,3800.0
279,5000.0
245,5800.0
343,3775.0


## Estimators

Following Scikit-Learn, all learning components are "estimators"; objects that can learn from training data and then apply themselves to new data. Estimators share the following patterns:

- a constructor that takes a list of parameters
- a standard string representation that shows the class name and all non-default parameters, e.g. `LinearRegression(fit_intercept=False)`
- a `.fit(..)` method to fit the estimator to training data

There estimators can be further broken down into two main subtypes:

### Transformers

Transformers are estimators that are used to prepare data for consumption by other estimators ('preprocessing'). In addition to `.fit(...)`, the transformer implements a `.transform(...)` method, which will apply a transformation based on what was computed during `.fit(..)`. With this pattern dynamic preprocessing steps can be applied to both training and test/production data consistently.

An example of a transformer is `bigframes.ml.preprocessing.StandardScaler`, which rescales a dataset to have a mean of zero and a standard deviation of one:

In [22]:
from bigframes.ml.preprocessing import StandardScaler

# StandardScaler will only work on numeric columns
numeric_columns = ["culmen_length_mm", "culmen_depth_mm", "flipper_length_mm"]

scaler = StandardScaler()
scaler.fit(X_train[numeric_columns])

# Now, standardscaler should transform the numbers to have mean of zero
# and standard deviation of one:
scaler.transform(X_train[numeric_columns])

HTML(value='Query job 03a0eb1c-747e-4c2a-b7b5-d3e4e5a78134 is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job 70608c84-dac8-4e77-8a9e-00d823b24f37 is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job d18fdc32-2152-45d3-8c62-bf9b1556ec47 is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job 2a022682-535f-4dc0-80ba-1640306ad9ef is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job c145b39d-7d02-4394-80f0-fc605b2ba256 is DONE. 0 Bytes processed. <a target="_blank" href…

HTML(value='Query job fc156a2b-db95-44a3-9ad1-d95b9d290080 is RUNNING. <a target="_blank" href="https://consol…

Unnamed: 0_level_0,standard_scaled_culmen_length_mm,standard_scaled_culmen_depth_mm,standard_scaled_flipper_length_mm
penguin_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,-1.344188,0.642519,-1.193942
1,-0.750047,1.005876,-1.193942
2,-0.545811,0.90206,-1.193942
4,-1.214219,-0.188011,-0.619171
5,-0.118772,0.694427,-0.619171
6,0.568203,-0.291828,-0.619171
7,1.236611,0.642519,-0.044401
9,-0.675779,1.524957,-0.044401
10,-0.564378,0.90206,0.530369
11,-0.898582,0.798243,-1.122096


In [23]:
# We can then repeat this transformation on new data
scaler.transform(X_test[numeric_columns])

HTML(value='Query job c6268b07-0d3d-4fe0-971d-cc99fd98cd7e is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job 31550d88-fc7b-4fcb-9975-9ed24bf2e009 is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job 5ec7c8b1-037c-466c-a51e-963f8274e76b is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job 4e860716-bc41-4ef6-83ff-310d085ed7cc is DONE. 0 Bytes processed. <a target="_blank" href…

HTML(value='Query job 6b96a757-42fe-4b65-92fd-a3ae339fe769 is RUNNING. <a target="_blank" href="https://consol…

Unnamed: 0_level_0,standard_scaled_culmen_length_mm,standard_scaled_culmen_depth_mm,standard_scaled_flipper_length_mm
penguin_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
3,0.493935,0.382978,-0.619171
8,1.050942,0.953968,-0.044401
17,1.255178,1.1616,-0.547325
23,-1.307054,0.694427,-0.547325
25,1.515114,0.486795,0.027445
27,1.236611,1.265417,0.027445
29,1.403713,0.953968,0.027445
34,0.419668,0.538703,-1.62502
35,-1.455589,0.694427,-1.050249
39,0.326833,1.1616,-0.475479


#### Composing transformers

To process data where different columns need different preprocessors, `bigframes.composition.ColumnTransformer` can be employed:

In [24]:
from bigframes.ml.compose import ColumnTransformer
from bigframes.ml.preprocessing import OneHotEncoder

# Create an aggregate transform that applies StandardScaler to the numeric columns,
# and OneHotEncoder to the string columns
preproc = ColumnTransformer([
    ("scale", StandardScaler(), ["culmen_length_mm", "culmen_depth_mm", "flipper_length_mm"]),
    ("encode", OneHotEncoder(), ["species", "sex", "island"])])

# Now we can fit all columns of the training data
preproc.fit(X_train)

processed_X_train = preproc.transform(X_train)
processed_X_test = preproc.transform(X_test)

processed_X_train

HTML(value='Query job a8d8afa4-d91e-487e-8709-8727a73ab453 is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job b9afd624-4345-4160-8809-05786563ce35 is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job c918fc7c-a956-4259-b5c5-09c2eac615cd is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job 1d855341-282f-4d10-9ba9-3ce6683b729a is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job c257ff78-3e15-4296-82f5-ba6c2eb6a6ff is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job a17eec0c-10d0-4943-95be-60fced57d5cb is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job 1db53c8a-cf45-4c69-a443-6b7a49fc3a07 is DONE. 536 Bytes processed. <a target="_blank" hr…

HTML(value='Query job ae870ee3-e633-4556-94e6-6669fa0bfde2 is DONE. 0 Bytes processed. <a target="_blank" href…

HTML(value='Query job baa7c70c-eabc-49e1-bae9-fdd4891cdb6e is RUNNING. <a target="_blank" href="https://consol…

Unnamed: 0_level_0,onehotencoded_island,standard_scaled_culmen_length_mm,standard_scaled_culmen_depth_mm,standard_scaled_flipper_length_mm,onehotencoded_sex,onehotencoded_species
penguin_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,"[{'index': 2, 'value': 1.0}]",-1.344188,0.642519,-1.193942,"[{'index': 2, 'value': 1.0}]","[{'index': 1, 'value': 1.0}]"
1,"[{'index': 2, 'value': 1.0}]",-0.750047,1.005876,-1.193942,"[{'index': 3, 'value': 1.0}]","[{'index': 1, 'value': 1.0}]"
2,"[{'index': 2, 'value': 1.0}]",-0.545811,0.90206,-1.193942,"[{'index': 3, 'value': 1.0}]","[{'index': 1, 'value': 1.0}]"
4,"[{'index': 2, 'value': 1.0}]",-1.214219,-0.188011,-0.619171,"[{'index': 2, 'value': 1.0}]","[{'index': 1, 'value': 1.0}]"
5,"[{'index': 2, 'value': 1.0}]",-0.118772,0.694427,-0.619171,"[{'index': 3, 'value': 1.0}]","[{'index': 1, 'value': 1.0}]"
6,"[{'index': 2, 'value': 1.0}]",0.568203,-0.291828,-0.619171,"[{'index': 2, 'value': 1.0}]","[{'index': 2, 'value': 1.0}]"
7,"[{'index': 2, 'value': 1.0}]",1.236611,0.642519,-0.044401,"[{'index': 2, 'value': 1.0}]","[{'index': 2, 'value': 1.0}]"
9,"[{'index': 2, 'value': 1.0}]",-0.675779,1.524957,-0.044401,"[{'index': 3, 'value': 1.0}]","[{'index': 1, 'value': 1.0}]"
10,"[{'index': 2, 'value': 1.0}]",-0.564378,0.90206,0.530369,"[{'index': 3, 'value': 1.0}]","[{'index': 1, 'value': 1.0}]"
11,"[{'index': 2, 'value': 1.0}]",-0.898582,0.798243,-1.122096,"[{'index': 3, 'value': 1.0}]","[{'index': 1, 'value': 1.0}]"


### Predictors

Predictors are estimators that learn and make predictions. In addition to `.fit(...)`, the predictor implements a `.predict(...)` method, which will use what was learned during `.fit(...)` to predict some output.

Predictors can be further broken down into two categories:

#### Supervised predictors

Supervised learning is when we train a model on input-output pairs, and then ask it to predict the output for new inputs. An example of such a predictor is `bigframes.ml.linear_models.LinearRegression`.

In [25]:
from bigframes.ml.linear_model import LinearRegression

linreg = LinearRegression()

# Learn from the training data how to predict output y
linreg.fit(processed_X_train, y_train)

# Predict y for the test data
predicted_y_test = linreg.predict(processed_X_test)

predicted_y_test

HTML(value='Query job ceced0cc-13a7-4b14-b42c-4d5f69e7e49a is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job a708b8df-6040-49b1-a6da-d2c0d162f247 is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job e9b9cbb5-f6a4-4d85-ba78-1edae77dce94 is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job 6c0a41a7-a732-413a-b074-ba82f175eab8 is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job 2d08b79d-9c36-4db7-824a-332fdd02e9fc is DONE. 0 Bytes processed. <a target="_blank" href…

HTML(value='Query job 7fa0bf53-1022-45ee-b3ac-78fa5c155585 is RUNNING. <a target="_blank" href="https://consol…

Unnamed: 0_level_0,predicted_body_mass_g
penguin_id,Unnamed: 1_level_1
3,3394.118128
8,4048.685642
17,3976.454093
23,3541.582194
25,4032.844186
27,4118.351772
29,4087.767826
34,3183.755249
35,3418.802274
39,3519.186468


#### Unsupervised predictors

In unsupervised learning, there are no known outputs in the training data, instead the model learns on input data alone and predicts something else. An example of an unsupervised predictor is `bigframes.ml.cluster.KMeans`, which learns how to fit input data to a target number of clusters.

In [26]:
from bigframes.ml.cluster import KMeans

kmeans = KMeans(n_clusters=4)

kmeans.fit(processed_X_train)

kmeans.predict(processed_X_test)

HTML(value='Query job 6f19614c-82c0-4f8b-b74b-9d91a894efdd is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job 51899e2d-f6ef-4e62-98b6-c11550f74f4b is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job 44d3fddc-74bc-4de0-a458-2c73b38f74fb is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job 33584475-f02b-4c98-9a51-e29996f4f950 is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job df25ba49-280e-424d-a357-dde71a9b35dd is DONE. 0 Bytes processed. <a target="_blank" href…

HTML(value='Query job 6f92a04e-af7e-41d6-9303-6366c1751294 is RUNNING. <a target="_blank" href="https://consol…

Unnamed: 0_level_0,CENTROID_ID
penguin_id,Unnamed: 1_level_1
3,3
8,3
17,3
23,1
25,3
27,3
29,3
34,3
35,1
39,3


## Pipelines

Transfomers and predictors can be chained into a single estimator component using `bigframes.ml.pipeline.Pipeline`:

In [27]:
from bigframes.ml.pipeline import Pipeline

pipeline = Pipeline([
  ('preproc', preproc),
  ('linreg', linreg)
])

# Print our pipeline
pipeline

Pipeline(steps=[('preproc',
                 ColumnTransformer(transformers=[('scale', StandardScaler(),
                                                  ['culmen_length_mm',
                                                   'culmen_depth_mm',
                                                   'flipper_length_mm']),
                                                 ('encode', OneHotEncoder(),
                                                  ['species', 'sex',
                                                   'island'])])),
                ('linreg', LinearRegression())])

The pipeline simplifies the workflow by applying each of its component steps automatically:

In [28]:
pipeline.fit(X_train, y_train)

predicted_y_test = pipeline.predict(X_test)
predicted_y_test

HTML(value='Query job ed42cbb3-3d25-47ca-96c5-71a84e426a8c is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job 3fc74930-03b9-4a49-8ed3-c3edc4dd6e51 is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job 38a4ce3b-5c2a-4d44-b826-f24529d6500b is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job ecad776d-77c8-4d94-8186-d5571b512b62 is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job c9bfc58f-ce2c-47a9-bbc7-b10d9de9b5a6 is DONE. 0 Bytes processed. <a target="_blank" href…

HTML(value='Query job 8fd8036e-3753-433d-975b-c7b42406f648 is RUNNING. <a target="_blank" href="https://consol…

Unnamed: 0_level_0,predicted_body_mass_g
penguin_id,Unnamed: 1_level_1
3,3394.116212
8,4048.683645
17,3976.452358
23,3541.580346
25,4032.842027
27,4118.34983
29,4087.765797
34,3183.75379
35,3418.800633
39,3519.18471


In the backend, a pipeline will actually be compiled into a single model with an embedded TRANSFORM step.

## Evaluating results

Some models include a convenient `.score(X, y)` method for evaulation with a preset accuracy metric:

In [29]:
# In the case of a pipeline, this will be equivalent to calling .score on the contained LinearRegression
pipeline.score(X_test, y_test)

HTML(value='Query job 2a043039-670f-4eb8-9cf0-765ee6ed7de6 is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job 7a3da9b3-e6d5-453a-8178-9cb311e83113 is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job 5fcfd48b-c26f-487e-8387-c662b59ea424 is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job bc8b2042-1e13-441c-9531-300ed5badb7a is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job 5e867182-dd7a-4aff-87a8-f7596e900fd5 is DONE. 0 Bytes processed. <a target="_blank" href…

HTML(value='Query job d4cdb016-8f1e-4960-8ed7-4524ccc5a8a8 is RUNNING. <a target="_blank" href="https://consol…

Unnamed: 0,mean_absolute_error,mean_squared_error,mean_squared_log_error,median_absolute_error,r2_score,explained_variance
0,229.48269,82962.794947,0.004248,206.728384,0.88633,0.892953


For a more general approach, the library `bigframes.ml.metrics` is provided:

In [30]:
from bigframes.ml.metrics import r2_score

r2_score(y_test, predicted_y_test)

HTML(value='Query job e57383ef-f043-458b-96c6-893e7c5b0c00 is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job 1a9db485-477b-43e2-94eb-dea7dc21d45d is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job 4570a563-b8e0-4308-b8cb-c4731491d4f7 is RUNNING. <a target="_blank" href="https://consol…

0.8863300923278365

## Save/Load to BigQuery

Estimators can be saved to BigQuery as BQML models, and loaded again in future.

Saving requires `bigquery.tables.create` permission, and loading requires `bigquery.models.getMetadata` permission.
These permissions can be at project level or the dataset level.

If you have those permissions, please go ahead and uncomment the code in the following cells and run.

In [33]:
# # Replace with a path where you have permission to save a model
# model_name = "bigframes-dev.bqml_tutorial.penguins_model"

# linreg.to_gbq(model_name, replace=True)

HTML(value='Copy job c2413be4-6972-4e36-8234-5063628b6d71 is RUNNING. <a target="_blank" href="https://console…

Pipeline(steps=[('transform',
                 ColumnTransformer(transformers=[('ont_hot_encoder',
                                                  OneHotEncoder(max_categories=1000001,
                                                                min_frequency=0),
                                                  'island'),
                                                 ('standard_scaler',
                                                  StandardScaler(),
                                                  'culmen_length_mm'),
                                                 ('standard_scaler',
                                                  StandardScaler(),
                                                  'culmen_depth_mm'),
                                                 ('standard_scaler',
                                                  StandardScaler(),
                                                  'flipper_length_mm'),
                                              

In [34]:
# # WARNING - until b/281709360 is fixed & pipeline is updated, pipelines will load as models,
# # and details of their transform steps will be lost (the loaded model will behave the same)
# bigframes.pandas.read_gbq_model(model_name)

Pipeline(steps=[('transform',
                 ColumnTransformer(transformers=[('ont_hot_encoder',
                                                  OneHotEncoder(max_categories=1000001,
                                                                min_frequency=0),
                                                  'island'),
                                                 ('standard_scaler',
                                                  StandardScaler(),
                                                  'culmen_length_mm'),
                                                 ('standard_scaler',
                                                  StandardScaler(),
                                                  'culmen_depth_mm'),
                                                 ('standard_scaler',
                                                  StandardScaler(),
                                                  'flipper_length_mm'),
                                              