# Using ML - ML fundamentals

The `bigframes.ml` module implements Scikit-Learn's machine learning API in BigFrames. It exposes BigQuery's ML capabilities in a simple, popular API that works seamlessly with both BigFrames and BigQuery.

In [1]:
# Lets load some test data to use in this tutorial
import bigframes
session = bigframes.connect()

df = session.read_gbq("bigquery-public-data.ml_datasets.penguins")
df = df.dropna()

# Temporary workaround: lets name our index so it isn't lost
# BigFrames currently drops unnamed indexes when round-tripping through
# pandas, which some ML APIs do to route around missing functionality
df.index.name = "penguin_id"

df

Unnamed: 0_level_0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
penguin_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,Adelie Penguin (Pygoscelis adeliae),Dream,36.6,18.4,184.0,3475.0,FEMALE
1,Adelie Penguin (Pygoscelis adeliae),Dream,39.8,19.1,184.0,4650.0,MALE
2,Adelie Penguin (Pygoscelis adeliae),Dream,40.9,18.9,184.0,3900.0,MALE
3,Chinstrap penguin (Pygoscelis antarctica),Dream,46.5,17.9,192.0,3500.0,FEMALE
4,Adelie Penguin (Pygoscelis adeliae),Dream,37.3,16.8,192.0,3000.0,FEMALE
5,Adelie Penguin (Pygoscelis adeliae),Dream,43.2,18.5,192.0,4100.0,MALE
6,Chinstrap penguin (Pygoscelis antarctica),Dream,46.9,16.6,192.0,2700.0,FEMALE
7,Chinstrap penguin (Pygoscelis antarctica),Dream,50.5,18.4,200.0,3400.0,FEMALE
8,Chinstrap penguin (Pygoscelis antarctica),Dream,49.5,19.0,200.0,3800.0,MALE
9,Adelie Penguin (Pygoscelis adeliae),Dream,40.2,20.1,200.0,3975.0,MALE


## Data split

Part of preparing data for a machine learning task is splitting it into subsets for training and testing, to ensure that the solution is not overfitting. Most commonly this is done with `bigframes.ml.model_selection.train_test_split` like so:

In [2]:
# In this example, we're doing supervised learning, where we will learn to predict
# output variable `y` from input features `X`
X = df[['island', 'culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'sex', 'species']]
y = df[['body_mass_g']] 

from bigframes.ml.model_selection import train_test_split

# This will split X and y into test and training sets, with 20% of the rows in the test set,
# and the rest in the training set
train_X, test_X, train_y, test_y = train_test_split(
  X, y, test_size=0.2)

# Show the shape of the data after the split
print(f"""train_X shape: {train_X.shape}
test_X shape: {test_X.shape}
train_y shape: {train_y.shape}
test_y shape: {test_y.shape}""")

train_X shape: (267, 6)
test_X shape: (67, 6)
train_y shape: (267, 1)
test_y shape: (67, 1)


In [3]:
# If we look at the data, we can see that random rows were selected for
# each side of the split
test_X.head(5)

Unnamed: 0_level_0,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,sex,species
penguin_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
234,Biscoe,44.9,13.3,213.0,FEMALE,Gentoo penguin (Pygoscelis papua)
15,Dream,37.0,16.5,185.0,FEMALE,Adelie Penguin (Pygoscelis adeliae)
103,Dream,40.7,17.0,190.0,MALE,Adelie Penguin (Pygoscelis adeliae)
7,Dream,50.5,18.4,200.0,FEMALE,Chinstrap penguin (Pygoscelis antarctica)
25,Dream,52.0,18.1,201.0,MALE,Chinstrap penguin (Pygoscelis antarctica)


In [4]:
# Note that this matches the rows in test_X
test_y.head(5)

Unnamed: 0_level_0,body_mass_g
penguin_id,Unnamed: 1_level_1
234,5100.0
15,3400.0
103,3725.0
7,3400.0
25,4050.0


## Estimators

Following Scikit-Learn, all learning components are "estimators"; objects that can learn from training data and then apply themselves to new data. Estimators share the following patterns:

- a constructor that takes a list of parameters
- a standard string representation that shows the class name and all non-default parameters, e.g. `LinearRegression(fit_intercept=False)`
- a `.fit(..)` method to fit the estimator to training data

There estimators can be further broken down into two main subtypes:

### Transformers

Transformers are estimators that are used to prepare data for consumption by other estimators ('preprocessing'). In addition to `.fit(...)`, the transformer implements a `.transform(...)` method, which will apply a transformation based on what was computed during `.fit(..)`. With this pattern dynamic preprocessing steps can be applied to both training and test/production data consistently.

An example of a transformer is `bigframes.ml.preprocessing.StandardScaler`, which rescales a dataset to have a mean of zero and a standard deviation of one:

In [5]:
from bigframes.ml.preprocessing import StandardScaler

# StandardScaler will only work on numeric columns
numeric_columns = ["culmen_length_mm", "culmen_depth_mm", "flipper_length_mm"]

scaler = StandardScaler()
scaler.fit(train_X[numeric_columns])

# Now, standardscaler should transform the numbers to have mean of zero
# and standard deviation of one:
scaler.transform(train_X[numeric_columns])

Unnamed: 0_level_0,scaled_culmen_length_mm,scaled_culmen_depth_mm,scaled_flipper_length_mm
penguin_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
164,0.143385,-0.762996,1.214164
139,1.035787,-0.711952,1.141512
141,1.072212,-1.120299,1.141512
147,1.145061,-0.660909,1.72273
148,1.21791,-0.967169,1.72273
161,1.017575,-0.762996,1.214164
166,1.163273,-1.120299,1.795382
167,1.345396,-0.354649,1.795382
168,1.072212,-0.609866,1.795382
169,1.236122,-0.660909,1.795382


In [6]:
# We can then repeat this transformation on new data
scaler.transform(test_X[numeric_columns])

Unnamed: 0_level_0,scaled_culmen_length_mm,scaled_culmen_depth_mm,scaled_flipper_length_mm
penguin_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
225,2.219585,-0.099432,2.013339
198,0.125173,0.053698,1.359469
238,1.126849,-0.558822,0.923555
268,1.108636,-0.201519,2.158643
267,0.890089,-0.609866,2.158643
202,0.544056,-0.967169,1.359469
226,1.290759,0.053698,2.013339
260,1.21791,-0.762996,1.577425
216,0.853664,-0.456736,1.432121
265,1.418245,-0.456736,2.158643


#### Composing transformers

To process data where different columns need different preprocessors, `bigframes.composition.ColumnTransformer` can be employed:

In [7]:
from bigframes.ml.compose import ColumnTransformer
from bigframes.ml.preprocessing import OneHotEncoder

# Create an aggregate transform that applies StandardScaler to the numeric columns,
# and OneHotEncoder to the string columns
preproc = ColumnTransformer([
    ("scale", StandardScaler(), ["culmen_length_mm", "culmen_depth_mm", "flipper_length_mm"]),
    ("encode", OneHotEncoder(), ["species", "sex", "island"])])

# Now we can fit all columns of the training data
preproc.fit(train_X)

processed_train_X = preproc.transform(train_X)
processed_test_X = preproc.transform(test_X)

processed_train_X

Unnamed: 0_level_0,onehotencoded_island,scaled_culmen_length_mm,scaled_culmen_depth_mm,scaled_flipper_length_mm,onehotencoded_sex,onehotencoded_species
penguin_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
164,"[{'index': 1, 'value': 1.0}]",0.143385,-0.762996,1.214164,"[{'index': 0, 'value': 1.0}]","[{'index': 3, 'value': 1.0}]"
139,"[{'index': 1, 'value': 1.0}]",1.035787,-0.711952,1.141512,"[{'index': 2, 'value': 1.0}]","[{'index': 3, 'value': 1.0}]"
141,"[{'index': 1, 'value': 1.0}]",1.072212,-1.120299,1.141512,"[{'index': 2, 'value': 1.0}]","[{'index': 3, 'value': 1.0}]"
147,"[{'index': 1, 'value': 1.0}]",1.145061,-0.660909,1.72273,"[{'index': 2, 'value': 1.0}]","[{'index': 3, 'value': 1.0}]"
148,"[{'index': 1, 'value': 1.0}]",1.21791,-0.967169,1.72273,"[{'index': 2, 'value': 1.0}]","[{'index': 3, 'value': 1.0}]"
161,"[{'index': 1, 'value': 1.0}]",1.017575,-0.762996,1.214164,"[{'index': 2, 'value': 1.0}]","[{'index': 3, 'value': 1.0}]"
166,"[{'index': 1, 'value': 1.0}]",1.163273,-1.120299,1.795382,"[{'index': 2, 'value': 1.0}]","[{'index': 3, 'value': 1.0}]"
167,"[{'index': 1, 'value': 1.0}]",1.345396,-0.354649,1.795382,"[{'index': 2, 'value': 1.0}]","[{'index': 3, 'value': 1.0}]"
168,"[{'index': 1, 'value': 1.0}]",1.072212,-0.609866,1.795382,"[{'index': 2, 'value': 1.0}]","[{'index': 3, 'value': 1.0}]"
169,"[{'index': 1, 'value': 1.0}]",1.236122,-0.660909,1.795382,"[{'index': 2, 'value': 1.0}]","[{'index': 3, 'value': 1.0}]"


### Predictors

Predictors are estimators that learn and make predictions. In addition to `.fit(...)`, the predictor implements a `.predict(...)` method, which will use what was learned during `.fit(...)` to predict some output.

Predictors can be further broken down into two categories:

#### Supervised predictors

Supervised learning is when we train a model on input-output pairs, and then ask it to predict the output for new inputs. An example of such a predictor is `bigframes.ml.linear_models.LinearRegression`.

In [8]:
from bigframes.ml.linear_model import LinearRegression

linreg = LinearRegression()

# Learn from the training data how to predict output y
linreg.fit(processed_train_X, train_y)

# Predict y for the test data
predicted_test_y = linreg.predict(processed_test_X)

predicted_test_y

Unnamed: 0_level_0,predicted_body_mass_g
penguin_id,Unnamed: 1_level_1
176,4733.306444
117,3439.072734
309,4034.287495
236,4730.548497
322,3941.153831
107,3390.970008
63,3928.46871
20,3397.713678
25,4053.649981
123,4348.591721


#### Unsupervised predictors

In unsupervised learning, there are no known outputs in the training data, instead the model learns on input data alone and predicts something else. An example of an unsupervised predictor is `bigframes.ml.cluster.KMeans`, which learns how to fit input data to a target number of clusters.

In [9]:
from bigframes.ml.cluster import KMeans

kmeans = KMeans(n_clusters=4)

kmeans.fit(processed_train_X)

kmeans.predict(processed_test_X)

Unnamed: 0_level_0,CENTROID_ID
penguin_id,Unnamed: 1_level_1
225,3
82,1
15,1
107,1
206,1
144,2
10,1
103,1
238,2
155,2


## Pipelines

Transfomers and predictors can be chained into a single estimator component using `bigframes.ml.pipeline.Pipeline`:

In [10]:
from bigframes.ml.pipeline import Pipeline

pipeline = Pipeline([
  ('preproc', preproc),
  ('linreg', linreg)
])

# Print our pipeline
pipeline

Pipeline(steps=[('preproc',
                 ColumnTransformer(transformers=[('scale', StandardScaler(),
                                                  ['culmen_length_mm',
                                                   'culmen_depth_mm',
                                                   'flipper_length_mm']),
                                                 ('encode', OneHotEncoder(),
                                                  ['species', 'sex',
                                                   'island'])])),
                ('linreg', LinearRegression())])

The pipeline simplifies the workflow by applying each of its component steps automatically:

In [11]:
pipeline.fit(train_X, train_y)

predicted_test_y = pipeline.predict(test_X)
predicted_test_y

Unnamed: 0_level_0,predicted_body_mass_g
penguin_id,Unnamed: 1_level_1
225,5760.822218
198,5357.784351
238,5310.316536
268,5640.40225
267,5563.409525
202,5288.878646
226,5660.289322
260,5450.749037
216,5406.814023
265,5649.406601


In the backend, a pipeline will actually be compiled into a single model with an embedded TRANSFORM step.

## Evaluating results

Some models include a convenient `.score()` method for evaulation with a preset accuracy metric:

In [12]:
# In the case of a pipeline, this will be equivalent to calling .score on the contained LinearRegression
pipeline.score(test_X, test_y)

Unnamed: 0,mean_absolute_error,mean_squared_error,mean_squared_log_error,median_absolute_error,r2_score,explained_variance
0,238.802892,91765.717649,0.005583,178.636467,0.853166,0.859504


For a more general approach, the library `bigframes.ml.metrics` is provided:

In [13]:
from bigframes.ml.metrics import r2_score

r2_score(test_y, predicted_test_y)

0.8531658892882623

## Save/Load to BigQuery

Estimators can be saved to BigQuery as BQML models, and loaded again in future

In [14]:
# Replace with a path where you have permission to save a model
model_name = "bigframes-dev.bqml_tutorial.penguins_model"

linreg.to_gbq(model_name, replace=True)

LinearRegression()

In [15]:
# WARNING - until b/281709360 is fixed & pipeline is updated, pipelines will load as models,
# and details of their transform steps will be lost (the loaded model will behave the same)
session.read_gbq_model(model_name)

LinearRegression()