# Using ML - ML fundamentals

The `bigframes.ml` module implements Scikit-Learn's machine learning API in
BigQuery DataFrames. It exposes BigQuery's ML capabilities in a simple, popular
API that works seamlessly with the rest of the BigQuery DataFrames API.

In [1]:
# Lets load some test data to use in this tutorial
import bigframes.pandas

df = bigframes.pandas.read_gbq("bigquery-public-data.ml_datasets.penguins")
df = df.dropna()

# Temporary workaround: lets name our index so it isn't lost BigQuery DataFrame
# currently drops unnamed indexes when round-tripping through pandas, which
# some ML APIs do to route around missing functionality
df.index.name = "penguin_id"

df

Unnamed: 0_level_0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
penguin_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,Adelie Penguin (Pygoscelis adeliae),Biscoe,40.1,18.9,188.0,4300.0,MALE
1,Adelie Penguin (Pygoscelis adeliae),Torgersen,39.1,18.7,181.0,3750.0,MALE
2,Gentoo penguin (Pygoscelis papua),Biscoe,47.4,14.6,212.0,4725.0,FEMALE
3,Chinstrap penguin (Pygoscelis antarctica),Dream,42.5,16.7,187.0,3350.0,FEMALE
4,Adelie Penguin (Pygoscelis adeliae),Biscoe,43.2,19.0,197.0,4775.0,MALE
5,Gentoo penguin (Pygoscelis papua),Biscoe,46.7,15.3,219.0,5200.0,MALE
6,Adelie Penguin (Pygoscelis adeliae),Biscoe,41.3,21.1,195.0,4400.0,MALE
7,Gentoo penguin (Pygoscelis papua),Biscoe,45.2,13.8,215.0,4750.0,FEMALE
8,Gentoo penguin (Pygoscelis papua),Biscoe,46.5,13.5,210.0,4550.0,FEMALE
9,Gentoo penguin (Pygoscelis papua),Biscoe,50.5,15.2,216.0,5000.0,FEMALE


## Data split

Part of preparing data for a machine learning task is splitting it into subsets for training and testing, to ensure that the solution is not overfitting. Most commonly this is done with `bigframes.ml.model_selection.train_test_split` like so:

In [2]:
# In this example, we're doing supervised learning, where we will learn to predict
# output variable `y` from input features `X`
X = df[['island', 'culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'sex', 'species']]
y = df[['body_mass_g']] 

from bigframes.ml.model_selection import train_test_split

# This will split X and y into test and training sets, with 20% of the rows in the test set,
# and the rest in the training set
X_train, X_test, y_train, y_test = train_test_split(
  X, y, test_size=0.2)

# Show the shape of the data after the split
print(f"""X_train shape: {X_train.shape}
X_test shape: {X_test.shape}
y_train shape: {y_train.shape}
y_test shape: {y_test.shape}""")

X_train shape: (267, 6)
X_test shape: (67, 6)
y_train shape: (267, 1)
y_test shape: (67, 1)


In [3]:
# If we look at the data, we can see that random rows were selected for
# each side of the split
X_test.head(5)

Unnamed: 0_level_0,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,sex,species
penguin_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
249,Torgersen,41.1,18.6,189.0,MALE,Adelie Penguin (Pygoscelis adeliae)
36,Biscoe,43.4,14.4,218.0,FEMALE,Gentoo penguin (Pygoscelis papua)
74,Biscoe,42.8,14.2,209.0,FEMALE,Gentoo penguin (Pygoscelis papua)
235,Dream,34.0,17.1,185.0,FEMALE,Adelie Penguin (Pygoscelis adeliae)
117,Dream,37.8,18.1,193.0,MALE,Adelie Penguin (Pygoscelis adeliae)


In [4]:
# Note that this matches the rows in X_test
y_test.head(5)

Unnamed: 0_level_0,body_mass_g
penguin_id,Unnamed: 1_level_1
249,3325.0
36,4600.0
74,4700.0
235,3400.0
117,3750.0


## Estimators

Following Scikit-Learn, all learning components are "estimators"; objects that can learn from training data and then apply themselves to new data. Estimators share the following patterns:

- a constructor that takes a list of parameters
- a standard string representation that shows the class name and all non-default parameters, e.g. `LinearRegression(fit_intercept=False)`
- a `.fit(..)` method to fit the estimator to training data

There estimators can be further broken down into two main subtypes:

### Transformers

Transformers are estimators that are used to prepare data for consumption by other estimators ('preprocessing'). In addition to `.fit(...)`, the transformer implements a `.transform(...)` method, which will apply a transformation based on what was computed during `.fit(..)`. With this pattern dynamic preprocessing steps can be applied to both training and test/production data consistently.

An example of a transformer is `bigframes.ml.preprocessing.StandardScaler`, which rescales a dataset to have a mean of zero and a standard deviation of one:

In [5]:
from bigframes.ml.preprocessing import StandardScaler

# StandardScaler will only work on numeric columns
numeric_columns = ["culmen_length_mm", "culmen_depth_mm", "flipper_length_mm"]

scaler = StandardScaler()
scaler.fit(X_train[numeric_columns])

# Now, standardscaler should transform the numbers to have mean of zero
# and standard deviation of one:
scaler.transform(X_train[numeric_columns])

Unnamed: 0_level_0,standard_scaled_culmen_length_mm,standard_scaled_culmen_depth_mm,standard_scaled_flipper_length_mm
penguin_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,-0.750505,0.84903,-0.937262
2,0.622496,-1.322402,0.804051
3,-0.299107,-0.261935,-1.009817
5,0.490839,-0.968913,1.311935
6,-0.524806,1.959995,-0.429379
7,0.208715,-1.726389,1.021716
9,1.205551,-1.019412,1.09427
10,0.772962,-0.817418,1.457044
12,1.243168,-1.120408,1.602153
14,-1.709725,0.344046,-0.792152


In [6]:
# We can then repeat this transformation on new data
scaler.transform(X_test[numeric_columns])

Unnamed: 0_level_0,standard_scaled_culmen_length_mm,standard_scaled_culmen_depth_mm,standard_scaled_flipper_length_mm
penguin_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,-0.938587,0.748033,-1.445145
4,-0.16745,0.899528,-0.284269
8,0.453222,-1.877885,0.658942
11,-1.12667,0.697535,-0.792152
13,-1.183094,1.404513,-0.792152
15,0.867003,-0.766919,0.513833
16,-1.784958,1.959995,-0.211715
23,-0.355532,0.647036,-1.5177
34,-0.600039,-1.776888,0.949161
36,-0.129833,-1.423399,1.23938


#### Composing transformers

To process data where different columns need different preprocessors, `bigframes.composition.ColumnTransformer` can be employed:

In [7]:
from bigframes.ml.compose import ColumnTransformer
from bigframes.ml.preprocessing import OneHotEncoder

# Create an aggregate transform that applies StandardScaler to the numeric columns,
# and OneHotEncoder to the string columns
preproc = ColumnTransformer([
    ("scale", StandardScaler(), ["culmen_length_mm", "culmen_depth_mm", "flipper_length_mm"]),
    ("encode", OneHotEncoder(), ["species", "sex", "island"])])

# Now we can fit all columns of the training data
preproc.fit(X_train)

processed_X_train = preproc.transform(X_train)
processed_X_test = preproc.transform(X_test)

processed_X_train

Unnamed: 0_level_0,onehotencoded_island,standard_scaled_culmen_length_mm,standard_scaled_culmen_depth_mm,standard_scaled_flipper_length_mm,onehotencoded_sex,onehotencoded_species
penguin_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,"[{'index': 1, 'value': 1.0}]",-0.750505,0.84903,-0.937262,"[{'index': 2, 'value': 1.0}]","[{'index': 1, 'value': 1.0}]"
2,"[{'index': 1, 'value': 1.0}]",0.622496,-1.322402,0.804051,"[{'index': 1, 'value': 1.0}]","[{'index': 3, 'value': 1.0}]"
3,"[{'index': 2, 'value': 1.0}]",-0.299107,-0.261935,-1.009817,"[{'index': 1, 'value': 1.0}]","[{'index': 2, 'value': 1.0}]"
5,"[{'index': 1, 'value': 1.0}]",0.490839,-0.968913,1.311935,"[{'index': 2, 'value': 1.0}]","[{'index': 3, 'value': 1.0}]"
6,"[{'index': 1, 'value': 1.0}]",-0.524806,1.959995,-0.429379,"[{'index': 2, 'value': 1.0}]","[{'index': 1, 'value': 1.0}]"
7,"[{'index': 1, 'value': 1.0}]",0.208715,-1.726389,1.021716,"[{'index': 1, 'value': 1.0}]","[{'index': 3, 'value': 1.0}]"
9,"[{'index': 1, 'value': 1.0}]",1.205551,-1.019412,1.09427,"[{'index': 1, 'value': 1.0}]","[{'index': 3, 'value': 1.0}]"
10,"[{'index': 1, 'value': 1.0}]",0.772962,-0.817418,1.457044,"[{'index': 2, 'value': 1.0}]","[{'index': 3, 'value': 1.0}]"
12,"[{'index': 1, 'value': 1.0}]",1.243168,-1.120408,1.602153,"[{'index': 2, 'value': 1.0}]","[{'index': 3, 'value': 1.0}]"
14,"[{'index': 1, 'value': 1.0}]",-1.709725,0.344046,-0.792152,"[{'index': 1, 'value': 1.0}]","[{'index': 1, 'value': 1.0}]"


### Predictors

Predictors are estimators that learn and make predictions. In addition to `.fit(...)`, the predictor implements a `.predict(...)` method, which will use what was learned during `.fit(...)` to predict some output.

Predictors can be further broken down into two categories:

#### Supervised predictors

Supervised learning is when we train a model on input-output pairs, and then ask it to predict the output for new inputs. An example of such a predictor is `bigframes.ml.linear_models.LinearRegression`.

In [8]:
from bigframes.ml.linear_model import LinearRegression

linreg = LinearRegression()

# Learn from the training data how to predict output y
linreg.fit(processed_X_train, y_train)

# Predict y for the test data
predicted_y_test = linreg.predict(processed_X_test)

predicted_y_test

Unnamed: 0_level_0,predicted_body_mass_g,onehotencoded_island,standard_scaled_culmen_length_mm,standard_scaled_culmen_depth_mm,standard_scaled_flipper_length_mm,onehotencoded_sex,onehotencoded_species
penguin_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,3781.402407,"[{'index': 3, 'value': 1.0}]",-0.938587,0.748033,-1.445145,"[{'index': 2, 'value': 1.0}]","[{'index': 1, 'value': 1.0}]"
4,4124.107944,"[{'index': 1, 'value': 1.0}]",-0.16745,0.899528,-0.284269,"[{'index': 2, 'value': 1.0}]","[{'index': 1, 'value': 1.0}]"
8,4670.344196,"[{'index': 1, 'value': 1.0}]",0.453222,-1.877885,0.658942,"[{'index': 1, 'value': 1.0}]","[{'index': 3, 'value': 1.0}]"
11,3529.417214,"[{'index': 2, 'value': 1.0}]",-1.12667,0.697535,-0.792152,"[{'index': 1, 'value': 1.0}]","[{'index': 1, 'value': 1.0}]"
13,4014.101714,"[{'index': 1, 'value': 1.0}]",-1.183094,1.404513,-0.792152,"[{'index': 2, 'value': 1.0}]","[{'index': 1, 'value': 1.0}]"
15,5212.41288,"[{'index': 1, 'value': 1.0}]",0.867003,-0.766919,0.513833,"[{'index': 2, 'value': 1.0}]","[{'index': 3, 'value': 1.0}]"
16,4163.595615,"[{'index': 3, 'value': 1.0}]",-1.784958,1.959995,-0.211715,"[{'index': 2, 'value': 1.0}]","[{'index': 1, 'value': 1.0}]"
23,3392.453069,"[{'index': 2, 'value': 1.0}]",-0.355532,0.647036,-1.5177,"[{'index': 1, 'value': 1.0}]","[{'index': 1, 'value': 1.0}]"
34,4698.305397,"[{'index': 1, 'value': 1.0}]",-0.600039,-1.776888,0.949161,"[{'index': 1, 'value': 1.0}]","[{'index': 3, 'value': 1.0}]"
36,4828.226949,"[{'index': 1, 'value': 1.0}]",-0.129833,-1.423399,1.23938,"[{'index': 1, 'value': 1.0}]","[{'index': 3, 'value': 1.0}]"


#### Unsupervised predictors

In unsupervised learning, there are no known outputs in the training data, instead the model learns on input data alone and predicts something else. An example of an unsupervised predictor is `bigframes.ml.cluster.KMeans`, which learns how to fit input data to a target number of clusters.

In [9]:
from bigframes.ml.cluster import KMeans

kmeans = KMeans(n_clusters=4)

kmeans.fit(processed_X_train)

kmeans.predict(processed_X_test)

Unnamed: 0_level_0,CENTROID_ID,NEAREST_CENTROIDS_DISTANCE,onehotencoded_island,standard_scaled_culmen_length_mm,standard_scaled_culmen_depth_mm,standard_scaled_flipper_length_mm,onehotencoded_sex,onehotencoded_species
penguin_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,3,"[{'CENTROID_ID': 3, 'DISTANCE': 1.236380597035...","[{'index': 3, 'value': 1.0}]",-0.938587,0.748033,-1.445145,"[{'index': 2, 'value': 1.0}]","[{'index': 1, 'value': 1.0}]"
4,3,"[{'CENTROID_ID': 3, 'DISTANCE': 1.039497631856...","[{'index': 1, 'value': 1.0}]",-0.16745,0.899528,-0.284269,"[{'index': 2, 'value': 1.0}]","[{'index': 1, 'value': 1.0}]"
8,1,"[{'CENTROID_ID': 1, 'DISTANCE': 1.171040485975...","[{'index': 1, 'value': 1.0}]",0.453222,-1.877885,0.658942,"[{'index': 1, 'value': 1.0}]","[{'index': 3, 'value': 1.0}]"
11,2,"[{'CENTROID_ID': 2, 'DISTANCE': 0.969102754012...","[{'index': 2, 'value': 1.0}]",-1.12667,0.697535,-0.792152,"[{'index': 1, 'value': 1.0}]","[{'index': 1, 'value': 1.0}]"
13,3,"[{'CENTROID_ID': 3, 'DISTANCE': 1.113138945949...","[{'index': 1, 'value': 1.0}]",-1.183094,1.404513,-0.792152,"[{'index': 2, 'value': 1.0}]","[{'index': 1, 'value': 1.0}]"
15,1,"[{'CENTROID_ID': 1, 'DISTANCE': 1.070996026772...","[{'index': 1, 'value': 1.0}]",0.867003,-0.766919,0.513833,"[{'index': 2, 'value': 1.0}]","[{'index': 3, 'value': 1.0}]"
16,3,"[{'CENTROID_ID': 3, 'DISTANCE': 1.780136190720...","[{'index': 3, 'value': 1.0}]",-1.784958,1.959995,-0.211715,"[{'index': 2, 'value': 1.0}]","[{'index': 1, 'value': 1.0}]"
23,2,"[{'CENTROID_ID': 2, 'DISTANCE': 1.382540667483...","[{'index': 2, 'value': 1.0}]",-0.355532,0.647036,-1.5177,"[{'index': 1, 'value': 1.0}]","[{'index': 1, 'value': 1.0}]"
34,1,"[{'CENTROID_ID': 1, 'DISTANCE': 1.598627908302...","[{'index': 1, 'value': 1.0}]",-0.600039,-1.776888,0.949161,"[{'index': 1, 'value': 1.0}]","[{'index': 3, 'value': 1.0}]"
36,1,"[{'CENTROID_ID': 1, 'DISTANCE': 1.095162305190...","[{'index': 1, 'value': 1.0}]",-0.129833,-1.423399,1.23938,"[{'index': 1, 'value': 1.0}]","[{'index': 3, 'value': 1.0}]"


## Pipelines

Transfomers and predictors can be chained into a single estimator component using `bigframes.ml.pipeline.Pipeline`:

In [10]:
from bigframes.ml.pipeline import Pipeline

pipeline = Pipeline([
  ('preproc', preproc),
  ('linreg', linreg)
])

# Print our pipeline
pipeline

Pipeline(steps=[('preproc',
                 ColumnTransformer(transformers=[('scale', StandardScaler(),
                                                  ['culmen_length_mm',
                                                   'culmen_depth_mm',
                                                   'flipper_length_mm']),
                                                 ('encode', OneHotEncoder(),
                                                  ['species', 'sex',
                                                   'island'])])),
                ('linreg', LinearRegression())])

The pipeline simplifies the workflow by applying each of its component steps automatically:

In [11]:
pipeline.fit(X_train, y_train)

predicted_y_test = pipeline.predict(X_test)
predicted_y_test

Unnamed: 0_level_0,predicted_body_mass_g,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,sex,species
penguin_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,3781.396682,Torgersen,39.1,18.7,181.0,MALE,Adelie Penguin (Pygoscelis adeliae)
4,4124.102574,Biscoe,43.2,19.0,197.0,MALE,Adelie Penguin (Pygoscelis adeliae)
8,4670.338389,Biscoe,46.5,13.5,210.0,FEMALE,Gentoo penguin (Pygoscelis papua)
11,3529.411644,Dream,38.1,18.6,190.0,FEMALE,Adelie Penguin (Pygoscelis adeliae)
13,4014.09632,Biscoe,37.8,20.0,190.0,MALE,Adelie Penguin (Pygoscelis adeliae)
15,5212.407319,Biscoe,48.7,15.7,208.0,MALE,Gentoo penguin (Pygoscelis papua)
16,4163.590502,Torgersen,34.6,21.1,198.0,MALE,Adelie Penguin (Pygoscelis adeliae)
23,3392.44731,Dream,42.2,18.5,180.0,FEMALE,Adelie Penguin (Pygoscelis adeliae)
34,4698.299674,Biscoe,40.9,13.7,214.0,FEMALE,Gentoo penguin (Pygoscelis papua)
36,4828.221398,Biscoe,43.4,14.4,218.0,FEMALE,Gentoo penguin (Pygoscelis papua)


In the backend, a pipeline will actually be compiled into a single model with an embedded TRANSFORM step.

## Evaluating results

Some models include a convenient `.score(X, y)` method for evaulation with a preset accuracy metric:

In [12]:
# In the case of a pipeline, this will be equivalent to calling .score on the contained LinearRegression
pipeline.score(X_test, y_test)

Unnamed: 0,mean_absolute_error,mean_squared_error,mean_squared_log_error,median_absolute_error,r2_score,explained_variance
0,216.444357,72639.698707,0.00463,170.588356,0.896396,0.900547


For a more general approach, the library `bigframes.ml.metrics` is provided:

In [14]:
from bigframes.ml.metrics import r2_score

r2_score(y_test, predicted_y_test["predicted_body_mass_g"])

0.8963962044533755

## Save/Load to BigQuery

Estimators can be saved to BigQuery as BQML models, and loaded again in future.

Saving requires `bigquery.tables.create` permission, and loading requires `bigquery.models.getMetadata` permission.
These permissions can be at project level or the dataset level.

If you have those permissions, please go ahead and uncomment the code in the following cells and run.

In [15]:
# # Replace with a path where you have permission to save a model
# model_name = "bigframes-dev.bqml_tutorial.penguins_model"

# linreg.to_gbq(model_name, replace=True)

In [16]:
# # WARNING - until b/281709360 is fixed & pipeline is updated, pipelines will load as models,
# # and details of their transform steps will be lost (the loaded model will behave the same)
# bigframes.pandas.read_gbq_model(model_name)