# Intermediate Machine Learning with scikit-learn

When you have explored a series of steps that is useful for your task—or for a related family of tasks—you would like to package those in a less ad hoc and more reusable way.  Certainly wrapping a set of steps in a simple factory function is not difficult.  But for most flexibility it is best to take advantage of scikit-learn **pipelines**.

In [None]:
# Some libraries tend to be in flux for their dependency versions
import warnings
warnings.simplefilter("ignore")

## Imperative Sequential Processing

Let us repeat a similar construction of building and training a model as we have seen previously.  Here we carry out the steps imperatively, in the sequence we worked out in previous lessons.

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Some libraries tend to be in flux for their dependency versions
import warnings
warnings.simplefilter("ignore")

### Load the data

First load the data; this step will not become part of the pipeline since loading the initial data may occur in various ways that are use dependent.  We will use a cross validation score rather than do a train/test split up front.  In the next lesson we will see how this can be more robust than a simple train/test split.

In [None]:
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()

# From here on, we refer to features and target by the
# generic X, y rather than tie it to the dataset
X, y = cancer.data, cancer.target
X.shape

### Generate Synthetic Features

We think the model may perform better with polynomial features that get at the interactions of multiple features.

In [None]:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(2)
X_poly = poly.fit_transform(X)
X_poly.shape

### Scale the Data

Scale the data for better performance in subsequent models.

In [None]:
from sklearn.preprocessing import MinMaxScaler

# compute minimum and maximum on the training data
scaler = MinMaxScaler()
scaler.fit(X_poly)
# rescale training data
X_poly_scaled = scaler.transform(X_poly)
X_poly_scaled.shape

### Select Most Important Engineered Features

Since we have increased the number of features to an unweildy number, let us select only the top most important few.

In [None]:
from sklearn.feature_selection import SelectPercentile

select = SelectPercentile(percentile=20)
select.fit(X_poly_scaled, y)
X_selected = select.transform(X_poly_scaled)
X_selected.shape

### Test Feature Engineered Data Against Model

Having gone through those steps, we would like to see how our engineered dataset performs on a model that showed some success in earlier lessons.

In [None]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(max_depth=7, random_state=1)

In [None]:
from sklearn.metrics import f1_score, make_scorer
scorer = make_scorer(f1_score)

from sklearn.model_selection import KFold
kf = KFold(5, random_state=0)

In [None]:
from sklearn.model_selection import cross_val_score
cv_scores = cross_val_score(rfc, X_selected, y, scoring=scorer, cv=kf)
print(" CV scores:", cv_scores)
print("Mean score:", np.mean(cv_scores))

As a sanity check, let us see how we would have performed on the raw data.  This gets a moderate but significant improvement over the raw data in F1 score.

In [None]:
cv_scores = cross_val_score(rfc, X, y, scoring=scorer, cv=kf)
print("Raw data CV scores:", cv_scores)
print("    Raw mean score:", np.mean(cv_scores))

## Using Pipelines

A pipeline is simply an abstraction in scikit-learn to bundle together steps like those used above into a single model interface, following the same APIs as a model itself.  A particular pipeline is likely to be somewhat domain specific in that you may learn that those particular steps are useful for e.g. cancer data, but not as useful for data with very different characteristics.

<img src="img/pipeline-diagram.png" alt="Pipeline Illustration" width="75%"/>

Image credit (CC-BY-NA): [Karl Rosaen](http://karlrosaen.com/ml/learning-log/2016-06-20/)

### Putting it together

We can easily construct a pipeline consisting of just those steps (instances that follow the scorer interface) we want.  When we instantiate the various classes, we are free to pass in parameters we know we will want; these likely reflect our previous exploration of the particular domain and its datasets.

In [None]:
from sklearn.pipeline import Pipeline

pipe = Pipeline([
    ("Polynomial features", PolynomialFeatures(2)),
    ("MinMax scaling", MinMaxScaler()),
    ("Top 20% features", SelectPercentile(percentile=20)),
    ("Random Forest", RandomForestClassifier(max_depth=7)),
])

Using the pipeline object is just like using a plain model, but all the preparation steps are bundled into a single object.

In [None]:
cv_scores = cross_val_score(pipe, 
                            X, y, 
                            scoring=make_scorer(f1_score), 
                            cv=KFold(5))

print(" Pipeline CV scores:", cv_scores)
print("Pipeline mean score:", np.mean(cv_scores))

We can recover (and even modify in-place) the steps of a pipeline

In [None]:
pipe.steps

In [None]:
# We can serialize the pipeline for later use
from pickle import dump, load
dump(pipe, open('data/cancer-pipeline.pkl','wb'))

In [None]:
newpipe = load(open('data/cancer-pipeline.pkl','rb'))
cv_scores = cross_val_score(newpipe, 
                            X, y, 
                            scoring=make_scorer(f1_score), 
                            cv=KFold(5))

print(" Pipeline CV scores:", cv_scores)
print("Pipeline mean score:", np.mean(cv_scores))

In [None]:
pipe.fit(X, y)

In [None]:
pipe.predict(X)

#### A pipeline factory

There is a convenience function to create pipelines.  The only difference of interest with the class constructor is that names of steps are generated for you rather than being explicitly spelled when you create a pipeline.  This is slightly more convenient but also takes away your option of giving more descriptive names for steps.

In [None]:
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(
    PolynomialFeatures(2),
    MinMaxScaler(),
    SelectPercentile(percentile=20),
    RandomForestClassifier(max_depth=7))
pipe.steps

## Pipelines with Grid Search

A very nice feature of using a pipeline is that it "plays well" with a grid search.  In fact, you are not restrained to searching over the hyperparameters of the model step, but can also search over arguments to other steps in the pipeline.  For this, spelling is a little easier if we use the generated step names from `make_pipeline()`.

In [None]:
%%time
# Takes about a minute for this grid search
from sklearn.model_selection import GridSearchCV

params = {'polynomialfeatures__degree': [1, 2, 3],
          'selectpercentile__percentile': [10, 15, 20, 50],
          'randomforestclassifier__max_depth': [5, 7, 9],
          'randomforestclassifier__criterion': ['entropy', 'gini']}

grid = GridSearchCV(pipe, param_grid=params, cv=5)
grid.fit(X, y)

print("best cross-validation accuracy:", grid.best_score_)
print("best dataset score: ", grid.score(X, y))   # Overfitting against entire dataset
print("best parameters: ", grid.best_params_)

In [None]:
model = grid.best_estimator_
cv_scores = cross_val_score(model, 
                            X, y, 
                            scoring=make_scorer(f1_score), 
                            cv=KFold(5))

print(" Grid CV scores:", cv_scores)
print("Grid mean score:", np.mean(cv_scores))

The model we select as `.best_estimator_` is itself a pipeline; it simply has been re-parameterized from the original pipeline, using the grid search.

In [None]:
model.steps

We can examine the relative success of all the parameter combinations as well.  As we saw in a prior lesson, `.cv_results_` contains a rich collection of information beyond this also.  Although the highest degree polynomial features and highest percentage feature selection was the best estimator, in the ranking of classifiers, there is quite a bit of variation in all the parameters searched.  In particular, entirely different combinations perform only slightly worse in the example (they are close enough that it might turn out differently among the top few with different random seeds).

In [None]:
df_grid = pd.DataFrame(grid.cv_results_).set_index('rank_test_score').sort_index()
df_params = df_grid.loc[:,df_grid.columns.str.contains('param_')]
cols = [c.split('_')[-1] for c in df_params.columns]
df_params.columns = cols
df_params.head(10)

## Next lesson

**Robust train/test splits**: In this lesson we looked at the very useful pipeline interface provided by scikit-learn.  Using pipelines greatly aids in making models and processing steps reproducible and easy to distribute to colleagues (or to yourself with later projects).

The next and final lesson of this course, on robust train/test splits, will look at a variety of capabilities in scikit-learn for performing divisions between training and validation data that go beyond the basic function we used in most of these lessons.

<a href="TrainTest.ipynb"><img src="img/open-notebook.png" align="left"/></a>