<img src='img/logo.png'>
<img src='img/title.png'>

# Pipelines

This notebook covers "pipelines" or machine learning workflows that are expressed as a sequence of named steps with parameters that may be varied in a grid or randomized search.

# Table of Contents
* [Pipelines](#Pipelines)
	* [Algorithm chains and pipelines](#Algorithm-chains-and-pipelines)
		* [Building pipelines](#Building-pipelines)
		* [Using pipelines in grid searches](#Using-pipelines-in-grid-searches)
			* [Another `Pipeline` example](#Another-Pipeline-example)
			* [And here it is with `Pipeline`](#And-here-it-is-with-Pipeline)
		* [The General Pipeline Interface](#The-General-Pipeline-Interface)
		* [`Pipeline` creation with ``make_pipeline``](#Pipeline-creation-with-make_pipeline)
			* [Accessing step attributes](#Accessing-step-attributes)
			* [Accessing attributes in grid-searched pipeline.](#Accessing-attributes-in-grid-searched-pipeline.)
		* [Grid-searching preprocessing steps and model parameters](#Grid-searching-preprocessing-steps-and-model-parameters)
* [Summary](#Summary)


In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

plt.rcParams['image.interpolation'] = "none"
np.set_printoptions(precision=3)

import src.mglearn as mglearn

## Algorithm chains and pipelines

The next few cells show a pipeline workflow written out in declarative form with the tools we have covered so far, including `MinMaxScaler` for scaling, and `GridSearchCV` for model selection with cross validation.

In [None]:
from sklearn.svm import SVC
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

# load and split the data
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    cancer.data, cancer.target, random_state=0)

# compute minimum and maximum on the training data
scaler = MinMaxScaler().fit(X_train)
# rescale training data
X_train_scaled = scaler.transform(X_train)

svm = SVC()
# learn an SVM on the scaled training data
svm.fit(X_train_scaled, y_train)
# scale test data and score the scaled data
X_test_scaled = scaler.transform(X_test)
svm.score(X_test_scaled, y_test)

Grid search with cross validation on the scaled data:

In [None]:
from sklearn.model_selection import GridSearchCV
# illustration purposes only, don't use this code
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100],
              'gamma': [0.001, 0.01, 0.1, 1, 10, 100]}
grid = GridSearchCV(SVC(), param_grid=param_grid, cv=5)
grid.fit(X_train_scaled, y_train)
print("best cross-validation accuracy:", grid.best_score_)
print("test set score: ", grid.score(X_test_scaled, y_test))
print("best parameters: ", grid.best_params_)

In [None]:
mglearn.plots.plot_improper_processing()

### Building pipelines

`pipe = Pipeline(` in the next cell is a shorter way of expressing the logic in the cells above.  It creates a `Pipeline` with two named steps, `scaler` and `svm`.

In [None]:
from sklearn.pipeline import Pipeline
pipe = Pipeline([("scaler", MinMaxScaler()), ("svm", SVC())])

In [None]:
# pipe has an interface like SVC() but the "fit" method is inclusive of pipeline steps
pipe.fit(X_train, y_train)

In [None]:
pipe.score(X_test, y_test)

### Using pipelines in grid searches

This is the same kind of `GridSearch` we have done, but we are passing `pipe`, our `Pipeline` with two steps, as the `estimator` argument to `GridSearchCV`.

Where the `param_grid` needs to refer to parameters of named steps, it can use double underscores, as in `svm__C` to specify a list of parameters to try for support vector machine's error `C` parameter.

In [None]:
param_grid = {'svm__C': [0.001, 0.01, 0.1, 1, 10, 100],
              'svm__gamma': [0.001, 0.01, 0.1, 1, 10, 100]}

In [None]:
grid = GridSearchCV(pipe, param_grid=param_grid, cv=5)
grid.fit(X_train, y_train)
print("best cross-validation accuracy:", grid.best_score_)
print("test set score: ", grid.score(X_test, y_test))
print("best parameters: ", grid.best_params_)

#### Another `Pipeline` example

Let's start with 100 rows and 10,000 columns of random data.

In [None]:
rnd = np.random.RandomState(seed=0)
X = rnd.normal(size=(100, 10000))
y = rnd.normal(size=(100,))

`SelectPercentile` feature selctor as first step, the `Ridge` regression (regression with `L2` norm penalty).

Select the top 5% of the features with the lowest f_regression sore.


In [None]:
from sklearn.feature_selection import SelectPercentile, f_regression

select = SelectPercentile(score_func=f_regression, percentile=5).fit(X, y)
X_selected = select.transform(X)
print(X_selected.shape)

To perform cross validation we need to fit and transform each fold separately.

First, here's how it would be done delcaratively:

In [None]:
from sklearn.model_selection import KFold

kfold5 = KFold(5, random_state=0)

In [None]:
from sklearn.linear_model import Ridge

scores = np.empty(0)
for train, test in kfold5.split(X):
    X_train = X[train]
    y_train = y[train]
    X_test  = X[test]
    y_test  = y[test]
    
    selector = SelectPercentile(score_func=f_regression, percentile=5).fit(X_train, y_train)
    X_train_selected = selector.transform(X_train)
    X_test_selected = selector.transform(X_test)
    
    score = Ridge().fit(X_train_selected, y_train).score(X_test_selected, y_test)
    scores = np.append(scores, score)
    
    
np.mean(scores)

#### And here it is with `Pipeline`

In [None]:
from sklearn.cross_validation import cross_val_score

pipe = Pipeline([("select", SelectPercentile(score_func=f_regression, percentile=5)), 
                 ("ridge", Ridge())])
np.mean(cross_val_score(pipe, X, y, cv=kfold5.split(X)))

### The General Pipeline Interface

In [None]:
def fit(self, X, y):
    X_transformed = X
    for step in self.steps[:-1]:
        # iterate over all but the final step
        # fit and transform the data
        X_transformed = step[1].fit_transform(X_transformed, y)
    # fit the last step
    self.steps[-1][1].fit(X_transformed, y)
    return self

In [None]:
def predict(self, X):
    X_transformed = X
    for step in self.steps[:-1]:
        # iterate over all but the final step
        # transform the data
        X_transformed = step[1].transform(X_transformed)
    # fit the last step
    return self.steps[-1][1].predict(X_transformed)

**Pipeline Illustration**
<img src="img/pipeline-diagram.png" alt="Pipeline Illustration" width="50%"/>

Image: CC-BY-NA, [Karl Rosaen](http://karlrosaen.com/ml/learning-log/2016-06-20/)

### `Pipeline` creation with ``make_pipeline``

In [None]:
from sklearn.pipeline import make_pipeline
# standard syntax
pipe_long = Pipeline([("scaler", MinMaxScaler()), ("svm", SVC(C=100))])
# abbreviated syntax
pipe_short = make_pipeline(MinMaxScaler(), SVC(C=100))

In [None]:
pipe_short.steps

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
# auto-naming the steps
pipe = make_pipeline(StandardScaler(), PCA(n_components=2), StandardScaler())
pipe.steps

#### Accessing step attributes

In [None]:
# fit the pipeline defined above to the cancer dataset
pipe.fit(cancer.data)
# extract the first two principal components from the "pca" step
components = pipe.named_steps["pca"].components_
print(components.shape)

#### Accessing attributes in grid-searched pipeline.

In [None]:
from sklearn.linear_model import LogisticRegression

pipe = make_pipeline(StandardScaler(), LogisticRegression())

In [None]:
param_grid = {'logisticregression__C': [0.01, 0.1, 1, 10, 100]}

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    cancer.data, cancer.target, random_state=4)
grid = GridSearchCV(pipe, param_grid, cv=5)
grid.fit(X_train, y_train)

In [None]:
print(grid.best_estimator_)

In [None]:
print(grid.best_estimator_.named_steps["logisticregression"])

In [None]:
print(grid.best_estimator_.named_steps["logisticregression"].coef_)

### Grid-searching preprocessing steps and model parameters

The following shows a pipeline that
 * Normalizes to 0 mean and unit variance (`StandardScaler`)
 * Adds polynomial features (`PolynomialFeatures`)
 * Runs `Ridge` regression

In [None]:
from sklearn.datasets import load_boston
boston = load_boston()
X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target, random_state=0)

from sklearn.preprocessing import PolynomialFeatures
pipe = make_pipeline(
    StandardScaler(),
    PolynomialFeatures(),
    Ridge())

The steps were auto-named using lower case versions of the classes we used in the `Pipeline`.  Double underscores are used to control the `degree` parameter for polynomial features and `alpha` parameter for `Ridge` regression.

In [None]:
param_grid = {'polynomialfeatures__degree': [1, 2, 3],
              'ridge__alpha': [0.001, 0.01, 0.1, 1, 10, 100]}

In [None]:
grid = GridSearchCV(pipe, param_grid=param_grid, cv=5, n_jobs=-1)
grid.fit(X_train, y_train)

In [None]:
plt.matshow(np.array(pd.DataFrame(grid.cv_results_).mean_test_score).reshape(3, -1),
            vmin=0, cmap="viridis")
plt.xlabel("ridge__alpha")
plt.ylabel("polynomialfeatures__degree")
plt.xticks(range(len(param_grid['ridge__alpha'])), param_grid['ridge__alpha'])
plt.yticks(range(len(param_grid['polynomialfeatures__degree'])), 
                     param_grid['polynomialfeatures__degree'])

plt.colorbar();

In [None]:
print(grid.best_params_)

In [None]:
grid.score(X_test, y_test)

In [None]:
param_grid = {'ridge__alpha': [0.001, 0.01, 0.1, 1, 10, 100]}
pipe = make_pipeline(StandardScaler(), Ridge())
grid = GridSearchCV(pipe, param_grid, cv=5)
grid.fit(X_train, y_train)
grid.score(X_test, y_test)

# Summary

In this notebook, we reviewed the following topics in preparation for more advanced topics:

 * [Algorithm chains and pipelines](#Algorithm-Chains-and-Pipelines)
 * [The general pipeline interface](#The-General-Pipeline-Interface)
 * [Accessing step attributes](#Accessing-step-attributes)
 * [Accessing attributes in grid-searched pipeline.](#Accessing-attributes-in-grid-searched-pipeline.)
 * [Grid-searching preprocessing steps and model parameters](#Grid-searching-preprocessing-steps-and-model-parameters)

<a href='Pipelines_Exercises.ipynb' class='btn btn-primary btn-lg'>Exercises</a>

<img src='img/copyright.png'>