### Sklearn Pipelines 

This notebook shows how to implement pipelines to simplify the model creation process.

In [26]:
import numpy as np
import matplotlib.pyplot as plt

# load data
from sklearn.datasets import load_digits
from sklearn.cross_validation import train_test_split

digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target, random_state = 0)

In [27]:
# without pipelines
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
svm = SVC().fit(X_train_scaled, y_train)

X_test_scaled = scaler.transform(X_test)
score = svm.score(X_test_scaled, y_test)
print('SVM score WITHOUT pipeline: {}'.format(score))

SVM score WITHOUT pipeline: 0.9844444444444445


In [31]:
# with pipelines
from sklearn.pipeline import Pipeline

# verbose constructor
pipe = Pipeline([("my_scaler", StandardScaler()), ("my_svm", SVC())])
pipe.fit(X_train, y_train)

score_p = pipe.score(X_test, y_test)
print('SVM score WITH pipeline: {}'.format(score_p))

SVM score WITH pipeline: 0.9844444444444445


Exactly the same as before, as desired.
___

### Longer Pipelines

Pipelines are not limited to two steps. The only requirement is that all but the last step needs to have a .transform method.

Here is an example of a longer pipeline and its' transformations:

1. The pipeline first drops features with little variance
2. Feature scaling
3. Statistical feature selection
4. Training the SVM


In [39]:
from sklearn.pipeline import make_pipeline
from sklearn.feature_selection import SelectFdr, VarianceThreshold

# transformations ----------\/-------------------\/--------------\/------------
pipe = make_pipeline(VarianceThreshold(), StandardScaler(), SelectFdr(), SVC())
pipe.fit(X_train, y_train)

score_l = pipe.score(X_test, y_test)
print('SVM score: {}'.format(score_l))

SVM score: 0.9866666666666667


### Unsupervised Pipelines

* Reduce to 10 dimensions
* Find 10 clusters

In [43]:
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

cluster_pipe = make_pipeline(PCA(n_components=10), KMeans(n_clusters=10))
cluster_pipe.fit(X_train)
cluster_pipe.predict(X_train)

array([0, 4, 4, ..., 5, 5, 7], dtype=int32)

Accessing attributes:

In [44]:
cluster_pipe.named_steps['pca'] 

PCA(copy=True, n_components=10, whiten=False)

In [48]:
cluster_pipe.named_steps['pca'].components_.shape

(10, 64)

In [52]:
cluster_pipe.named_steps['kmeans']
cluster_pipe.named_steps['kmeans'].cluster_centers_.shape

(10, 10)