## sklearn.pipline.Pipeline
- `class sklearn.pipeline.Pipeline(steps,memory=None)`

- Pipeline of transforms with a final estimator.
- Sequentially apply a list of transforms and a final estimator.
- Intermediate steps of the pipeline must be `transform`
- they must implement `fit and transform` methods.
- The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters.

- `Pipeline` can be used to chain multiple estimators into one.
- This is userful as there is often a fixed sequence of steps in processing the data.
- The Pipeline is buit using a list of `(key,value)` pairs.
- key : `string containing the name` you want to give this step
- value : is an estimator object

In [2]:
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.decomposition import PCA

  return f(*args, **kwds)


In [3]:
estimators = [('reduce_dim',PCA()),('clf',SVC())]
pipe = Pipeline(estimators)
pipe

Pipeline(memory=None,
     steps=[('reduce_dim', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('clf', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])

In [4]:
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.decomposition import PCA

In [5]:
estimators = [('reduce_dim',PCA()),('clf',SVC())]
pipe = Pipeline(estimators)
pipe

Pipeline(memory=None,
     steps=[('reduce_dim', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('clf', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])

The utility function `make_pipeline` is a shorthand for constructing pipelines; it takes a variable number of estimators and returns a pipeline, filing in the names automatically:

In [8]:
from sklearn.pipeline import make_pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import Binarizer

In [9]:
make_pipeline(Binarizer(),MultinomialNB())

Pipeline(memory=None,
     steps=[('binarizer', Binarizer(copy=True, threshold=0.0)), ('multinomialnb', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

The estimators of a pipeline are stored as a listin the steps attribute

In [10]:
pipe.steps[0]

('reduce_dim',
 PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
   svd_solver='auto', tol=0.0, whiten=False))

and as a `dict` in `named_steps`:

In [11]:
pipe.named_steps['reduce_dim']

PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

Parameters of the estimatros in the pipeline can be accessed using the `<estimator>__<parameter>` syntax:

In [12]:
pipe.set_params(clf__C=10)

Pipeline(memory=None,
     steps=[('reduce_dim', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('clf', SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])

Attributes of named_steps map to keys, enabling tab completion in interactive environments:

In [14]:
pipe.named_steps.reduce_dim is pipe.named_steps['reduce_dim']

True

## Pipelining : chaining a PCA and a logistic regression
- The PCA does an unsupervised dimensionality reduction, while the logistic regression does the prediction.

In [38]:
print(__doc__)
# Code source: Gaël Varoquaux
# Modified for documentation by Jaques Grobler
# License: BSD 3 clause
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model,decomposition,datasets
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

Automatically created module for IPython interactive environment


In [23]:
logistic = linear_model.LogisticRegression()
pca = decomposition.PCA()
pipe = Pipeline(steps=[('pca',pca),('logistic',logistic)])

In [24]:
digits = datasets.load_digits()
X_digits = digits.data
y_digits = digits.target

## 20 News Group classification

In [26]:
from sklearn.datasets import fetch_20newsgroups

news = fetch_20newsgroups(subset='all')
X = news.data
y = news.target

In [27]:
from sklearn.feature_extraction.text import TfidfVectorizer , HashingVectorizer , CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

In [28]:
model1 = Pipeline([
    ('vect',CountVectorizer()),
    ('model_mult',MultinomialNB()),
        ])

In [29]:
model2 = Pipeline([
    ('vect',TfidfVectorizer()),
    ('model',MultinomialNB()),
        ])

In [30]:
model3 = Pipeline([
    ('vect',HashingVectorizer()),
    ('model_mult',MultinomialNB()),
])

In [32]:
model4 = Pipeline([
    ('vect',TfidfVectorizer(stop_words='english',
                           token_pattern=r"\b[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+\b")),
    ('model',MultinomialNB()),
])

## Intuitive Example of pipeline

In [45]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
from sklearn.linear_model import SGDClassifier
iris = load_iris()
dfX = pd.DataFrame(iris.data,columns=iris.feature_names)
dfy = pd.DataFrame(iris.target,columns = ['y'])

In [None]:
    vect = CountVectorizer()
    tfidf = TfidfTransformer()
    clf = SGDClassifier()

    vX = vect.fit_transform(Xtrain)
    tfidfX = tfidf.fit_transform(vX)
    predicted = clf.fit_predict(tfidfX)

    # Now evaluate all steps on test set
    vX = vect.fit_transform(Xtest)
    tfidfX = tfidf.fit_transform(vX)
    predicted = clf.fit_predict(tfidfX)

In [None]:
    vect = CountVectorizer()
    tfidf = TfidfTransformer()
    clf = SGDClassifier()

    vX = vect.fit_transform(Xtrain)
    tfidfX = tfidf.fit_transform(vX)
    predicted = clf.fit_predict(tfidfX)

    # Now evaluate all steps on test set
    vX = vect.fit_transform(Xtest)
    tfidfX = tfidf.fit_transform(vX)
    predicted = clf.fit_predict(tfidfX)