## Pipeline

Scikit-learn documentation:
- [pipeline module](http://scikit-learn.org/stable/modules/pipeline.html)
- [Pipeline class](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)
- [FeatureUnion](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html)
- [GridSearchCV class](http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.GridSearchCV.html)

In [1]:
from __future__ import division
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.grid_search import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.cross_validation import train_test_split
from sklearn.base import BaseEstimator
import pandas as pd
import numpy as np

Generate data

In [2]:
df = pd.DataFrame([['some word here', 1000, 0.6, 0],
                  ['another word here', 900, 0.7, 0],
                  ['a world here', 900, 0.7, 0],
                  ['another world here', 900, 0.7, 0],
                  ['cat and dog', 100, 0.9, 1],
                  ['dog and fish', 50, 0.9, 1],
                  ['cat and fish', 100, 0.9, 1],
                  ['dog and cat', 50, 0.9, 1]])
df.columns = ['description', 'feature1', 'feature2', 'target']
df.head()

Unnamed: 0,description,feature1,feature2,target
0,some word here,1000,0.6,0
1,another word here,900,0.7,0
2,a world here,900,0.7,0
3,another world here,900,0.7,0
4,cat and dog,100,0.9,1


In [3]:
y = df.pop('target')
X = df

Train test split

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

Customized steps like ColumnExtractor should inherit from BaseEstimator or implement `get_params` and `set_params`, and define `__getattr__` so that GridSearchCV  can work properly:

In [5]:
class ColumnExtractor(BaseEstimator):

    def __init__(self, column_name):
        self.column_name = column_name

    def transform(self, X_in):
        X = X_in.copy(deep=True)
        return X[self.column_name].values

    def fit(self, X_in, y=None):
        return self

In [6]:
class ColumnPop(BaseEstimator):

    def __init__(self, column_name):
        self.column_name = column_name

    def transform(self, X_in):
        X = X_in.copy(deep=True)
        X.pop(self.column_name)
        return X

    def fit(self, X, y=None):
        return self

### Define pipeline

- Features consist of union of
  - NLP features (TfIdf on description)
  - Other features
- Model (Multinomial Naive Bayes) on union of features

In [7]:
pipeline = Pipeline([
    ('features', FeatureUnion([
        ('nlp', Pipeline([
            ('extract', ColumnExtractor("description")),
            ('tfidf', TfidfVectorizer())
        ])),
        ('non-nlp', Pipeline([
            ('pop', ColumnPop("description"))
            ]))
    ])),    
    ('model', MultinomialNB())
    ])

See what the created pipeline looks like

In [8]:
pipeline

Pipeline(steps=[('features', FeatureUnion(n_jobs=1,
       transformer_list=[('nlp', Pipeline(steps=[('extract', ColumnExtractor(column_name='description')), ('tfidf', TfidfVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'conten...  transformer_weights=None)), ('model', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

Fit the pipeline and then use it to predict (and score)

In [9]:
pipeline.fit(X_train, y_train)

Pipeline(steps=[('features', FeatureUnion(n_jobs=1,
       transformer_list=[('nlp', Pipeline(steps=[('extract', ColumnExtractor(column_name='description')), ('tfidf', TfidfVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'conten...  transformer_weights=None)), ('model', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

In [10]:
print 'predictions: {}'.format(pipeline.predict(X_test))
print 'accuracy: {}'.format(pipeline.score(X_test, y_test))

predictions: [0 1]
accuracy: 1.0


### Set Pipeline Parameters

First level model parameter

In [11]:
pipeline.set_params(model__alpha=0.1).fit(X_train, y_train)

Pipeline(steps=[('features', FeatureUnion(n_jobs=1,
       transformer_list=[('nlp', Pipeline(steps=[('extract', ColumnExtractor(column_name='description')), ('tfidf', TfidfVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'conten...  transformer_weights=None)), ('model', MultinomialNB(alpha=0.1, class_prior=None, fit_prior=True))])

"Nested" parameter

In [12]:
pipeline.set_params(features__nlp__tfidf__max_features=2).fit(X_train, y_train)

Pipeline(steps=[('features', FeatureUnion(n_jobs=1,
       transformer_list=[('nlp', Pipeline(steps=[('extract', ColumnExtractor(column_name='description')), ('tfidf', TfidfVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'conten...  transformer_weights=None)), ('model', MultinomialNB(alpha=0.1, class_prior=None, fit_prior=True))])

More parameters, comma separated

In [13]:
pipeline.set_params(model__alpha=0.1, features__nlp__tfidf__max_features=2).fit(X_train, y_train)

Pipeline(steps=[('features', FeatureUnion(n_jobs=1,
       transformer_list=[('nlp', Pipeline(steps=[('extract', ColumnExtractor(column_name='description')), ('tfidf', TfidfVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'conten...  transformer_weights=None)), ('model', MultinomialNB(alpha=0.1, class_prior=None, fit_prior=True))])

### Grid Search Pipeline

Grid search multiple parameters and get the best combination

In [14]:
parameters = {'model__alpha': (0.1, 1),
             'features__nlp__tfidf__max_features': (None, 2, 10)}
grid_search_pip = GridSearchCV(pipeline, parameters)
grid_search_pip.fit(X_train, y_train)
grid_search_pip.best_params_

{'features__nlp__tfidf__max_features': None, 'model__alpha': 0.1}