# Pipelines and custom transformers in sklearn

---

**Data pipelines** are a series of automated data transformations used to perform (and ensure the validity of) routine data maintenance and analysis tasks. 

Many organizations rely on data engineering teams to encode common tasks into pipelines. It is likely that you will at some point be required to productionize a data pipeline.

## Data pipelines

---

The term **pipeline** is jargon for **a series of concatenated data transformations**. Each stage of a pipeline feeds from the previous stage, i.e. the output of a stage is plugged into the input of the next stage and data flows through the pipeline from beginning to end.

---

<img src="./assets/images/pipeline.png">

---

Pipelines provide a higher level of abstraction than the individual building blocks of a data science process and are a great way to organize analyses.

### Examples of data pipelines

---

What are some examples of data pipelines?

- Change in units (lbs -> kg)

- Change in scale (normalization)

- Missing data imputation

- Image & sound processing

## Pipelines in scikit-learn

---

Pipelines improve coding and model management in ```scikit-learn```. These tie together all the steps that you may need to prepare your datasets and make your predictions. 

Because you will need to perform all of the exact same transformations on your evaluation data, encoding the exact steps is important for reproducibility and consistency. **This is especially important and convenient when sharing code with a team!**

Loading the sklearn pipeline code:

---

In [1]:
import pandas as pd
import numpy as np

from sklearn.pipeline import Pipeline

## Example with natural language processing (NLP)

---

**NLP** is currently a very popular and high demand skill in the world of data science. It is essentially statistics and machine learning applied to human text data. A very common example of this is "comments", which are now common in nearly every app and website.

---

Our practice data comes from the **Evergreen Stumbleupon Kaggle Competition**. Participants where challenged to build a classifier to categorize webpages as "evergreen" or "non-evergreen".

Check out the information on Kaggle here:

    https://www.kaggle.com/c/stumbleupon/data
    
Binary evergreen labels (either evergreen (1) or non-evergreen (0)) were provided.

#### Load the local Kaggle dataset:

---

In [4]:
# For this we are going to use the json package as well to create the "title" 
# and "body" columns from the "boilerplate" column, which is in json format
import json

# it is a tab delimited file:
data = pd.read_csv("./assets/datasets/stumbleupon.tsv", sep='\t')

data.head(1)

Unnamed: 0,url,urlid,boilerplate,alchemy_category,alchemy_category_score,avglinksize,commonlinkratio_1,commonlinkratio_2,commonlinkratio_3,commonlinkratio_4,...,is_news,lengthyLinkDomain,linkwordscore,news_front_page,non_markup_alphanum_characters,numberOfLinks,numwords_in_url,parametrizedLinkRatio,spelling_errors_ratio,label
0,http://www.bloomberg.com/news/2010-12-23/ibm-p...,4042,"{""title"":""IBM Sees Holographic Calls Air Breat...",business,0.789131,2.055556,0.676471,0.205882,0.047059,0.023529,...,1,1,24,0,5424,170,8,0.152941,0.07913,0


In [10]:
# create the columns by converting json:
data['title'] = data.boilerplate.map(lambda x: json.loads(x).get('title', ''))
data['body'] = data.boilerplate.map(lambda x: json.loads(x).get('body', ''))

# change title and body na columns to blank strings
titles = data['title'].fillna('')
body = data['body'].fillna('')

titles[0:3]

0    IBM Sees Holographic Calls Air Breathing Batte...
1    The Fully Electronic Futuristic Starting Gun T...
2    Fruits that Fight the Flu fruits that fight th...
Name: title, dtype: object

#### Make and check the target for the classifier

---

In [12]:
Y = data['label']
Y[0:3]

0    0
1    1
2    1
Name: label, dtype: int64

In [13]:
Y.value_counts() / len(y)

1    0.51332
0    0.48668
Name: label, dtype: float64

### CountVectorizer

---

Each datapoint is a string of free form text. How can we make this numeric and useful for a classification model? 

One of the simplest and most common ways is to **build a dictionary of words and use those as features**. 

This is what sklearns **`CountVectorizer`** does.

_Example output:_


|Sentence|the|cat|is|on|table|blue|
|---|---|---|---|---|---|---|
|The cat is on the table|2|1|1|1|1|0|
|The table is blue|1|0|1|0|1|1|
|...|...|...|...|...|...|...|

#### Load and apply the CountVectorizer

---

In [8]:
from sklearn.feature_extraction.text import CountVectorizer

# We want 1000 words as features max.
# ngram_range = (1,2) specifies that we want single words and word pairs
# stop_words = 'english' specifies that we do not want the most common english words
# binary = True means we want 1 if the word was present rather than a count of that word
vectorizer = CountVectorizer(max_features = 1000,
                             ngram_range=(1, 2),
                             stop_words='english',
                             binary=True)

# fit on an example:
vectorizer.fit(['IBM Sees Holographic Calls Air Breathing'])

vectorizer.get_feature_names()

[u'air',
 u'air breathing',
 u'breathing',
 u'calls',
 u'calls air',
 u'holographic',
 u'holographic calls',
 u'ibm',
 u'ibm sees',
 u'sees',
 u'sees holographic']

#### Example row from Countvectorizer

---

In [9]:
vectorizer.transform(['IBM Sees Holographic Air']).todense()


matrix([[1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1]])

#### Create a vectorizer for the title and body columns

---

In [11]:
title_vectorizer = CountVectorizer(max_features = 1000,
                                   ngram_range=(1, 2),
                                   stop_words='english',
                                   binary=True)

body_vectorizer = CountVectorizer(max_features = 1000,
                                  ngram_range=(1, 2),
                                  stop_words='english',
                                  binary=True)

# Use `fit` to learn the vocabulary of the titles
title_vectorizer.fit(titles)

# and the vocabulary of the body
body_vectorizer.fit(body)

# Use `transform` to generate the sample title and body word matrix
# one column per feature (word or n-grams)
title_X = title_vectorizer.transform(titles)
body_X = body_vectorizer.transform(body)

We can use **`title_X`** (a matrix of all common title n-grams in the dataset), as an input to a logistic regression classifier. 

**Classify how whether a story is evergreen based on title features:**

---

In [14]:
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_score

title_model = LogisticRegression()
title_scores = cross_val_score(title_model, title_X, Y, cv=5)

print('CV scores: {}'.format(title_scores))

print('Average CVScore: {:0.3f} +/- {:0.3f}'.format(title_scores.mean(), title_scores.std()))

CV scores: [ 0.75743243  0.75997295  0.75794456  0.74983097  0.76589986]
Average CVScore: 0.758 +/- 0.005


#### Try on the body features:

---

In [16]:
body_model = LogisticRegression()
body_scores = cross_val_score(body_model, body_X, Y, cv=5)

print('CV scores: {}'.format(body_scores))

print('Average CVScore: {:0.3f} +/- {:0.3f}'.format(body_scores.mean(), body_scores.std()))

CV scores: [ 0.73243243  0.74104124  0.73427992  0.74171738  0.7435724 ]
Average CVScore: 0.739 +/- 0.004


## Combining steps together in a pipeline

---

We can combine these steps to evaluate some future dataset. To properly do this we need to make sure we perform the exact same transformations on the data. 

For example, if `has_brownies_in_text` is column 19, we need to make sure it is also column 19 during future evaluation.

**Pipelines combine both pre-processing and model building steps into a single object**. Rather than manually building transformations and then feeding them into the models, pipelines tie both of these steps together.

Similar to models and vectorizers in scikit-learn, pipelines are equipped with

- `fit()` methods
- `predict()` or `predict_proba()` methods (as any model would be)


#### Build a pipeline for the title feature data

---

In [18]:
# Manually split the data to build a training set
training_data = data[:6000]
title_X_train = training_data['title'].fillna('')
Y_train = training_data['label']

# Manually construct the test data
title_X_test = data[6000:]['title'].fillna('')

from sklearn.pipeline import Pipeline

pipeline = Pipeline([
        ('vec', title_vectorizer),
        ('model', title_model)
    ])

# Fit the full pipeline
# This means we perform the steps laid out above
# First we fit the vectorizer,
# And then feed the output of that into the fit function of the model

pipeline.fit(title_X_train, Y_train)

# Here again we apply the full pipeline for predictions
# The text is transformed automatically to match the features from the pipeline
pipeline.predict_proba(title_X_test)

array([[ 0.46800424,  0.53199576],
       [ 0.28316267,  0.71683733],
       [ 0.00514   ,  0.99486   ],
       ..., 
       [ 0.29063018,  0.70936982],
       [ 0.60683954,  0.39316046],
       [ 0.66318704,  0.33681296]])

### Add a scaler to the pipeline 

---

As an additional step, we could add a **`MaxAbsScaler`** scaling step to the pipeline, which will occur after the vectorization.

**`MaxAbsScaler`** by default transforms the features so that the maximum absolute value of any feature is 1. 

We have already done this manually but this is an example of doing it in the pipeline:

In [21]:
from sklearn.preprocessing import MaxAbsScaler

ma_scaler = MaxAbsScaler()

pipeline = Pipeline([
        ('vec', title_vectorizer),
        ('max_abs_scaler', ma_scaler),
        ('model', title_model)
    ])

pipeline.fit(title_X_train, Y_train)

pipeline.predict(title_X_test)

array([1, 1, 1, ..., 1, 0, 0])

## Merging feature sets in pipelines

---

While scikit-learn pipelines facilitate transformations of raw data, there may be many steps required before this takes place in your pipeline. These complex pipelines are often referred to as **ETL pipelines for (Extract, Transform, Load)**.

In an ETL pipeline, the data is pulled or extracted from some source (like a database), transformed or manipulated, and then loaded into whatever system will analyze the data.

Many data science teams rely on software tools to manage these ETL pipelines. If a transformation step fails, these tools alert you, or ensure that step can be re-run. If these transformation steps need to happen daily or weekly, these tools can manage that timeline.

One of the most popular Python tools for this is Luigi developed by Spotify:

    https://github.com/spotify/luigi

Or Airflow by AirBnB:

    https://github.com/apache/incubator-airflow

### FeatureUnion

---

We can also get some of this functionality in sklearn.

Let's say, for example, we want to merge many different feature sets together automatically. **`FeatureUnion`** is a very useful tool to do this for us.

In [28]:
# We want not only our binary title features, but the word counts as well:

title_normcount_vectorizer = CountVectorizer(max_features = 1000,
                                             ngram_range=(1, 2),
                                             stop_words='english',
                                             binary=False)

feature_makers = [('binarizer', title_vectorizer), ('normalizer', title_normcount_vectorizer)]

from sklearn.pipeline import FeatureUnion

feature_union = FeatureUnion(feature_makers)

comb_features = feature_union.fit_transform(title_X_train)

comb_features

<6000x2000 sparse matrix of type '<type 'numpy.int64'>'
	with 40981 stored elements in Compressed Sparse Row format>

## `make_pipeline()` with preprocessing and modeling

---

Scikit-learn pipelines can also be built perhaps more easily using the `make_pipeline()` function.

In [30]:
# import the make_pipeline function
from sklearn.pipeline import make_pipeline

# StandardScaler is the same as normalization we have been doing
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

pipe1 = make_pipeline(StandardScaler(), LogisticRegression())    

pipe2 = Pipeline(steps=[('standardscaler',StandardScaler()),
                        ('logistic_regr',LogisticRegression())
                       ])

The pipelines are identical:

In [32]:
print pipe1
print '\n--------------------------------------------\n'
print pipe2

Pipeline(steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('logisticregression', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

--------------------------------------------

Pipeline(steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('logistic_regr', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])


## Examining the preprocessing module

---

The preprocessing module comes loaded with many very useful pre-processing classes.

**Data Manipulators**

- Binarizer
- KernelCenterer
- MaxAbsScaler
- MinMaxScaler
- Normalizer
- OneHotEncoder
- PolynomialFeatures
- RobustScaler
- StandardScaler

**Data Imputation**

- Imputer

**Function Transformer**

- FunctionTransformer

**Label Manipulators**

- LabelBinarizer
- LabelEncoder
- MultiLabelBinarizer

## Custom transformer classes

---

You can build your own transformers using python classes. This is your first look into python classes, but don't be intimidated!

Classes allow you to create **objects containing attributes and functions**, much like how pandas Series objects contain functions such as **`unique()`** as well as attributes such as **`values`**.

#### Build a custom transformer class

---

In [33]:
# we need to import the template classes to create a class that works like an sklearn class
from sklearn.base import BaseEstimator, TransformerMixin

# our "FeatureMultiplier" will simply multiply "X", the input, by some factor set during initialization:
class FeatureMultiplier(BaseEstimator, TransformerMixin):
    def __init__(self, factor):
        self.factor = factor

    def transform(self, X, *_):
        return X * self.factor

    def fit(self, *_):
        return self

In [37]:
fm = FeatureMultiplier(2)

test = np.diag((1,2,3,4))
print test

fm.transform(test)

[[1 0 0 0]
 [0 2 0 0]
 [0 0 3 0]
 [0 0 0 4]]


array([[2, 0, 0, 0],
       [0, 4, 0, 0],
       [0, 0, 6, 0],
       [0, 0, 0, 8]])

In [38]:
# Like any class with attributes, we can access and change it!

print fm.factor

fm.factor = 6

print fm.factor

fm.transform(test)

2
6


array([[ 6,  0,  0,  0],
       [ 0, 12,  0,  0],
       [ 0,  0, 18,  0],
       [ 0,  0,  0, 24]])

### Practice custom transformation classes!

---

Create a custom transformer that:

- Is initialized with a column name for a pandas DataFrame (use `'title'`)
- Accepts a pandas DataFrame and returns the column (send in `data`)

In [41]:
class ColumnExtractor(BaseEstimator, TransformerMixin):
    def __init__(self, column):
        self.column = column

    def transform(self, dataframe, *_):
        return dataframe[self.column]

    def fit(self, *_):
        return self
    
ce = ColumnExtractor('title')

print ce.column

ce.transform(data)[0:5]

title


0    IBM Sees Holographic Calls Air Breathing Batte...
1    The Fully Electronic Futuristic Starting Gun T...
2    Fruits that Fight the Flu fruits that fight th...
3                  10 Foolproof Tips for Better Sleep 
4    The 50 Coolest Jerseys You Didn t Know Existed...
Name: title, dtype: object