<img src= "img/pipelines.png" style="height:450px">


[Image Source](https://towardsdatascience.com/using-functiontransformer-and-pipeline-in-sklearn-to-predict-chardonnay-ratings-9b13fdd6c6fd)

__Agenda__

- Pipelines and Composite estimators

- Why do we need them?

- How to use them in sklearn: accessing a particular object in pipe and changing parameters

- Combining pipelines with gridsearch

- Summary and further reading

In [None]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler


from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, classification_report
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

from sklearn.preprocessing import MinMaxScaler, Normalizer


# Pipelines

__What is a Pipeline?__

_Transformers:_ Any object with .transform method. Ex: PCA, OneHotEncoder.

_Estimators:_ Any object with predict method. Ex: RandomForestClassifier, LinearRegression etc.

_Pipelines:_ A tool for combining transformers with estimators. 


__Other relevant tools are:__

- FutureUnion

- TransformedTargetRegressor

__Why do we need pipelines?__

- Convenience and encapsulation

Even though we train 10 transformers and 5 estimators we will call fit and predict once.

- Joint parameter selection - here emphasize preprocessing part

We can put pipelines into gridsearch and find best parameters for all the estimators at once.

- Safety

Pipelines help avoid leaking statistics from your test data into trained model.

[Pipelines and Composite Estimators](https://scikit-learn.org/stable/modules/compose.html#combining-estimators)



## Usage of Pipelines

The Pipeline is built using a list of (key, value) pairs, where the key is a string containing the name you want to give this step and value is an estimator object:

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.decomposition import PCA 
from sklearn.impute import SimpleImputer

estimators = [('imputer', SimpleImputer()),('reduce_dim', PCA()), ('clf', SVC())]

pipe = Pipeline(estimators)

pipe



__Your Turn__

- Create your own pipeline. You can use the same transformers and estimators with different parameters and 'name'.

- You can create a new pipe with a scaler also.

In [None]:
pipe = Pipeline([('imputer', SimpleImputer()),
                 ('scaler', StandardScaler()),
                 ('clf', LogisticRegression(C=1000,
                                            max_iter=1000,
                                            solver='saga'))])

Sklearn also gives us "make_pipeline" which is almost the same thing but with make_pipeline you don't have to give names.

__Your Turn__

-  [Check documentation: 6.1.1.1.1. Construction](https://scikit-learn.org/stable/modules/compose.html) and use make_pipeline to construct an pipeline.

In [None]:
from sklearn.pipeline import make_pipeline

In [None]:
# %load -r 1-10 supplement.py

## Accessing steps

We have multiple ways to access and object in the pipeline

- steps attribute

- [idx]



In [None]:
## note that these will all give the simple imputer object
pipe.steps[0]

pipe['simpleimputer']

pipe[0]

In [None]:
## We can also access a particular object by named_steps
## sklearn claims that tab completion should work here but 
## in my notebook it didn't

pipe.named_steps.simpleimputer

In [None]:
## We can 'slice' pipelines to create sub-pipes

pipe[1:]

## Access to the parameters

Parameters of the estimators in the pipeline can be accessed using the 
"estimator__parameter" syntax.

In [None]:
pipe.set_params(clf__C = 10)

pipe

## Caching Transformers

[(6.1.1.3. Caching transformers: avoid repeated computation)](https://scikit-learn.org/stable/modules/compose.html#pipeline-chaining-estimators)

There are two ways of caching transformers:

1. Use mkdtemp

In [None]:
from sklearn.datasets import load_boston

X, y = load_boston(return_X_y=True)

In [None]:
from tempfile import mkdtemp
from shutil import rmtree
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
estimators = [('reduce_dim', PCA()), ('clf', LinearRegression())]
cachedir = mkdtemp()
pipe = Pipeline(estimators, memory=cachedir)
pipe.fit(X, y)

In [None]:
# Clear the cache directory when you don't need it anymore
rmtree(cachedir)

Or by giving a string as the directory names

In [None]:
pipe = Pipeline(estimators, memory='cached_transformers')
pipe.fit(X, y)

In [None]:
## again remove it when you are done with them
rmtree('cached_transformers')

## Transforming target in regression

In [None]:
import numpy as np
from sklearn.datasets import load_boston
from sklearn.compose import TransformedTargetRegressor
from sklearn.preprocessing import QuantileTransformer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
X, y = load_boston(return_X_y=True)
transformer = QuantileTransformer(output_distribution='normal')
regressor = LinearRegression()
regr = TransformedTargetRegressor(regressor=regressor,
                                  transformer=transformer)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
regr.fit(X_train, y_train)

print('R2 score: {0:.2f}'.format(regr.score(X_test, y_test)))

raw_target_regr = LinearRegression().fit(X_train, y_train)
print('R2 score: {0:.2f}'.format(raw_target_regr.score(X_test, y_test)))

## Using Pipelines with GridSearchCV

In [None]:
pipe

In [None]:
# we can also access the parameters in the gridsearch

from sklearn.model_selection import GridSearchCV
param_grid = dict(simpleimputer__strategy=['mean', 'median'],
                  svc__C=[0.1, 10, 100])
grid_search = GridSearchCV(pipe, param_grid=param_grid)

grid_search

In [None]:
# note also that we can choose to skip some steps in the pipeline

param_grid = dict(reduce_dim=['passthrough', PCA(5), PCA(6)],
                  svc=[SVC(gamma='auto'),
                       LogisticRegression(solver='lbfgs', max_iter=1000)],
                  svc__C=[0.1, 10, 100])

grid_search = GridSearchCV(pipe, param_grid=param_grid)

## Pipelines in action

In [None]:
df = pd.read_csv('data/diabetes.csv')
display(df.head(), df.shape)

In [None]:
target = df.Outcome
column_list = df.columns.tolist()

column_list.remove('Outcome')
print(column_list)

data = df[column_list]

[On Scaling data](https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-scaler)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    target,
                                                    test_size=0.20,
                                                    stratify=target,
                                                    random_state=120919)

__Your Turn__

- Create a pipeline and use this pipeline for fitting and predicting diabetes results for the above data.


In [None]:
# %load -r 1-10 supplement.py

__Your Turn__

- Now use gridsearch with pipelines and return the best parameters

In [None]:
# %load -r 13-22 supplement.py


__Remark__


Note that even if gridsearch and pipes are getting along very well. The options are not limitless. Try to add Randomforests as classifier in the gridsearch. 

## Further research and miscellaneous

- [FeatureUnion](https://scikit-learn.org/stable/modules/compose.html#featureunion-composite-feature-spaces)

- [ColumnTransformer](https://scikit-learn.org/stable/modules/compose.html#columntransformer-for-heterogeneous-data)

- [sklearn, dictionary of terms](https://scikit-learn.org/stable/glossary.html#term-transformer)

- [Pydata meeting on pipelines](https://www.youtube.com/watch?v=BFaadIqWlAg)

- [Another pydata talk on pipelines with FeatureUnion](https://www.youtube.com/watch?v=URdnFlZnlaE)

- [On scalers](https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-scaler)

- [A nice notebook on pipelines](https://github.com/amueller/introduction_to_ml_with_python/blob/master/06-algorithm-chains-and-pipelines.ipynb)