<img src= "img/pipelines.png" style="height:450px">


[Image Source](https://towardsdatascience.com/using-functiontransformer-and-pipeline-in-sklearn-to-predict-chardonnay-ratings-9b13fdd6c6fd)

__Agenda__

- Pipelines and Composite estimators

- Why do we need them?

- How to use them in sklearn: accessing a particular object in pipe and changing parameters

- Combining pipelines with gridsearch

- Summary and further reading

- Objective defined
- Data Cleaning
- Preprocessing before modeling data:
    - Scale
    - One hot encode
    - Imputing data for missing values/ '0' or '99'
    - PCA
    - Feature engineering
        - Polynomial, log (either to X or Y), square, interaction terms
    - Feature selection
    
- **Can do the preprocessing in a pipeline**

In [14]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler


from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, classification_report
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

from sklearn.preprocessing import MinMaxScaler, Normalizer


# Pipelines

__What is a Pipeline?__

_Transformers:_ Any object with .transform method. Ex: PCA, OneHotEncoder.

_Estimators:_ Any object with predict method. Ex: RandomForestClassifier, LinearRegression etc.

_Pipelines:_ A tool for combining transformers with estimators. 


__Other relevant tools are:__

- FutureUnion

- TransformedTargetRegressor

__Why do we need pipelines?__

- Convenience and encapsulation

Even though we train 10 transformer and 5 estimator we will call fit and predict once.

- Joint parameter selection - here emphasize preprocessing part
- Not only for the model, but also for the preprocessing transformers (very helpful!)

We can put pipelines into gridsearch and find best parameters for all the estimators at once.

- Safety

Pipelines help avoid leaking statistics from your test data into trained model.

[Pipelines and Composite Estimators](https://scikit-learn.org/stable/modules/compose.html#combining-estimators)



## Usage of Pipelines

The Pipeline is built using a list of (key, value) pairs, where the key is a string containing the name you want to give this step and value is an estimator object:

In [29]:
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.decomposition import PCA 
from sklearn.impute import SimpleImputer

estimators = [('imputer', SimpleImputer()),('reduce_dim', PCA()), ('clf', SVC())]

pipe = Pipeline(estimators)

pipe

Pipeline(memory=None,
         steps=[('imputer',
                 SimpleImputer(add_indicator=False, copy=True, fill_value=None,
                               missing_values=nan, strategy='mean',
                               verbose=0)),
                ('reduce_dim',
                 PCA(copy=True, iterated_power='auto', n_components=None,
                     random_state=None, svd_solver='auto', tol=0.0,
                     whiten=False)),
                ('clf',
                 SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None,
                     coef0=0.0, decision_function_shape='ovr', degree=3,
                     gamma='scale', kernel='rbf', max_iter=-1,
                     probability=False, random_state=None, shrinking=True,
                     tol=0.001, verbose=False))],
         verbose=False)

__Your Turn__

- Create your own pipeline. You can use the same transformers and estimators with different parameters and 'name'.

- You can create a new pipe with a scaler also.

In [26]:
# %load -r 1-5 supplement.py
pipe = Pipeline([('imputer', SimpleImputer()),
                 ('scaler', StandardScaler()),
                 ('clf', LogisticRegression(C = 1000,
                                                max_iter = 1000,
                                                solver = 'saga'))])

In [16]:
pipe2 = make_pipeline(SimpleImputer(), StandardScaler(), LogisticRegression(C = 1000,
                                                max_iter = 1000,
                                                solver = 'saga'))

Sklearn also gives us "make_pipeline" which is almost the same thing but with make_pipeline you don't have to give names.

__Your Turn__

-  [Check documentation: 6.1.1.1.1. Construction](https://scikit-learn.org/stable/modules/compose.html) and use make_pipeline to construct an pipeline.

In [11]:
from sklearn.pipeline import make_pipeline

In [17]:
# %load -r 26-28 supplement.py
pipe2 = make_pipeline(SimpleImputer(), StandardScaler(), LogisticRegression())

pipe2

Pipeline(memory=None,
         steps=[('simpleimputer',
                 SimpleImputer(add_indicator=False, copy=True, fill_value=None,
                               missing_values=nan, strategy='mean',
                               verbose=0)),
                ('standardscaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('logisticregression',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=None, max_iter=100,
                                    multi_class='auto', n_jobs=None,
                                    penalty='l2', random_state=None,
                                    solver='lbfgs', tol=0.0001, verbose=0,
                                    warm_start=False))],
         verbose=False)

## Accessing steps

We have multiple ways to access and object in the pipeline

- steps attribute

- [idx]



In [18]:
## note that these will all give the simple imputer object
pipe.steps[0]

pipe['simpleimputer']

pipe[0]

SimpleImputer(add_indicator=False, copy=True, fill_value=None,
              missing_values=nan, strategy='mean', verbose=0)

In [19]:
## We can also access a particular object by named_steps
## sklearn claims that tab completion should work here but 
## in my notebook it didn't

pipe.named_steps.simpleimputer

SimpleImputer(add_indicator=False, copy=True, fill_value=None,
              missing_values=nan, strategy='mean', verbose=0)

In [20]:
## We can 'slice' pipelines to create sub-pipes

pipe[1:]


Pipeline(memory=None,
         steps=[('standardscaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('logisticregression',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=None, max_iter=100,
                                    multi_class='auto', n_jobs=None,
                                    penalty='l2', random_state=None,
                                    solver='lbfgs', tol=0.0001, verbose=0,
                                    warm_start=False))],
         verbose=False)

## Access to the parameters

Parameters of the estimators in the pipeline can be accessed using the 
"estimator__parameter" syntax.

In [30]:
pipe.set_params(clf__C = 10)

pipe

Pipeline(memory=None,
         steps=[('imputer',
                 SimpleImputer(add_indicator=False, copy=True, fill_value=None,
                               missing_values=nan, strategy='mean',
                               verbose=0)),
                ('reduce_dim',
                 PCA(copy=True, iterated_power='auto', n_components=None,
                     random_state=None, svd_solver='auto', tol=0.0,
                     whiten=False)),
                ('clf',
                 SVC(C=10, break_ties=False, cache_size=200, class_weight=None,
                     coef0=0.0, decision_function_shape='ovr', degree=3,
                     gamma='scale', kernel='rbf', max_iter=-1,
                     probability=False, random_state=None, shrinking=True,
                     tol=0.001, verbose=False))],
         verbose=False)

In [32]:
# Change the parameters of the pipeline after defining them
pipe.set_params(clf__cache_size=100)
pipe.set_params(imputer__strategy = 'median')

Pipeline(memory=None,
         steps=[('imputer',
                 SimpleImputer(add_indicator=False, copy=True, fill_value=None,
                               missing_values=nan, strategy='median',
                               verbose=0)),
                ('reduce_dim',
                 PCA(copy=True, iterated_power='auto', n_components=None,
                     random_state=None, svd_solver='auto', tol=0.0,
                     whiten=False)),
                ('clf',
                 SVC(C=10, break_ties=False, cache_size=100, class_weight=None,
                     coef0=0.0, decision_function_shape='ovr', degree=3,
                     gamma='scale', kernel='rbf', max_iter=-1,
                     probability=False, random_state=None, shrinking=True,
                     tol=0.001, verbose=False))],
         verbose=False)

## Caching Transformers

[(6.1.1.3. Caching transformers: avoid repeated computation)](https://scikit-learn.org/stable/modules/compose.html#pipeline-chaining-estimators)

**Storing the trained transformers. Useful when doing GridSearch so it won't have to do it again for every iteration.**

There are two ways of caching transformers:

1. Use mkdtemp

In [33]:
from sklearn.datasets import load_boston

X, y = load_boston(return_X_y=True)

In [34]:
from tempfile import mkdtemp
from shutil import rmtree
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
estimators = [('reduce_dim', PCA()), ('clf', LinearRegression())]

# create temp folder that will store: PCA, LinearRegression
cachedir = mkdtemp()
pipe = Pipeline(estimators, memory=cachedir) 
pipe.fit(X,y)

Pipeline(memory='/var/folders/j4/x_vwn42s7nn6tdn2j0r9ljb40000gn/T/tmpc6m28hh9',
         steps=[('reduce_dim',
                 PCA(copy=True, iterated_power='auto', n_components=None,
                     random_state=None, svd_solver='auto', tol=0.0,
                     whiten=False)),
                ('clf',
                 LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
                                  normalize=False))],
         verbose=False)

In [35]:
# Clear the cache directory when you don't need it anymore
rmtree(cachedir)

Or by giving a string as the directory names

In [36]:
pipe = Pipeline(estimators, memory='cached_transformers')
pipe.fit(X,y)

Pipeline(memory='cached_transformers',
         steps=[('reduce_dim',
                 PCA(copy=True, iterated_power='auto', n_components=None,
                     random_state=None, svd_solver='auto', tol=0.0,
                     whiten=False)),
                ('clf',
                 LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
                                  normalize=False))],
         verbose=False)

In [37]:
## again remove it when you are done with them
rmtree('cached_transformers')

**However, won't be able to access coefficients, params if cached. Will need to use another method.**

## Transforming target in regression: when you want to make a trans to the target variable

Pipelines cannot transform on Y variables. Will need to import TransformedTargetRegressor

In [39]:
import numpy as np
from sklearn.datasets import load_boston
from sklearn.compose import TransformedTargetRegressor
from sklearn.preprocessing import QuantileTransformer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

X, y = load_boston(return_X_y=True)

transformer = QuantileTransformer(output_distribution='normal')
regressor = LinearRegression()
regr = TransformedTargetRegressor(regressor=regressor,
                                  transformer=transformer)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# applies transformation to y
regr.fit(X_train, y_train)

# has inverse transform already included
print('R2 score: {0:.2f}'.format(regr.score(X_test, y_test)))

raw_target_regr = LinearRegression().fit(X_train, y_train)
print('R2 score: {0:.2f}'.format(raw_target_regr.score(X_test, y_test)))

R2 score: 0.67
R2 score: 0.64


  % (self.n_quantiles, n_samples))


In [41]:
# When you want to see your predictions, you want to see Y on its original scale
# does reverse transformation so that you get Y values in orginal scale (i.e. not log or squared output)
regr.predict(X_test)

array([23.27152702, 23.4       , 33.2       , 10.9       , 21.4       ,
       19.62021138, 21.2       , 21.1       , 19.2       , 20.1       ,
        5.        , 14.9       , 17.4       ,  6.65004785, 50.        ,
       34.61034559, 22.7       , 44.34453699, 32.9867652 , 22.5       ,
       23.9       , 24.20172669, 20.05133096, 31.70093439, 22.        ,
       12.7       , 17.8       , 18.58001694, 43.42653849, 20.04348536,
       18.2       , 18.4       , 19.9404876 , 22.57570934, 29.6       ,
       20.6       ,  8.5       , 25.        , 17.86265554, 14.5       ,
       25.28479391, 20.1       , 21.41578664, 16.08486244, 21.7       ,
       23.97642227, 20.1       , 22.8       ,  9.36997255, 23.2       ,
       21.27308207, 16.60345899, 23.2       , 30.79044804, 14.31310147,
       20.42443604, 19.86302251, 15.17376169, 11.20175978, 22.13741921,
       17.60180754, 20.9       , 36.16350157, 34.83675554, 17.8       ,
       35.4       , 18.87210869, 18.5       , 17.8       , 22.3 

**FeatureUnion: Parallelized version of pipeline, instead of step by step process. All steps in a union are applied to the same dataset, then bundles the result.**

## Using Pipelines with GridSearchCV

In [43]:
pipe2

Pipeline(memory=None,
         steps=[('simpleimputer',
                 SimpleImputer(add_indicator=False, copy=True, fill_value=None,
                               missing_values=nan, strategy='mean',
                               verbose=0)),
                ('standardscaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('logisticregression',
                 LogisticRegression(C=1000, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=None, max_iter=1000,
                                    multi_class='auto', n_jobs=None,
                                    penalty='l2', random_state=None,
                                    solver='saga', tol=0.0001, verbose=0,
                                    warm_start=False))],
         verbose=False)

In [44]:
## we can also access the parameters in the gridsearch

from sklearn.model_selection import GridSearchCV
param_grid = dict(simpleimputer__strategy=['mean', 'median'],
                  LogisticRegression__C=[0.1, 10, 100]) # or c = np.logspace   (better)
grid_search = GridSearchCV(pipe2, param_grid=param_grid, cv=5)
# since CV (cross-validation) =5 -> fits a total of 30 models

grid_search

GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('simpleimputer',
                                        SimpleImputer(add_indicator=False,
                                                      copy=True,
                                                      fill_value=None,
                                                      missing_values=nan,
                                                      strategy='mean',
                                                      verbose=0)),
                                       ('standardscaler',
                                        StandardScaler(copy=True,
                                                       with_mean=True,
                                                       with_std=True)),
                                       ('logisticregression',
                                        LogisticRegression(C=1000,
                                           

In [None]:
## note also that we can choose to skip some steps in the pipeline 

param_grid = dict(standardscaler=['passthrough', PCA(5), PCA(6)], # 'passthrough' will use the default or pre-defined value as well
                  clf=[SVC(gamma = 'auto'), LogisticRegression(solver = 'lbfgs',max_iter =1000)], # can define 2 models: SVC, LR
                  svc__C=[0.1, 10, 100])

grid_search = GridSearchCV(pipe2, param_grid=param_grid)



## Pipelines in action

In [48]:
df = pd.read_csv('data/diabetes.csv')
display(df.head(),df.shape)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


(768, 9)

In [49]:
target = df.Outcome
column_list = df.columns.tolist()

column_list.remove('Outcome')
print(column_list)

data = df[column_list]

['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']


In [50]:
X = data

[On Scaling data](https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-scaler)

In [51]:
X_train, X_test, y_train, y_test = train_test_split(X,
                target,
                test_size=0.20,
                stratify= target,
                random_state = 120919)

__Your Turn__

- Create a pipeline and use this pipeline for fitting and predicting diabetes results for the above data.


In [53]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

In [55]:
estimators = [('reduce_dim', PCA(n_components=5)), ('clf', SVC())]
my_pipe = Pipeline(estimators)
pipe

Pipeline(memory='cached_transformers',
         steps=[('reduce_dim',
                 PCA(copy=True, iterated_power='auto', n_components=None,
                     random_state=None, svd_solver='auto', tol=0.0,
                     whiten=False)),
                ('clf',
                 LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
                                  normalize=False))],
         verbose=False)

In [52]:
my_pipe = [('ss', StandardScaler()), ('PCA', PCA()), clf = [('clf', SVC()), ('log_reg', LogisticRegression())]]
grid_search = GridSearchCV(my_pipe, param_grid=param_grid)

In [None]:
# %load -r 1-10 supplement.py
pipe = Pipeline([('imputer', SimpleImputer()),
                 ('scaler', StandardScaler()),
                 ('clf', LogisticRegression(C = 1000,
                                                max_iter = 1000,
                                                solver = 'saga'))])

## we can access to a particular step in the pipeline

pipe.fit(X_train, y_train)
pipe.score(X_train, y_train)

__Your Turn__

- Now use gridsearch with pipelines and return the best parameters

In [None]:
# %load -r 14-23 supplement.py
param_grid = dict(scaler=['passthrough',MinMaxScaler() , Normalizer(), StandardScaler()],
                  clf = [SVC(gamma = 'auto'),
                         LogisticRegression(solver = 'lbfgs',max_iter =1000)],
                  clf__C = [0.1, 10, 100])

gs = GridSearchCV(pipe, param_grid=param_grid, cv = 3, verbose = 1)

gs.fit(X_train, y_train)

gs.best_params_

__Remark__


Note that even if gridsearch and pipes are getting along very well. The options are not limitless. Try to add Randomforests as classifier in the gridsearch. 

## Further research and miscellaneous

- [FeatureUnion](https://scikit-learn.org/stable/modules/compose.html#featureunion-composite-feature-spaces)

- [ColumnTransformer](https://scikit-learn.org/stable/modules/compose.html#columntransformer-for-heterogeneous-data)

- [sklearn, dictionary of terms](https://scikit-learn.org/stable/glossary.html#term-transformer)

- [Pydata meeting on pipelines](https://www.youtube.com/watch?v=BFaadIqWlAg)

- [Another pydata talk on pipelines with FeatureUnion](https://www.youtube.com/watch?v=URdnFlZnlaE)

- [On scalers](https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-scaler)

- [A nice notebook on pipelines](https://github.com/amueller/introduction_to_ml_with_python/blob/master/06-algorithm-chains-and-pipelines.ipynb)