### Pipelines

<include type="image">

###### This is not a pipe
    
![](../../img/cecinestpas.jpg)

</include>


So far, we've established a general workflow where we:


1. **Clean the data  
   Fill/impute/drop `NaN` values  

   One-hot encode categorical variables

   Label-encode target if categorical

   Check for skew / deskew

1. Preprocess the data

   Feature selection (`SelectKBest, SelectFromModel, SelectPercentile, RFE`, etc.)

   Scaling (`StandardScaler, MinMaxScaler`)

1. Modeling

   Classification (`KNeighborsClassifier, LogisticRegression`, etc.)

   Regression (`Lasso, Ridge, ElasticNet`, etc.)
  
For every dataset, we've done some version of all of these. Pipelines give us a convenient way to chain these tasks together. As a result, we can feed cleaned data into a pipeline and a trained model a the end!

In [2]:
###### Import the Python Numerical Stack

import pandas as pd
import numpy as np

In [3]:
###### Load bike sharing data

bike_data = pd.read_csv('day.csv',index_col=0)

In [4]:
###### Display head of `DataFrame`

bike_data.head()

Unnamed: 0_level_0,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
instant,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1,2011-01-01,1,0,1,0,6,0,2,0.344167,0.363625,0.805833,0.160446,331,654,985
2,2011-01-02,1,0,1,0,0,0,2,0.363478,0.353739,0.696087,0.248539,131,670,801
3,2011-01-03,1,0,1,0,1,1,1,0.196364,0.189405,0.437273,0.248309,120,1229,1349
4,2011-01-04,1,0,1,0,2,1,1,0.2,0.212122,0.590435,0.160296,108,1454,1562
5,2011-01-05,1,0,1,0,3,1,1,0.226957,0.22927,0.436957,0.1869,82,1518,1600


#### Clean the data

In [6]:
###### Convert the datetime colum to datetime using pd.to_datetime()

bike_data['datetime'] = pd.to_datetime(bike_data['dteday'])

In [8]:
###### Make a feature for the day of week

bike_data['dayofweek'] = bike_data['datetime'].apply(lambda x: x.dayofweek)

In [9]:
###### Make a feature for month

bike_data['month'] = bike_data['datetime'].apply(lambda x: x.month)

In [10]:
###### Make a feature for hour

bike_data['hour'] = bike_data['datetime'].apply(lambda x: x.hour)

In [11]:
###### Drop the datetime column

bike_data.drop('datetime', axis=1, inplace=True)

In [22]:
###### Split up our features and target into features and target

features = bike_data.drop(
    ['cnt', 'registered', 'casual'],
    axis=1
)

target = bike_data['cnt']

In [30]:
###### Get dummies of categorical columns

num_cols = ['temp','atemp','humidity','windspeed']
cat_cols = [i for i in features.columns if i not in num_cols]

features_dummies = pd.get_dummies(features, columns=cat_cols)

#### `Pipeline`

`Pipeline` is a class in `sklearn` that allows us to chain steps together. 

We add steps to the pipeline using a list of tuples of the form `[('step name', sklearn object)...]`

Let's make a `Pipeline` that scales the data and fits a `RandomForestRegressor` model.

In [31]:
###### Load necessary classes and functions from Scikit-Learn 

from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.linear_model import Lasso, Ridge
from sklearn.model_selection import train_test_split

In [32]:
###### Train-test split your data

X_train, X_test, y_train, y_test = train_test_split(features_dummies, target, random_state = 42)

In [33]:
###### Instantiate your pipeline

simple_pipe = Pipeline([('scaler',StandardScaler()), ('lasso', Lasso())])

In [34]:
###### Fit the pipeline to your training features and target

simple_pipe.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('lasso', Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False))])

In [35]:
###### What's your train r2 score?

simple_pipe.score(X_train, y_train)

0.9998648476654407

In [36]:
###### What's your test r2 score?

simple_pipe.score(X_test, y_test)

0.8329634849239228

We now have a fit `Pipeline` object that scores just like any other model. This consists of a `StandardScaler` and a `Lasso`. What properties does this `Pipeline` have?

- `.steps` gives you a list of tuples containing the names of your steps and the fit object of the step itself.
- `.named_steps` gives you a dictionary with your pipeline objects where the keys are the names and the values are the fit sklearn object.

In [37]:
###### Look at the steps

simple_pipe.steps

[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)),
 ('lasso', Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
     normalize=False, positive=False, precompute=False, random_state=None,
     selection='cyclic', tol=0.0001, warm_start=False))]

In [38]:
###### Look at the named steps

simple_pipe.named_steps

{'scaler': StandardScaler(copy=True, with_mean=True, with_std=True),
 'lasso': Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
    normalize=False, positive=False, precompute=False, random_state=None,
    selection='cyclic', tol=0.0001, warm_start=False)}

We can access each step and use it if we'd like. Let's look at the mean and standard deviation of our data from the scaler object.

In [39]:
###### Display means from `StandardScaler`

simple_pipe.steps[0][1].mean_

array([ 0.50145839,  0.48048566,  0.18949707, ...,  0.07664234,
        0.07664234,  1.        ])

In [40]:
###### .std_ is deprecated, use .scale_

simple_pipe.named_steps['scaler'].scale_

array([ 0.18158071,  0.16167184,  0.0762798 , ...,  0.2660231 ,
        0.2660231 ,  1.        ])

In [41]:
simple_pipe.named_steps['lasso'].coef_

array([  6.36296770e+02,   1.27054620e+02,  -1.25504305e+02, ...,
         0.00000000e+00,  -5.30034457e-01,   0.00000000e+00])

####  `make_pipeline`

While `Pipeline` gives us the ability to explicitly name our steps, this can be cumbersome, especially when we may not care what are steps are named. If this is the case, we use `make_pipeline`.

Let's execute the same pipeline, except this time we'll use `make_pipeline`.

In [42]:
###### Import `make_pipeline` helper function 

from sklearn.pipeline import make_pipeline

In [43]:
###### Define a second `Pipeline` using `make_pipeline`

another_pipe = make_pipeline(
    StandardScaler(),
    Lasso()
)

In [44]:
###### Fit second `Pipeline`

another_pipe.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('lasso', Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False))])

In [45]:
###### Display training score

another_pipe.score(X_train, y_train)

0.9998648476654407

In [46]:
###### Display testing score

another_pipe.score(X_test, y_test)

0.8329634849239228

Even though we don't name them, `make_pipeline` still has a `.named_steps` attribute. It automatically assigns names to each step and we can access them similarly to how we did before.

In [47]:
###### Display named steps for `Pipeline`

another_pipe.named_steps

{'standardscaler': StandardScaler(copy=True, with_mean=True, with_std=True),
 'lasso': Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
    normalize=False, positive=False, precompute=False, random_state=None,
    selection='cyclic', tol=0.0001, warm_start=False)}

In [48]:
###### Display steps for `Pipeline`

another_pipe.steps

[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)),
 ('lasso', Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
     normalize=False, positive=False, precompute=False, random_state=None,
     selection='cyclic', tol=0.0001, warm_start=False))]

#### Aside: Transformation pipelines

Although it's standard to have a pipeline end in a model, it's also possible to have a pipeline just for transformers, as shown below:

In [49]:
###### Import transformers from Scikit-Learn 

from sklearn.feature_selection import SelectFromModel, SelectKBest, f_regression

In [51]:
###### Make transformation pipeline

transformer_pipe = make_pipeline(
    SelectKBest(score_func=f_regression, k=40),
    StandardScaler(),
    SelectFromModel(Lasso())
)

In [52]:
###### Fit transformation pipeline

transformer_pipe.fit(X_train, y_train)

  corr /= X_norms
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)


Pipeline(memory=None,
     steps=[('selectkbest', SelectKBest(k=40, score_func=<function f_regression at 0x7f4a65c9c840>)), ('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('selectfrommodel', SelectFromModel(estimator=Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False),
        norm_order=1, prefit=False, threshold=None))])

In [53]:
###### Use transformation pipeline to transform data

features_skb_scaled_sfm = transformer_pipe.transform(X_train)

In [54]:
###### Show shape of training set

X_train.shape

(548, 1381)

In [55]:
###### Show shape of transformed data

features_skb_scaled_sfm.shape

(548, 32)

In [56]:
###### Show steps in transformation pipeline

transformer_pipe.steps

[('selectkbest',
  SelectKBest(k=40, score_func=<function f_regression at 0x7f4a65c9c840>)),
 ('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)),
 ('selectfrommodel',
  SelectFromModel(estimator=Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
     normalize=False, positive=False, precompute=False, random_state=None,
     selection='cyclic', tol=0.0001, warm_start=False),
          norm_order=1, prefit=False, threshold=None))]

In [57]:
from sklearn.preprocessing import FunctionTransformer

In [63]:
###### Make transformation pipeline

transformer_pipe = make_pipeline(
    SelectKBest(score_func=f_regression, k=40),
    FunctionTransformer(lambda x: x + 1),
    FunctionTransformer(np.log),    
    StandardScaler(),
    SelectFromModel(Lasso())
)

In [64]:
transformer_pipe.fit(X_train, y_train)

  corr /= X_norms
  corr /= X_norms
  F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)


Pipeline(memory=None,
     steps=[('selectkbest', SelectKBest(k=40, score_func=<function f_regression at 0x7f4a65c9c840>)), ('functiontransformer-1', FunctionTransformer(accept_sparse=False,
          func=<function <lambda> at 0x7f4a65c43ea0>, inv_kw_args=None,
          inverse_func=None, kw_args=None, pass_y='deprecated',
...ction='cyclic', tol=0.0001, warm_start=False),
        norm_order=1, prefit=False, threshold=None))])

#### Using Pipelines with `GridSearchCV`

So far, we've only chained transformers and models together in a `pipeline`. What if we want to use GridSearch to tune our model in the `pipeline`?

Since we have to refer to our steps by name, let's use `Pipeline` instead of `make_pipeline`. 

Let's make a pipeline with the following steps:

    ('skb', SelectKBest(score_func=f_regression, k=40)),
    ('scaler', StandardScaler()),
    ('sfm', SelectFromModel(Lasso())),
    ('regr', ElasticNet())

In [65]:
###### Load necessary models

from sklearn.model_selection import GridSearchCV, ShuffleSplit
from sklearn.linear_model import ElasticNet

In [66]:
###### Create model pipeline

pipe_for_gs = Pipeline([
    ('skb', SelectKBest(score_func=f_regression, k=40)),
    ('scaler', StandardScaler()),
    ('sfm', SelectFromModel(Lasso())),
    ('regr', ElasticNet())
])

In [67]:
pipe_for_gs.steps

[('skb',
  SelectKBest(k=40, score_func=<function f_regression at 0x7f4a65c9c840>)),
 ('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)),
 ('sfm',
  SelectFromModel(estimator=Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
     normalize=False, positive=False, precompute=False, random_state=None,
     selection='cyclic', tol=0.0001, warm_start=False),
          norm_order=1, prefit=False, threshold=None)),
 ('regr', ElasticNet(alpha=1.0, copy_X=True, fit_intercept=True, l1_ratio=0.5,
        max_iter=1000, normalize=False, positive=False, precompute=False,
        random_state=None, selection='cyclic', tol=0.0001, warm_start=False))]

Next, let's make our parameter grid. When using a `Pipeline`, we need to specify which step our params are for. To do that, we use the name we gave the step (in this case `'rf'` for `RandomForestRegressor`), with a **dunder** to reference a parameter for that model. 

As an example, if we wanted to tune `ElasticNet`'s `l1_ratio` parameter, we use `regr__l1_ratio:[.1,.5,.9]`. 

Let's fill out the params below to tune `alpha` and `l1_ratio`:

In [68]:
###### Define parameter grid

params = {
    'regr__l1_ratio':[.1,.3,.5,.7,.9],
    'regr__alpha':np.logspace(-3,3,7)
}

Now pass your pipeline into `GridSearchCV` with your parameters, using `ShuffleSplit`

In [70]:
###### Create grid search model pipeline

gspipe = GridSearchCV(
    pipe_for_gs, 
    param_grid=params, 
    cv=ShuffleSplit(n_splits=5, random_state=42), 
    n_jobs=-1
)

In [72]:
import warnings

warnings.filterwarnings("ignore")

In [73]:
###### Fit the grid search model pipeline

gspipe.fit(X_train, y_train.ravel())

GridSearchCV(cv=ShuffleSplit(n_splits=5, random_state=42, test_size='default',
       train_size=None),
       error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('skb', SelectKBest(k=40, score_func=<function f_regression at 0x7f4a65c9c840>)), ('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('sfm', SelectFromModel(estimator=Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precom...alse, precompute=False,
      random_state=None, selection='cyclic', tol=0.0001, warm_start=False))]),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'regr__l1_ratio': [0.1, 0.3, 0.5, 0.7, 0.9], 'regr__alpha': array([  1.00000e-03,   1.00000e-02,   1.00000e-01,   1.00000e+00,
         1.00000e+01,   1.00000e+02,   1.00000e+03])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [74]:
###### Display best score for grid search model pipeline

gspipe.best_score_

0.78172296797392005

To get the `.steps` or `.named_steps`, we need to access `GridSearchCV`'s `.best_estimator_` parameter, which contains our `Pipeline`. How do we access our model? Our scaler?

In [75]:
###### Display named steps for best estimator

gspipe.best_estimator_.named_steps

{'skb': SelectKBest(k=40, score_func=<function f_regression at 0x7f4a65c9c840>),
 'scaler': StandardScaler(copy=True, with_mean=True, with_std=True),
 'sfm': SelectFromModel(estimator=Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
    normalize=False, positive=False, precompute=False, random_state=None,
    selection='cyclic', tol=0.0001, warm_start=False),
         norm_order=1, prefit=False, threshold=None),
 'regr': ElasticNet(alpha=0.01, copy_X=True, fit_intercept=True, l1_ratio=0.5,
       max_iter=1000, normalize=False, positive=False, precompute=False,
       random_state=None, selection='cyclic', tol=0.0001, warm_start=False)}

In [77]:
###### Display regression model

best_enet = gspipe.best_estimator_.named_steps['regr']

In [79]:
best_enet.coef_

array([ 378.87721844,  453.72992468, -148.81005573,   31.9290364 ,
        -44.79472115,   80.20697489,  -21.38346023, -509.73015276,
       -107.56767034,  143.505641  , -996.71026257,   15.31506559,
         71.47075243,   47.08860356,   43.04100281,  -34.3541221 ,
         12.95372759,  108.20101907,   74.14514789,  -29.95968775,
        347.24735486, -237.61765384,   47.44770019,   31.95266437,
         15.29584749,   71.46501315,   47.09969187,   43.06459276,
        -34.30972345,   12.98314759,  108.22568078,   74.1803482 ])

In [49]:
###### Display `SelectFromModel` transformer

gspipe.best_estimator_.named_steps['sfm']

SelectFromModel(estimator=Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False),
        norm_order=1, prefit=False, threshold=None)