# Crash Course on Sklearn pipelines

[Sklearn pipelines](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) allow to `stack the different steps of the modelling pipeline` (data transformation, hyper-parameter tuning and model estimation) into a single object. 
The pipeline is fitted on the training data and can be used to transform new data and estimate the model on it.

Each element of the pipeline must have a `fit method`. This method defines the action to perform (for example remove features with missing values).

The two main families of pipeline components are:

- Transformers: Define how to process data (imputation, outliers, ...).
- Estimators: Fit a model and compute predictions (classifier, grid search engine, ...).


Pipelines are easy to use and ensure that the same treatment is applied to all data sets. There are many standard functionalities (we will show some of them) and it is also easy to create customized function (classes) to taylor your needs.

## Ste-up

Create a data set to play with and load libraries

- For the current notebook we use a small dataset that we artificially create using make classification. 
  - We use 500 observations.
  - We use the default values for the number of relevant, redundants, .... features
- We also already load the Pipeline class from Sklearn

In [3]:
%load_ext autoreload
%autoreload 2

import numpy as np
import pandas as pd

from sklearn.pipeline import Pipeline
from sklearn.datasets import make_classification

We make our own data creating function so that we can directly play with dataframes. 

In [4]:
def make_X_y():
    X, y = make_classification(n_samples=500)
    X = pd.DataFrame(X)
    X.columns = ['var_'+str(i) for i in range(0, X.shape[1])]
    y = pd.Series(y)
    nan_loc = [(2, 3), (17, 1), (4, 12)]
    for loc in nan_loc:
        X.iloc[loc] = np.nan
    
    return X, y

X, y = make_X_y()

In [5]:
X.sample(4)

Unnamed: 0,var_0,var_1,var_2,var_3,var_4,var_5,var_6,var_7,var_8,var_9,var_10,var_11,var_12,var_13,var_14,var_15,var_16,var_17,var_18,var_19
255,-0.070558,-0.003146,-0.392045,-0.665389,-0.778423,0.616162,-0.666926,-1.517769,-0.713532,-1.036467,2.570161,-0.285801,1.563787,0.310235,0.271145,-0.677554,-1.616693,-0.391866,1.035647,0.966569
385,1.788208,1.376253,0.181644,-1.186774,0.091714,0.453242,-1.443753,1.138081,-0.557411,0.032735,0.382635,0.00227,-0.297153,-1.272223,0.168665,-0.181349,1.229209,0.177103,0.1932,2.037883
209,1.794388,-0.48957,0.007432,-2.403328,-1.134548,0.949206,-2.681073,-0.175683,-0.569914,-0.243658,-0.468765,-0.104364,-0.546232,-0.386275,0.221393,-1.064874,-1.436184,-0.595917,-0.220731,-1.265249
177,-0.376922,1.46997,0.795458,0.918363,0.328963,0.597953,0.983485,0.581397,-0.224482,-0.089829,-0.52133,1.501288,-1.996734,0.307053,-1.824602,0.058471,1.212209,-0.906636,-0.044441,-0.299773


##

As in real case we split the data into a train and a test set. For this we use `train_test_split` with the default value

In [6]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)

## Full modelling pipeline

Let's create our first pipeline that will do the following:

- `Impute` the missing values (wait.... the data does not have missing values!).
- Scale the features using `MinMaxScaler`.
- Estimate a `Random Forest Classifier`.

Let's load the required functionalities!

In [7]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import SimpleImputer

## Define the pipeline

`A pipeline is defined by a series or steps`. Each step contains a label and an object. The object is mostly a transformer or an estimator.
- In the definition of the pipeline the objects are initiated.
- If the objects do not have the required method (fit, transform, predict, ...) the pipeline will fail.

In [8]:
model = Pipeline(
    steps = [
        ("Impute", SimpleImputer(strategy="mean")),
        ('scaler', MinMaxScaler()),
        ("clf", RandomForestClassifier(max_depth=3))
    ]
)

Fitting the pipeline will do the following:
- Compute the value for imputation.
- Compute the scaling factor after imputation.
- Estimate the Random Forest on the imputed and scaled train set.

Let's run it

In [9]:
model.fit(X=X_train, y=y_train)

Pipeline(memory=None,
         steps=[('Impute',
                 SimpleImputer(add_indicator=False, copy=True, fill_value=None,
                               missing_values=nan, strategy='mean',
                               verbose=0)),
                ('scaler', MinMaxScaler(copy=True, feature_range=(0, 1))),
                ('clf',
                 RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                        class_weight=None, criterion='gini',
                                        max_depth=3, max_features='auto',
                                        max_leaf_nodes=None, max_samples=None,
                                        min_impurity_decrease=0.0,
                                        min_impurity_split=None,
                                        min_samples_leaf=1, min_samples_split=2,
                                        min_weight_fraction_leaf=0.0,
                                        n_estimators=100, n_jobs=None,
     

## Make prediction

The pipeline can be used on the test set to compute the model performance on unseen data.

In [10]:
from sklearn.metrics import roc_auc_score

prediction = model.predict_proba(X_test)[:, 1]
print(roc_auc_score(y_test, prediction))

0.9487577639751552


## Pushing a bit further

In the pipeline, we can also `include cross-validation`. This allows o.a. to perform model tuning at different level of the pipeline.
To illustrate this, use the pipeline defined above and define 3 parameters we want to tune:
- The scaler: Standard vs MinMax.
- The maximum depth of the random forest.
- The number of estimators of the random forest.


In [11]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import RandomizedSearchCV

parameters = {
    'scaler': [StandardScaler(), MinMaxScaler()],
    'clf__max_depth': [3, 5, 7],
    'clf__n_estimators': [10, 25, 50, 100],
    }

In  order to optimize the pipeline, we will use randomized grid search with 10 searches in the 3 dimensional space of hyper-parameters.
In order to do so, we provide the pipeline as `input` to the grid search object.
[RamdomizedSearch](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html)

In [12]:
grid = RandomizedSearchCV(model, parameters, cv=3, n_iter=10).fit(X_train, y_train)

print('Training set score: ' + str(grid.score(X_train, y_train)))
print('Test set score: ' + str(grid.score(X_test, y_test)))

Training set score: 0.9333333333333333
Test set score: 0.88


We can then see the best combination of parameters

In [13]:
grid.best_params_

{'scaler': StandardScaler(copy=True, with_mean=True, with_std=True),
 'clf__n_estimators': 50,
 'clf__max_depth': 5}

And the best pipeline

In [20]:
grid.best_estimator_

Pipeline(memory=None,
         steps=[('Impute',
                 SimpleImputer(add_indicator=False, copy=True, fill_value=None,
                               missing_values=nan, strategy='mean',
                               verbose=0)),
                ('scaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('clf',
                 RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                        class_weight=None, criterion='gini',
                                        max_depth=5, max_features='auto',
                                        max_leaf_nodes=None, max_samples=None,
                                        min_impurity_decrease=0.0,
                                        min_impurity_split=None,
                                        min_samples_leaf=1, min_samples_split=2,
                                        min_weight_fraction_leaf=0.0,
                                        n_estima

Often, it is useful to look at the full set of results.

In [21]:
pd.DataFrame(grid.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_scaler,param_clf__n_estimators,param_clf__max_depth,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
0,0.035623,0.002476,0.003237,0.000355,"StandardScaler(copy=True, with_mean=True, with...",25,7,"{'scaler': StandardScaler(copy=True, with_mean...",0.896,0.872,0.856,0.874667,0.016438,2
1,0.013822,0.000349,0.002197,0.000188,"MinMaxScaler(copy=True, feature_range=(0, 1))",10,3,"{'scaler': MinMaxScaler(copy=True, feature_ran...",0.904,0.864,0.848,0.872,0.023551,4
2,0.031884,0.001867,0.003018,0.00031,"MinMaxScaler(copy=True, feature_range=(0, 1))",25,3,"{'scaler': MinMaxScaler(copy=True, feature_ran...",0.904,0.864,0.848,0.872,0.023551,4
3,0.014619,0.000179,0.00182,3.5e-05,"StandardScaler(copy=True, with_mean=True, with...",10,7,"{'scaler': StandardScaler(copy=True, with_mean...",0.888,0.864,0.848,0.866667,0.016438,8
4,0.123424,0.002033,0.008005,0.000475,"MinMaxScaler(copy=True, feature_range=(0, 1))",100,7,"{'scaler': MinMaxScaler(copy=True, feature_ran...",0.896,0.856,0.856,0.869333,0.018856,7
5,0.063445,0.001113,0.004626,0.000165,"MinMaxScaler(copy=True, feature_range=(0, 1))",50,7,"{'scaler': MinMaxScaler(copy=True, feature_ran...",0.888,0.856,0.856,0.866667,0.015085,8
6,0.016201,0.002325,0.002256,0.000337,"StandardScaler(copy=True, with_mean=True, with...",10,3,"{'scaler': StandardScaler(copy=True, with_mean...",0.88,0.88,0.832,0.864,0.022627,10
7,0.063577,0.000964,0.004708,0.000191,"StandardScaler(copy=True, with_mean=True, with...",50,5,"{'scaler': StandardScaler(copy=True, with_mean...",0.904,0.88,0.856,0.88,0.019596,1
8,0.060924,0.003342,0.005016,0.000712,"MinMaxScaler(copy=True, feature_range=(0, 1))",50,3,"{'scaler': MinMaxScaler(copy=True, feature_ran...",0.896,0.864,0.856,0.872,0.017282,4
9,0.140336,0.001724,0.009655,0.000287,"MinMaxScaler(copy=True, feature_range=(0, 1))",100,5,"{'scaler': MinMaxScaler(copy=True, feature_ran...",0.904,0.848,0.872,0.874667,0.02294,2


# Make you own transformer

In [16]:
from sklearn.base import BaseEstimator, TransformerMixin


class CustomTransformer(BaseEstimator, TransformerMixin):
    """Remove feature with less than n distinct values."""

    def __init__(self, min_number_values):
        """Initiate the class."""
        self.min_number_values = min_number_values

    def fit(self, X, y=None):
        """Assess the features to filter out."""
        self.features_to_drop_ = []
        
        for feature in X.columns:
            number_of_distinct_values = X[feature].nunique()
            
            if number_of_distinct_values < self.min_number_values:
                self.features_to_drop_.append(feature)
    
        return self

    def transform(self, df):
        """Apply the filter."""
        return df.drop(columns=self.features_to_drop_)


In [17]:
X = pd.DataFrame({
    "constant": [1, 1, 1, 1],
    "numerical": [1, 2, 2, 1],
    "numerical_2": [1, 2, 3, 4],
    "correlated": [1.7, 4.7, 6.6, 7.8]
})
X

Unnamed: 0,constant,numerical,numerical_2,correlated
0,1,1,1,1.7
1,1,2,2,4.7
2,1,2,3,6.6
3,1,1,4,7.8


In [18]:
transformer = CustomTransformer(3)
transformer.fit_transform(X)

Unnamed: 0,numerical_2,correlated
0,1,1.7
1,2,4.7
2,3,6.6
3,4,7.8
