# Machine Learning Pipeline

### What is a Pipeline and why we need it ?

A pipeline is a set of automated processed that allows to automatically execute tasks.
Pipelines are typically scheduled, and are idempotent (executed multiple times with the same inputs, they should produce the same output)

When starting  on an ML model, most data scientists start by their code in a notebook, cell by cell, without functions.

This works perfectly for EDA (Exploratory Data Analysis), but is not an ideal way to build a machine learning model. Building your ML model as a pipeline provides the following benefits:
1. Less bugs - by compartimenting your code into `functions`, each `function` will take care of a single part of the pipeline, making bugs discovery/fixing much easier
2. Readablity - by breaking your code in small pieces, it becomes more readable, easier to document and share - essential if you're working in a team
3. Flexibility - much easier to adjust parts of the model without risking breaking / impacting the rest 
4. Reproducibility - it's quite easy to produce unstable code in notebooks (execution order of cells is not fixed, easy to shadow variables)
5. Production readiness - code produced this way is going to be much closer to what is expected in a production environment (where reliability and ease of debugging are critical)

_Today, we're doing the pipeline in a notebook for simplicity/speed. A real world pipeline should be a python script (for a simple pipeline) or a module (if it's more complex). It will allow the pipeline to easily be scheduled on a server._


In [1]:
"""
    Importing basic libraries, and setting up a logger - we're aiming at a script usable in a production 
    environment, so let's not use print here.
"""
import pandas as pd
import logging 

import warnings
warnings.filterwarnings("ignore")

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# paths to data
path_test = '../datasets/credit/cs-test.csv'
path_training = '../datasets/credit/cs-training.csv'

In [3]:
# Taking a first look at the data
# There a column "Unnamed: 0" with a auto-incrementing integer - doesn't seem necessary, let's remove it
df = pd.read_csv(path_training)
df.head()

Unnamed: 0.1,Unnamed: 0,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
0,1,1,0.766127,45,2,0.802982,9120.0,13,0,6,0,2.0
1,2,0,0.957151,40,0,0.121876,2600.0,4,0,0,0,1.0
2,3,0,0.65818,38,1,0.085113,3042.0,2,1,0,0,0.0
3,4,0,0.23381,30,0,0.03605,3300.0,5,0,0,0,0.0
4,5,0,0.907239,49,1,0.024926,63588.0,7,0,1,0,0.0


In [4]:
# Let's create a function to properly load this data - the first (humble) 
# brick of our pipeline
# note the typing hints - another way to make our code more readable
def load(path: str) -> pd.DataFrame:
    """Loads a credit data file"""
    return pd.read_csv(path).drop(labels=['Unnamed: 0'], axis='columns')
    

In [5]:
# current state of the pipeline - it loads...

# *** Pipeline v0
path_training = '../datasets/credit/cs-training.csv'
df = load(path_training)
df.head()

Unnamed: 0,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
0,1,0.766127,45,2,0.802982,9120.0,13,0,6,0,2.0
1,0,0.957151,40,0,0.121876,2600.0,4,0,0,0,1.0
2,0,0.65818,38,1,0.085113,3042.0,2,1,0,0,0.0
3,0,0.23381,30,0,0.03605,3300.0,5,0,0,0,0.0
4,0,0.907239,49,1,0.024926,63588.0,7,0,1,0,0.0


In [7]:
def pre_processing(df: pd.DataFrame) -> pd.DataFrame:
    """
        Pre-processing is the place for:
            - cleaning data
            - feature engineering
            
        This is the brain of the model
    """
    logger.info('pre-processing')
    # I've included 2 simple new features. To improve the accuracy of the model, we'll need more than this !
    # computing expenses
    df['monthly-expenses'] = df['MonthlyIncome'] * df['DebtRatio']
    # cash out
    df['cash-out'] = df['MonthlyIncome'] - df['monthly-expenses']
    # There should be no `None` values after this step
    return df.fillna(0)

In [8]:
# We have two steps now

# *** Pipeline v1
logger.info('i am pipeline')
training_df = load(path_training)
preproc_df = pre_processing(df=training_df)

# *** Pipeline end
preproc_df.head()

INFO:__main__:i am pipeline
INFO:__main__:pre-processing


Unnamed: 0,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents,monthly-expenses,cash-out
0,1,0.766127,45,2,0.802982,9120.0,13,0,6,0,2.0,7323.197016,1796.802984
1,0,0.957151,40,0,0.121876,2600.0,4,0,0,0,1.0,316.878123,2283.121877
2,0,0.65818,38,1,0.085113,3042.0,2,1,0,0,0.0,258.914887,2783.085113
3,0,0.23381,30,0,0.03605,3300.0,5,0,0,0,0.0,118.963951,3181.036049
4,0,0.907239,49,1,0.024926,63588.0,7,0,1,0,0.0,1584.975094,62003.024906


In [10]:
import numpy as np
from sklearn import preprocessing
from typing import Tuple

def scale_features(df: pd.DataFrame) -> Tuple[np.ndarray, np.ndarray, list]:
    """ 
        1. Scale features - necessary step for some models
        2. Extract the X & y matrices
    """
    logger.info('scaling features... ')
    # Extract arrays
    y = df['SeriousDlqin2yrs'].values
    X = df.drop('SeriousDlqin2yrs', axis='columns').values
    feature_names = df.drop('SeriousDlqin2yrs', axis='columns').columns.tolist()
    
    # Scale
    scaler = preprocessing.StandardScaler()
    X = scaler.fit_transform(X)

    logger.info(X.shape)

    return X, y, feature_names

In [11]:

# *** Pipeline v2
logger.info('i am pipeline')
training_df = load(path_training)
preproc_df = pre_processing(df=training_df)
X, y, feature_names = scale_features(df=preproc_df)

# *** Pipeline end


INFO:__main__:i am pipeline
INFO:__main__:pre-processing
INFO:__main__:scaling features... 
INFO:__main__:(150000, 12)


In [13]:
from sklearn import model_selection
"""
    AUC - Area Under the precision-recall Curve
    
    We're using the AUC rather than the precision as the metric for this model
    
    It's a plot of Precision (y-axis) & recall (x-axis). High Precision show a low false positive rate, 
    while a high recall show a low false negative rate.
    A high AUC means we have both high recall & high precision. As a result, having a high AUC means having a 
    low false negative rate - that's the ultimate objective here.
    
    Besides, accuracy is quite useless for this problem. We are trying to predict a are event (default). 
    Less than 7% of the sample size has defaulted. So with a stupid model (nobody defaults), we'd get a 
    seemingly excellent accuracy (93%+), while the model would be useless (fails to predict any default)
    
"""

def cross_validation(X, y, estimator, n_jobs: int=1):
    """
        Cross validation (CV) is a common technique used to test the stability of a prediction.
        CV Warns us if we're overfitting
        In other words, CV help us select the model which will perform best on unseen data.
    """
    logger.info('Cross Validation')
    # cross_validation
    cv = model_selection.ShuffleSplit(n_splits=5,
                                      test_size=0.2,
                                      random_state=0
                                      )


    scores = model_selection.cross_val_score(estimator,
                                                 X,
                                                 y,
                                                 cv=cv,
                                                 scoring='roc_auc',
                                                 n_jobs=n_jobs
                                                 )
    logger.info('CV performance:')
    logger.info("AUC: {:.2f} % (+/- {:.2f} )".format(scores.mean() * 100,
                                                  scores.std() * 2 * 100
                                                  )
         )
    return scores

In [15]:
 from sklearn import linear_model

# *** Pipeline v3: Load, Pre-process, Scale, Cross Validation
logger.info('i am pipeline')
training_df = load(path_training)
preproc_df = pre_processing(df=training_df)
X, y, feature_names = scale_features(df=preproc_df)

estimator = linear_model.LinearRegression()
scores = cross_validation(X, y, estimator)

# *** Pipeline end


INFO:__main__:i am pipeline
INFO:__main__:pre-processing
INFO:__main__:scaling features... 
INFO:__main__:(150000, 12)
INFO:__main__:Cross Validation
INFO:__main__:CV performance:
INFO:__main__:AUC: 69.54 % (+/- 1.12 )


In [40]:
from sklearn.externals import joblib
import os

def fit_and_save(X: np.ndarray, y: np.ndarray, estimator):
    """
        Trains the model (finally !)
        and saved the result to disk (so we can later load it to predict)
    """
    logger.info('training model')
    estimator.fit(X, y)

    os.makedirs('models', exist_ok=True)
    path = 'models/logistic-reg.model'
    logger.info(f'saving model in {path}')
    joblib.dump(estimator, path)

    

In [41]:
from sklearn import linear_model


def train(*, path: str, estimator):
    # *** Pipeline v4 - First complete version. Gains proper logs, and becomes a function
    logger.info('*** Pipeline Start ***')
    training_df = load(path)
    preproc_df = pre_processing(df=training_df)
    X, y, feature_names = scale_features(df=preproc_df)
    scores = cross_validation(X, y, estimator)
    fit_and_save(X, y, estimator)
    logger.info('*** Pipeline End ***')
    # *** Pipeline end

# Calling the pipeline
estimator = linear_model.LogisticRegression(max_iter=5)
train(path=path_training, estimator=estimator)

INFO:__main__:*** Pipeline Start ***
INFO:__main__:pre-processing
INFO:__main__:scaling features... 
INFO:__main__:(150000, 12)
INFO:__main__:Cross Validation
INFO:__main__:CV performance:
INFO:__main__:AUC: 70.02 % (+/- 0.72 )
INFO:__main__:training model
INFO:__main__:saving model in models/logistic-reg.model
INFO:__main__:*** Pipeline End ***


In [42]:
# to predict
# estimator = joblib.load(model_path)
# estimator.predict

# to use another model
from sklearn import ensemble
estimator = ensemble.RandomForestClassifier()
train(path=path_training, estimator=estimator)

INFO:__main__:*** Pipeline Start ***
INFO:__main__:pre-processing
INFO:__main__:scaling features... 
INFO:__main__:(150000, 12)
INFO:__main__:Cross Validation
INFO:__main__:CV performance:
INFO:__main__:AUC: 78.05 % (+/- 0.71 )
INFO:__main__:training model
INFO:__main__:saving model in models/logistic-reg.model
INFO:__main__:*** Pipeline End ***


In [43]:
gbc = ensemble.GradientBoostingClassifier()
train(path=path_training, estimator=gbc)

INFO:__main__:*** Pipeline Start ***
INFO:__main__:pre-processing
INFO:__main__:scaling features... 
INFO:__main__:(150000, 12)
INFO:__main__:Cross Validation
INFO:__main__:CV performance:
INFO:__main__:AUC: 86.14 % (+/- 1.18 )
INFO:__main__:training model
INFO:__main__:saving model in models/logistic-reg.model
INFO:__main__:*** Pipeline End ***


## Exercises
1. Create a pipeline to predict results using the trained model
2. Try to run the training pipeline with another sklearn model (try logistic regression). What breaks ? Why ? Have you found all the bugs ?
3. [harder] Update the training pipeline so that it runs several sklearn models in a loop, and informs you of the best model according to the cross validation
4. [at home, if you want to become a data engineer] How do you add hyper parameters optimization to this pipeline ?
5. [at home, if you want to become a data scientist] Create more features ? How high can you bring the AUC ?

## Solutions

Try before checking !

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Spoilers ahead !


In [26]:
# 1. Create a pipeline to predict results using the trained model

from sklearn import metrics

    
def predict(path: str):
    # loading
    model_path = 'models/logistic-reg.model'
    estimator = joblib.load(model_path)

    df = load(path)
    preproc_df = pre_processing(df=df)
    X, y, _ = scale_features(df=preproc_df)

    y_pred = estimator.predict(X)
    y_score = estimator.predict_proba(X)
    cm = metrics.confusion_matrix(y, y_pred)
    print(cm)


predict(path=path_training)    
    


INFO:__main__:pre-processing
INFO:__main__:scaling features... 
INFO:__main__:(150000, 12)


[[139668    306]
 [  9654    372]]


## 2. Try to run the training pipeline with another sklearn model (try random forests).  What breaks ? Why ? Have you found all the bugs ?

### Answer
 The code works and execute, but we're creating a hard to detect bug by having a hard coded
 filename for our model. That means that a predict function could work incorrectly or crash 
 if the file with the correct model is not present (ie: it expects a random forest model file, but
 only  )

 This kind of bug is very tricky because it generates no immediate exception / trace
 meaning that with the "right" circumstances, it can stay undiscovered in production 
 environments for long periods of time.


### Solution
We use the class name of the estimator (using the __class__ dunder method)
to dynamically generate the filename


In [29]:

def fit_and_save(X, y, estimator):
    logger.info('training model')
    estimator.fit(X, y)

    path = f'models/{estimator.__class__.__name__}.model' ## NEW
    logger.info(f'saving model in {path}')
    joblib.dump(estimator, path)

    
    
def train(*, path: str, estimator):
    # *** Pipeline v4 - no change here
    logger.info('*** Pipeline Start ***')
    training_df = load(path)
    preproc_df = pre_processing(df=training_df)
    X, y, feature_names = scale_features(df=preproc_df)
    scores = cross_validation(X, y, estimator)
    fit_and_save(X, y, estimator)
    logger.info('*** Pipeline End ***')
    # *** Pipeline end

# Calling the pipeline
from sklearn import ensemble
estimator = ensemble.RandomForestClassifier(n_estimators= 10)
train(path=path_training, estimator=estimator)

INFO:__main__:*** Pipeline Start ***
INFO:__main__:pre-processing
INFO:__main__:scaling features... 
INFO:__main__:(150000, 12)
INFO:__main__:Cross Validation
INFO:__main__:CV performance:
INFO:__main__:AUC: 77.92 % (+/- 0.77 )
INFO:__main__:training model
INFO:__main__:saving model in models/RandomForestClassifier.model
INFO:__main__:*** Pipeline End ***


## 3. Update the training pipeline so that it runs several sklearn models in a loop, and informs you of the best model according to the cross validation

First we create a list of the models we want to try. 
A wrapper class will help keep our code organized, as opposed to a simple list of model instances.

Then we add a new step to our pipeline: `find_best_model` - it iterates each model, computes the Cross Validated AUC

In [45]:
from dataclasses import dataclass
from typing import List

@dataclass
class ModelWrapper:
    name: str
    model: 'typing.Any'
    hyper: dict
    _estimator: 'typing.Any' = None
        
    @property
    def estimator(self):
        if self._estimator is None:
            self._estimator = self.model(**self.hyper)

        return self._estimator
    
MODELS = [
    ModelWrapper(name='log', model=linear_model.LogisticRegression, hyper={'max_iter': 100, 'penalty': 'l2', 'dual': False, 
                                               'solver': 'liblinear', 'C': 0.5}),
    ModelWrapper(name='rfc-1', model=ensemble.RandomForestClassifier, hyper={'n_estimators': 2}),
    ModelWrapper(name='rfc-2', model=ensemble.RandomForestClassifier, hyper={'n_estimators': 20}),  
    ModelWrapper(name='rfc-3', model=ensemble.RandomForestClassifier, hyper={'n_estimators': 50}),  
    ModelWrapper(name='ada-1', model=ensemble.AdaBoostClassifier, hyper={'n_estimators':50, 'learning_rate':1.0}),
    ModelWrapper(name='gbc-1', model=ensemble.GradientBoostingClassifier, hyper={'n_estimators':200,'learning_rate':0.2}),
#     ModelWrapper(name='hgb-1', model=ensemble.HistGradientBoostingClassifier, hyper={'learning_rate':0.1, 'max_iter':100, 'max_leaf_nodes':31})
]


In [46]:
import time

def fit_and_save(X: np.ndarray, y: np.ndarray, model: ModelWrapper) -> None:
    """
        Slight adapation of the previous code to receive the ModelWrapper
    """
    logger.info('training model')
    model.estimator.fit(X, y)

    path = f'models/{model.name}.model' ## MODIFIED TO USE THE ModelWrapper Class
    logger.info(f'saving model in {path}')
    joblib.dump(model.estimator, path)


def find_best_model(X: np.ndarray, y: np.ndarray, models: List[ModelWrapper]) -> ModelWrapper:
    """ Completely new function
            1. Iterate each model defined in the MODELS list
            2. Get the Cross Validated AUC for it
            3. Select the best model based on AUC, and returns it
    """
    # test all models
    results = []
    for modelwrapper in models:
        t0 = time.time()
        logger.info(f'*** Training {modelwrapper.name} ***')
        scores = cross_validation(X, y, modelwrapper.estimator)
        results.append([scores.mean(), modelwrapper])
        logger.info(f'training took {time.time()-t0:.2f} seconds')

    # find best model
    best_modelwrapper = sorted(results, key=lambda k: k[0], reverse=True)[0][1]
    logger.info(f'*** Best model is {best_modelwrapper.name} ***')
    return best_modelwrapper

        
    
def train(*, path: str, models: List[ModelWrapper]):
    # *** Pipeline v6
    logger.info('*** Pipeline Start ***')
    training_df = load(path)
    preproc_df = pre_processing(df=training_df)
    X, y, feature_names = scale_features(df=preproc_df)
    best_modelwrapper = find_best_model(X, y, models)
    fit_and_save(X, y, best_modelwrapper)
    logger.info('*** Pipeline End ***')
    # *** Pipeline end
    
train(path=path_training, models=MODELS)    


INFO:__main__:*** Pipeline Start ***
INFO:__main__:pre-processing
INFO:__main__:scaling features... 
INFO:__main__:(150000, 12)
INFO:__main__:*** Training log ***
INFO:__main__:Cross Validation
INFO:__main__:CV performance:
INFO:__main__:AUC: 69.94 % (+/- 0.76 )
INFO:__main__:training took 2.69 seconds
INFO:__main__:*** Training rfc-1 ***
INFO:__main__:Cross Validation
INFO:__main__:CV performance:
INFO:__main__:AUC: 66.15 % (+/- 1.06 )
INFO:__main__:training took 2.68 seconds
INFO:__main__:*** Training rfc-2 ***
INFO:__main__:Cross Validation
INFO:__main__:CV performance:
INFO:__main__:AUC: 80.73 % (+/- 0.73 )
INFO:__main__:training took 24.31 seconds
INFO:__main__:*** Training rfc-3 ***
INFO:__main__:Cross Validation
INFO:__main__:CV performance:
INFO:__main__:AUC: 82.95 % (+/- 1.15 )
INFO:__main__:training took 58.83 seconds
INFO:__main__:*** Training ada-1 ***
INFO:__main__:Cross Validation
INFO:__main__:CV performance:
INFO:__main__:AUC: 85.65 % (+/- 1.24 )
INFO:__main__:training 