# V.- Machine Learning Pipeline: Wrapping up for Deployment


In the previous notebooks, we worked through the typical Machine Learning pipeline steps to build a classification model that allows us to predict repeat customers for a travel agency. **The purpose of these notebooks is to provide an idea of the steps that must be covered when preparing a machine learning model for deployment.**

We want to deploy our model in production; therefore, we need to write code in a very specific way.

Here we will summarise the key pieces of code that we need to take forward to deploy this particular project in production. **Very importantly, we will ensure that we are able to reproduce the results obtained in ther previous notebooks!** This is a very important consideration when deploying a model.

Let's get started.

### Setting the seed

It is important to note, that we are pre-processing data with the idea of deploying the model. Therefore, from now on, for each step that includes some element of randomness, it is extremely important that we **set the seed**. This way, we can obtain reproducibility between our research and our development code.

Let's go ahead and load the dataset.

In [1]:
# to handle datasets
import pandas as pd
import numpy as np

# to split data into train and test set
from sklearn.model_selection import train_test_split

# to evaluate the models
# to assess model performance
from sklearn.metrics import log_loss
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import f1_score

# to persist the model and the scaler
import joblib 

# pretty print
from pprint import pprint

# maximum number of dataframe rows and columns displayed
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', None)

# random seed
RANDOM_STATE = 801
pd.options.mode.chained_assignment = None

import warnings
warnings.simplefilter(action='ignore')

# 1.- Load data

In [2]:
# load dataset
data0 = pd.read_csv('travelChurn_20k.csv')
print(data0.shape)
data0.head()

(20000, 18)


Unnamed: 0,repeat,Gender,Age_Range,Income_Range,Occupation,Household_Type,Length_of_Residence,Home_Value_Range,Wealth_Rank,Mail_Buyer,Ecommerce_Behav_Rank,Upscale_Retail_Shopper,Premium_Bank_Card,Books_Behav,Family_Behav,Health_Magazine,Personal_Travel,Sporting_Goods_Interest
0,0,M,45-54 Years Old,"$100,000 - $124,999",Executive/Administrator,Adult Male & Adult Female Present,In the 6th Year,"$250,000 - $300,000",8,,9,Y,Y,1,,0,Y,U
1,0,M,45-54 Years Old,"$75,000 - $99,999",Unknown,Adult Male & Adult Female Present,In the 14th Year,"$150,000 - $200,000",8,,5,U,U,0,,0,U,U
2,0,M,24-34 Years Old,"$100,000 - $124,999",Unknown,Adult Male & Adult Female Present,In the 2nd Year,"$600,000 - $650,000",9,,8,,,2,,1,,
3,1,M,24-34 Years Old,"$75,000 - $99,999",Unknown,Adult Male & Adult Female Present,In the 6th Year,"$100,000 - $150,000",8,,5,U,U,3,,2,U,U
4,0,M,55-64 Years Old,"$125,000 - $149,999",Unknown,Unknown,In the 1st Year,Unknown,3,,10,,,0,,0,,


In [3]:
data0.columns

Index(['repeat', 'Gender', 'Age_Range', 'Income_Range', 'Occupation',
       'Household_Type', 'Length_of_Residence', 'Home_Value_Range',
       'Wealth_Rank', 'Mail_Buyer', 'Ecommerce_Behav_Rank',
       'Upscale_Retail_Shopper', 'Premium_Bank_Card', 'Books_Behav',
       'Family_Behav', 'Health_Magazine', 'Personal_Travel',
       'Sporting_Goods_Interest'],
      dtype='object')

In [4]:
data0.dtypes

repeat                      int64
Gender                     object
Age_Range                  object
Income_Range               object
Occupation                 object
Household_Type             object
Length_of_Residence        object
Home_Value_Range           object
Wealth_Rank                object
Mail_Buyer                 object
Ecommerce_Behav_Rank       object
Upscale_Retail_Shopper     object
Premium_Bank_Card          object
Books_Behav                object
Family_Behav               object
Health_Magazine            object
Personal_Travel            object
Sporting_Goods_Interest    object
dtype: object

# 2.- Separate dataset into train and test

In [5]:
X_train, X_test, y_train, y_test = train_test_split(data0, data0['repeat'], test_size=0.1, 
                                                    random_state=RANDOM_STATE)

X_train.shape, X_test.shape

((18000, 18), (2000, 18))

In [6]:
X_train.head()

Unnamed: 0,repeat,Gender,Age_Range,Income_Range,Occupation,Household_Type,Length_of_Residence,Home_Value_Range,Wealth_Rank,Mail_Buyer,Ecommerce_Behav_Rank,Upscale_Retail_Shopper,Premium_Bank_Card,Books_Behav,Family_Behav,Health_Magazine,Personal_Travel,Sporting_Goods_Interest
18239,0,M,45-54 Years Old,"$100,000 - $124,999",Executive/Administrator,Adult Male & Adult Female Present With Children,In the 11th Year,"$200,000 - $250,000",,2,,Y,U,,0,,U,U
6282,0,F,45-54 Years Old,"$100,000 - $124,999",Unknown,Adult Male & Adult Female Present,15+ Years,"$150,000 - $200,000",7.0,2,8.0,Y,U,2.0,0,2.0,U,U
18641,0,F,55-64 Years Old,"$40,000 - $49,999",Unknown,Adult Male & Adult Female Present With Children,15+ Years,"$150,000 - $200,000",,2,,Y,U,,4,,U,U
7354,0,M,45-54 Years Old,"$125,000 - $149,999",Executive/Administrator,Adult Male & Adult Female Present,15+ Years,"$200,000 - $250,000",8.0,1,10.0,Y,U,5.0,2,4.0,Y,U
5691,0,F,55-64 Years Old,"$50,000 - $74,999",Unknown,Adult Male & Adult Female Present,15+ Years,"$50,000 - $100,000",3.0,2,5.0,U,U,0.0,0,0.0,U,U


In [7]:
X_test.head()

Unnamed: 0,repeat,Gender,Age_Range,Income_Range,Occupation,Household_Type,Length_of_Residence,Home_Value_Range,Wealth_Rank,Mail_Buyer,Ecommerce_Behav_Rank,Upscale_Retail_Shopper,Premium_Bank_Card,Books_Behav,Family_Behav,Health_Magazine,Personal_Travel,Sporting_Goods_Interest
2384,1,M,35-44 Years Old,"$40,000 - $49,999",Unknown,Adult Male Present,In the 4th Year,"$100,000 - $150,000",1,1,5,,,0,0,0,,
3408,0,M,45-54 Years Old,"$100,000 - $124,999",Unknown,Adult Male Present,In the 9th Year,"$150,000 - $200,000",9,1,9,Y,U,0,0,0,U,U
16620,0,M,45-54 Years Old,"$75,000 - $99,999",Executive/Administrator,Adult Male & Adult Female Present,In the 9th Year,"$50,000 - $100,000",1,1,9,U,U,2,1,4,U,U
12534,0,M,45-54 Years Old,"$75,000 - $99,999",Executive/Administrator,Adult Male & Adult Female Present,15+ Years,"$50,000 - $100,000",3,1,9,U,U,0,2,0,U,U
2713,0,M,35-44 Years Old,"$250,000+",Unknown,Adult Male Present,In the 6th Year,"$100,000 - $150,000",3,1,10,U,U,1,0,0,U,U


In [8]:
# save raw train and test sets
X_train.to_csv('X_train.csv', index=False)
X_test.to_csv('X_test.csv', index=False)

In [9]:
# drop target from training and testing sets
X_train = X_train.drop('repeat', axis=1)
X_test = X_test.drop('repeat', axis=1)

# 3.- Feature Engineering

We will load the pre-processing pipeline we developed in a previous notebook.

In [10]:
# load pre-processing pipeline
preprocess = joblib.load('preprocessing.pkl')

In [11]:
df_train = preprocess.transform(X_train)

In [12]:
df_test = preprocess.transform(X_test)

# 4.- Get predictions

We will load the random forest classifier trained in a previous notebook to score our imputed & encoded data.

In [13]:
rf = joblib.load('model.pkl')

In [14]:
# make predictions on the train set
y_train_pred = rf.predict(df_train)

In [15]:
# make predictions on the train set
y_test_pred = rf.predict(df_test)

## 4.1.- Evaluate predictions

In [16]:
def evaluate(model, test_features, test_labels):
    
    """    
    A function to assess a classification model's performance
    
    -------------------------------
    model: model to be evaluated
    test_features: features to be fed to the model
    test_labels: known outcomes
    """
    
    # precision & recall
    predictions = model.predict(test_features)
    precision = precision_score(test_labels, predictions)
    recall = recall_score(test_labels, predictions)   
    f1 = f1_score(test_labels, predictions)
   
    # log-loss
    probabilities = model.predict_proba(test_features)
    
    # keep the predictions for class 1 only
    probabilities = probabilities[:, 1]
    
    # calculate log loss
    loss = log_loss(test_labels, probabilities)
    
    print('Precision = {:0.2f}.'.format(precision))
    print('Recall = {:0.2f}.'.format(recall))
    print('F1 = {:0.4f}.'.format(f1))
    print('LogLoss = {:0.4f}.'.format(loss))

In [17]:
evaluate(rf, df_train, y_train)

Precision = 1.00.
Recall = 1.00.
F1 = 0.9992.
LogLoss = 0.0004.


In [18]:
evaluate(rf, df_test, y_test)

Precision = 0.98.
Recall = 0.19.
F1 = 0.3158.
LogLoss = 0.4092.


The precision, recall, F1 and logloss values we obtain are the very same ones we observed when training the model - we have complete reproducibility.

## 4.2.- Prediction threshold verification

We will now verify the prediction threshold to be used - in this case it should be the default: 0.5

In [19]:
results = rf.predict_proba(df_test)
predictions = pd.DataFrame({'0': results[:, 0], '1':results[:, 1]})

In [20]:
# the prediction threshold is 0.5
y_pred = (predictions.iloc[:,1] >= 0.5).astype(int)

In [21]:
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)   
f1 = f1_score(y_test, y_pred)

In [22]:
print('Precision = {:0.2f}.'.format(precision))
print('Recall = {:0.2f}.'.format(recall))
print('F1 = {:0.4f}.'.format(f1))

Precision = 0.98.
Recall = 0.19.
F1 = 0.3158.


We see that the same precision, accuracy and F1 values are obtained - therefore we have ensured that the threshold to be used in production must be 0.5