# III.- Travel Marketing Machine Learning Pipeline: Model Build 

There will be a notebook for each one of the Machine Learning Pipeline steps:

1. Data Analysis
2. Feature Engineering
3. Model Build

**This is the notebook for step 3: Model Build**

**The purpose of these notebooks is to provide an idea of the steps that must be covered when preparing a machine learning model for deployment.**

===================================================================================================

## Predicting Repeat Customers in the Travel Business

The aim of the project is to build a machine learning model to predict which customers of a travel agency are going to be repeat customers.

### Why is this important? 

The travel agency is giving out too many discounted packages without ROI - they want to send discounted offers only to customers that will repeat. On the other hand, they want to reduce churn by sending targeted marketing to customers who defect.

### What is the objective of the machine learning model?

We aim to identify customers that will repeat using data describing each customer's socioeconomic status and interests. 

====================================================================================================

## Travel marketing dataset: Initial Model Build

In the following cells, we will train and tune a random forest classifier. The choice of model is arbitrary as the idea is to provide an illustration of the steps that must be followed when preparing a machine learning model for production.


### Hyperparameter Tuning

We will perform model training & hyperparameter tuning in two stages:

1. A coarse random search will be used to get ballpark figures for the model hyperparameters.
2. The values used in the previous step will be used to perform a refined full grid search to find optimal hyperparameters.

The criteria that will be used to tune the model is to **maximise F1 score.** This choice of metric is arbitrary.


### Setting the seed

It is important to note that we are engineering variables with the idea of deploying the model. Therefore, from now on, for each step that includes some element of randomness, it is extremely important that we **set the seed**. This way, we can obtain reproducibility between our research and our development code.

In [1]:
# to handle datasets
import pandas as pd
import numpy as np

# pretty print
from pprint import pprint

# to build the model
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV

# to assess model performance
from sklearn.metrics import log_loss
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import f1_score

# to save tuned model
import joblib

# maximum number of dataframe rows and columns displayed
pd.set_option('display.max_rows', 5000)
pd.set_option('display.max_columns', None)

# to time hyperparameter searches
import time

# random seed
RANDOM_STATE = 801
pd.options.mode.chained_assignment = None

In [2]:
# load the train and test set with the engineered datasets
# we saved in the previous notebook

X_train_encoded = pd.read_csv('X_train_engineered.csv')
X_test_encoded = pd.read_csv('X_test_engineered.csv')

In [3]:
X_train_encoded.shape

(18000, 168)

In [4]:
X_test_encoded.shape

(2000, 168)

In [5]:
# load target
y_train = pd.read_csv('y_train.csv')
y_test = pd.read_csv('y_test.csv')

In the following cells we will train an tune the hyperparameters of a random forest model.

# 1.- Coarse random search

We will first perform an initial random search to get ballpark figures for the model's hyperparameters.

In [6]:
#########################################################
# parameter values for a random grid search are defined #
#########################################################

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 25, stop = 175, num = 7)]

# Number of features to consider at every split
max_features = ['auto', 'sqrt']

# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 30, num = 3)]
max_depth.append(None)

# Minimum number of samples required to split a node
min_samples_split = [1, 3, 5]

# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 3, 5]

# Method of selecting samples for training each tree
bootstrap = [True, False]

# Metric to assess the quality of a split.
criterion = ['gini', 'entropy']

# Create parameter grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap,
               'criterion': criterion}

pprint(random_grid)

{'bootstrap': [True, False],
 'criterion': ['gini', 'entropy'],
 'max_depth': [10, 20, 30, None],
 'max_features': ['auto', 'sqrt'],
 'min_samples_leaf': [1, 3, 5],
 'min_samples_split': [1, 3, 5],
 'n_estimators': [25, 50, 75, 100, 125, 150, 175]}


In [7]:
# instantiate a random forest classifier 
rf = RandomForestClassifier(random_state = RANDOM_STATE)

In [8]:
# Use a random grid search to find initial hyperparameters

# Random parameter search, using 5 fold cross validation, 
# search across 100 different combinations, and use all available cores
# the metric that will be used for the search is F1 (please note this is an arbitrary choice)
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 200, cv = 5, 
                               verbose=2, random_state=RANDOM_STATE, n_jobs = -1, scoring = 'f1')

In [9]:
# Fit the random search model
time_start = time.time()
rf_random.fit(X_train_encoded, np.ravel(y_train))
time_end = time.time()
print("Random grid search took" ,(time_end - time_start)/60., ' minutes')

Fitting 5 folds for each of 200 candidates, totalling 1000 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:   18.1s
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done 349 tasks      | elapsed:  2.6min
[Parallel(n_jobs=-1)]: Done 632 tasks      | elapsed:  4.3min
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed:  6.5min finished


Random grid search took 6.569512482484182  minutes


In [10]:
# These are the best hyperparameters found by the random search - these values will guide the 
# full search performed in the next section
pprint(rf_random.best_params_)

{'bootstrap': False,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'min_samples_leaf': 1,
 'min_samples_split': 3,
 'n_estimators': 75}


## 1.1.- Random Search Model Evaluation

We will now compare a base model (the one with default parameters) with the model that resulted from the random search; the idea is to verify that there has been an improvement in performance after the initial hyperparameter tuning.

In [11]:
def evaluate(model, test_features, test_labels):
    
    """    
    A function to assess a classification model's performance
    
    -------------------------------
    model: model to be evaluated
    test_features: features to be fed to the model
    test_labels: known outcomes
    """
    
    # precision & recall
    predictions = model.predict(test_features)
    precision = precision_score(test_labels, predictions)
    recall = recall_score(test_labels, predictions)   
    f1 = f1_score(test_labels, predictions)
   
    # log-loss
    probabilities = model.predict_proba(test_features)
    
    # keep the predictions for class 1 only
    probabilities = probabilities[:, 1]
    
    # calculate log loss
    loss = log_loss(test_labels, probabilities)
    
    print('Precision = {:0.2f}.'.format(precision))
    print('Recall = {:0.2f}.'.format(recall))
    print('F1 = {:0.4f}.'.format(f1))
    print('LogLoss = {:0.4f}.'.format(loss))

In [12]:
base_model = RandomForestClassifier(random_state = RANDOM_STATE)
base_model.fit(X_train_encoded, np.ravel(y_train))

RandomForestClassifier(random_state=801)

We evaluate the base model on the training and testing sets:

In [13]:
evaluate(base_model, X_train_encoded, y_train)

Precision = 1.00.
Recall = 1.00.
F1 = 0.9992.
LogLoss = 0.0833.


In [14]:
evaluate(base_model, X_test_encoded, y_test)

Precision = 0.98.
Recall = 0.17.
F1 = 0.2893.
LogLoss = 0.3570.


Now we evaluate the model resulting from our initial random search:

In [15]:
best_random = rf_random.best_estimator_

In [16]:
evaluate(best_random, X_train_encoded, y_train)

Precision = 1.00.
Recall = 1.00.
F1 = 0.9992.
LogLoss = 0.0410.


In [17]:
evaluate(best_random, X_test_encoded, y_test)

Precision = 1.00.
Recall = 0.17.
F1 = 0.2956.
LogLoss = 0.3523.


As evidenced by the increase of F1 values in the test set, there is an improvement in performance after the random grid search. Therefore, the corresponding hyperparameters are a good starting point for a full grid search on a finer grid.

# 2.- Full grid search

Now that we know approximate values for the hyperparameters we can run a full grid search using these as guidance for search ranges.

In [18]:
#######################################################
# parameter values for a full grid search are defined #
#######################################################

# Number of trees in random forest
n_estimators = [60, 65, 70, 75, 80, 85]

# Number of features to consider at every split
max_features = ['auto']

# Maximum number of levels in tree
max_depth = [10, 20, 30]
max_depth.append(None)

# Minimum number of samples required to split a node
min_samples_split = [1, 2, 3, 4]

# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 3, 4]

# Method of selecting samples for training each tree
bootstrap = [False]

# Metric to assess the quality of a split.
criterion = ['gini']

# Create parameter grid
param_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap,
               'criterion': criterion}

pprint(param_grid)

{'bootstrap': [False],
 'criterion': ['gini'],
 'max_depth': [10, 20, 30, None],
 'max_features': ['auto'],
 'min_samples_leaf': [1, 2, 3, 4],
 'min_samples_split': [1, 2, 3, 4],
 'n_estimators': [60, 65, 70, 75, 80, 85]}


In [19]:
# Create a base model
rf = RandomForestClassifier(random_state = RANDOM_STATE)

In [20]:
# Instantiate the grid search 
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, 
                          cv = 5, n_jobs = -1, verbose = 2, scoring = 'f1')

In [21]:
# Fit the full grid search model
time_start = time.time()
grid_search.fit(X_train_encoded, np.ravel(y_train))
time_end = time.time()
print("Full grid search took" ,(time_end - time_start)/60., ' minutes')

Fitting 5 folds for each of 384 candidates, totalling 1920 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    3.2s
[Parallel(n_jobs=-1)]: Done 178 tasks      | elapsed:   34.3s
[Parallel(n_jobs=-1)]: Done 381 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done 664 tasks      | elapsed:  2.6min
[Parallel(n_jobs=-1)]: Done 1029 tasks      | elapsed:  4.8min
[Parallel(n_jobs=-1)]: Done 1474 tasks      | elapsed:  7.6min
[Parallel(n_jobs=-1)]: Done 1920 out of 1920 | elapsed: 10.9min finished


Full grid search took 10.993253016471863  minutes


In [22]:
# These are the optimal hyperparameters found by the full grid search
pprint(grid_search.best_params_)

{'bootstrap': False,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'n_estimators': 65}


## 2.1.- Full Grid Search Model Evaluation

We now want to evaluate the model resulting from the full grid search

In [23]:
best_grid = grid_search.best_estimator_

In [24]:
evaluate(best_grid, X_train_encoded, y_train)

Precision = 1.00.
Recall = 1.00.
F1 = 0.9992.
LogLoss = 0.0004.


In [25]:
evaluate(best_grid, X_test_encoded, y_test)

Precision = 0.98.
Recall = 0.19.
F1 = 0.3158.
LogLoss = 0.4092.


There was an improvement in the F1 score after the full grid search. 

# 3.- Saving the tuned model

We will save the best model resulting from the full grid search as a pickle file that will be invoked by the production code.

In [26]:
joblib.dump(best_grid, open('model.pkl', 'wb'))

This conclude the model building section for this project.