# ECB Data Academy - Week 5 - Workshop - Solutions

[Krisolis](http://www.krisolis.ie)

##  Building Ensemble Models In Python

## Workshop Tasks
This workshop builds on the work we did in last week's workshop and the code below builds a decision tree for the channel prediction task. Perform the following tasks:
- For both Bagging model and a GradientBoosting model
  - Train a prediction model for the channel prediction problem.
  - Assess the performance of the model on the test dataset.

In [1]:
import pandas as pd
import numpy as np

import TAS_Python_Utilities

import sklearn
import sklearn.impute
import sklearn.model_selection
import sklearn.metrics
import sklearn.tree
import sklearn.svm
import sklearn.ensemble
import sklearn.linear_model
import sklearn.neighbors

import matplotlib.pyplot as plt
%matplotlib inline

import pandas_profiling

### Load Dataset

Import the dataset from the file ABT_Telco_Churn.csv into a Python data frame called ABT_Telco_Churn.

In [2]:
abt = pd.read_csv("InsureABC_Channel_Data.csv", encoding = "UTF-8", index_col = 0)
target_feature_name = 'PrefChannel'
print(abt.columns)
print(abt.shape)
display(abt.head())

Index(['Title', 'GivenName', 'MiddleInitial', 'Surname', 'CreditCardType',
       'Occupation', 'Gender', 'Age', 'Location', 'MotorInsurance',
       'MotorValue', 'MotorType', 'HealthInsurance', 'HealthType',
       'HealthDependentsAdults', 'HealthDependentsKids', 'TravelInsurance',
       'TravelType', 'PrefChannel'],
      dtype='object')
(5200, 19)


Unnamed: 0_level_0,Title,GivenName,MiddleInitial,Surname,CreditCardType,Occupation,Gender,Age,Location,MotorInsurance,MotorValue,MotorType,HealthInsurance,HealthType,HealthDependentsAdults,HealthDependentsKids,TravelInsurance,TravelType,PrefChannel
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
1,Mrs.,Macy,A,Boyle,AMEX,Clinical laboratory technologist,female,23,Urban,No,,,No,,,,Yes,Premium,SMS
2,Ms.,Thea,L,McIntosh,AMEX,,female,44,Urban,No,,,Yes,Level1,2.0,3.0,No,,Phone
3,Mr.,Niall,T,Graham,AMEX,Biophysicist,male,52,Urban,Yes,17274.0,Single,No,,,,No,,Phone
4,Ms.,Murron,P,Miller,AMEX,Sheriff,female,19,Urban,Yes,4920.0,Bundle,No,,,,No,,SMS
5,Mr.,Kai,A,Henderson,Visa,Automotive painter,male,47,Rural,Yes,14994.0,Single,Yes,Level1,1.0,2.0,Yes,Business,Phone


### Data Preparation - From Week 3 Workshop

Remove columns with too many levels

In [3]:
abt = abt[abt.columns.difference(['GivenName', 'MiddleInitial', 'Surname', 'Occupation'])]

Remap spurious target level values

In [4]:
abt.loc[abt['PrefChannel'] == 'P','PrefChannel'] = "Phone"
abt.loc[abt['PrefChannel'] == 'E','PrefChannel'] = "Email"
abt.loc[abt['PrefChannel'] == 'S','PrefChannel'] = "SMS"

Perform simple imputation  on columns with missing values

In [5]:
imputers = dict()

imputers['HealthDependentsAdults'] = sklearn.impute.SimpleImputer(strategy="constant", fill_value = 0)
abt['HealthDependentsAdults'] = imputers['HealthDependentsAdults'].fit_transform(abt['HealthDependentsAdults'].values.reshape(-1, 1))

#imp = sklearn.impute.SimpleImputer(strategy="constant", fill_value = 0)
#abt['HealthDependentsAdults'] = imp.fit_transform(abt['HealthDependentsAdults'].values.reshape(-1, 1))

#imp = sklearn.impute.SimpleImputer(strategy="median")
#abt['HealthDependentsAdults'] = imp.fit_transform(abt['HealthDependentsAdults'].values.reshape(-1, 1))


imputers['HealthDependentsKids'] = sklearn.impute.SimpleImputer(strategy="constant", fill_value = 0)
abt['HealthDependentsKids'] = imputers['HealthDependentsKids'].fit_transform(abt['HealthDependentsKids'].values.reshape(-1, 1))

imputers['CreditCardType'] = sklearn.impute.SimpleImputer(strategy="constant", fill_value = 'missing')
abt['CreditCardType'] = imputers['CreditCardType'].fit_transform(abt['CreditCardType'].values.reshape(-1, 1))

imputers['MotorValue'] = sklearn.impute.SimpleImputer(strategy="constant", fill_value = 0)
abt['MotorValue'] = imputers['MotorValue'].fit_transform(abt['MotorValue'].values.reshape(-1, 1))

imputers['MotorType'] = sklearn.impute.SimpleImputer(strategy="constant", fill_value = 'none')
abt['MotorType'] = imputers['MotorType'].fit_transform(abt['MotorType'].values.reshape(-1, 1))

imputers['HealthType'] = sklearn.impute.SimpleImputer(strategy="constant", fill_value = 'none')
abt['HealthType'] = imputers['HealthType'].fit_transform(abt['HealthType'].values.reshape(-1, 1))

imputers['TravelType'] = sklearn.impute.SimpleImputer(strategy="constant", fill_value = 'none')
abt['TravelType'] = imputers['TravelType'].fit_transform(abt['TravelType'].values.reshape(-1, 1))

### Build a Decision Tree Model - From Week 4 Workshop

Extract descriptive and target features.

In [6]:
X = abt[abt.columns.difference(['CustomerID', target_feature_name])]
Y = abt[target_feature_name]

Convert categorical features to dummy variables.

In [7]:
# Create dummy varaibles for all categorical features
X = pd.get_dummies(X) 
X

Unnamed: 0_level_0,Age,HealthDependentsAdults,HealthDependentsKids,MotorValue,CreditCardType_AMEX,CreditCardType_Visa,CreditCardType_missing,Gender_f,Gender_female,Gender_m,...,Title_Mrs.,Title_Ms.,TravelInsurance_No,TravelInsurance_Yes,TravelType_Backpacker,TravelType_Business,TravelType_Premium,TravelType_Senior,TravelType_Standard,TravelType_none
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,23,0.0,0.0,0.0,1,0,0,0,1,0,...,1,0,0,1,0,0,1,0,0,0
2,44,2.0,3.0,0.0,1,0,0,0,1,0,...,0,1,1,0,0,0,0,0,0,1
3,52,0.0,0.0,17274.0,1,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,1
4,19,0.0,0.0,4920.0,1,0,0,0,1,0,...,0,1,1,0,0,0,0,0,0,1
5,47,1.0,2.0,14994.0,0,1,0,0,0,0,...,0,0,0,1,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5196,68,1.0,0.0,0.0,0,1,0,0,0,0,...,0,0,1,0,0,0,0,0,0,1
5197,45,0.0,0.0,36429.0,0,1,0,0,1,0,...,1,0,1,0,0,0,0,0,0,1
5198,48,0.0,3.0,14206.0,0,1,0,0,0,0,...,0,0,1,0,0,0,0,0,0,1
5199,47,1.0,2.0,27251.0,0,1,0,0,0,0,...,0,0,1,0,0,0,0,0,0,1


Divide this data into a 50% training set, a 20% validation set,  and a 30% test set.

In [8]:
X_train_plus_valid, X_test, y_train_plus_valid, y_test = \
sklearn.model_selection.train_test_split(X, Y, random_state=0, test_size = 0.30, train_size = 0.7, stratify=Y)

X_train, X_valid, y_train, y_valid = \
sklearn.model_selection.train_test_split(X_train_plus_valid, y_train_plus_valid, \
                                    random_state=0, test_size = 0.2/0.7, train_size = 0.5/0.7, stratify=y_train_plus_valid)

Train a decision tree to predict the Churn_Fl variable from the ABT_Telco_Churn using a decision tree.

In [9]:
my_tree = sklearn.tree.DecisionTreeClassifier(criterion="entropy",
                                             min_samples_leaf=60)
my_tree.fit(X_train,y_train)

DecisionTreeClassifier(criterion='entropy', min_samples_leaf=60)

Assess the performance of the tree

In [10]:
print("******** Training Data ********")
# Make a set of predictions for the training data
y_pred = my_tree.predict(X_train)

# Print performance details
print(sklearn.metrics.classification_report(y_train, y_pred))

# Print confusion matrix
print("Confusion Matrix")
display(pd.crosstab(y_train, y_pred, rownames=['True'], colnames=['Predicted'], margins=True))

print("****** Validation Data ********")

# Make a set of predictions for the validation data
y_pred = my_tree.predict(X_valid)

# Print performance details
print(sklearn.metrics.classification_report(y_valid, y_pred))

# Print confusion matrix
print("Confusion Matrix")
display(pd.crosstab(y_valid, y_pred, rownames=['True'], colnames=['Predicted'], margins=True))

******** Training Data ********
              precision    recall  f1-score   support

       Email       0.65      0.73      0.69      1116
       Phone       0.75      0.73      0.74      1003
         SMS       0.61      0.44      0.51       480

    accuracy                           0.68      2599
   macro avg       0.67      0.64      0.65      2599
weighted avg       0.68      0.68      0.68      2599

Confusion Matrix


Predicted,Email,Phone,SMS,All
True,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Email,820,180,116,1116
Phone,244,737,22,1003
SMS,207,60,213,480
All,1271,977,351,2599


****** Validation Data ********
              precision    recall  f1-score   support

       Email       0.60      0.68      0.64       447
       Phone       0.72      0.71      0.71       401
         SMS       0.50      0.38      0.43       192

    accuracy                           0.63      1040
   macro avg       0.61      0.59      0.59      1040
weighted avg       0.63      0.63      0.63      1040

Confusion Matrix


Predicted,Email,Phone,SMS,All
True,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Email,302,81,64,447
Phone,109,283,9,401
SMS,90,30,72,192
All,501,394,145,1040


### Build Ensemble Models

Train a **Bagging** ensemble model.

In [11]:
bagging_model = sklearn.ensemble.BaggingClassifier(sklearn.tree.DecisionTreeClassifier(min_samples_leaf=0.05),
                                                  n_estimators=50)
bagging_model.fit(X_train,y_train)

BaggingClassifier(base_estimator=DecisionTreeClassifier(min_samples_leaf=0.05),
                  n_estimators=50)

Assess the performance of the Bagging model

In [12]:
print("******** Training Data ********")
# Make a set of predictions for the training data
y_pred = bagging_model.predict(X_train)

# Print performance details
print(sklearn.metrics.classification_report(y_train, y_pred))

# Print confusion matrix
print("Confusion Matrix")
display(pd.crosstab(y_train, y_pred, rownames=['True'], colnames=['Predicted'], margins=True))

print("****** Validation Data ********")

# Make a set of predictions for the validation data
y_pred = bagging_model.predict(X_valid)

# Print performance details
print(sklearn.metrics.classification_report(y_valid, y_pred))

# Print confusion matrix
print("Confusion Matrix")
display(pd.crosstab(y_valid, y_pred, rownames=['True'], colnames=['Predicted'], margins=True))

******** Training Data ********
              precision    recall  f1-score   support

       Email       0.63      0.66      0.65      1116
       Phone       0.70      0.71      0.71      1003
         SMS       0.58      0.50      0.54       480

    accuracy                           0.65      2599
   macro avg       0.64      0.62      0.63      2599
weighted avg       0.65      0.65      0.65      2599

Confusion Matrix


Predicted,Email,Phone,SMS,All
True,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Email,739,228,149,1116
Phone,265,712,26,1003
SMS,169,70,241,480
All,1173,1010,416,2599


****** Validation Data ********
              precision    recall  f1-score   support

       Email       0.61      0.61      0.61       447
       Phone       0.69      0.72      0.70       401
         SMS       0.50      0.44      0.47       192

    accuracy                           0.62      1040
   macro avg       0.60      0.59      0.59      1040
weighted avg       0.62      0.62      0.62      1040

Confusion Matrix


Predicted,Email,Phone,SMS,All
True,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Email,273,98,76,447
Phone,104,288,9,401
SMS,74,33,85,192
All,451,419,170,1040


Train a **GradientBoosting** ensemble model.

In [16]:
gb_model = sklearn.ensemble.GradientBoostingClassifier(min_samples_split = 0.2, n_estimators=500)
gb_model.fit(X_train,y_train)

GradientBoostingClassifier(min_samples_split=0.2, n_estimators=500)

Assess the performance of the GradientBoosting model

In [17]:
print("******** Training Data ********")
# Make a set of predictions for the training data
y_pred = gb_model.predict(X_train)

# Print performance details
print(sklearn.metrics.classification_report(y_train, y_pred))

# Print confusion matrix
print("Confusion Matrix")
display(pd.crosstab(y_train, y_pred, rownames=['True'], colnames=['Predicted'], margins=True))

print("****** Validation Data ********")

# Make a set of predictions for the validation data
y_pred = gb_model.predict(X_valid)

# Print performance details
print(sklearn.metrics.classification_report(y_valid, y_pred))

# Print confusion matrix
print("Confusion Matrix")
display(pd.crosstab(y_valid, y_pred, rownames=['True'], colnames=['Predicted'], margins=True))

******** Training Data ********
              precision    recall  f1-score   support

       Email       0.74      0.79      0.76      1116
       Phone       0.80      0.82      0.81      1003
         SMS       0.73      0.59      0.65       480

    accuracy                           0.76      2599
   macro avg       0.76      0.73      0.74      2599
weighted avg       0.76      0.76      0.76      2599

Confusion Matrix


Predicted,Email,Phone,SMS,All
True,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Email,879,146,91,1116
Phone,168,820,15,1003
SMS,138,59,283,480
All,1185,1025,389,2599


****** Validation Data ********
              precision    recall  f1-score   support

       Email       0.60      0.63      0.61       447
       Phone       0.69      0.71      0.70       401
         SMS       0.47      0.39      0.43       192

    accuracy                           0.62      1040
   macro avg       0.59      0.58      0.58      1040
weighted avg       0.61      0.62      0.61      1040

Confusion Matrix


Predicted,Email,Phone,SMS,All
True,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Email,282,92,73,447
Phone,106,284,11,401
SMS,83,34,75,192
All,471,410,159,1040


Use a **grid search** to optimize the hyper-parameters of the gradient boosting model. (It is probably useful to optimize *n_estimators*, *max_features*, and *min_samples_split*.)

In [22]:
# Set up the parameter grid to seaerch
param_grid = [
 {'n_estimators': list(range(100, 501, 50)), 
  'max_features': list(range(2, 10, 2)), 
  'min_samples_split': list(range(20, 200, 50)) }
]

# Perform the search
my_tuned_model = sklearn.model_selection.GridSearchCV(sklearn.ensemble.GradientBoostingClassifier(), 
                                                      param_grid, 
                                                      cv=5,
                                                      verbose = 2,
                                                      n_jobs = -1)
my_tuned_model.fit(X_train_plus_valid, y_train_plus_valid)

# Print details
print("Best parameters set found on development set:")
print(my_tuned_model.best_params_)
print(my_tuned_model.best_score_)

Fitting 5 folds for each of 144 candidates, totalling 720 fits
Best parameters set found on development set:
{'max_features': 2, 'min_samples_split': 170, 'n_estimators': 150}
0.6565004459089742


Assess the performance of the tuned model.

In [23]:
print("******** Training Data ********")
# Make a set of predictions for the training data
y_pred = my_tuned_model.predict(X_train)

# Print performance details
print(sklearn.metrics.classification_report(y_train, y_pred))

# Print confusion matrix
print("Confusion Matrix")
display(pd.crosstab(y_train, y_pred, rownames=['True'], colnames=['Predicted'], margins=True))

print("****** Validation Data ********")

# Make a set of predictions for the validation data
y_pred = my_tuned_model.predict(X_valid)

# Print performance details
print(sklearn.metrics.classification_report(y_valid, y_pred))

# Print confusion matrix
print("Confusion Matrix")
display(pd.crosstab(y_valid, y_pred, rownames=['True'], colnames=['Predicted'], margins=True))

******** Training Data ********
              precision    recall  f1-score   support

       Email       0.65      0.74      0.69      1116
       Phone       0.76      0.73      0.75      1003
         SMS       0.61      0.45      0.52       480

    accuracy                           0.68      2599
   macro avg       0.67      0.64      0.65      2599
weighted avg       0.68      0.68      0.68      2599

Confusion Matrix


Predicted,Email,Phone,SMS,All
True,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Email,825,173,118,1116
Phone,245,735,23,1003
SMS,204,60,216,480
All,1274,968,357,2599


****** Validation Data ********
              precision    recall  f1-score   support

       Email       0.63      0.69      0.66       447
       Phone       0.74      0.73      0.73       401
         SMS       0.53      0.42      0.47       192

    accuracy                           0.66      1040
   macro avg       0.63      0.61      0.62      1040
weighted avg       0.65      0.66      0.65      1040

Confusion Matrix


Predicted,Email,Phone,SMS,All
True,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Email,310,73,64,447
Phone,100,292,9,401
SMS,79,32,81,192
All,489,397,154,1040
