In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import xgboost as xgb
import os
from sklearn.preprocessing import OneHotEncoder , LabelEncoder
from sklearn.model_selection import train_test_split , GridSearchCV , RandomizedSearchCV
from sklearn.ensemble import GradientBoostingClassifier , RandomForestClassifier , AdaBoostClassifier
from sklearn.metrics import classification_report, roc_auc_score
from sklearn import metrics
from sklearn.metrics import confusion_matrix # creates a confusion matrix
from sklearn.metrics import plot_confusion_matrix # draws a confusion matrix
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

## ML Final Project

**Scenario:** You work at a multinational bank that is aiming to increase it's market share in 
Europe. Recently, it has been noticed that the number of customers using the banking 
services has declined, and the bank is worried that existing customers have stopped 
using them as their main bank. <br> 

As a data scientist, you are tasked with finding out the 
reasons behind customer churn (when a customer stops using them as the main bank) and to predict customer churn. <br> 

The marketing team, 
in particular, is interested in your findings and want to better understand existing 
customer behavior and possibly predict customer churn. Your results will help the 
marketing team to use their budget wisely to target potential churners. To achieve 
this objective, in this exercise, you will import the banking data (Churn_Modelling.csv) 
provided by the bank and do some machine learning to solve their problem.

Data dictionary

- CustomerID: Unique ID of each customer
- CredRate: Credit Score of the customer 
- Geography: Country customer is from 
- Gender
- Age
- Tenure: How long customer has been with bank 
- Balance: Amount of money customer has/had with the bank
- Prod Number: Number of products customer has with bank 
- HasCrCard: Does customer have credit card
- ActMem: Is customer active member 
- Estimated salary: Annual estimated salary of customer 
- Exited: Whether customer has churned (1 is yes)

# 1. Introduction

Objective: To better understand existing customer behavior and prediict potential churners to better utilize marketing budget.

Input Features: "CredRate", "Gender", "Age", "Tenure", "Prod Number", "HasCrCard", "ActMem" and "Estimated salary"

Drop Features: "CustomerID", "Geography" and "Exited" (prediction target)

Prediction Target: "Exited" variable - To reduce "1" (customer churn rate) and increase "0"

# 2. Data Understanding and Preprocessing

### Vizualize Dataset

In [None]:
#Import Data
data = pd.read_csv("Churn_Modelling.csv")

In [None]:
#Strip spaces in Variable headers as it will cause error in coding - Prod Number and Estimated Salary
data.columns = data.columns.str.replace(' ', '') 
#Check DData
data.head()

### Data Exploration

In [None]:
#checking for null values
data.isna().sum()

In [None]:
#look at the 14 null data - 13 (Active = 0) vs 1 (Churned = 1)
data[data.isnull().any(axis=1)]

Let's take a look at the NaN features - Gender, Age and Estimated Salary

In [None]:
plt.figure(figsize=(13,13))
sns.heatmap(data.corr() , annot= True)

No particularly closely correlated variables for Age, Gender and Estimated Salary to replace the data.

Let's look at the data statistics.

In [None]:
data.describe()

No particular skewness observed for Age data.

Mean = 38,
Median = 37

No particular skewness observed for Estimated Salary data.

Mean = 100,074,
Median = 100,168

Let's fill Age and Estimated Salary NaN values with their Mean values.

In [None]:
data['Age'] = data['Age'].fillna(data['Age'].median())
data['EstimatedSalary'] = data['EstimatedSalary'].fillna(data['EstimatedSalary'].median())

In [None]:
df = data.groupby(['Gender','Exited']).size().reset_index().pivot(columns='Exited', index='Gender', values=0)
df.plot(kind='bar', stacked=True)

No particular skewness observed for Gender data. Female have a higher tendency to churn over Male. 

Fill the NA gender with mode (highest frequency)

In [None]:
data['Gender'] = data['Gender'].fillna(data['Gender'].mode()[0])
# data['Gender'].mode()[0]

In [None]:
#checking for null values again
data.isna().sum()

### Binning for EDA - Age

In [None]:
data['age_bins'] = pd.cut(x=data['Age'], bins=[0, 29, 39, 49, 59, 99])
data

In [None]:
df = data.groupby(['age_bins','Exited']).size().reset_index().pivot(columns='Exited', index='age_bins', values=0)
df.plot(kind='bar', stacked=True, figsize=(7,5))
df.plot(figsize=(7,5))

Churn Rate is high for customers in their 40s (%), followed by 50s (%), then 30s (%). These are our prime spenders

### Binning for EDA - Salary

In [None]:
data['salary_bins'] = pd.cut(x=data['EstimatedSalary'], bins=[0,30000,60000,90000,120000,150000,200000])

In [None]:
df = data.groupby(['salary_bins','Exited']).size().reset_index().pivot(columns='Exited', index='salary_bins', values=0)
df.plot(kind='bar', stacked=True, figsize=(7,5))
df.plot(figsize=(7,5))

Churn Rate is pretty consistent across different Salary ranges with the exception of slightly higher churn rate observed for >150k Salary (highest spending power).
Action must be taken to retain these customers with higher spending power.

In [None]:
df = data.groupby(['ProdNumber','Exited']).size().reset_index().pivot(columns='Exited', index='ProdNumber', values=0)
df.plot(kind='bar', stacked=True, figsize=(7,5))
df.plot(figsize=(7,5))

Sweet Spot seems to be at 2 products. Should X-Sell customer to 2 products then stop. Cm with higher than 3 products have a higher tendency to exit.

### Binning for EDA - Balance

In [None]:
data['bal_bins'] = pd.cut(x=data['Balance'], bins=[0,60000,90000,120000,150000,180000,260000])

In [None]:
df = data.groupby(['bal_bins','Exited']).size().reset_index().pivot(columns='Exited', index='bal_bins', values=0)
df.plot(kind='bar', stacked=True, figsize=(7,5))
df.plot(figsize=(7,5))

Trend seems to follow normal distribution with customers having 90k-150k balance with the highest churn rate!

### Age X Salary (Active) - Acquisition

In [None]:
data0=data[data.Exited==0]
df = data0.groupby(['salary_bins','age_bins']).size().reset_index().pivot(columns='salary_bins', index='age_bins', values=0)
df.plot(kind='bar', stacked=True, figsize=(10,7))
df.plot(figsize=(10,6))

### Age X Salary (Exited) - Acquisition

In [None]:
data1=data[data.Exited==1]
df = data1.groupby(['salary_bins','age_bins']).size().reset_index().pivot(columns='salary_bins', index='age_bins', values=0)
df.plot(kind='bar', stacked=True, figsize=(10,7))
df.plot(figsize=(10,6))

Looking at those who has churned (chart above), there must be action taken to recover these prime age group (30s and 40s) and high income earners (90k and above).

### Age X ProdNumber (Active)

In [None]:
data0=data[data.Exited==0]
df = data0.groupby(['ProdNumber','age_bins']).size().reset_index().pivot(columns='ProdNumber', index='age_bins', values=0)
df.plot(kind='bar', stacked=True, figsize=(10,7))
df.plot(figsize=(10,6))

### Age X ProdNumber (Exited)

In [None]:
data1=data[data.Exited==1]
df = data1.groupby(['ProdNumber','age_bins']).size().reset_index().pivot(columns='ProdNumber', index='age_bins', values=0)
df.plot(kind='bar', stacked=True, figsize=(10,7))
df.plot(figsize=(10,6))

Normal distribution

In [None]:
df = data.groupby(['ProdNumber','bal_bins']).size().reset_index().pivot(columns='ProdNumber', index='bal_bins', values=0)
df.plot(kind='bar', stacked=True, figsize=(10,6))
df.plot(figsize=(10,5))

### Balance X ProdNumber (Active)  - Existing Customers

In [None]:
data0=data[data.Exited==0]
df = data0.groupby(['ProdNumber','bal_bins']).size().reset_index().pivot(columns='ProdNumber', index='bal_bins', values=0)
df.plot(kind='bar', stacked=True, figsize=(10,6))
df.plot(figsize=(10,5))

Customer who stays with the bank mainly holds 1-2 products. Shouldn't be too agressive to upsell customers

### Balance X ProdNumber (Exited)  - Existing Customers

In [None]:
data1=data[data.Exited==1]
df = data1.groupby(['ProdNumber','bal_bins']).size().reset_index().pivot(columns='bal_bins', index='ProdNumber', values=0)
df.plot(kind='bar', stacked=True, figsize=(10,6))
df.plot(figsize=(10,5))

Should focus on x-selling customer with 90k to 150k balance from 1 product to 2 products to reduce churn rate!

In [None]:
ax1 = sns.barplot(x="ProdNumber", y="Balance", hue="Exited", data=data, estimator=sum)
ax1.get_yaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))

In [None]:
ax1 = sns.barplot(x="ProdNumber", y="Balance", data=data1, estimator=sum)
ax1.get_yaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))

### Age X Balance

In [None]:
df = data0.groupby(['bal_bins','age_bins']).size().reset_index().pivot(columns='bal_bins', index='age_bins', values=0)
df.plot(kind='bar', stacked=True, figsize=(10,7))
df.plot(figsize=(10,6))

In [None]:
plt.figure(figsize=(10,8))
ax = sns.barplot(x="age_bins", y="Balance", hue="Exited", data=data, estimator=sum)
ax.get_yaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))

In [None]:
ax1 = sns.barplot(x="age_bins", y="Balance", data=data1, estimator=sum)
ax1.get_yaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))

Total Close to $170MM of funds exited!

In [None]:
data1=data[data.Exited==1]
df = data1.groupby(['bal_bins','age_bins']).size().reset_index().pivot(columns='bal_bins', index='age_bins', values=0)
df.plot(kind='bar', stacked=True, figsize=(10,7))
df.plot(figsize=(10,6))

In [None]:
data['cr_bins'] = pd.cut(x=data['CredRate'], bins=[300,579,669,739,799,850])

In [None]:
df = data.groupby(['cr_bins','Exited']).size().reset_index().pivot(columns='Exited', index='cr_bins', values=0)
df.plot(kind='bar', stacked=True, figsize=(10,5))
df.plot(figsize=(10,5))

In [None]:
df['total'] = df[0]+df[1]
df['perc0'] = df[0].div(df[0].sum(),0)*100
df['perc1'] = df[1].div(df[1].sum(),0)*100
df['percT'] = df['total'].div(df.total.sum(),0)*100
df

In [None]:
df_total = df[0]+df[1]
df_rel = df[df.columns[:2]].div(df_total, 0)*100
df_rel

### One of the most well-known types of credit score are FICO® Scores, created by the Fair Isaac Corporation. FICO® Scores are used by many lenders, and often range from 300 to 850.

Credit Score (Rating): % - Impact

300-579	(Very Poor): 24% - Credit applicants may be required to pay a fee or deposit, and applicants with this rating may not be approved for credit at all.

580-669	(Fair): 33% - Applicants with scores in this range are considered to be subprime borrowers.

670-739	(Good): 24% - Studies shown that only 8% of applicants in this score range are likely to become seriously delinquent in the future.

740-799	(Very Good): 12% - Applicants with scores here are likely to receive better than average rates from lenders.

800-850	(Exceptional): 7% - Applicants with scores in this range are at the top of the list for the best rates from lenders

### Insights/Analysis
From the above score with a range between 300-850, more than half of the banks customers (57%) are considered to be subprime borrowers, we should consider targetting customer whose cred score is Very Good for sharper targeting and lower costs.

Might consider the Good Credit Rating for a larger base targetting. Depending on what the bank intends to go for.

Spend Revenue? or Loan Revenue?
Or Retail Spend/Investment Revenue?

In [None]:
df = data.groupby(['HasCrCard','Exited']).size().reset_index().pivot(columns='Exited', index='HasCrCard', values=0)
df.plot(kind='bar', stacked=True)

HasCrCard ~ mehh

In [None]:
df = data.groupby(['Tenure','Exited']).size().reset_index().pivot(columns='Exited', index='Tenure', values=0)
df.plot(kind='bar', stacked=True, figsize=(10,7))

Tenure ~ Consistent

In [None]:
df = data.groupby(['Geography','Exited']).size().reset_index().pivot(columns='Exited', index='Geography', values=0)
df.plot(kind='bar', stacked=True)

In [None]:
df = data.groupby(['ActMem','Exited']).size().reset_index().pivot(columns='Exited', index='ActMem', values=0)
df.plot(kind='bar', stacked=True, figsize=(10,7))

Higher tendency to churn when you are inactive - Logical. Coincides with XGBoost Feature importance

if you have more time/resources/data - look at what customer product he is currently having, then apply a model to offer the next best product for customer  to push customer from 1 product to 2 product to increase stickiness.
We should also consider acquisition instead of looking to squeeze more from our existing customers as the incremental revenue will always be higher for newly acquired customers.
We could also offer differential pricing for customers with better credit ratings to incentivise them to take up a loan/spend more to increase the banks net revenue

So do you want to move to higher wealth cm or just stay within mass market?
What are other factors that cause the cm to choose the bank? pricing/int rate? Service? ATM locations/accessibility?

### Preform necessary Preprocessing steps , so that it is prepared for model building

In [None]:
df_enc = pd.get_dummies(data['Gender'])

data = pd.concat([data, df_enc], axis=1)
data.head()

In [None]:
df_enc = pd.get_dummies(data['Geography'])

data = pd.concat([data, df_enc], axis=1)
data.head()

In [None]:
features = data.drop(['CustomerId','Exited','Geography','Male','Gender','age_bins','bal_bins','salary_bins','cr_bins'] , axis=1)

In [None]:
target = data.Exited

In [None]:
features

In [None]:
plt.figure(figsize= (14 , 6))
sns.boxplot(data= features)

### Separate features and target: target is Exited. Do a train test split. 

In [None]:
X_train , X_test , y_train , y_test = train_test_split(features, target, test_size=0.2, random_state = 42)

In [None]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)

In [None]:
X_test

In [None]:
X_train

In [None]:
from sklearn.preprocessing import StandardScaler

scalar_method = StandardScaler()

scaled_X = scalar_method.fit_transform(features)

scaled_data = pd.DataFrame(scaled_X , columns= features.columns)

scaled_data.head()

In [None]:
plt.figure(figsize= (14 , 7))
sns.boxplot(data= scaled_data)

In [None]:
plt.figure(figsize= (14 , 7))
sns.boxplot(data= X_train)

In [None]:
plt.figure(figsize= (14 , 7))
sns.boxplot(data= X_test)

In [None]:
target.value_counts() 

In [None]:
y_test.value_counts() 

Split is approximately the same. Which is good

### Starting with RFC

In [None]:
model_rfc = RandomForestClassifier(random_state=42)

In [None]:
model_rfc.fit(X_train, y_train)

In [None]:
result_rfc = model_rfc.predict(X_test)

In [None]:
metrics.accuracy_score(y_test, result_rfc)

In [None]:
data.Exited.value_counts(1)
#Target data is fairly skewed with ~80% of active and 20% churn rate. No point looking at accuracy.
#We need to focus on predicting false positive for area under curve

In [None]:
y_test.value_counts(1)
#Checking our y Test to ensure there's distribution of 1 (churned customer)

In [None]:
print(metrics.classification_report(y_test, result_rfc))

In [None]:
# Making the Confusion Matrix
plt.rcParams.update({'font.size': 40})
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, result_rfc)
ax=sns.heatmap(cm, annot= True, cmap="Blues",cbar=False, fmt='g')
plt.xlabel("predicted", va = 'top')
plt.ylabel("true")
# plt.title('confusion matrix') 
ax.xaxis.set_label_position('top')


In [None]:
from sklearn.metrics import roc_auc_score

area_under_curve = roc_auc_score(y_test, result_rfc)

print(area_under_curve)

In [None]:
# ROC curve plot

from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_test, model_rfc.predict_proba(X_test)[:,1])  # second argument = positive class predictions
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html
plt.rcParams.update({'font.size': 12})
plt.plot(fpr, tpr, label='Random Forest (area = %0.2f)' %area_under_curve)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])

plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
plt.legend(loc="lower right")
plt.show()

RFC is already giving quite a good model score! But we should consider logistic regression and GaussianNB which are also catered for binary classification - such as predicting if customer will churn/not churn before spending the marketing budget

## Logistic Regression

In [None]:
sns.regplot(x= 'Age', y= 'Exited', data= data, logistic= True).set_title("Age Log Odds Linear Plot")

Age seems like a good fit for logistic regression modelling.

In [None]:
# Fitting Logistic Regression to the Training set
from sklearn.linear_model import LogisticRegression
Model_LR = LogisticRegression()
Model_LR.fit(X_train, y_train)

In [None]:
# Predicting the Test set results
LR_pred = Model_LR.predict(X_test)
LR_pred

In [None]:
# Making the Confusion Matrix

plt.rcParams.update({'font.size': 40})
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, LR_pred)
ax=sns.heatmap(cm, annot= True, cmap="Blues", cbar=False, fmt='g')
plt.xlabel("predicted", va = 'top')
plt.ylabel("true")
# plt.title('confusion matrix') 
ax.xaxis.set_label_position('top')

In [None]:
print(classification_report(y_test, LR_pred))

# AUC score
from sklearn.metrics import roc_auc_score
area_under_curve = roc_auc_score(y_test, Model_LR.predict(X_test))
print(area_under_curve)

In [None]:
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_test, Model_LR.predict_proba(X_test)[:,1])  # second argument = positive class predictions
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html
plt.rcParams.update({'font.size': 12})

plt.plot(fpr, tpr, label='Logistic Classifier (area = %0.2f)' % area_under_curve)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])

plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
plt.legend(loc="lower right")
plt.show()

## Gaussian Naive Bayes

In [None]:
# Fitting Naive Bayes to the Training set
from sklearn.naive_bayes import GaussianNB
ModelGNB = GaussianNB()
ModelGNB.fit(X_train, y_train)


In [None]:
# Predicting the Test set results
GNB_pred = ModelGNB.predict(X_test)

In [None]:
# Making the Confusion Matrix

plt.rcParams.update({'font.size': 40})
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, GNB_pred)
ax=sns.heatmap(cm, annot= True, cmap="Blues", cbar=False, fmt='g')
plt.xlabel("predicted", va = 'top')
plt.ylabel("true")
# plt.title('confusion matrix') 
ax.xaxis.set_label_position('top')

In [None]:
print(classification_report(y_test, GNB_pred))
# AUC score
from sklearn.metrics import roc_auc_score
area_under_curve = roc_auc_score(y_test, ModelGNB.predict(X_test))
print(area_under_curve)

In [None]:
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_test, ModelGNB.predict_proba(X_test)[:,1])  # second argument = positive class predictions
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html
plt.rcParams.update({'font.size': 12})

plt.plot(fpr, tpr, label='GaussianNB (area = %0.2f)' % area_under_curve)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])

plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
plt.legend(loc="lower right")
plt.show()


## XGBoostClassifier

In [None]:
import xgboost as xgb
model_xgb = xgb.XGBClassifier(objective = 'binary:logistic', random_state = 42)

model_xgb = model_xgb.fit(X_train, y_train)
result_xgb = model_xgb.predict(X_test)

In [None]:
# Making the Confusion Matrix

plt.rcParams.update({'font.size': 40})
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, result_xgb)
ax=sns.heatmap(cm, annot= True, cmap="Blues", cbar=False, fmt='g')
plt.xlabel("predicted", va = 'top')
plt.ylabel("true")
# plt.title('confusion matrix') 
ax.xaxis.set_label_position('top')

In [None]:
print(metrics.classification_report(y_test, result_xgb))
area_under_curve = roc_auc_score(y_test, model_xgb.predict(X_test))
print(area_under_curve)

In [None]:
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_test, model_xgb.predict_proba(X_test)[:,1])  # second argument = positive class predictions
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html
plt.rcParams.update({'font.size': 12})

plt.plot(fpr, tpr, label='XGBoost (area = %0.2f)' % area_under_curve)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])

plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
plt.legend(loc="lower right")
plt.show()

## GradientBoostingClassifier

In [None]:
model_GB = GradientBoostingClassifier(random_state = 42)
model_GB = model_GB.fit(X_train, y_train)
result_GB= model_GB.predict(X_test)

In [None]:
print(metrics.classification_report(y_test, result_GB))
area_under_curve = roc_auc_score(y_test, model_GB.predict(X_test))
print(area_under_curve)

In [None]:
# Making the Confusion Matrix

plt.rcParams.update({'font.size': 40})
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, result_GB)
ax=sns.heatmap(cm, annot= True, cmap="Blues", cbar=False, fmt='g')
plt.xlabel("predicted", va = 'top')
plt.ylabel("true")
# plt.title('confusion matrix') 
ax.xaxis.set_label_position('top')

In [None]:
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_test, model_GB.predict_proba(X_test)[:,1])  # second argument = positive class predictions
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html
plt.rcParams.update({'font.size': 12})

plt.plot(fpr, tpr, label='XGBoost (area = %0.2f)' % area_under_curve)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])

plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
plt.legend(loc="lower right")
plt.show()

## MODEL Training/Tuning

### RFC Tuning

In [None]:
param_dict_RFC = {"n_estimators" : [64 , 96 , 128] , 
               "max_depth" : [6 , 9, 12],
               "max_features" : [5, 7, 9], 
               "criterion" : ['entropy','gini'], 
               "bootstrap" : [True , False ] }

In [None]:
param_dict_RFC

In [None]:
grid_modelrfc = GridSearchCV(estimator=model_rfc, param_grid= param_dict_RFC , cv=10, n_jobs=-1 , verbose=1) 
# n_jobs means use all processors

In [None]:
grid_modelrfc.fit(X_train , y_train)

grid_modelrfc.best_params_

In [None]:
result_rfc_tuned = grid_modelrfc.predict(X_test)

In [None]:
print(metrics.classification_report(y_test, result_rfc_tuned))
area_under_curve = roc_auc_score(y_test, grid_modelrfc.predict(X_test))
print(area_under_curve)

In [None]:
# Making the Confusion Matrix

plt.rcParams.update({'font.size': 40})
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, result_rfc_tuned)
ax=sns.heatmap(cm, annot= True, cmap="Blues", cbar=False, fmt='g')
plt.xlabel("predicted", va = 'top')
plt.ylabel("true")
# plt.title('confusion matrix') 
ax.xaxis.set_label_position('top')

In [None]:
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_test, grid_modelrfc.predict_proba(X_test)[:,1])  # second argument = positive class predictions
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html
plt.rcParams.update({'font.size': 12})

plt.plot(fpr, tpr, label='RFC Tuned (area = %0.2f)' % area_under_curve)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])

plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
plt.legend(loc="lower right")
plt.show()

### AdaBoost Tuning for RFC


In [None]:
model_adaboost = AdaBoostClassifier(base_estimator= model_rfc)

In [None]:
model_adaboost = model_adaboost.fit(X_train, y_train)
result_ada= model_adaboost.predict(X_test)

In [None]:
print(metrics.classification_report(y_test, result_ada))
area_under_curve = roc_auc_score(y_test, model_adaboost.predict(X_test))
print(area_under_curve)

In [None]:
param_dict_ada =  {'n_estimators' : [16,32,64] 
                 , 'learning_rate' : [0.1, 0.5, 1]}

In [None]:
grid_model_ada = GridSearchCV(param_grid= param_dict_ada , 
                              estimator= model_adaboost, n_jobs=-1,
                              cv= 10, verbose=1)

In [None]:
grid_model_ada.fit(X_train, y_train)  # takes some time to train...

In [None]:
grid_model_ada.best_params_

In [None]:
result_ada_Tuned = grid_model_ada.predict(X_test)

In [None]:
print(metrics.classification_report(y_test, result_ada_Tuned))
area_under_curve = roc_auc_score(y_test, grid_model_ada.predict(X_test))
print(area_under_curve)

In [None]:
# Making the Confusion Matrix

plt.rcParams.update({'font.size': 40})
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, result_ada_Tuned)
ax=sns.heatmap(cm, annot= True, cmap="Blues", cbar=False, fmt='g')
plt.xlabel("predicted", va = 'top')
plt.ylabel("true")
# plt.title('confusion matrix') 
ax.xaxis.set_label_position('top')

In [None]:
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_test, grid_model_ada.predict_proba(X_test)[:,1])  # second argument = positive class predictions
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html
plt.rcParams.update({'font.size': 12})

plt.plot(fpr, tpr, label='RFC Adaboost (area = %0.2f)' % area_under_curve)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])

plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
plt.legend(loc="lower right")
plt.show()

In [None]:
len(model_rfc.feature_importances_)

In [None]:
model_rfc.feature_importances_

In [None]:
plt.rcParams.update({'font.size': 12})
feat_importances = pd.Series(model_rfc.feature_importances_, index=features.columns)
feat_importances.nlargest(15).plot(kind='barh')

In [None]:
plt.rcParams.update({'font.size': 12})
feat_importances = pd.Series(model_xgb.feature_importances_, index=features.columns)
feat_importances.nlargest(15).plot(kind='barh')

### XGBoost Tuning

In [None]:
param_grid = {
    'max_depth': [6,9,12],
    'max_delta_step' : [0, 1, 2], 
    'learning_rate': [0.00001, 0.00005, 0.0001],
    'reg_lambda': [250, 300, 350],
    'gamma': [25, 50, 75],
    'scale_pos_weight': [4,5]
}


In [None]:
optimal_params = GridSearchCV(
                    estimator = model_xgb,
#                     objective = 'binary:logistic',
                    param_grid = param_grid,
                    scoring = 'f1', ## f1 see https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
                    verbose = 1, 
                    cv = 10
)

optimal_params.fit(X_train, 
                   y_train, 
                   early_stopping_rounds=10,   
                   eval_set=[(X_test, y_test)],   # evaluate auc upon the test set
                   verbose=False)

print(optimal_params.best_params_)

In [None]:
result_xgbt = optimal_params.predict(X_test)

In [None]:
print(metrics.classification_report(y_test, result_xgbt))
area_under_curve = roc_auc_score(y_test, optimal_params.predict(X_test))
print(area_under_curve)

In [None]:
# Making the Confusion Matrix

plt.rcParams.update({'font.size': 40})
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, result_xgbt)
ax=sns.heatmap(cm, annot= True, cmap="Blues", cbar=False, fmt='g')
plt.xlabel("predicted", va = 'top')
plt.ylabel("true")
# plt.title('confusion matrix') 
ax.xaxis.set_label_position('top')

In [None]:
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_test, optimal_params.predict_proba(X_test)[:,1])  # second argument = positive class predictions
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html
plt.rcParams.update({'font.size': 12})

plt.plot(fpr, tpr, label='XGB Adaboost (area = %0.2f)' % area_under_curve)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])

plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
plt.legend(loc="lower right")
plt.show()

### AdaBoost Tuning for XGBoostt

In [None]:
model_adaboostxgb = AdaBoostClassifier(base_estimator= model_xgb, 
                               n_estimators= 16 , learning_rate= 0.05)

In [None]:
model_adaboostxgb = model_adaboostxgb.fit(X_train, y_train)
result_adaxgb= model_adaboostxgb.predict(X_test)

In [None]:
print(metrics.classification_report(y_test, result_adaxgb))
area_under_curve = roc_auc_score(y_test, model_adaboostxgb.predict(X_test))
print(area_under_curve)

In [None]:
param_dict_adaxgb =  {'n_estimators' : [4,10,16] 
                 , 'learning_rate' : [0.01, 0.05, 0.1]}

In [None]:
grid_model_adaxgb = GridSearchCV(param_grid= param_dict_adaxgb , 
                              estimator= model_adaboost, n_jobs=-1,
                              cv= 10, verbose=1)

In [None]:
grid_model_adaxgb.fit(X_train, y_train)  # takes some time to train...

In [None]:
grid_model_adaxgb.best_params_

In [None]:
result_adaxgb_Tuned = grid_model_adaxgb.predict(X_test)

In [None]:
print(metrics.classification_report(y_test, result_adaxgb_Tuned))
area_under_curve = roc_auc_score(y_test, grid_model_adaxgb.predict(X_test))
print(area_under_curve)

In [None]:
# Making the Confusion Matrix

plt.rcParams.update({'font.size': 40})
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, result_adaxgb_Tuned)
ax=sns.heatmap(cm, annot= True, cmap="Blues", cbar=False, fmt='g')
plt.xlabel("predicted", va = 'top')
plt.ylabel("true")
# plt.title('confusion matrix') 
ax.xaxis.set_label_position('top')

In [None]:
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_test, grid_model_adaxgb.predict_proba(X_test)[:,1])  # second argument = positive class predictions
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html
plt.rcParams.update({'font.size': 12})

plt.plot(fpr, tpr, label='RFC Adaboost (area = %0.2f)' % area_under_curve)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])

plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
plt.legend(loc="lower right")
plt.show()

Ultimately depend on your marketing $$ budget

Using RF (precision): By spending $251(189+62), you can salvage 189/(189+62) = 75% cm salvage. (precision measure - reduce false positives)

Using RF (Recall): By spending $393(189+204), you can salvage 189/(189+204) = 48% cm salvage. (recall measure - reduce false negatives)

Control Overfitting
When you observe high training accuracy, but low test accuracy, it is likely that you encountered overfitting problem.

There are in general two ways that you can control overfitting in XGBoost:

The first way is to directly control model complexity.

This includes max_depth, min_child_weight and gamma.

The second way is to add randomness to make training robust to noise.

This includes subsample and colsample_bytree.

You can also reduce stepsize eta. Remember to increase num_round when you do so.