# Background

The data comes from direct marketing efforts of a European banking institution. The marketing campaign involves making a phone call to a customer, often multiple times to ensure a product subscription, in this case a term deposit. Term deposits are usually short-term deposits with maturities ranging from one month to a few years. The customer must understand when buying a term deposit that they can withdraw their funds only after the term ends. All customer information that might reveal personal information is removed due to privacy concerns.

#### Goal(s):

Predict if the customer will subscribe (yes/no) to a term deposit (variable y)
Success Metric(s):
Hit %81 or above accuracy by evaluating with 5-fold cross validation and reporting the average performance score.
Bonus(es):
Determine which customers are more likely to be interested in investment products. What makes the customers buy? Tell us which feature we should be focusing more on.

# Imports

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import chart_studio.plotly as py
from plotly.offline import download_plotlyjs, init_notebook_mode,plot,iplot
init_notebook_mode(connected=True)
import cufflinks as cf
cf.go_offline()
import plotly.express as px
import plotly.graph_objs as go
from statsmodels.graphics.mosaicplot import mosaic
from itertools import product

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split, GridSearchCV, cross_validate
from sklearn.metrics import confusion_matrix, classification_report, matthews_corrcoef, roc_auc_score, roc_curve, auc, f1_score, recall_score, precision_score, accuracy_score


from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, HistGradientBoostingClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier

from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score


Since version 1.0, it is not needed to import enable_hist_gradient_boosting anymore. HistGradientBoostingClassifier and HistGradientBoostingRegressor are now stable and can be normally imported from sklearn.ensemble.



ModuleNotFoundError: No module named 'xgboost'

# Data

In [None]:
data = pd.read_csv("term-deposit-marketing-2020.csv")

In [None]:
data.head()

In [None]:
data.info()

##### Target Variable

y - has the client subscribed to a term deposit? (binary)

In [None]:
data.describe()

The months are easy to turn into numerical values, so let's go ahead and change the data types. 

In [None]:
data.month.unique()

In [None]:
months_dict = {'jan': 1, 'feb': 2, 'mar' : 3,
              'apr': 4, 'may': 5, 'jun': 6,
              'jul': 7, 'aug': 8, 'sep': 9, 
              'oct': 10, 'nov': 11, 'dec': 12}

In [None]:
data.replace({'month': months_dict}, inplace=True)

In [None]:
data.info()

In [None]:
data.describe()

#### Missing Data

In [None]:
data.isnull().sum()

In [None]:
data.head()

In [None]:
data = data.sample(frac=1)
data.head()

# EDA

### Distribution of Data

In [None]:
ax = data.groupby('y').count().plot(kind='bar')
plt.show();

We can see the dataset is clearly unbalanced. Let's have a look at the distribution of the two different classes. 

In [None]:
print('Percentage of clients not subscribed: ', (len(data[data['y'] == 'no'])/len(data)*100 ))
print('Percentage of clients subscribed: ', (len(data[data['y'] == 'yes'])/len(data)*100 ))

In [None]:
sns.pairplot(data, diag_kind = 'kde')
plt.show();

In [None]:
data.skew()

## Customer Profiling
Some of the features in the dataset can be used to categorise cusotmers: age, jobs, martial status, and education

### Age of customer

Type of Data: Numeric (integers)


In [None]:
fig, ax = plt.subplots(figsize = (12, 6))
sns.kdeplot(data[data['y']=='yes']['age'], color = 'orange', ax = ax, label = 'yes')
sns.kdeplot(data[data['y']=='no']['age'], ax = ax, label = 'no')
plt.axvline(np.mean((data[data['y']=='yes']['age'])), c = 'orange')
plt.axvline(np.mean((data[data['y']=='no']['age'])), c = 'blue')
ax.legend()
plt.show();

In [None]:
fig, axes = plt.subplots(nrows=2, ncols = 1, figsize = (15,5))
pd.crosstab(data['age'], data['y']).plot(kind='bar', stacked=True, rot=0, ax = axes[0], 
                                         title = 'Distribution of Clients by Age')
data.groupby('age')['y'].value_counts(normalize = True).unstack('y').plot.bar(stacked=True, ax = axes[1], rot = 0,
                                                                             title = 'Normalised Distribution of subscriptions by age')
plt.subplots_adjust(wspace=0.6, hspace=0.6)
plt.show();

In [None]:
print('No. of clients aged 62: ', data[data['age']==62].size)
print('No of 62 year old clients who subscribed: ',data.query('age==62 & y == "yes"').size)
126/210*100

In [None]:
print('No. of clients aged 65: ', data[data['age']==65].size)
print('No of 65 year old clients who subscribed: ',data.query('age==65 & y == "yes"').size)
126/224*100

In [None]:
print('No. of clients aged 73: ', data[data['age']==73].size)
print('No of 73 year old clients who subscribed: ',data.query('age==73 & y == "yes"').size)
84/112*100

In [None]:
print('No. of clients aged 85: ', data[data['age']==85].size)
print('No of 85 year old clients who subscribed: ',data.query('age==85 & y == "yes"').size)
42/42*100

In [None]:
print('No. of clients aged 90: ', data[data['age']==90].size)
print('No of 90 year old clients who subscribed: ',data.query('age==90 & y == "yes"').size)
14/14*100

In [None]:
print('No. of clients aged 95: ', data[data['age']==95].size)
print('No of 95 year old clients who subscribed: ',data.query('age==95 & y == "yes"').size)
95/95*100

In [None]:
print('No. of clients aged 80: ', data[data['age']==80].size)
print('No of 80 year old clients who subscribed: ',data.query('age==80 & y == "yes"').size)

In [None]:
print('No. of clients aged 81: ', data[data['age']==81].size)
print('No of 81 year old clients who subscribed: ',data.query('age==81 & y == "yes"').size)

In [None]:
print('No. of clients aged 82: ', data[data['age']==82].size)
print('No of 82 year old clients who subscribed: ',data.query('age==82 & y == "yes"').size)

In [None]:
print('No. of clients aged 86: ', data[data['age']==86].size)
print('No of 86 year old clients who subscribed: ',data.query('age==86 & y == "yes"').size)

In [None]:
print('No. of clients aged 94: ', data[data['age']==94].size)
print('No of 94 year old clients who subscribed: ',data.query('age==94 & y == "yes"').size)

There's some clear age groups where there is a higher percentage of clients who sign up for term deposits. It's worth noting that custimers aged 90 and 95 have exactly the same number of people, all of whom have subscribed for term deposits. This could possible be spurious or a mere coincidence. For now, there's not sufficient information to suggest these groups out outliers and so I'll leave them in the dataset. 

It's also interesting to note that the dataset has a higher proportion of clients aged between 26 to 60, but the percertage of clients who sign up for term deposits is higher for clients who are older. 

Age groups worth exploring further are 80-82 inclusive, 86 and 94 as these groups exclusively have customers who didn't subscribe for term deposits

In [None]:
print('Age of youngest customer: ', data['age'].min())
print('Age of oldest customer: ',data['age'].max())
print('\n')
print('Age by quartile: \n', data['age'].quantile([0.25, 0.5, 0.75]));

In [None]:
print('Age of youngest customer (Yes): ', data[data['y']=='yes']['age'].min())
print('Age of oldest customer (Yes): ',data[data['y']=='yes']['age'].max())
print('\n')
print('Age by quartile (yes): \n', data[data['y']=='yes']['age'].quantile([0.25, 0.5, 0.75]))
print('Age of youngest customer (No): ', data[data['y']=='no']['age'].min())
print('Age of oldest customer (No): ',data[data['y']=='no']['age'].max())
print('\n')
print('Age by quartile: ', data[data['y']=='no']['age'].quantile([0.25, 0.5, 0.75]))
fig, axes = plt.subplots(3, 1, sharex = True, figsize=(15,6))
fig.suptitle('Boxplots of distribution of Age of clients')
sns.boxplot(ax = axes[0],x='age', data = data, color= 'green')
axes[0].set_title('Everyone')
sns.boxplot(ax = axes[1],x='age', data = data[data['y']=='no'])
axes[1].set_title('No')
sns.boxplot(ax = axes[2], x='age', data = data[data['y']=='yes'], color = 'orange')
axes[2].set_title('Yes')
plt.subplots_adjust(wspace=0.6, hspace=0.6)
plt.show();

The distribution of the agegroups of clients in the whole dataset is roughly comparable between those clients who did and didn't subscribe of term deposits. The median is slightly lower for those who with term deposit subscriptions when comared to those clients who didn't register for subscriptions: 37 compared to 39. 

### Jobs

Type of Data: Categorical

In [None]:
data['job'].unique()

In [None]:
fig, axes = plt.subplots(nrows=1, ncols = 2, figsize = (12,6))
pd.crosstab(data['job'], data['y']).plot(kind='bar', stacked=True, ax=axes[0], rot=45)
data.groupby('job')['y'].value_counts(normalize = True).unstack('y').plot.bar(stacked=True, ax = axes[1], rot = 45,
                                                                             title = 'Normalised Distribution of subscriptions by Jobs')
plt.show();

In [None]:
print("Count of Clients by Jobs who subscribed for term deposits: \n",
      (data[data['y']=='yes']['job'].value_counts()/len(data[data['y']=='yes']['job']))*100)
print('\n')
print("Count of Clients by Jobs who didn't subscribe for term deposits: \n",
      (data[data['y']=='no']['job'].value_counts()/len(data[data['y']=='no']['job']))*100)
fig, axes = plt.subplots(1, 2, sharex = True, figsize = (15,5))
((data[data['y']=='yes']['job'].value_counts()/len(data[data['y']=='yes']['job']))*100).plot(ax=axes[0],kind='barh', color = 'orange', 
                                                                                             title = 'Clients who signed up' )
((data[data['y']=='no']['job'].value_counts()/len(data[data['y']=='no']['job']))*100).plot(kind='barh', ax= axes[1], 
                                                                                          title = 'Clients who did not sign up')
plt.show();

In [None]:
plt.figure(figsize=(15,10))
sns.boxenplot(x='age', y='job', data = data, hue = 'y')
plt.xticks(rotation = 45)
plt.title('Distribution of Clients in Different Jobs by Age')
plt.show();

When we group the clients by their jobs and the different ages they fall into we see roughly each profession, students, the unemployed, and those whose employment information is recorded as unknown have very similar distributions. The exceptions are the retired clients where we see the orange boxplot are at a much higher agegroup compared to the blue boxplot. This roughly supports what we saw earlier about the percentage of clients who subscribed for term deposits were of an older agegroup. 

### Marital

Type of Data: Categorical

In [None]:
fig, axes = plt.subplots(nrows=1, ncols = 2, figsize = (12,6))
pd.crosstab(data['marital'], data['y']).plot(kind='bar', stacked=True, rot=0, ax=axes[0])
data.groupby('marital')['y'].value_counts(normalize = True).unstack('y').plot.bar(stacked=True, ax = axes[1], rot = 45,
                                                                             title = 'Normalised Distribution of subscriptions by Marital Status')
plt.show();


In [None]:
print("Count of Clients by Marital Status who subscribed for term deposits: \n",
      (data[data['y']=='yes']['marital'].value_counts()/len(data[data['y']=='yes']['marital']))*100)
print('\n')
print("Count of Clients by Marital Status who didn't subscribe for term deposits: \n",
      (data[data['y']=='no']['marital'].value_counts()/len(data[data['y']=='no']['marital']))*100)
      

fig, axes = plt.subplots(1, 2, sharex = True, figsize = (15,5))
((data[data['y']=='yes']['marital'].value_counts()/len(data[data['y']=='yes']['marital']))*100).plot(ax = axes[0], kind='barh',
                                                                                                     color = 'orange',
                                                                                                     title = 'Clients who signed up')


((data[data['y']=='no']['marital'].value_counts()/len(data[data['y']=='no']['marital']))*100).plot(ax = axes[1], kind='barh', 
                                                                                                   title = 'Clients who did not sign up')
plt.show();

In [None]:
plt.figure(figsize=(8,8))
sns.boxenplot(x='age', y='marital', data = data, hue = 'y')
plt.xticks(rotation = 45)
plt.title('Distribution of Clients by Age for different Marital Statuses')
plt.show();

In [None]:
married = data[data['marital']== 'married']
single = data[data['marital']== 'single']
divorced = data[data['marital']== 'divorced']

In [None]:
fig, axes = plt.subplots(1, 3, sharex = True, sharey = True, figsize = (12,8))

(married.groupby('y')['job'].value_counts()/len(married)*100).plot(kind='barh', ax=axes[0],
                                                                            title = 'Married')

(single.groupby('y')['job'].value_counts()/len(single)*100).plot(kind='barh', ax = axes[1], 
                                                                      title = 'Single')

(divorced.groupby('y')['job'].value_counts()/len(divorced)*100).plot(kind='barh', ax = axes[2], 
                                                                      title = 'Divorced')
plt.show();


### Education

Tye of Data: Categorical

In [None]:
fig, axes = plt.subplots(nrows=1, ncols = 2, figsize = (12,6))
pd.crosstab(data['education'], data['y']).plot(ax = axes[0],kind='bar', stacked=True, rot=0)
data.groupby('education')['y'].value_counts(normalize = True).unstack('y').plot.bar(stacked=True, ax = axes[1], rot = 45,
                                                                             title = 'Normalised Distribution of subscriptions by Education')

plt.show();

It looks like those with tertiary education backgrounds seem to have a higher proportion of subscriptions. 

In [None]:
print("Count of Clients by Education background who subscribed for term deposits: \n",
      (data[data['y']=='yes']['education'].value_counts()/len(data[data['y']=='yes']['education']))*100)
print('\n')
print("Count of Clients by Education background who didn't subscribe for term deposits: \n",
      (data[data['y']=='no']['education'].value_counts()/len(data[data['y']=='no']['education']))*100)
      

fig, axes = plt.subplots(1, 2, sharex = True, figsize = (15,5))
((data[data['y']=='yes']['education'].value_counts()/len(data[data['y']=='yes']['education']))*100).plot(ax = axes[0], kind='barh',
                                                                                                     color = 'orange',
                                                                                                     title = 'Clients who signed up')


((data[data['y']=='no']['education'].value_counts()/len(data[data['y']=='no']['education']))*100).plot(ax = axes[1], kind='barh', 
                                                                                                   title = 'Clients who did not sign up')
plt.show();


We see a very similar distribution between clients who did sign up and those who didn't for term deposits. however, it's very obvious that those with a tertiary background is a larger group with subscriptions when compared to those who don't have subscriptions to term deposits. 

In [None]:
plt.figure(figsize=(8,8))
sns.boxenplot(x='age', y='education', data = data, hue = 'y')
plt.xticks(rotation = 45)
plt.title('Distribution of Clients by Age for different Education')
plt.show();

We see rougly the same distribution in ages between the different education groups. 

In [None]:
crosstable = pd.crosstab(data['education'], data['marital'])
crosstable = pd.DataFrame(crosstable)
crosstable

In [None]:
column_names = list(crosstable.columns)
data_names = list(data.columns)

In [None]:
plt.figure(figsize = (20,20))
props = {}
props[(column_names[1], str(crosstable.index[1]))] = {'facecolor': 'red', 'edgecolor': 'white'}
props[(column_names[1], str(crosstable.index[0]))] = {'facecolor': 'yellow', 'edgecolor': 'white'}
props[(column_names[1], str(crosstable.index[2]))] = {'facecolor': 'blue', 'edgecolor': 'white'}
props[(column_names[0], str(crosstable.index[2]))] = {'facecolor': 'green', 'edgecolor': 'white'}
props[(column_names[0], str(crosstable.index[1]))] = {'facecolor': 'orange', 'edgecolor': 'white'}
props[(column_names[0], str(crosstable.index[0]))] = {'facecolor': 'brown', 'edgecolor': 'white'}
props[(column_names[2], str(crosstable.index[0]))] = {'facecolor': 'purple', 'edgecolor': 'white'}
props[(column_names[2], str(crosstable.index[1]))] = {'facecolor': 'cyan', 'edgecolor': 'white'}
props[(column_names[2], str(crosstable.index[2]))] = {'facecolor': 'gray', 'edgecolor': 'white'}
mosaic(data, [data_names[3], data_names[2]], properties = props, gap=0.01 )
plt.show();

## Financial Profile
Some of the features can be used to group customers according to known financial information. For example, we can try to group customers based on their balance, previous borrowing history such as whether they have defaul on loans, if they have housing and personal loans, 
### Default

Type of Data: Binary where 'yes' indicates the customer defaulted on a loan

In [None]:
fig, axes = plt.subplots(nrows=1, ncols = 2, figsize = (12,6))
pd.crosstab(data['default'], data['y']).plot(ax = axes[0],kind='bar', stacked=True, rot=0)
data.groupby('default')['y'].value_counts(normalize = True).unstack('y').plot.bar(stacked=True, ax = axes[1], rot = 45,
                                                                                  title = 'Normalised Distribution of subscriptions depending on Default status')
plt.show();

In [None]:
default_cust = data[data['default']=='yes']
default_cust.describe()

In [None]:
rest_cust = data[data['default']=='no']
rest_cust.describe()

In [None]:
print('Percentage of customers who default on loans and are either on 0 balance or overdrawn: ', len(default_cust[default_cust['balance']<=0])/len(default_cust) *100)
print('Percentage of customers who do not default on loans, but are on 0 balance or overdrawn: ', len(rest_cust[rest_cust['balance']<=0])/len(rest_cust) *100)

In [None]:
len(rest_cust[rest_cust['age']>=60])/len(rest_cust)*100

In [None]:
print('Distribution of customers who have defaulted on loans by edcuation:\n' , default_cust.groupby('y')['education'].value_counts()/len(default_cust)*100)
print('\n')
print('Distribution of customers who have not defaulted on loans by edcuation:\n' , rest_cust.groupby('y')['education'].value_counts()/len(rest_cust)*100)
fig, axes = plt.subplots(1, 2, sharex = True, sharey = True, figsize = (17,5))

(default_cust.groupby('y')['education'].value_counts()/len(default_cust)*100).plot(kind='barh', ax=axes[0], title = 'Clients who defaulted on loans')

(rest_cust.groupby('y')['education'].value_counts()/len(rest_cust)*100).plot(kind='barh', ax=axes[1],
                                                                            title= 'Clients who did not default on loans')
plt.show();

In [None]:
print('Distribution of customers who have defaulted on loans segmented by jobs:\n' , default_cust.groupby('y')['job'].value_counts()/len(default_cust)*100)
print('\n')
print('Distribution of customers who have not defaulted on loans segmented by jobs:\n' , rest_cust.groupby('y')['job'].value_counts()/len(rest_cust)*100)
fig, axes = plt.subplots(1, 2, sharex = True, sharey = True, figsize = (17,5))

(default_cust.groupby('y')['job'].value_counts()/len(default_cust)*100).plot(kind='barh', ax=axes[0],
                                                                            title = 'Clients who did default on loans')

(rest_cust.groupby('y')['job'].value_counts()/len(rest_cust)*100).plot(kind='barh', ax = axes[1], 
                                                                      title = 'Client who did not default on loans')
plt.show();


In [None]:
plt.figure(figsize=(8,8))
sns.boxenplot(x='age', y='default', data = data, hue = 'y')
plt.xticks(rotation = 45)
plt.title('Distribution of Clients by Age by Default Status')
plt.show();

As we would expect, those who have previously defaulted on loans mostly didn't sign up for term deposits. Those with secondary level education are the largest group of those who took out loans, again forming the largest segmentation of customers who did default on loans and had signed up for term deposits. This doens't necessarily mean that secondary education is what we should look for when trying to determine which customers to target and could be just a consequence of the inblanaced dataset. 

### Average Yearly Balance (Euros)

Type of Data: Numeric (integer)


In [None]:
fig, ax = plt.subplots(figsize = (12, 6))
sns.kdeplot(data[data['y']=='yes']['balance'], color = 'orange', ax = ax, label = 'yes')
sns.kdeplot(data[data['y']=='no']['balance'], ax = ax, label = 'no')
plt.axvline(np.mean((data[data['y']=='yes']['balance'])), c = 'orange')
plt.axvline(np.mean((data[data['y']=='no']['balance'])), c = 'blue')
ax.legend()
plt.show();

In [None]:
np.min(data['balance'])

In [None]:
print('Number of customers who have 0 or negative balance: ', len(data[data['balance']<=0]))
print('Percentage of the whole dataset who are overdrawn or on 0 balance: ',len(data[data['balance']<=0])/len(data)*100)

In [None]:
overdrawn_customers = data[data['balance']<=0]
overdrawn_customers.head()

In [None]:
incredit_customers = data[data['balance']>0]
incredit_customers.head()

In [None]:
print('Percentage of Customers who have 0 balance or are overdrawn: ', overdrawn_customers['y'].value_counts()/len(overdrawn_customers)*100)
print('\n')
print('Percentage of Customers who have more than 0 balance: ',incredit_customers['y'].value_counts()/len(incredit_customers)*100)

fig, axes = plt.subplots(1, 2, sharex = True, sharey = True, figsize = (17,5))

(overdrawn_customers['y'].value_counts()/len(overdrawn_customers)*100).plot(ax = axes[0], 
                                                                            title= 'Overdrawn or 0 Balance',
                                                                            kind='barh')


(incredit_customers['y'].value_counts()/len(incredit_customers)*100).plot(ax = axes[1],
                                                                          title = "Clients with balance more than 0",
                                                                          kind='barh')

In [None]:
overdrawn_customers.describe()

In [None]:
incredit_customers.describe()

In [None]:
print('Distribution of overdrawn customers by edcuation:\n' , overdrawn_customers.groupby('y')['education'].value_counts()/len(overdrawn_customers)*100)
print('Distribution of incredit customers by edcuation:\n' , incredit_customers.groupby('y')['education'].value_counts()/len(incredit_customers)*100)
fig, axes = plt.subplots(1, 2, sharex = True, sharey = True, figsize = (17,5))
(overdrawn_customers.groupby('y')['education'].value_counts()/len(overdrawn_customers)*100).plot(ax = axes[0],kind='barh', 
                                                                                 title= 'Overdrawn Customers')
(incredit_customers.groupby('y')['education'].value_counts()/len(incredit_customers)*100).plot(kind='barh', ax=axes[1], 
                                                                              title = 'Incredit Customers')
plt.show();

In [None]:
print('Distribution of overdrawn customers by employment:\n' , overdrawn_customers.groupby('y')['job'].value_counts()/len(overdrawn_customers)*100)
print('\n')
print('Distribution of incredit customers by employment:\n' , incredit_customers.groupby('y')['job'].value_counts()/len(incredit_customers)*100)
fig, axes = plt.subplots(1, 2, sharex = True, sharey = True, figsize = (17,5))
(overdrawn_customers.groupby('y')['job'].value_counts()/len(overdrawn_customers)*100).plot(ax=axes[0],kind='barh', 
                                                                                          title = 'Overdrawn Customers')
(incredit_customers.groupby('y')['job'].value_counts()/len(incredit_customers)*100).plot(kind='barh', stacked=True, ax=axes[1], 
                                                                                        title = 'Incredit Customers')
plt.show();

In [None]:
fig, axes = plt.subplots(2, 1, sharex = True, figsize = (15,5))
fig.suptitle('Boxplots of distribution of Blanace maintained by clients')
sns.boxplot(x='balance', data = data[data['y']=='yes'], ax= axes[0])
axes[0].set_title('Yes')
sns.boxplot(x='balance', data = data[data['y']=='no'], ax = axes[1] )
axes[1].set_title('No')
plt.subplots_adjust(wspace=0.6, hspace=0.6)
plt.show();

In [None]:
np.min(data[data['y']=='yes']['balance'])

In [None]:
np.max(data[data['y']=='no']['balance'])

In [None]:
np.max(data[data['y']=='yes']['balance'])

In [None]:
sns.scatterplot(x='balance', y='default', data=data, hue='y')

In [None]:
sns.scatterplot(x='age', y='balance', data = data, hue = 'y')
plt.show();

We see roughly 5% of customers who had defaulted on loans and had agreed to term deposits maintained an average annual balance of 0 or were overdrawn. This goes up to roughly 8% when we look at customers who haven't defaulted on loans. When we compare the customer profiling metrics between overdrawn and incredit customers we see that both of these groups have roughly the same distribution in ages of customers. We see the the percentages of customers from various education backgrounds and to some extent in jobs are also mimicked between those who are overdrawn and those who aren't. This suggests that looking to differentiate between customers with overdrawn as a feature is not useful. When we look at the balance of customers who have signed up for term deposits grouped by employmnet and education backgrounds, we do see that for every segment customers who have higher balances sign up for term deposits. This suggests balance is a possible key feature, however how to segment customers based on balance is something we need to dig deeper into. 

### Housing Loans

Type of data: Binary where 'yes' indicates whether the customer has housing loans or not

In [None]:
fig, axes = plt.subplots(nrows=1, ncols = 2, figsize = (12,6))
pd.crosstab(data['housing'], data['y']).plot(kind='bar', stacked=True, rot=0, ax=axes[0])
data.groupby('housing')['y'].value_counts(normalize = True).unstack('y').plot.bar(stacked=True, ax = axes[1], rot = 45,
                                                                             title = 'Normalised Distribution of subscriptions based on Housing Loans')
plt.show();

In [None]:
has_housing = data[data['housing']=='yes']
no_housing = data[data['housing']=='no']

In [None]:
print('Distribution of customers who have housing loans segmented by education:\n' , has_housing.groupby('y')['education'].value_counts()/len(has_housing)*100)
print('Distribution of customers without housing loans segmented by education:\n' , no_housing.groupby('y')['education'].value_counts()/len(no_housing)*100)
fig, axes = plt.subplots(1, 2, sharex = True, sharey = True, figsize = (17,5))
(has_housing.groupby('y')['education'].value_counts()/len(has_housing)*100).plot(ax = axes[0],kind='barh', 
                                                                                 title= 'Has Housing Loans')
(no_housing.groupby('y')['education'].value_counts()/len(no_housing)*100).plot(kind='barh', ax=axes[1], 
                                                                              title = 'Without Housing Loans')
plt.show();

In [None]:
print('Distribution of customers who have housing loans segmented by jobs:\n' , has_housing.groupby('y')['job'].value_counts()/len(has_housing)*100)
print('Distribution of customers without housing loans segmented by education:\n' , no_housing.groupby('y')['job'].value_counts()/len(no_housing)*100)
fig, axes = plt.subplots(1, 2, sharex = True, sharey = True, figsize = (17,5))
(has_housing.groupby('y')['job'].value_counts()/len(has_housing)*100).plot(ax = axes[0],kind='barh', 
                                                                                 title= 'Has Housing Loans')
(no_housing.groupby('y')['job'].value_counts()/len(no_housing)*100).plot(kind='barh', ax=axes[1], 
                                                                              title = 'Without Housing Loans')
plt.show();

In [None]:
has_housing

In [None]:
plt.figure(figsize=(8,8))
sns.boxenplot(x='age', y='housing', data = data, hue = 'y')
plt.xticks(rotation = 45)
plt.title('Distribution of Clients by Age by Housing Loans')
plt.show();

### Personal Loan

Type of Data: Binary data where 'yes' indicates the customer has personal loans 

In [None]:
fig, axes = plt.subplots(nrows=1, ncols = 2, figsize = (12,6))
pd.crosstab(data['loan'], data['y']).plot(ax = axes[0],kind='bar', stacked=True, rot=0)
data.groupby('loan')['y'].value_counts(normalize = True).unstack('y').plot.bar(stacked=True, ax = axes[1], rot = 45,
                                                                                  title = 'Normalised Distribution of subscriptions depending on Personla Loan Status')
plt.show();

In [None]:
plt.figure(figsize=(8,8))
sns.boxenplot(x='age', y='loan', data = data, hue = 'y')
plt.xticks(rotation = 45)
plt.title('Distribution of Clients by Age by Personal Loans')
plt.show();

In [None]:
has_personal_loan = data[data['loan']=='yes']
has_no_loan = data[data['loan']=='no']

In [None]:
print('Distribution of customers with personal loans:\n' , has_personal_loan.groupby('y')['education'].value_counts()/len(has_personal_loan)*100)
print('\n')
print('Distribution of customers without personal loans: \n' , has_no_loan.groupby('y')['education'].value_counts()/len(has_no_loan)*100)
fig, axes = plt.subplots(1, 2, sharex = True, sharey = True, figsize = (17,5))

(has_personal_loan.groupby('y')['education'].value_counts()/len(has_personal_loan)*100).plot(kind='barh', ax=axes[0], title = 'Clients with personal loans')

(has_no_loan.groupby('y')['education'].value_counts()/len(has_no_loan)*100).plot(kind='barh', ax=axes[1],
                                                                            title= 'Clients without personal loans')
plt.show();

In [None]:
print('Distribution of customers with personal loans:\n' , has_personal_loan.groupby('y')['job'].value_counts()/len(has_personal_loan)*100)
print('\n')
print('Distribution of customers without personal loans: \n' , has_no_loan.groupby('y')['job'].value_counts()/len(has_no_loan)*100)
fig, axes = plt.subplots(1, 2, sharex = True, sharey = True, figsize = (17,5))

(has_personal_loan.groupby('y')['job'].value_counts()/len(has_personal_loan)*100).plot(kind='barh', ax=axes[0], title = 'Clients with personal loans')

(has_no_loan.groupby('y')['job'].value_counts()/len(has_no_loan)*100).plot(kind='barh', ax=axes[1],
                                                                            title= 'Clients without personal loans')
plt.show();

In [None]:
fig, axes = plt.subplots(1, 2, sharey = True, figsize = (15,5))
sns.countplot(x='default', data = has_personal_loan, ax=axes[0])
axes[0].set_title('Has Loans')
sns.countplot(x='default', data = has_no_loan, ax = axes[1])
axes[1].set_title('Has No Personal Loans')
plt.show();


## Marketing Strategies
We can finally explore features such as communication type, the last day and month of contact, duration of last contact, and number of times contacted during the last campaign to see if there are strategies that have yielded a higher number of subscriptions for term deposits. 

### Communication Type

Type of Data: categorical

In [None]:
fig, axes = plt.subplots(nrows=1, ncols = 2, figsize = (12,6))
pd.crosstab(data['contact'], data['y']).plot(kind='bar', stacked=True,rot=0, ax= axes[0])
data.groupby('contact')['y'].value_counts(normalize = True).unstack('y').plot.bar(rot = 45, stacked=True, ax = axes[1], 
                                                                                 title = 'Normalised Distribution of clients based on Communication Type')
plt.show();

In [None]:
cellular = data[data['contact']=='cellular']
tel_contact = data[data['contact']=='telephone']
misc_cont = data[data['contact']=='unknown']

In [None]:
print("Count of Clients by communication style who subscribed for term deposits: \n",
      (data[data['y']=='yes']['contact'].value_counts()/len(data[data['y']=='yes']['contact']))*100)
print('\n')
print("Count of Clients by communication style who didn't subscribe for term deposits: \n",
      (data[data['y']=='no']['contact'].value_counts()/len(data[data['y']=='no']['contact']))*100)
fig, axes = plt.subplots(1, 2, sharex = True, figsize = (15,5))
((data[data['y']=='yes']['contact'].value_counts()/len(data[data['y']=='yes']['contact']))*100).plot(ax=axes[0],kind='barh', color = 'orange', 
                                                                                             title = 'Clients who signed up' )
((data[data['y']=='no']['contact'].value_counts()/len(data[data['y']=='no']['contact']))*100).plot(kind='barh', ax= axes[1], 
                                                                                          title = 'Clients who did not sign up')
plt.show();

In [None]:
plt.figure(figsize=(10,8))
sns.boxenplot(x='age', y='contact', data = data, hue = 'y')
plt.xticks(rotation = 45)
plt.title('Distribution of Clients with different communication styles by Age')
plt.show();

In [None]:
fig, axes = plt.subplots(1, 3, sharex = True, sharey = True, figsize = (15,8))

(cellular.groupby('y')['job'].value_counts()/len(cellular)*100).plot(kind='barh', ax=axes[0],
                                                                            title = 'Cellular')

(tel_contact.groupby('y')['job'].value_counts()/len(tel_contact)*100).plot(kind='barh', ax = axes[1], 
                                                                      title = 'Telephone')

(misc_cont.groupby('y')['job'].value_counts()/len(misc_cont)*100).plot(kind='barh', ax = axes[2], 
                                                                      title = 'Unknown')
plt.show();


Similar distributions however we can see that celluar contact has a higher frequency amongst the clients who signed up for subsciptions. 

### Last Day of Contact of the Month

Type of Data: numeric

In [None]:
sns.displot(data['day'], kind='kde')
plt.axvline(np.mean(data['day']), c='red')
plt.show();

In [None]:
print("Day's of the month by freqeuncy:\n", data['day'].value_counts()/len(data)*100)

The distribution looks multimodal.

In [None]:
fig, axes = plt.subplots(nrows=1, ncols = 2, figsize = (18,6))

pd.crosstab(data['day'], data['y']).plot(kind='bar', stacked=True, ax=axes[0], rot=0)
data.groupby('day')['y'].value_counts(normalize = True).unstack('y').plot.bar(stacked=True, ax = axes[1], rot = 45,
                                                                             title = 'Normalised Distribution of subscriptions by Last Day of Contact')
plt.show();

In [None]:
fig, axes = plt.subplots(nrows=1, ncols = 2, figsize = (8, 10))
(data[data['y']=='yes']['day'].value_counts()/len(data[data['y']=='yes'])*100).plot(ax = axes[0], kind='barh', color = 'orange',
                                                                                   title = 'Yes')
(data[data['y']=='no']['day'].value_counts()/len(data[data['y']=='no'])*100).plot(ax = axes[1], kind='barh', title = 'No')
plt.show();

### Last Contact Month of the Year

Type of Data: Categorical

In [None]:
sns.displot(data['month'], kind='kde')
plt.axvline(np.mean(data['month']), c='red')
plt.show();

In [None]:
fig, axes = plt.subplots(nrows=1, ncols = 2, figsize = (12,6))

pd.crosstab(data['month'], data['y']).plot(kind='bar', stacked=True, ax=axes[0], rot=0)
data.groupby('month')['y'].value_counts(normalize = True).unstack('y').plot.bar(stacked=True, ax = axes[1], rot = 45,
                                                                             title = 'Normalised Distribution of subscriptions by Last Month of Contact')
plt.show();

In [None]:
plt.figure(figsize=(15,8))
sns.boxenplot(y='age', x='month', data = data, hue = 'y')
plt.xticks(rotation = 45)
plt.title('Distribution of Clients by Age')
plt.show();

In [None]:
plt.figure(figsize = (10,8))
(data.groupby('y')['month'].value_counts()/len(data)*100).plot(kind='barh', title = 'Clients Segregated by Month')
plt.show();

We see that the highest number of customers who subscribed and those who didn't subscribe for term deposits all occured during the month of May. This is most likely due to there being a higher frequency of customers in may. This further supports the idea we should look at months of March and October as the months of higher subscription rates. 

### Duration of the Last Contact (Seconds)

Type of Data: Numeric

In [None]:
sns.displot(data['duration'], kind='kde')
plt.axvline(np.mean(data['duration']), c='red')
plt.show();

In [None]:
fig, axes = plt.subplots(nrows=2, ncols = 1, figsize = (25,5))
pd.crosstab(data['duration'], data['y']).plot(kind='bar', stacked=True, rot=90, ax = axes[0], 
                                         title = 'Distribution of Clients by duration of Last Contact')
data.groupby('duration')['y'].value_counts(normalize = True).unstack('y').plot.bar(stacked=True, ax = axes[1], rot = 90,
                                                                             title = 'Normalised Distribution of duration of Last Contact')
plt.subplots_adjust(wspace=0.6, hspace=0.6)
plt.show();

In [None]:
fig, axes = plt.subplots(2, 1, sharex = True, sharey = True, figsize = (15,8))
sns.boxplot(x='duration', data = data[data['y']=='no'], ax = axes[0])
axes[0].set_title('No')
sns.boxplot(x='duration', data = data[data['y']=='yes'], ax = axes[1] )
axes[1].set_title('Yes')
plt.show();

In [None]:
np.min(data[data['y']=='no']['duration'])

In [None]:
zero_duration = data[data['duration']==0]

In [None]:
zero_duration['y'].nunique()

In [None]:
np.max(data[data['y']=='no']['duration'])

In [None]:
np.min(data[data['y']=='yes']['duration'])

In [None]:
np.max(data[data['y']=='yes']['duration'])

In [None]:
sns.scatterplot(x='balance', y='duration', data = data, hue = 'y')

### Number of Contacts for each client for this campaign

(note: this includes the last contact)

Type of Data: Numeric

In [None]:
fig, axes = plt.subplots(nrows=2, ncols = 1, figsize = (20,7))
pd.crosstab(data['campaign'], data['y']).plot(kind='bar', stacked=True, rot=0, ax = axes[0], 
                                         title = 'Distribution of Clients by Number of Contacts')
data.groupby('campaign')['y'].value_counts(normalize = True).unstack('y').plot.bar(stacked=True, ax = axes[1], rot = 0,
                                                                             title = 'Normalised Distribution by Number of Contacts')
plt.subplots_adjust(wspace=0.6, hspace=0.6)
plt.show();

In [None]:
fig, axes = plt.subplots(2, 1, sharex = True, sharey = True, figsize = (15,8))
sns.boxplot(x='campaign', data = data[data['y']=='no'], ax = axes[0])
axes[0].set_title('No')
sns.boxplot(x='campaign', data = data[data['y']=='yes'], ax = axes[1] )
axes[1].set_title('Yes')
plt.show();


In [None]:
plt.figure(figsize=(15,8))
sns.boxenplot(y='age', x='campaign', data = data, hue = 'y')
plt.xticks(rotation = 45)
plt.title('Distribution of Clients by Number of Contacts ')
plt.show();

In [None]:
data[data['campaign']==17]['y'].nunique()

In [None]:
sns.scatterplot(x='duration', y='campaign', data = data, hue = 'y')
plt.title('Duration of Last Contact vs. The Number of times contacted for this campaign')
plt.show()

### Correlation

In [None]:
plt.figure(figsize=(10,10))
sns.heatmap(data.corr(), annot=True)
plt.show();

In [None]:
data_og = data

In [None]:
data_og

# Feature Engineering 

### Categorical Data

Categorical Features: job, marital, education, default, housing, loan, contact,y

In [None]:
label_encoding = LabelEncoder()
data['job'] = label_encoding.fit_transform(data['job'])
data['marital'] = label_encoding.fit_transform(data['marital'])
data['education'] = label_encoding.fit_transform(data['education'])
data['default'] = label_encoding.fit_transform(data['default'])
data['housing'] = label_encoding.fit_transform(data['housing'])
data['loan'] = label_encoding.fit_transform(data['loan'])
data['contact'] = label_encoding.fit_transform(data['contact'])
data['y'] = label_encoding.fit_transform(data['y'])

In [None]:
data.head()

### Numerical Data 

SCALING: robust vs standard scaler

* robust uses IQR so would be less influenced by outliers


Numerical Features: age, day, month, duration

In [None]:
from sklearn.preprocessing import RobustScaler
data_robust = data

In [None]:
rs = RobustScaler()
rs_data = rs.fit_transform(data_robust)
rs_data = pd.DataFrame(rs_data)
rs_data.columns = data.columns
rs_data.head()

In [None]:
plt.figure(figsize=(10,10))
sns.heatmap(rs_data.corr(), annot=True)
plt.show();

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
data_standard = data

In [None]:
ss = StandardScaler()
ss_data = ss.fit_transform(data_standard)
ss_data = pd.DataFrame(ss_data)
ss_data.columns = data.columns
ss_data.head()

In [None]:
plt.figure(figsize=(10,10))
sns.heatmap(ss_data.corr(), annot=True)
plt.show();

# Train Test Split

In [None]:
X = data.drop('y', axis = 1)
y = data['y']

X_train, X_test, y_train, y_test = train_test_split(X.values, y.values, test_size = 0.2, random_state = 10)

In [None]:
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

In [None]:
print('Percentage of dataset in Class 0: ', data[data['y'] == 0].shape[0]/len(data)*100)
print('Percentage of dataset in Class 1: ', data[data['y'] == 1].shape[0]/len(data)*100)

In [None]:
print('Percentage of Class 0 in training set:', np.count_nonzero(y_train == 0)/len(y_train)*100)
print('Percentage of Class 1 in training set:', np.count_nonzero(y_train == 1)/len(y_train)*100)

In [None]:
print('Percentage of Class 0 in validation set:', np.count_nonzero(y_test == 0)/len(y_test)*100)
print('Percentage of Class 1 in validation set:', np.count_nonzero(y_test == 1)/len(y_test)*100)

### Scaled Data

In [None]:
X_rs = rs_data.drop('y', axis = 1)
y_rs = rs_data['y']

X_rs_train, X_rs_test, y_rs_train, y_rs_test = train_test_split(X_rs.values, y_rs.values, test_size = 0.2, random_state = 10)

In [None]:
X_ss = ss_data.drop('y', axis = 1)
y_ss = ss_data['y']

X_ss_train, X_ss_test, y_ss_train, y_ss_test = train_test_split(X_ss.values, y_ss.values, test_size = 0.2, random_state = 10)

# Classification Models 

1st attempt without scaling or stratefied sampling

## K Nearest Neighbors Classifier

In [None]:
base_knn = KNeighborsClassifier()
base_knn.fit(X_train, y_train)
y_pred = base_knn.predict(X_test)
training_score = base_knn.fit(X_train, y_train).score(X_train, y_train)
print('Accuracy Score: ', accuracy_score(y_test, y_pred))
print('Training Set Score: ', training_score)
print('Macro F1 Score: ', f1_score(y_test, y_pred, average = 'macro'))
print('Weighted F1 Score: ', f1_score(y_test, y_pred, average = 'weighted'))
print('Micro F1 Score: ', f1_score(y_test, y_pred, average = 'micro'))

In [None]:
# fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
y_pred_prob = base_knn.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
auc = roc_auc_score(y_test, y_pred_prob)
plt.figure(figsize=(10,6))
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
plt.savefig('roc_best_random.png', bbox_inches = 'tight')
plt.show()
print('Area Under Curve: ',auc)

In [None]:
print(classification_report(y_test, y_pred))

Using Scaled Dataset

In [None]:
rs_scaled_knn = KNeighborsClassifier()
rs_scaled_knn.fit(X_rs_train, y_rs_train)
y_pred = rs_scaled_knn.predict(X_test)
training_score = rs_scaled_knn.fit(X_rs_train, y_rs_train).score(X_rs_train, y_rs_train)
print('Accuracy Score: ', accuracy_score(y_test, y_pred))
print('Training Set Score: ', training_score)
print('Macro F1 Score: ', f1_score(y_test, y_pred, average = 'macro'))
print('Weighted F1 Score: ', f1_score(y_test, y_pred, average = 'weighted'))

references: https://iopscience.iop.org/article/10.1088/1757-899X/719/1/012072/pdf

## Random Forest

In [None]:
base_rfc = RandomForestClassifier(random_state=10)
base_rfc.fit(X_train, y_train)
y_pred = base_rfc.predict(X_test)
training_score = base_rfc.fit(X_train, y_train).score(X_train, y_train)
print('Accuracy Score: ', accuracy_score(y_test, y_pred))
print('Training Set Score: ', training_score)
print('Macro F1 Score: ', f1_score(y_test, y_pred, average = 'macro'))
print('Weighted F1 Score: ', f1_score(y_test, y_pred, average = 'weighted'))

In [None]:
# fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
y_pred_prob = base_rfc.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
auc = roc_auc_score(y_test, y_pred_prob)
plt.figure(figsize=(10,6))
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
plt.savefig('roc_best_random.png', bbox_inches = 'tight')
plt.show()
print('Area Under Curve: ',auc)

In [None]:
sorted_importances = base_rfc.feature_importances_.argsort()
_ = plt.figure(figsize=(15,10))
_ = plt.barh(X.columns[sorted_importances], base_rfc.feature_importances_[sorted_importances])
_ = plt.title('Feature Importance')
_ = plt.savefig('feature_imp_grid.png', bbox_inches='tight')
plt.show()

## AdaBoosting 

In [None]:
base_estimator = DecisionTreeClassifier(random_state=10)
base_adaboost = AdaBoostClassifier(base_estimator = base_estimator)
base_adaboost.fit(X_train, y_train)
y_pred = base_adaboost.predict(X_test)
training_score = base_adaboost.fit(X_train, y_train).score(X_train, y_train)
print('Accuracy Score: ', accuracy_score(y_test, y_pred))
print('Training Set Score: ', training_score)
print('Macro F1 Score: ', f1_score(y_test, y_pred, average = 'macro'))
print('Weighted F1 Score: ', f1_score(y_test, y_pred, average = 'weighted'))
# fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
y_pred_prob = base_adaboost.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
auc = roc_auc_score(y_test, y_pred_prob)
plt.figure(figsize=(10,6))
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
plt.savefig('roc_best_random.png', bbox_inches = 'tight')
plt.show()
print('Area Under Curve: ',auc)
sorted_importances = base_adaboost.feature_importances_.argsort()
_ = plt.figure(figsize=(15,10))
_ = plt.barh(X.columns[sorted_importances], base_adaboost.feature_importances_[sorted_importances])
_ = plt.title('Feature Importance')
_ = plt.savefig('feature_imp_grid.png', bbox_inches='tight')
plt.show()

## CatBoost

In [None]:
base_cb = CatBoostClassifier(random_state=10)
base_cb.fit(X_train, y_train)
y_pred = base_cb.predict(X_test)
training_score = base_cb.fit(X_train, y_train).score(X_train, y_train)
print('Accuracy Score: ', accuracy_score(y_test, y_pred))
print('Training Set Score: ', training_score)
print('Macro F1 Score: ', f1_score(y_test, y_pred, average = 'macro'))
print('Weighted F1 Score: ', f1_score(y_test, y_pred, average = 'weighted'))

In [None]:
# fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
y_pred_prob = base_cb.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
auc = roc_auc_score(y_test, y_pred_prob)
plt.figure(figsize=(10,6))
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
plt.savefig('roc_best_random.png', bbox_inches = 'tight')
plt.show()
print('Area Under Curve: ',auc)

In [None]:
sorted_importances = base_cb.feature_importances_.argsort()
_ = plt.figure(figsize=(15,10))
_ = plt.barh(X.columns[sorted_importances], base_cb.feature_importances_[sorted_importances])
_ = plt.title('Feature Importance')
_ = plt.savefig('feature_imp_grid.png', bbox_inches='tight')
plt.show()

## HistGradBoost

In [None]:
base_histgradboost = HistGradientBoostingClassifier(random_state=10)
base_histgradboost.fit(X_train, y_train)
y_pred = base_histgradboost.predict(X_test)
training_score = base_histgradboost.fit(X_train, y_train).score(X_train, y_train)
print('Accuracy Score: ', accuracy_score(y_test, y_pred))
print('Training Set Score: ', training_score)
print('Macro F1 Score: ', f1_score(y_test, y_pred, average = 'macro'))
print('Weighted F1 Score: ', f1_score(y_test, y_pred, average = 'weighted'))

In [None]:
y_pred_prob = base_histgradboost.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
auc = roc_auc_score(y_test, y_pred_prob)
plt.figure(figsize=(10,6))
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
plt.savefig('roc_best_random.png', bbox_inches = 'tight')
plt.show()
print('Area Under Curve: ',auc)

## LightGBM 

In [None]:
base_lgbm = LGBMClassifier(random_state=10)
base_lgbm.fit(X_train, y_train)
y_pred = base_lgbm.predict(X_test)
training_score = base_lgbm.fit(X_train, y_train).score(X_train, y_train)
print('Accuracy Score: ', accuracy_score(y_test, y_pred))
print('Training Set Score: ', training_score)
print('Macro F1 Score: ', f1_score(y_test, y_pred, average = 'macro'))
print('Weighted F1 Score: ', f1_score(y_test, y_pred, average = 'weighted'))
y_pred_prob = base_lgbm.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
auc = roc_auc_score(y_test, y_pred_prob)
plt.figure(figsize=(10,6))
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
plt.savefig('roc_best_random.png', bbox_inches = 'tight')
plt.show()
print('Area Under Curve: ',auc)
sorted_importances = base_lgbm.feature_importances_.argsort()
_ = plt.figure(figsize=(15,10))
_ = plt.barh(X.columns[sorted_importances], base_lgbm.feature_importances_[sorted_importances])
_ = plt.title('Feature Importance')
_ = plt.savefig('feature_imp_grid.png', bbox_inches='tight')
plt.show()

References: 
https://machinelearningmastery.com/gradient-boosting-with-scikit-learn-xgboost-lightgbm-and-catboost/
https://towardsdatascience.com/boosting-showdown-scikit-learn-vs-xgboost-vs-lightgbm-vs-catboost-in-sentiment-classification-f7c7f46fd956
https://www.projectpro.io/recipes/use-catboost-classifier-and-regressor-in-python
https://www.kaggle.com/code/prashant111/catboost-classifier-in-python


# Sampling Methods

RandomOverSampler VS SMOTE

## SMOTE

In [None]:
from imblearn.over_sampling import SMOTE
from collections import Counter

In [None]:
smt = SMOTE()

In [None]:
X_train_sm, y_train_sm = smt.fit_resample(X_train, y_train)

In [None]:
counter = Counter(y_train)
print(counter)

In [None]:
counter = Counter(y_train_sm)
print(counter)

### KNN

In [None]:
sm_knn = KNeighborsClassifier()
sm_knn.fit(X_train_sm, y_train_sm)
y_pred = sm_knn.predict(X_test)
training_score = sm_knn.fit(X_train_sm, y_train_sm).score(X_train_sm, y_train_sm)
print('Accuracy Score: ', accuracy_score(y_test, y_pred))
print('Training Set Score: ', training_score)
print('Macro F1 Score: ', f1_score(y_test, y_pred, average = 'macro'))
print('Weighted F1 Score: ', f1_score(y_test, y_pred, average = 'weighted'))
# fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
y_pred_prob = sm_knn.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
auc = roc_auc_score(y_test, y_pred_prob)
plt.figure(figsize=(10,6))
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
plt.savefig('roc_best_random.png', bbox_inches = 'tight')
plt.show()
print('Area Under Curve: ',auc)

### Random Forest Classifier 

In [None]:
sm_rfc = RandomForestClassifier(random_state=10)
sm_rfc.fit(X_train_sm, y_train_sm)
y_pred = base_rfc.predict(X_test)
training_score = base_rfc.fit(X_train_sm, y_train_sm).score(X_train_sm, y_train_sm)
print('Accuracy Score: ', accuracy_score(y_test, y_pred))
print('Training Set Score: ', training_score)
print('Macro F1 Score: ', f1_score(y_test, y_pred, average = 'macro'))
print('Weighted F1 Score: ', f1_score(y_test, y_pred, average = 'weighted'))
# fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
y_pred_prob = sm_rfc.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
auc = roc_auc_score(y_test, y_pred_prob)
plt.figure(figsize=(10,6))
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
plt.savefig('roc_best_random.png', bbox_inches = 'tight')
plt.show()
print('Area Under Curve: ',auc)
sorted_importances = sm_rfc.feature_importances_.argsort()
_ = plt.figure(figsize=(15,10))
_ = plt.barh(X.columns[sorted_importances], sm_rfc.feature_importances_[sorted_importances])
_ = plt.axvline(x=0.10)
_ = plt.title('Feature Importance')
_ = plt.savefig('feature_imp_grid.png', bbox_inches='tight')
plt.show()

### ADABoost

In [None]:
base_estimator = DecisionTreeClassifier(random_state=10)
sm_adaboost = AdaBoostClassifier(base_estimator = base_estimator)
sm_adaboost.fit(X_train_sm, y_train_sm)
y_pred = sm_adaboost.predict(X_test)
training_score = base_adaboost.fit(X_train_sm, y_train_sm).score(X_train_sm, y_train_sm)
print('Accuracy Score: ', accuracy_score(y_test, y_pred))
print('Training Set Score: ', training_score)
print('Macro F1 Score: ', f1_score(y_test, y_pred, average = 'macro'))
print('Weighted F1 Score: ', f1_score(y_test, y_pred, average = 'weighted'))
# fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
y_pred_prob = sm_adaboost.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
auc = roc_auc_score(y_test, y_pred_prob)
plt.figure(figsize=(10,6))
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
plt.savefig('roc_best_random.png', bbox_inches = 'tight')
plt.show()
print('Area Under Curve: ',auc)
sorted_importances = sm_adaboost.feature_importances_.argsort()
_ = plt.figure(figsize=(15,10))
_ = plt.barh(X.columns[sorted_importances], sm_adaboost.feature_importances_[sorted_importances])
_ = plt.title('Feature Importance')
_ = plt.axvline(x=0.10)
_ = plt.savefig('feature_imp_grid.png', bbox_inches='tight')
plt.show()

## CatBoost

In [None]:
sm_cb = CatBoostClassifier(random_state=10)
sm_cb.fit(X_train_sm, y_train_sm)
y_pred = sm_cb.predict(X_test)
training_score = sm_cb.fit(X_train_sm, y_train_sm).score(X_train_sm, y_train_sm)
print('Accuracy Score: ', accuracy_score(y_test, y_pred))
print('Training Set Score: ', training_score)
print('Macro F1 Score: ', f1_score(y_test, y_pred, average = 'macro'))
print('Weighted F1 Score: ', f1_score(y_test, y_pred, average = 'weighted'))
# fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
y_pred_prob = sm_cb.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
auc = roc_auc_score(y_test, y_pred_prob)
plt.figure(figsize=(10,6))
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
plt.savefig('roc_best_random.png', bbox_inches = 'tight')
plt.show()
print('Area Under Curve: ',auc)
sorted_importances = sm_cb.feature_importances_.argsort()
_ = plt.figure(figsize=(15,10))
_ = plt.barh(X.columns[sorted_importances], sm_cb.feature_importances_[sorted_importances])
_ = plt.axvline(x=0.10)
_ = plt.title('Feature Importance')
_ = plt.savefig('feature_imp_grid.png', bbox_inches='tight')
plt.show()

## LightGBM

In [None]:
sm_lgbm = LGBMClassifier(random_state=10)
sm_lgbm.fit(X_train_sm, y_train_sm)
y_pred = sm_lgbm.predict(X_test)
training_score = sm_lgbm.fit(X_train_sm, y_train_sm).score(X_train_sm, y_train_sm)
print('Accuracy Score: ', accuracy_score(y_test, y_pred))
print('Training Set Score: ', training_score)
print('Macro F1 Score: ', f1_score(y_test, y_pred, average = 'macro'))
print('Weighted F1 Score: ', f1_score(y_test, y_pred, average = 'weighted'))
y_pred_prob = sm_lgbm.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
auc = roc_auc_score(y_test, y_pred_prob)
plt.figure(figsize=(10,6))
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
plt.savefig('roc_best_random.png', bbox_inches = 'tight')
plt.show()
print('Area Under Curve: ',auc)
sorted_importances = sm_lgbm.feature_importances_.argsort()
_ = plt.figure(figsize=(15,10))
_ = plt.barh(X.columns[sorted_importances], sm_lgbm.feature_importances_[sorted_importances])
_ = plt.title('Feature Importance')
_ = plt.savefig('feature_imp_grid.png', bbox_inches='tight')
plt.show()

### Hyperparameter Tuning of Best Model
The Random Forest Classifier is the best performing model as it has the highest area under the curve and the highest weighted F1 score. 


In [None]:
param_grid = {'n_estimators':np.arange(1,50), 
              'max_depth': np.arange(1,10),
             'criterion': ['gini', 'entropy']}
grid_rfc = GridSearchCV(RandomForestClassifier(random_state = 10), param_grid, cv=5, scoring = 'accuracy')
grid_rfc.fit(X_train, y_train)
print('Mean Cross Validation Score:' ,grid_rfc.best_score_)
print('Parameters with Highest Cross Validation Score: ',grid_rfc.best_params_)
print("Random Forest Classifier Model's Best Accuracy: ", grid_rfc.score(X_test,y_test))

In [None]:
sm_rfc = RandomForestClassifier(random_state=10, criterion = 'gini', max_depth = 9, n_estimators = 34)
sm_rfc.fit(X_train_sm, y_train_sm)
y_pred = base_rfc.predict(X_test)
training_score = base_rfc.fit(X_train_sm, y_train_sm).score(X_train_sm, y_train_sm)
print('Accuracy Score: ', accuracy_score(y_test, y_pred))
print('Training Set Score: ', training_score)
print('Macro F1 Score: ', f1_score(y_test, y_pred, average = 'macro'))
print('Weighted F1 Score: ', f1_score(y_test, y_pred, average = 'weighted'))
# fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
y_pred_prob = sm_rfc.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
auc = roc_auc_score(y_test, y_pred_prob)
plt.figure(figsize=(10,6))
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
plt.savefig('roc_best_random.png', bbox_inches = 'tight')
plt.show()
print('Area Under Curve: ',auc)
sorted_importances = sm_rfc.feature_importances_.argsort()
_ = plt.figure(figsize=(15,10))
_ = plt.barh(X.columns[sorted_importances], sm_rfc.feature_importances_[sorted_importances])
_ = plt.axvline(x=0.10)
_ = plt.title('Feature Importance')
_ = plt.savefig('feature_imp_grid.png', bbox_inches='tight')
plt.show()

In [None]:
cf_matrix = confusion_matrix(y_test, y_pred)
sns.heatmap(cf_matrix/np.sum(cf_matrix), annot = True)

# Customer Segmentation

## KMeans Clustering

In [None]:
data.head()
features = data.drop('y', axis = 1)
target = data['y']

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.2, random_state = 10)

In [None]:
scaler = StandardScaler()
scaled_X_train = scaler.fit_transform(X_train)

In [None]:
range_n_clusters = [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
silhouette_avg = []
inertia = []
for num_clusters in range_n_clusters:
    kmeans = KMeans(n_clusters=num_clusters, random_state = 10, init = 'k-means++')
    kmeans.fit(scaled_X_train)
    cluster_labels = kmeans.labels_
    silhouette_avg.append(silhouette_score(scaled_X_train, cluster_labels))
    inertia.append(kmeans.inertia_)

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(10,10))
sns.lineplot(x=range_n_clusters,y=silhouette_avg, ax = axes[0], marker = 'o')
axes[0].set_title('Silhouette analysis For Optimal k')
sns.lineplot(x= range_n_clusters, y = inertia, ax= axes[1], marker = 'o')
axes[1].set_title('Elbow Method using Inertial for Optimal k')
plt.show();

In [None]:
kmeans2 = KMeans(n_clusters = 2, init = 'k-means++', random_state = 42)
kmeans2.fit(scaled_X_train)
scaled_X_train_2 = pd.DataFrame(scaled_X_train)
scaled_X_train_2.columns = X_train.columns
scaled_X_train_2['Segment KMeans'] = kmeans2.labels_
scaled_X_train_2['y_true'] = y_train.values

In [None]:
scaled_X_train_2['Segment KMeans'].value_counts()/len(scaled_X_train)*100

In [None]:
scaled_X_train_2['y_true'].value_counts()/len(scaled_X_train)*100

In [None]:
accuracy_score(scaled_X_train_2['y_true'], scaled_X_train_2['Segment KMeans'])

In [None]:
fig, ax
sns.countplot(x = 'y_true', data = scaled_X_train_2)

In [None]:
scaled_X_train_2.head(2)

In [None]:
kmeans2_results = pd.DataFrame(y_test)
kmeans2_results['y_pred'] = kmeans2.predict(X_test)

In [None]:
sns.countplot(x = kmeans2_results['y'])
plt.title('True Y (testing set)')
plt.show();

In [None]:
sns.countplot(x = kmeans2_results['y_pred'])
plt.title('KMeans with 2 clusters predictions(testing set)')
plt.show();

### 4 Clusters

In [None]:
kmeans4 = KMeans(n_clusters = 4, init = 'k-means++', random_state = 42)
kmeans4.fit(scaled_X_train)
scaled_X_train4 = pd.DataFrame(scaled_X_train)
scaled_X_train4.columns = X_train.columns
scaled_X_train4['Segment KMeans'] = kmeans4.labels_
scaled_X_train4['y_true'] = y_train.values

In [None]:
scaled_X_train4['Segment KMeans'].value_counts()/len(scaled_X_train)*100

In [None]:
scaled_X_train4['y_true'].value_counts()/len(scaled_X_train)*100

In [None]:
accuracy_score(scaled_X_train4['y_true'], scaled_X_train4['Segment KMeans'])

In [None]:
scaled_X_train4.head()

## PCA and Kmeans Clustering Analysis

In [None]:
data.head()

In [None]:
features = data.drop('y', axis = 1)
target = data['y']

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.2, random_state = 10)

In [None]:
X_train

In [None]:
scaler = StandardScaler()
scaled_X_train = scaler.fit_transform(X_train)
pca = PCA()
pca.fit(scaled_X_train)
plt.figure(figsize=(10,8))
plt.plot(range(1,14), pca.explained_variance_ratio_.cumsum(), marker = 'x')
plt.xlabel('Number of Components')
plt.ylabel('Explained Variance Ratio')
plt.show();

Not abundantly clear how many clusters we should go with. I know from the feature importances extracted that we are looking for around 3 or 4 clusters. 

In [None]:
features_pca = range(pca.n_components_)
plt.bar(features_pca, pca.explained_variance_)
plt.xticks(features_pca)
plt.ylabel('Variance')
plt.xlabel('Number of Components')
plt.axhline(0.8)
plt.show()

Will keep at least 80% variance. 

### PCA with 2 components

In [None]:
pca2 = PCA(n_components = 2)
pca2.fit(scaled_X_train)
transformed2 = pca2.transform(scaled_X_train)
print(scaled_X_train.shape)
print(transformed2.shape)
trans_df2 = pd.DataFrame(transformed2)
plt.figure(figsize=(10,6))
sns.scatterplot(x= trans_df2[0], y= trans_df2[1], hue = y_train)
plt.show()

In [None]:
pca3 = PCA(n_components = 3)
pca3.fit(scaled_X_train)
transformed3 = pca3.transform(scaled_X_train)
print(scaled_X_train.shape)
print(transformed3.shape)
trans_df3 = pd.DataFrame(transformed3)
fig, axes = plt.subplots(nrows=2,ncols=2, figsize = (12, 8))
sns.scatterplot(x= trans_df3[0], y= trans_df3[1], hue = y_train, ax = axes[0,0])
sns.scatterplot(x= trans_df3[0], y= trans_df3[2], hue = y_train, ax = axes[0,1])
sns.scatterplot(x= trans_df3[1], y= trans_df3[2], hue = y_train, ax = axes[1,0])
axes[-1,-1].axis('off')
plt.show();

### PCA with 4 components

In [None]:
pca4 = PCA(n_components = 4)
pca4.fit(scaled_X_train)
transformed4 = pca4.transform(scaled_X_train)
print(scaled_X_train.shape)
print(transformed4.shape)
trans_df4 = pd.DataFrame(transformed4)
fig, axes = plt.subplots(nrows=2,ncols=3, figsize = (12, 8))
sns.scatterplot(x= trans_df4[0], y= trans_df4[1], hue = y_train, ax = axes[0,0])
sns.scatterplot(x= trans_df4[0], y= trans_df4[2], hue = y_train, ax = axes[0,1])
sns.scatterplot(x= trans_df4[0], y= trans_df4[3], hue = y_train, ax = axes[0,2])
sns.scatterplot(x= trans_df4[1], y= trans_df4[2], hue = y_train, ax = axes[1,0])
sns.scatterplot(x= trans_df4[1], y= trans_df4[3], hue = y_train, ax = axes[1,1])
sns.scatterplot(x= trans_df4[2], y= trans_df4[3], hue = y_train, ax = axes[1,2])
plt.show();

### PCA with 8 components

In [None]:
pca8 = PCA(n_components = 8)
pca8.fit(scaled_X_train)
transformed8 = pca8.transform(scaled_X_train)
print(scaled_X_train.shape)
print(transformed8.shape)
trans_df8 = pd.DataFrame(transformed8)
fig, axes = plt.subplots(nrows=2,ncols=3, figsize = (12, 8))
sns.scatterplot(x= trans_df8[0], y= trans_df8[1], hue = y_train, ax = axes[0,0])
sns.scatterplot(x= trans_df8[0], y= trans_df8[2], hue = y_train, ax = axes[0,1])
sns.scatterplot(x= trans_df8[0], y= trans_df8[3], hue = y_train, ax = axes[0,2])
sns.scatterplot(x= trans_df8[1], y= trans_df8[2], hue = y_train, ax = axes[1,0])
sns.scatterplot(x= trans_df8[1], y= trans_df8[3], hue = y_train, ax = axes[1,1])
sns.scatterplot(x= trans_df8[2], y= trans_df8[3], hue = y_train, ax = axes[1,2])
plt.show()

We don't see any further imporvements from 4 to 8 so I'll go ahead with 4 components

In [None]:
scores_pca = transformed4

In [None]:
range_n_clusters = [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
silhouette_avg = []
inertia = []
for num_clusters in range_n_clusters:
    kmeans = KMeans(n_clusters=num_clusters, random_state = 10, init = 'k-means++')
    kmeans.fit(scores_pca)
    cluster_labels = kmeans.labels_
    silhouette_avg.append(silhouette_score(scores_pca, cluster_labels))
    inertia.append(kmeans.inertia_)

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(10,10))
sns.lineplot(x=range_n_clusters,y=silhouette_avg, ax = axes[0], marker = 'o')
axes[0].set_title('Silhouette analysis For Optimal k')
sns.lineplot(x= range_n_clusters, y = inertia, ax= axes[1], marker= 'o')
axes[1].set_title('Elbow Method using Inertial for Optimal k')
plt.show();

In [None]:
kmeans_pca_4 = KMeans(n_clusters = 4, init = 'k-means++', random_state = 42)
kmeans_pca_4.fit(scores_pca)

In [None]:
scaled_X_train = pd.DataFrame(scaled_X_train)
scaled_X_train.columns = X_train.columns

In [None]:
kmeans_pca4_df = pd.concat([scaled_X_train.reset_index(drop=True), pd.DataFrame(scores_pca)], axis = 1)
kmeans_pca4_df.columns.values[-4:] = ['Component 1', 'Component 2', 'Component 3', 'Component 4']
kmeans_pca4_df['Segment K-means PCA'] = kmeans_pca_4.labels_
kmeans_pca4_df['Segment'] = kmeans_pca4_df['Segment K-means PCA'].map({0:'first', 1:'second', 2: 'third', 3: 'fourth'})
kmeans_pca4_df['Segment K-means PCA'].value_counts()

In [None]:
kmeans_pca4_df.head(2)

In [None]:
fig, axes = plt.subplots(nrows= 2, ncols = 3, figsize = (15,8))
sns.scatterplot(ax = axes[0,0] ,x ='Component 1', y ='Component 2', data = kmeans_pca4_df, hue = kmeans_pca4_df['Segment'], palette = ['g','r','c', 'm'])
sns.scatterplot(ax = axes[0,1] ,x ='Component 1', y ='Component 3', data = kmeans_pca4_df, hue = kmeans_pca4_df['Segment'], palette = ['g','r','c', 'm'])
sns.scatterplot(ax = axes[0,2] ,x ='Component 1', y ='Component 4', data = kmeans_pca4_df, hue = kmeans_pca4_df['Segment'], palette = ['g','r','c', 'm'])
sns.scatterplot(ax = axes[1,0] ,x ='Component 2', y ='Component 3', data = kmeans_pca4_df, hue = kmeans_pca4_df['Segment'], palette = ['g','r','c', 'm'])
sns.scatterplot(ax = axes[1,1] ,x ='Component 2', y ='Component 4', data = kmeans_pca4_df, hue = kmeans_pca4_df['Segment'], palette = ['g','r','c', 'm'])
sns.scatterplot(ax = axes[1,2] ,x ='Component 3', y ='Component 4', data = kmeans_pca4_df, hue = kmeans_pca4_df['Segment'], palette = ['g','r','c', 'm'])
plt.show();

reference: https://365datascience.com/tutorials/python-tutorials/pca-k-means/

In [None]:
features_kmeans = X_train.assign(segment = kmeans_pca_4.labels_).groupby(['segment']).mean().round()
features_kmeans

In [None]:
sns.heatmap(features_kmeans.T, cmap = 'YlGnBu')
plt.show();

Checking with 3 clusters

In [None]:
kmeans_pca_3 = KMeans(n_clusters = 3, init = 'k-means++', random_state = 42)
kmeans_pca_3.fit(scores_pca)
scaled_X_train = pd.DataFrame(scaled_X_train)
scaled_X_train.columns = X_train.columns

In [None]:
kmeans_pca3_df = pd.concat([scaled_X_train.reset_index(drop=True), pd.DataFrame(scores_pca)], axis = 1)
kmeans_pca3_df.columns.values[-3:] = ['Component 1', 'Component 2', 'Component 3']
kmeans_pca3_df['Segment K-means PCA'] = kmeans_pca_3.labels_
kmeans_pca3_df['Segment'] = kmeans_pca3_df['Segment K-means PCA'].map({0:'first', 1:'second', 2: 'third'})
kmeans_pca3_df['Segment K-means PCA'].value_counts()

In [None]:
fig, axes = plt.subplots(nrows= 2, ncols = 2, figsize = (15,8))
sns.scatterplot(ax = axes[0,0] ,x ='Component 1', y ='Component 2', data = kmeans_pca3_df, hue = kmeans_pca3_df['Segment'], palette = ['g','r','c'])
sns.scatterplot(ax = axes[0,1] ,x ='Component 1', y ='Component 3', data = kmeans_pca3_df, hue = kmeans_pca3_df['Segment'], palette = ['g','r','c'])
sns.scatterplot(ax = axes[1,0] ,x ='Component 2', y ='Component 3', data = kmeans_pca3_df, hue = kmeans_pca3_df['Segment'], palette = ['g','r','c'])
axes[-1,-1].axis('off')
plt.show();

K Means with 5 clusters as it had the highest sihouette score

In [None]:
kmeans5 = KMeans(n_clusters = 5, init = 'k-means++', random_state = 42)
kmeans5.fit(scores_pca)

In [None]:
scaled_X_train = pd.DataFrame(scaled_X_train)
scaled_X_train.columns = X_train.columns

In [None]:
kmeans_pca5_df = pd.concat([scaled_X_train.reset_index(drop=True), pd.DataFrame(scores_pca)], axis = 1)
kmeans_pca5_df.columns.values[-4:] = ['Component 1', 'Component 2', 'Component 3', 'Component 4']
kmeans_pca5_df['Segment K-means PCA'] = kmeans5.labels_
kmeans_pca5_df['Segment'] = kmeans_pca5_df['Segment K-means PCA'].map({0:'first', 1:'second', 2: 'third', 3: 'fourth', 4: 'fifth'})
kmeans_pca5_df['Segment K-means PCA'].value_counts()

In [None]:
kmeans_pca5_df

In [None]:
fig, axes = plt.subplots(nrows= 2, ncols = 3, figsize = (15,8))
sns.scatterplot(ax = axes[0,0] ,x ='Component 1', y ='Component 2', data = kmeans_pca5_df, hue = kmeans_pca5_df['Segment'], palette = ['g','r','c', 'm', 'b'])
sns.scatterplot(ax = axes[0,1] ,x ='Component 1', y ='Component 3', data = kmeans_pca5_df, hue = kmeans_pca5_df['Segment'], palette = ['g','r','c', 'm', 'b'])
sns.scatterplot(ax = axes[0,2] ,x ='Component 1', y ='Component 4', data = kmeans_pca5_df, hue = kmeans_pca5_df['Segment'], palette = ['g','r','c', 'm', 'b'])
sns.scatterplot(ax = axes[1,0] ,x ='Component 2', y ='Component 3', data = kmeans_pca5_df, hue = kmeans_pca5_df['Segment'], palette = ['g','r','c', 'm', 'b'])
sns.scatterplot(ax = axes[1,1] ,x ='Component 2', y ='Component 4', data = kmeans_pca5_df, hue = kmeans_pca5_df['Segment'], palette = ['g','r','c', 'm', 'b'])
sns.scatterplot(ax = axes[1,2] ,x ='Component 3', y ='Component 4', data = kmeans_pca5_df, hue = kmeans_pca5_df['Segment'], palette = ['g','r','c', 'm', 'b'])
plt.show();

# Conclusion
The best performing model here is a Random Forest Classifier which yield an accuracy score 92% after SMOTE analysis was conducted to account for the imbalance in the dataset. It is important to note that whilst the success metrics outlined in aims of the project is accuracy, the weighted f1 score was prioritised due to the imbalance in the dataset. For this model, the weighted f1 score is 92% as well. 
The Random Forest Classifier, along side other classification algorigthms highlighted the month of last contact, the duration of the last contact, the means of last contact, and the balance as important features in the dataset for the classifcation. This was in line with the results of KMeans clustering with PCA. 3 to 4 clusters look visible. To gain a better idea of the problem, it's worth looking into how many clients signed up for subscriptions who then tried to withdraw their deposits before the maturity time. This alongisde the duration of the contacts can help gain a better understanding of the relationship duration and subscription. Otherwise, it does look like the longer a customer is spoken with the higher the chance of them subscribing for term deposits. October and March have the highest subscription rates and therefore is worth driving more attention to these two months through telephone/cellphone contact. 


pick two or three and do 2D plots; 