# Problem statement

We have data from a Portuguese bank on details of customers related to selling a term deposit
The objective of the project is to help the marketing team identify potential customers who are relatively more likely to subscribe to the term deposit and this increase the hit ratio

# Data dictionary

**Bank client data**
* 1 - age 
* 2 - job : type of job 
* 3 - marital : marital status
* 4 - education 
* 5 - default: has credit in default? 
* 6 - housing: has housing loan? 
* 7 - loan: has personal loan?
* 8 - balance in account

**Related to previous contact**
* 8 - contact: contact communication type
* 9 - month: last contact month of year
* 10 - day: last contact day of the month
* 11 - duration: last contact duration, in seconds*

**Other attributes**
* 12 - campaign: number of contacts performed during this campaign and for this client
* 13 - pdays: number of days that passed by after the client was last contacted from a previous campaign
* 14 - previous: number of contacts performed before this campaign and for this client
* 15 - poutcome: outcome of the previous marketing campaign

**Output variable (desired target):has the client subscribed a term deposit?**



In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline 
# To enable plotting graphs in Jupyter notebook
import seaborn as sns

# Remove scientific notations and display numbers with 2 decimal points instead
pd.options.display.float_format = '{:,.2f}'.format

In [3]:
#Load the file from local directory using pd.read_csv which is a special form of read_table
bank_df = pd.read_csv("bank-full.csv")
bank_df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,Target
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no


## Deliverable – 1 (EDA)

### Univariate

In [4]:
bank_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        45211 non-null  int64 
 1   job        45211 non-null  object
 2   marital    45211 non-null  object
 3   education  45211 non-null  object
 4   default    45211 non-null  object
 5   balance    45211 non-null  int64 
 6   housing    45211 non-null  object
 7   loan       45211 non-null  object
 8   contact    45211 non-null  object
 9   day        45211 non-null  int64 
 10  month      45211 non-null  object
 11  duration   45211 non-null  int64 
 12  campaign   45211 non-null  int64 
 13  pdays      45211 non-null  int64 
 14  previous   45211 non-null  int64 
 15  poutcome   45211 non-null  object
 16  Target     45211 non-null  object
dtypes: int64(7), object(10)
memory usage: 5.9+ MB


**Numerical columns**

In [None]:
bank_df.describe()

`'balance' has negative values those are okay as we have loan and default as features for the given dataset, so we can assume that the balance is of a credit account and it can be negative as well`

In [None]:
bank_df.skew()

`The distribution of all numerical variables other than age is highly skewed - hence we might want to transform or bin some of these variables`

In [None]:
sns.boxplot(bank_df['age'])
plt.show()

`People above the age of 70 are outliers`

`Age column has some outliers. The median age is about 40 years. There are some customers above 90 years of age. This data might have to be checked`

In [None]:
sns.distplot(bank_df['balance'])
plt.show()

In [None]:
sns.distplot(bank_df['campaign'])
plt.show()

`Binning to be done for both the columns 'balance' and 'campaign'`

In [None]:
bank_df[bank_df['duration']==0]

`this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model`

In [None]:
bank_df.drop(['duration'], inplace=True, axis=1)

**Non numerical columns**

In [None]:
for i in bank_df.columns[bank_df.dtypes=='object']:
    print(i,":")
    print()
    print(bank_df[i].value_counts(normalize=True)*100)
    print()
    print()

`We can drop poutcome as most of the values are unknown`

`Target is imbalanced but there is no need to treat it as 'yes' class is around 11%, we treat for imbalanced data when one class is very low`

In [None]:
bank_df.drop(['poutcome'], inplace=True, axis=1)

In [None]:
sns.countplot(bank_df['Target'])
plt.show()

In [None]:
bank_df['Target'].value_counts(normalize=True)

`The response rate is only 11.6%. Hence the Y variable has a high class imbalance. Hence accuracy will not be a reliable model performance measure.`

`FN is very critical for this business case because a false negative is a customer who will potentially subscribe for a loan but who has been classified as 'will not subscribe'. Hence the most relevant model performance measure is recall`

### Bivariate analysis

In [None]:
for i in ['age','balance','day','campaign','pdays','previous']:
    sns.boxplot(x='Target',y=i,data=bank_df)
    plt.show()

`Campaign values are higher for people saying no to term deposits i.e. people saying yes to term deposits have less number of contact during the campaign`

In [None]:
#Group numerical variables by mean for the classes of Y variable
np.round(bank_df.groupby(["Target"]).mean() ,1)

`The mean balance is higher for customers who subscribe to the term deposit compared to those who dont`


`Number of days that passed by after the client was last contacted from a previous campaign is higher for people who have subscribed`

`Number of contacts performed before this campaign is also higher for customers who subscribe`

`All of the above facts indicate that customers with a higher balance and those who have been contacted frequently before the campaign tend to subscribe for the term deposit`

**Bivariate analysis using crosstab for categorical values**

In [None]:
pd.crosstab(bank_df['job'], bank_df['Target'], normalize='index').sort_values(by='yes',ascending=False )

#### The highest conversion is for students (28%) and lowest is for blue-collar(7%

In [None]:
pd.crosstab(bank_df['marital'], bank_df['Target'], normalize='index').sort_values(by='yes',ascending=False )

In [None]:
pd.crosstab(bank_df['education'], bank_df['Target'], normalize='index').sort_values(by='yes',ascending=False )

In [None]:
print(pd.crosstab(bank_df['default'], bank_df['Target'], normalize='index').sort_values(by='yes',ascending=False ))
print()
print(bank_df['default'].value_counts(normalize=True))

`Since default - yes is only 2% of the data and the conversion is also comparitively lower for default - yes, we can remove this column`

In [None]:
bank_df.drop(['default'], axis=1, inplace=True)

In [None]:
bank_df.columns

In [None]:
pd.crosstab(bank_df['housing'], bank_df['Target'], normalize='index').sort_values(by='yes',ascending=False )

In [None]:
pd.crosstab(bank_df['loan'], bank_df['Target'], normalize='index').sort_values(by='yes',ascending=False )

In [None]:
pd.crosstab(bank_df['contact'], bank_df['Target'], normalize='index').sort_values(by='yes',ascending=False )

In [None]:
pd.crosstab(bank_df['month'], bank_df['Target'], normalize='index').sort_values(by='yes',ascending=False )

## Deliverable – 2 (Prepare the data for analytics)

In [None]:
# Not a necessary step, but it helps having more categorical variables when target is categorical


#Binning balance

bin_edges=[-8020,0,72,448,1428,102128]
# first value is min value -1 of the column and last value is max +1  of the column, so that all values are included
# you can choose middle value on your own or select 25th, 50th and 75th percentile value
bin_names=['very low','low','medium','high','very high']
# Names of each bin or category
bank_df['balance'] = pd.cut(bank_df['balance'],bin_edges,labels=bin_names)

In [None]:
# Not a necessary step, but it helps having more categorical variables when target is categorical

#Binning Campaign

bin_edges=[0,2,3,4,564]
# first value is min value -1 of the column and last value is max +1  of the column, so that all values are included
# you can choose middle value on your own or select 25th, 50th and 75th percentile value
bin_names=['<=2','3', '4','>=4']
# Names of each bin or category
bank_df['campaign'] = pd.cut(bank_df['campaign'],bin_edges,labels=bin_names)

In [None]:
bank_df['Target'] = bank_df['Target'].map({'yes':1, 'no':0})

In [None]:
# Separating independent and dependent variables

X = bank_df.drop("Target" , axis=1)
y = bank_df["Target"]   

X = pd.get_dummies(X, drop_first=True)

In [None]:
from sklearn.model_selection import train_test_split
test_size = 0.30 # taking 70:30 training and test set
seed = 7  # Random numbmer seeding for reapeatability of the code
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=seed)

In [None]:
X_train.shape,X_test.shape

## Deliverable – 3 (create the ensemble model)

## 1.

In [None]:
algo= []
tr = []
te = []
recall = []
precision = []
roc = []

# Blanks list to store model name, training score, testing score, recall, precision and roc

**Logistic Regression**

In [None]:
# Logistic Regression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import recall_score, precision_score, roc_auc_score, accuracy_score
model = LogisticRegression(random_state=7)

model.fit(X_train, y_train)

algo.append('Logistic Regression')
tr.append(model.score(X_train, y_train))
te.append(model.score(X_test, y_test))
recall.append(recall_score(y_test,model.predict(X_test)))
precision.append(precision_score(y_test,model.predict(X_test)))
roc.append(roc_auc_score(y_test,model.predict(X_test)))

**Decision Tree**

In [None]:
from sklearn.tree import DecisionTreeClassifier
#instantiating decision tree as the default model
dt_model = DecisionTreeClassifier()
dt_model.fit(X_train, y_train)

In [None]:
#training acuracy
dt_model.score(X_train, y_train)

In [None]:
#testing acuracy
dt_model.score(X_test, y_test)

`Model is an overfit as testing score is less than training score`

**Note: -** `Decision Tree is a non-parametric algorithm and hence prone to overfitting easily. This is evident from the difference in scores in training and testing. In ensemble techniques, we want multiple instances (each different from the other) and each instance to be overfit!!! hopefully, the different instances will do different mistakes in classification and when we club them, their# errors will get cancelled out giving us the benefit of lower bias and lower overall variance errors.`

In [None]:
clf_pruned = DecisionTreeClassifier(criterion = "entropy", random_state = 7, max_depth=3, min_samples_leaf=5)
clf_pruned.fit(X_train, y_train)

In [None]:
## Calculating feature importance
feature_cols = X_train.columns

feat_importance = clf_pruned.tree_.compute_feature_importances(normalize=False)


feat_imp_dict = dict(zip(feature_cols, clf_pruned.feature_importances_))
feat_imp = pd.DataFrame.from_dict(feat_imp_dict, orient='index')
feat_imp.sort_values(by=0, ascending=False)[0:10] #Top 10 features

In [None]:
preds_pruned = clf_pruned.predict(X_test)
preds_pruned_train = clf_pruned.predict(X_train)


In [None]:
print("Training Accuracy:",accuracy_score(y_train, preds_pruned_train))
print()
print("Training Accuracy:",accuracy_score(y_test, preds_pruned))
print()
print("Recall:",recall_score(y_test, preds_pruned, average="binary", pos_label=1))

`Overfitting is reduced after pruning, but recall has drastically reduced`

In [None]:
# Decision Tree Classifier using entropy, adding the values in the list

model = DecisionTreeClassifier(criterion = "entropy", random_state = 7, max_depth=3, min_samples_leaf=5)

model.fit(X_train, y_train)

algo.append('Decision Tree entropy')
tr.append(model.score(X_train, y_train))
te.append(model.score(X_test, y_test))
recall.append(recall_score(y_test,model.predict(X_test)))
precision.append(precision_score(y_test,model.predict(X_test)))
roc.append(roc_auc_score(y_test,model.predict(X_test)))

In [None]:
# Decision Tree Classifier using gini, adding the values in the list

model = DecisionTreeClassifier(criterion = "gini", random_state = 7, max_depth=3, min_samples_leaf=5)

model.fit(X_train, y_train)

algo.append('Decision Tree gini')
tr.append(model.score(X_train, y_train))
te.append(model.score(X_test, y_test))
recall.append(recall_score(y_test,model.predict(X_test)))
precision.append(precision_score(y_test,model.predict(X_test)))
roc.append(roc_auc_score(y_test,model.predict(X_test)))

## 2.

In [None]:
# Random Forest
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(random_state=7, n_estimators=50)

model.fit(X_train, y_train)

algo.append('Random Forest')
tr.append(model.score(X_train, y_train))
te.append(model.score(X_test, y_test))
recall.append(recall_score(y_test,model.predict(X_test)))
precision.append(precision_score(y_test,model.predict(X_test)))
roc.append(roc_auc_score(y_test,model.predict(X_test)))

In [None]:
# Bagging
from sklearn.ensemble import BaggingClassifier

model = BaggingClassifier(random_state=7,n_estimators=100, max_samples= .7, bootstrap=True, oob_score=True)

model.fit(X_train, y_train)

algo.append('Bagging')
tr.append(model.score(X_train, y_train))
te.append(model.score(X_test, y_test))
recall.append(recall_score(y_test,model.predict(X_test)))
precision.append(precision_score(y_test,model.predict(X_test)))
roc.append(roc_auc_score(y_test,model.predict(X_test)))

In [None]:
# AdaBoost
from sklearn.ensemble import AdaBoostClassifier

model = AdaBoostClassifier(random_state=7,n_estimators= 200, learning_rate=0.1)

model.fit(X_train, y_train)

algo.append('AdaBoost')
tr.append(model.score(X_train, y_train))
te.append(model.score(X_test, y_test))
recall.append(recall_score(y_test,model.predict(X_test)))
precision.append(precision_score(y_test,model.predict(X_test)))
roc.append(roc_auc_score(y_test,model.predict(X_test)))

In [None]:
# Gradient Boosting
from sklearn.ensemble import GradientBoostingClassifier

model = GradientBoostingClassifier(random_state=7, n_estimators=200,)

model.fit(X_train, y_train)

algo.append('Gradient Boosting')
tr.append(model.score(X_train, y_train))
te.append(model.score(X_test, y_test))
recall.append(recall_score(y_test,model.predict(X_test)))
precision.append(precision_score(y_test,model.predict(X_test)))
roc.append(roc_auc_score(y_test,model.predict(X_test)))

In [None]:
# DataFrame to compare results.

results = pd.DataFrame()
results['Model'] = algo
results['Training Score'] = tr
results['Testing Score'] = te
results['Recall'] = recall
results['Precision'] = precision
results['ROC AUC Score'] = roc
results = results.set_index('Model')
results

**Confusion matrix means**

*True Positive (observed=1,predicted=1):*

Customer subscribed to term deposit and model predicted that the customer will

*False Positive (observed=0,predicted=1):*

Customer did not subscribe to term deposit and model predicted that the customer will

*True Negative (observed=0,predicted=0):*

Customer did not subscribe to term deposit and model predicted that the customer won't

*False Negative (observed=1,predicted=0):*

Customer subscribed to term deposit and model predicted that the customer won't

Here the company wants more people to subscribe to term deposits. So if we have a customer who is willing to subscribe then we shouldn't loose that customer. Therefore focus shpould be on False negative. Decreasing FN and increasing recall.

### Bagging gives overall best model performance. However, please note that the recall is still very low and will have to be improved