# Bank Marketing Predictions

Dataset retrieved from [Kaggle](https://www.kaggle.com/henriqueyamahata/bank-marketing)

This dataset is filled with outcomes of marketing campaign phone calls from a Portuguese banking institution. Each phone call has a target to get the customer to subscribe to a bank term deposit. This is what we will be training the model to predict.

The large issue with this dataset is that there is a lot of categorical features that need to be encoded.

In [None]:
import pandas as pd
from pandas import Series,DataFrame
import matplotlib.pyplot as plt

# numpy, matplotlib, seaborn
import numpy as np
%matplotlib inline

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix

from sklearn.model_selection import GridSearchCV

## Data Analysis

The data is well prepared with no null values. The biggest thing that needs to be tackled is encoding of categorical features.

In [None]:
data = pd.read_csv('../../bank-marketing.csv')
data.head()

In [None]:
data.isnull().sum()

As you can visually see, the dataset is very skewed. If the data is run through a model without being stratified, it will cause us to get improper accuracy results, as over 99% of the data is not fraudulent.

In [None]:
y_class = ['Subscribed', 'Didn\'t Subscribe']

plt.bar(y_class, [np.sum(data['y'] == 'yes'), np.sum(data['y'] == 'no')], color='teal')
plt.xlabel("Outcome")
plt.ylabel("Phone Calls")
plt.title("Distribution of Customers Who Did and Didn't Subscribe")
plt.show()

## Data Engineering

First, I will drop the number of emplyees feature and bin certain categorical features to try and simplify the dataset.

In [None]:
data.pop('nr.employed')

bin_education = {
    "university.degree": "university.degree",
    "professional.course": "professional.course",
    "high.school": "high.school",
    "basic.9y": "basic",
    "basic.6y": "basic",
    "basic.4y": "basic",
    "unknown": "unknown",
    "illiterate": "unknown"
}
data.education = data.education.map(bin_education)

Now, I am applying a form of one hot encoding to categorical features

In [None]:
data = pd.get_dummies(data, columns=['job'], prefix = ['job'])
data = pd.get_dummies(data, columns=['marital'], prefix = ['marital'])
data = pd.get_dummies(data, columns=['education'], prefix = ['education'])
data = pd.get_dummies(data, columns=['default'], prefix = ['default'])
data = pd.get_dummies(data, columns=['housing'], prefix = ['housing'])
data = pd.get_dummies(data, columns=['loan'], prefix = ['loan'])
data = pd.get_dummies(data, columns=['contact'], prefix = ['contact'])
data = pd.get_dummies(data, columns=['month'], prefix = ['month'])
data = pd.get_dummies(data, columns=['day_of_week'], prefix = ['day_of_week'])
data = pd.get_dummies(data, columns=['poutcome'], prefix = ['poutcome'])

The number of days since last call feature has a value of 999 if the customer hasn't been called before. This can possibly create unwanted bias. Instead of setting a wild number that has an unknown purpose, 999 should be set to 0 and a new featured named 'called_before' should be created that will be a boolean value.

In [None]:
data['called_before'] = data.pdays.apply(lambda row: 0 if (row == 999) else 1)
data['pdays'] = data.pdays.apply(lambda row: 0 if (row == 999) else row)

The target column is currently either 'yes' or 'no' strings. That can't be fed into the model, so it should be encoded to 1 or 0.

In [None]:
data['y'] = data.y.apply(lambda row: 1 if (row == 'yes') else 0)

In [None]:
data.y.value_counts()

## Create Training and Test Sets

There are a total of 4,640 samples out of 41,188 samples where clients have subscribed to bank term deposits. To properly train the model, the training set should be stratified in half. This means that half the samples are non subscribers and half are subscribers. For our training set, I will do 3,800 non subscribers and 3,800 subscribers. That leaves us 1,680 stratified samples to test the trained model with. 

Stratify the data and assign to variables for easy management.

In [None]:
X_not_sub, X_sub = data[data.y == 0], data[data.y == 1]
y_not_sub, y_sub = X_not_sub.pop('y'), X_sub.pop('y')

Create the training and test sets.

In [None]:
X_train, y_train = X_not_sub[:3800].append(X_sub[:3800]), y_not_sub[:3800].append(y_sub[:3800])
X_test, y_test = X_not_sub[3800:4620].append(X_sub[3800:4620]), y_not_sub[3800:4620].append(y_sub[3800:4620])

## Train the Model

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

rnd_clf = RandomForestClassifier(random_state=42, n_estimators=100)

rnd_params = [
   { 
     'max_depth': [12, 14, 16, 18, 20, 22, 24, 26, 28], 
     'min_samples_leaf' : [1, 2, 3, 4, 5, 6],
     'max_leaf_nodes': [4, 8, 12, 16, 20, 24, 28],
   },
]

rnd_cv = GridSearchCV(estimator=rnd_clf, param_grid=rnd_params, cv=4)
rnd_cv.fit(X_train, y_train)