## Imports

In [2]:
import pandas as pd 
import numpy as np
from sklearn.model_selection import train_test_split
import torch 
from torch import nn
from torch import optim 
from imblearn import over_sampling

pd.options.mode.chained_assignment = None

## Data Cleaning

In [4]:
df = pd.read_csv("./data/bank-additional-full.csv", delimiter = ";")
df.head(5)

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


The description of the columns can be found at the UCI Machine Learning Repository, linked [here](https://archive.ics.uci.edu/dataset/222/bank+marketing)

In [6]:
df.columns

Index(['age', 'job', 'marital', 'education', 'default', 'housing', 'loan',
       'contact', 'month', 'day_of_week', 'duration', 'campaign', 'pdays',
       'previous', 'poutcome', 'emp.var.rate', 'cons.price.idx',
       'cons.conf.idx', 'euribor3m', 'nr.employed', 'y'],
      dtype='object')

For this particular analysis, I am going to ignore economic factors, as I will take the simplifying assumption that the majority of people who are in this campaign are not making decisions based on economic factors like the consumer price index.

In [8]:
df = df[df.columns[~df.columns.isin(['emp.var.rate', 'cons.price.idx','cons.conf.idx', 'euribor3m', 'nr.employed'])]]
df.head(5)

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,261,1,999,0,nonexistent,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,149,1,999,0,nonexistent,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,226,1,999,0,nonexistent,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,151,1,999,0,nonexistent,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,307,1,999,0,nonexistent,no


In [9]:
df['default'].value_counts()

no         32588
unknown     8597
yes            3
Name: default, dtype: int64

In [10]:
df['housing'].value_counts()

yes        21576
no         18622
unknown      990
Name: housing, dtype: int64

In [11]:
df['loan'].value_counts()

no         33950
yes         6248
unknown      990
Name: loan, dtype: int64

As we can see from above, the number of customers who have defaulted on previous loans is relatively low. So, to avoid dealing with the unknown values, we will make the assumption that the vast majority of people have not defaulted and therefore this column does not provide us any predictive information. 

We will also drop the unknowns from people who have taken housing or personal loans, as we have a sufficient sample size without the unknowns.

In [13]:
df = df.drop(columns = ['default'])

df = df[(df['housing'] != 'unknown') & (df['loan'] != 'unknown')]

In order to run our neural network, we have to make all of our variables numeric. We do so by turning any categorical variables into dummy variables (1 = variable is true, 0 = false), and, since the binary variables are stored as yes/no variables, we will convert those into 1/0 values as well. 

In [15]:
# converting categorical variables into dummies

df_dummy = pd.get_dummies(df, columns = ['job', 'marital', 'education', 'contact', 'month', 'day_of_week', 'poutcome'], prefix_sep = ': ')

# turning y/n variables into 1/0

df_dummy['housing'] = np.where(df['housing'].values == 'yes', 1, 0)
df_dummy['loan'] = np.where(df['loan'].values == 'yes', 1, 0)
df_dummy['y'] = np.where(df['y'].values == 'yes', 1, 0)

df_dummy.head(5)

Unnamed: 0,age,housing,loan,duration,campaign,pdays,previous,y,job: admin.,job: blue-collar,...,month: oct,month: sep,day_of_week: fri,day_of_week: mon,day_of_week: thu,day_of_week: tue,day_of_week: wed,poutcome: failure,poutcome: nonexistent,poutcome: success
0,56,0,0,261,1,999,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0
1,57,0,0,149,1,999,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0
2,37,1,0,226,1,999,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0
3,40,0,0,151,1,999,0,0,1,0,...,0,0,0,1,0,0,0,0,1,0
4,56,0,1,307,1,999,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0


## Building the Basic Neural Network

For any model building, we want to ensure our model is not trained on the entirety of our dataset. Otherwise, we would have no data to test the model on, and run the risk of overtraining. I have decided to use Scikit-Learn to split the dataset into 70% training data, 30% testing data. An 80-20 split is another popular option but I prefer to have more testing data to ensure the model is even less susceptible to the overtraining problem.

We make sure our outcome variable, y (whether or not the client subscribed to the term deposit, the goal of the campaign), is separated from our other predictor variables.  

In [18]:
X = df_dummy[df_dummy.columns[~df_dummy.columns.isin(['y'])]]
y = df_dummy['y']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, train_size = .70)

### Training the Sequential Neural Network

In [20]:
def df_to_tensor(df, outcome):
    ''' Converts a df into a numpy array and then a Tensor with dtype float32 to be used in a PyTorch model 
        
        Params: 
            df (DataFrame): Input dataframe to be converted
            outcome (Boolean): Whether or not the df is an outcome vector; if it is, must be converted to 1D tensor for processing
        
        Returns:
            as_tensor (Tensor): dtype float32 Tensor; use float32 as it is the input type for torch.nn neural networks
            
    '''
    
    df_np = df.to_numpy() # convert to numpy so we can use torch.from_numpy method
    if outcome: 
        return torch.from_numpy(df_np).reshape(-1, 1).to(torch.float32) # reshape makes the Tensor 1D if it is an outcome vector
    return torch.from_numpy(df_np).to(torch.float32)

In [21]:
# converting training data into tensors

X_train_tensor = df_to_tensor(X_train, outcome = False)
y_train_tensor = df_to_tensor(y_train, outcome = True)

For our model, we will be using PyTorch's Sequential Neural Network (SNN), as it allows us to utilize multiple layers in a sequential order. Being able to apply multiple activation functions to our model allows for more thorough training of the model. 

For this given model, we use two Linear Modules as our hidden layers to perform linear transformations (for ease of calculation), along with the ReLU activation function for both layers. Our output layer makes use of a Sigmoid function, as this maps our transformed data to [0,1] for classification. 

With regards to the number of neurons we use for our hidden layers, many rules of thumb have been proposed. Some have suggested that the number of neurons should be $\frac{2}{3}$ that of the output layer, some have suggested no more than 2x the input layer, so we will split the middle and use 1.3x. This is an arbitrary decision, but since there is not much literature for guidance beyond these softer suggestions, we will make do with this in lieu of extensive guess and check. 

In [23]:
# calculating the number of output neurons for our hidden layers 

N_i = len(X_train.columns) # number of features represents the number of the nodes in the input layer

N_h = int(1.3 * N_i)

In [24]:
# Building the model using PyTorch

model = nn.Sequential(
    nn.Linear(N_i, N_h),
    nn.ReLU(),
    nn.Linear(N_h, N_i),
    nn.ReLU(),
    nn.Linear(N_i, 1),
    nn.Sigmoid())

Since we are using a binary classifier (either the campaign is successful or it isn't), we use Binary Cross Entropy loss as it measures the error in mislabeled outcomes for a single outcome vector. We use the Adam optimizer due to its ability to have quick convergence and deal with sparse gradients, which we may deal with as a result of having a lot of dummy variables. 

As for how we go about training the model, we use 100 epochs, which represents the number of times the entire dataset is run through the model. 100 is a fairly traditional number based on the existing NN literature. For batch size, which is the size of the sample that is run through the model at a time, as we are using batch gradient descent, powers of 2 are common, with 32 being considered the upper limit for batch size. We will use this upper limit as we are dealing with a high number of samples (N = ~40,000). Our learning rate of 

In [26]:
loss_fn = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

n_epochs = 100
batch_size = 32

for epoch in range(n_epochs):
    for i in range(0, len(X_train_tensor), batch_size):
        Xbatch = X_train_tensor[i:i+batch_size]
        y_pred = model(Xbatch)
        ybatch = y_train_tensor[i:i+batch_size]
        loss = loss_fn(y_pred, ybatch)
        optimizer.zero_grad() # resets the gradient for faster performance 
        loss.backward() # computes the gradient for the given batch 
        optimizer.step() # updates parameters based on gradient calculation 
    print(f'Finished epoch {epoch}, latest loss {loss}')

Finished epoch 0, latest loss 0.22431285679340363
Finished epoch 1, latest loss 0.2086951732635498
Finished epoch 2, latest loss 0.2068432867527008
Finished epoch 3, latest loss 0.21198216080665588
Finished epoch 4, latest loss 0.20626583695411682
Finished epoch 5, latest loss 0.18865451216697693
Finished epoch 6, latest loss 0.1815384328365326
Finished epoch 7, latest loss 0.17915308475494385
Finished epoch 8, latest loss 0.17947454750537872
Finished epoch 9, latest loss 0.18734335899353027
Finished epoch 10, latest loss 0.18332123756408691
Finished epoch 11, latest loss 0.183085098862648
Finished epoch 12, latest loss 0.17581342160701752
Finished epoch 13, latest loss 0.18078556656837463
Finished epoch 14, latest loss 0.1804172694683075
Finished epoch 15, latest loss 0.17048503458499908
Finished epoch 16, latest loss 0.16576209664344788
Finished epoch 17, latest loss 0.16001704335212708
Finished epoch 18, latest loss 0.15929800271987915
Finished epoch 19, latest loss 0.17186003923416

### Training Performance

We compute the training accuracy of our model, which is the percentage of correctly labelled outcomes, to determine how well it classifies our desired outcome.

In [29]:
with torch.no_grad(): # we don't want to update our gradient; we just want to see the already classified data
    y_pred_train = model(X_train_tensor)

train_classifications = y_pred_train.round()
    
train_accuracy = (train_classifications == y_train_tensor).float().mean()
print(f"Training Accuracy: {round(float(train_accuracy)*100, 2)}%") # have to convert accuracy from Tensor to float

Training Accuracy: 91.29%


This is a very encouraging training accuracy! However, it is quite high and could be a result of overtraining our data, or it could be the result of us having a class imbalance.

Let us examine the class imbalance possibility: 

In [31]:
training_pos = len(torch.masked_select(y_train_tensor, y_train_tensor == 1))
training_neg = len(torch.masked_select(y_train_tensor, y_train_tensor == 0))
print(f"Number of Training Positives: {training_pos}")
print(f"Number of Training Negatives: {training_neg}")
print(f"Ratio of Positives to Negatives: {round(training_pos / training_neg, 2)}")

Number of Training Positives: 3148
Number of Training Negatives: 24990
Ratio of Positives to Negatives: 0.13


We do have a rather large imbalance in class size, which could be artificially inflating our accuracy measure. Considering the size imbalance is large, but not to the point where we have too small of a sample size for either class, we will proceed forward with some other metrics to gain more insight into our model's performance on the training data.

In [33]:
def calculate_pos_neg(predictions, original):
    ''' Calculates TP, TN, FP, FN of given Tensors
    
        Inputs:
        predictions (tensor): tensor of predicted class probabilities
        original (tensor): original outcome/class assignments
        
        Outputs:
        results (list): List of TP, TN, FP, and FN 
    '''
    classifications = predictions.round() # turns probabilities into 0/1 classifications 
    
    combined = torch.stack((classifications, original), 0) # combined[0] is classifications, combined[1] is y_train_tensor
    tp, fp, tn, fn = 0, 0, 0, 0
    for i in range(len(classifications)):
        if combined[0][i] == combined[1][i]:
            if combined[0][i] == 1:
                tp += 1
            else:
                tn += 1
        if combined[0][i] != combined[1][i]:
            if combined[0][i] == 1:
                fp += 1
            else: 
                fn += 1
                
    return [tp, fp, tn, fn]

train_metrics = calculate_pos_neg(y_pred_train, y_train_tensor)
train_tp = train_metrics[0]
train_fp = train_metrics[1]
train_tn = train_metrics[2]
train_fn = train_metrics[3]
            
train_precision = train_tp / (train_tp + train_fp)
train_recall = train_tp / (train_tp + train_fn)
train_f1 = (2 * train_precision * train_recall) / (train_precision + train_recall)
train_specificity = train_fp / (train_fp + train_tn)
train_tnr = train_tn / (train_tn + train_fp)
train_fnr = train_fn / (train_fn + train_tp)

print(f"Training Precision: {round(train_precision * 100, 2)}%")
print(f"Training Recall: {round(train_recall * 100, 2)}%")
print(f"Training F1 Score: {round(train_f1, 2)}")
print(f"Training Specificity: {round(train_specificity * 100, 2)}%")
print(f"Training True Negative Rate (TNR): {round(train_tnr * 100, 2)}%")
print(f"Training False Negative Rate (FNR): {round(train_fnr * 100, 2)}%")

Training Precision: 64.7%
Training Recall: 48.79%
Training F1 Score: 0.56
Training Specificity: 3.35%
Training True Negative Rate (TNR): 96.65%
Training False Negative Rate (FNR): 51.21%


As we can see, with these other metrics, our model's accuracy is not reflective of what we would want to see. The training model is exceptionally good at identifying negatives correctly (TNR), but only decent at identifying positives correctly (Precision). 

Our F1 score, which is the weighted mean of precision and recall, suggests that our model is average at best at predicting positives.

### Fitting Model on Test Data

To address the overfitting concern, let us use our test data to assess the model performance:

In [37]:
# Converting test data to Tensors

X_test_tensor = df_to_tensor(X_test, outcome = False)
y_test_tensor = df_to_tensor(y_test, outcome = True)

In [38]:
# computing test accuracy 

with torch.no_grad():
    y_pred_test = model(X_test_tensor)

test_accuracy = (y_pred_test.round() == y_test_tensor).float().mean()
print(f"Test Accuracy: {round(float(test_accuracy)*100, 2)}%")

Test Accuracy: 90.78%


We have a similarly high test data accuracy, so let's examine the class balance here: 

In [40]:
testing_pos = len(torch.masked_select(y_test_tensor, y_test_tensor == 1))
testing_neg = len(torch.masked_select(y_test_tensor, y_test_tensor == 0))
print(f"Number of Testing Positives: {testing_pos}")
print(f"Number of Testing Negatives: {testing_neg}")
print(f"Ratio of Positives to Negatives: {round(testing_pos / testing_neg, 2)}")

Number of Testing Positives: 1385
Number of Testing Negatives: 10675
Ratio of Positives to Negatives: 0.13


Our class imbalance is roughly the same, so we would expect similar performance:

In [42]:
test_metrics = calculate_pos_neg(y_pred_test, y_test_tensor)
test_tp = test_metrics[0]
test_fp = test_metrics[1]
test_tn = test_metrics[2]
test_fn = test_metrics[3]
            
test_precision = test_tp / (test_tp + test_fp)
test_recall = test_tp / (test_tp + test_fn)
test_f1 = (2 * test_precision * test_recall) / (test_precision + test_recall)
test_specificity = test_fp / (test_fp + test_tn)
test_tnr = test_tn / (test_tn + test_fp)
test_fnr = test_fn / (test_fn + test_tp)

print(f"Testing Precision: {round(test_precision * 100, 2)}%")
print(f"Testing Recall: {round(test_recall * 100, 2)}%")
print(f"Testing F1 Score: {round(test_f1, 2)}")
print(f"Testing Specificity: {round(test_specificity * 100, 2)}%")
print(f"Testing True Negative Rate (TNR): {round(test_tnr * 100, 2)}%")
print(f"Testing False Negative Rate (FNR): {round(test_fnr * 100, 2)}%")

Testing Precision: 63.75%
Testing Recall: 45.7%
Testing F1 Score: 0.53
Testing Specificity: 3.37%
Testing True Negative Rate (TNR): 96.63%
Testing False Negative Rate (FNR): 54.3%


We perform roughly the same compared to our training model. 

Is there a way for us to improve model performance?

## Solution 1: Fixing the Class Imbalance

### SMOTE

SMOTE, or Synthetic Minority Oversampling Technique, is a class imbalance correction technique. SMOTE works to correct class imbalance by synthetically creating new minority class datapoints, which avoids the hiccups involved with random techniques like over or undersampling. 

There are some limitations, like trouble translating to higher dimensions, or the possibility of adding noise to the data because it does not consider the majority class when creating synthetic minority class data points.

First, let's generate the new, more balanced classes:

In [47]:
oversample = over_sampling.SMOTE(sampling_strategy = 0.3)
X_bal, y_bal = oversample.fit_resample(X, y)

num_pos = sum(y) 
num_neg = len(y) - num_pos 

bal_num_pos = sum(y_bal)
bal_num_neg = len(y_bal) - bal_num_pos

print(f"Original Number of Positives: {num_pos} vs. New Number of Positives: {bal_num_pos}")
print(f"Original Number of Negatives: {num_neg} vs. New Number of Negatives: {bal_num_neg}")

Original Number of Positives: 4533 vs. New Number of Positives: 10699
Original Number of Negatives: 35665 vs. New Number of Negatives: 35665


All we have done is inflate the number of positives so that they now constitute roughly 30% of our dataset. Let us see if this creates any improvement in model performance: 

### Testing Model post-SMOTE

In [50]:
X_bal_train, X_bal_test, y_bal_train, y_bal_test = train_test_split(X_bal, y_bal, random_state=0, train_size = .70)

In [51]:
# converting balanced training data into tensors

X_bal_train_tensor = df_to_tensor(X_bal_train, outcome = False)
y_bal_train_tensor = df_to_tensor(y_bal_train, outcome = True)

In [52]:
for epoch in range(n_epochs):
    for i in range(0, len(X_bal_train_tensor), batch_size):
        Xbatch = X_bal_train_tensor[i:i+batch_size]
        y_pred = model(Xbatch)
        ybatch = y_bal_train_tensor[i:i+batch_size]
        loss = loss_fn(y_pred, ybatch)
        optimizer.zero_grad()
        loss.backward() 
        optimizer.step()
    print(f'Finished epoch {epoch}, latest loss {loss}')

Finished epoch 0, latest loss 0.5528285503387451
Finished epoch 1, latest loss 0.4085456430912018
Finished epoch 2, latest loss 0.37239256501197815
Finished epoch 3, latest loss 0.2951275110244751
Finished epoch 4, latest loss 0.26278969645500183
Finished epoch 5, latest loss 0.24875782430171967
Finished epoch 6, latest loss 0.22522373497486115
Finished epoch 7, latest loss 0.19584278762340546
Finished epoch 8, latest loss 0.15925084054470062
Finished epoch 9, latest loss 0.14214853942394257
Finished epoch 10, latest loss 0.12505365908145905
Finished epoch 11, latest loss 0.10558471828699112
Finished epoch 12, latest loss 0.11162086576223373
Finished epoch 13, latest loss 0.10797242075204849
Finished epoch 14, latest loss 0.09547233581542969
Finished epoch 15, latest loss 0.07443880289793015
Finished epoch 16, latest loss 0.07245483249425888
Finished epoch 17, latest loss 0.08304374665021896
Finished epoch 18, latest loss 0.08149172365665436
Finished epoch 19, latest loss 0.07781846821

### Training Performance w/ SMOTE

In [53]:
with torch.no_grad():
    y_bal_train_pred = model(X_bal_train_tensor)

bal_train_classifications = y_bal_train_pred.round()
    
bal_train_accuracy = (bal_train_classifications == y_bal_train_tensor).float().mean()
print(f"Training Accuracy: {round(float(bal_train_accuracy)*100, 2)}%")

Training Accuracy: 92.73%


We observe a similar level of accuracy to our previous attempts, but given our lower loss during the training phases, I would believe this will result in better performance metrics.

In [86]:
bal_train_metrics = calculate_pos_neg(y_bal_train_pred, y_bal_train_tensor)
bal_train_tp = bal_train_metrics[0]
bal_train_fp = bal_train_metrics[1]
bal_train_tn = bal_train_metrics[2]
bal_train_fn = bal_train_metrics[3]
            
bal_train_precision = bal_train_tp / (bal_train_tp + bal_train_fp)
bal_train_recall = bal_train_tp / (bal_train_tp + bal_train_fn)
bal_train_f1 = (2 * bal_train_precision * bal_train_recall) / (bal_train_precision + bal_train_recall)
bal_train_specificity = bal_train_fp / (bal_train_fp + bal_train_tn)
bal_train_tnr = bal_train_tn / (bal_train_tn + bal_train_fp)
bal_train_fnr = bal_train_fn / (bal_train_fn + bal_train_tp)

print(f"Balanced Training Precision: {round(bal_train_precision * 100, 2)}%")
print(f"Balanced Training Recall: {round(bal_train_recall * 100, 2)}%")
print(f"Balanced Training F1 Score: {round(bal_train_f1, 2)}")
print(f"Balanced Training Specificity: {round(bal_train_specificity * 100, 2)}%")
print(f"Balanced Training True Negative Rate (TNR): {round(bal_train_tnr * 100, 2)}%")
print(f"Balanced Training False Negative Rate (FNR): {round(bal_train_fnr * 100, 2)}%")

Balanced Training Precision: 89.6%
Balanced Training Recall: 77.42%
Balanced Training F1 Score: 0.83
Balanced Training Specificity: 2.69%
Balanced Training True Negative Rate (TNR): 97.31%
Balanced Training False Negative Rate (FNR): 22.58%


We see a dramatic improvement in our metrics! We have a very strong positive predictor now.