## Imports

In [2]:
import pandas as pd 
import numpy as np
from sklearn.model_selection import train_test_split
import torch 
from torch import nn
from torch import optim 
from imblearn import over_sampling

pd.options.mode.chained_assignment = None

## Data Cleaning

In [4]:
df = pd.read_csv("./data/bank-additional-full.csv", delimiter = ";")
df.head(5)

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


The description of the columns can be found at the UCI Machine Learning Repository, linked [here](https://archive.ics.uci.edu/dataset/222/bank+marketing)

In [6]:
df.columns

Index(['age', 'job', 'marital', 'education', 'default', 'housing', 'loan',
       'contact', 'month', 'day_of_week', 'duration', 'campaign', 'pdays',
       'previous', 'poutcome', 'emp.var.rate', 'cons.price.idx',
       'cons.conf.idx', 'euribor3m', 'nr.employed', 'y'],
      dtype='object')

For this particular analysis, I am going to ignore economic factors, as I will take the simplifying assumption that the majority of people who are in this campaign are not making decisions based on economic factors like the consumer price index.

In [8]:
df = df[df.columns[~df.columns.isin(['emp.var.rate', 'cons.price.idx','cons.conf.idx', 'euribor3m', 'nr.employed'])]]
df.head(5)

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,261,1,999,0,nonexistent,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,149,1,999,0,nonexistent,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,226,1,999,0,nonexistent,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,151,1,999,0,nonexistent,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,307,1,999,0,nonexistent,no


In [9]:
df['default'].value_counts()

no         32588
unknown     8597
yes            3
Name: default, dtype: int64

In [10]:
df['housing'].value_counts()

yes        21576
no         18622
unknown      990
Name: housing, dtype: int64

In [11]:
df['loan'].value_counts()

no         33950
yes         6248
unknown      990
Name: loan, dtype: int64

As we can see from above, the number of customers who have defaulted on previous loans is relatively low. So, to avoid dealing with the unknown values, we will make the assumption that the vast majority of people have not defaulted and therefore this column does not provide us any predictive information. 

We will also drop the unknowns from people who have taken housing or personal loans, as we have a sufficient sample size without the unknowns.

In [13]:
df = df.drop(columns = ['default'])

df = df[(df['housing'] != 'unknown') & (df['loan'] != 'unknown')]

In order to run our neural network, we have to make all of our variables numeric. We do so by turning any categorical variables into dummy variables (1 = variable is true, 0 = false), and, since the binary variables are stored as yes/no variables, we will convert those into 1/0 values as well. 

In [15]:
# converting categorical variables into dummies

df_dummy = pd.get_dummies(df, columns = ['job', 'marital', 'education', 'contact', 'month', 'day_of_week', 'poutcome'], prefix_sep = ': ')

# turning y/n variables into 1/0

df_dummy['housing'] = np.where(df['housing'].values == 'yes', 1, 0)
df_dummy['loan'] = np.where(df['loan'].values == 'yes', 1, 0)
df_dummy['y'] = np.where(df['y'].values == 'yes', 1, 0)

df_dummy.head(5)

Unnamed: 0,age,housing,loan,duration,campaign,pdays,previous,y,job: admin.,job: blue-collar,...,month: oct,month: sep,day_of_week: fri,day_of_week: mon,day_of_week: thu,day_of_week: tue,day_of_week: wed,poutcome: failure,poutcome: nonexistent,poutcome: success
0,56,0,0,261,1,999,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0
1,57,0,0,149,1,999,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0
2,37,1,0,226,1,999,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0
3,40,0,0,151,1,999,0,0,1,0,...,0,0,0,1,0,0,0,0,1,0
4,56,0,1,307,1,999,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0


## Building the Basic Neural Network

For any model building, we want to ensure our model is not trained on the entirety of our dataset. Otherwise, we would have no data to test the model on, and run the risk of overtraining. I have decided to use Scikit-Learn to split the dataset into 70% training data, 30% testing data. An 80-20 split is another popular option but I prefer to have more testing data to ensure the model is even less susceptible to the overtraining problem.

We make sure our outcome variable, y (whether or not the client subscribed to the term deposit, the goal of the campaign), is separated from our other predictor variables.  

In [18]:
X = df_dummy[df_dummy.columns[~df_dummy.columns.isin(['y'])]]
y = df_dummy['y']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, train_size = .70)

### Training the Sequential Neural Network

In [20]:
def df_to_tensor(df, outcome):
    ''' Converts a df into a numpy array and then a Tensor with dtype float32 to be used in a PyTorch model 
        
        Inputs: 
            df (DataFrame): Input dataframe to be converted
            outcome (Boolean): Whether or not the df is an outcome vector; if it is, must be converted to 1D tensor for processing
        
        Outputs:
            as_tensor (Tensor): dtype float32 Tensor; use float32 as it is the input type for torch.nn neural networks
            
    '''
    
    df_np = df.to_numpy() # convert to numpy so we can use torch.from_numpy method
    if outcome: 
        return torch.from_numpy(df_np).reshape(-1, 1).to(torch.float32) # reshape makes the Tensor 1D if it is an outcome vector
    return torch.from_numpy(df_np).to(torch.float32)

In [21]:
# converting training data into tensors

X_train_tensor = df_to_tensor(X_train, outcome = False)
y_train_tensor = df_to_tensor(y_train, outcome = True)

For our model, we will be using PyTorch's Sequential Neural Network (SNN), as it allows us to utilize multiple layers in a sequential order. Being able to apply multiple activation functions to our model allows for more thorough training of the model. 

For this given model, we use two Linear Modules as our hidden layers to perform linear transformations (for ease of calculation), along with the ReLU activation function for both layers. Our output layer makes use of a Sigmoid function, as this maps our transformed data to [0,1] for classification. 

With regards to the number of neurons we use for our hidden layers, many rules of thumb have been proposed. Some have suggested that the number of neurons should be $\frac{2}{3}$ that of the output layer, some have suggested no more than 2x the input layer. We will use K-fold Cross Validation to determine which option, including a 3rd possibility of the middle of the road of the two (1.3x input layer). 

In [23]:
# calculating the number of output neurons for our hidden layers 

N_i = len(X_train.columns) # number of features represents the number of the nodes in the input layer

N_h = int(1.3 * N_i)

In [159]:
# Building the model using PyTorch

def sequential_NN(multiplier, N_i = len(X.columns)):
    ''' Create a Sequental NN from PyTorch with two linear hidden layers and a sigmoid activation function
    
        Inputs:
            multiplier (float): what we multiply the number of input neurons by to get the number of neurons for the hidden layers
            N_i (int): The number of input neurons, which we take as the number of features in the model 
        
        Outputs:
            model (nn.Sequential): Sequential NN with each hidden layer having the calculated number of neurons
            
    '''
    
    N_h = int(N_i * multiplier) 
                                 
    model = nn.Sequential(
    nn.Linear(N_i, N_h),
    nn.ReLU(),
    nn.Linear(N_h, N_i),
    nn.ReLU(),
    nn.Linear(N_i, 1),
    nn.Sigmoid())
    
    return model 

Since we are using a binary classifier (either the campaign is successful or it isn't), we use Binary Cross Entropy loss as it measures the error in mislabeled outcomes for a single outcome vector. We use the Adam optimizer due to its ability to have quick convergence and deal with sparse gradients, which we may deal with as a result of having a lot of dummy variables. 

As for how we go about training the model, we use 100 epochs, which represents the number of times the entire dataset is run through the model. 100 is a fairly traditional number based on the existing NN literature. For batch size, which is the size of the sample that is run through the model at a time, as we are using batch gradient descent, powers of 2 are common, with 32 being considered the upper limit for batch size. We will use this upper limit as we are dealing with a high number of samples (N = ~40,000). Our learning rate of 0.001 for the Adam optimizer is widely recognized as the ideal learning rate, so we will not change this. 

In [165]:
def train_sgd():
    '''Trains a model through mini-batch Stochastic Gradient Descent (SGD)
    
        Inputs: 
            model (torch.nn): A PyTorch Neural Network 
            loss (nn.BCELoss()): Binary Cross-Entropy Loss function from PyTorch
            alpha (float): learning rate for optimizer 
            optimizer (optim.Adam): Adam optimizer for performing backward iteration, steps, and gradient calculation
            X (tensor): training data feature tensor
            y (tensor): training data outcome tensor 
            num_epochs (int): Number of times the entirety of the model is run through SGD 
            batch_size (int): Number of inputs processed in single iteration of SGD 
        
    '''
    
    model = neural
    loss_fn = nn.BCELoss()
    alpha = 0.001
    optimizer = optim.Adam(model.parameters(), lr=0.001)
    X = X_train_tensor
    y = y_train_tensor
    n_epochs = 100
    batch_size = 32

    for epoch in range(n_epochs):
        for i in range(0, len(X_train_tensor), batch_size):
            Xbatch = X_train_tensor[i:i+batch_size]
            y_pred = model(Xbatch)
            ybatch = y_train_tensor[i:i+batch_size]
            loss = loss_fn(y_pred, ybatch)
            optimizer.zero_grad() # resets the gradient for faster performance 
            loss.backward() # computes the gradient for the given batch 
            optimizer.step() # updates parameters based on gradient calculation 
        print(f'Finished epoch {epoch}, latest loss {loss}')

In [167]:
neural = sequential_NN(1.3)
train_sgd()

Finished epoch 0, latest loss 0.2160571813583374
Finished epoch 1, latest loss 0.21437910199165344
Finished epoch 2, latest loss 0.19906504452228546
Finished epoch 3, latest loss 0.21520960330963135
Finished epoch 4, latest loss 0.22045573592185974
Finished epoch 5, latest loss 0.21404850482940674
Finished epoch 6, latest loss 0.20876367390155792
Finished epoch 7, latest loss 0.20605015754699707
Finished epoch 8, latest loss 0.20433136820793152
Finished epoch 9, latest loss 0.19889554381370544
Finished epoch 10, latest loss 0.19855527579784393
Finished epoch 11, latest loss 0.20169496536254883
Finished epoch 12, latest loss 0.19745559990406036
Finished epoch 13, latest loss 0.19541020691394806
Finished epoch 14, latest loss 0.19591735303401947
Finished epoch 15, latest loss 0.19069263339042664
Finished epoch 16, latest loss 0.19934764504432678
Finished epoch 17, latest loss 0.18870408833026886
Finished epoch 18, latest loss 0.19255760312080383
Finished epoch 19, latest loss 0.194712117

### Training Performance

We compute the training accuracy of our model, which is the percentage of correctly labelled outcomes, to determine how well it classifies our desired outcome.

In [137]:
def classify_data(X):
    ''' Runs given data through our model, returning the predicted classes
    
        Inputs:
            X (tensor): tensor containing input data for the model 
        
        Outputs:
            classifications (tensor): tensor containing the predicted classes for X 
        
    '''
    
    with torch.no_grad(): # we don't want to update our gradient; we just want to see the already classified data
        predicted_probabilities = model(X)

    classifications = predicted_probabilities.round() # probabilities >= 0.5 get rounded to 1, under to 0 
    return classifications

def accuracy(classifications, y):
    ''' Calculate the accuracy, or the percentage of correct predictions, of the model 
    
        Inputs:
            classifications (tensor): tensor of predicted classes (output of classify_data)
            y (tensor): tensor of true classes
        
        Outputs: 
            accuracy (float): accuracy of the predictions as a percentage
        
    '''
    accuracy = (classifications == y).float().mean()
    return round(float(accuracy)*100, 2) # have to convert accuracy from Tensor to float to convert as percentage

In [139]:
train_classifications = classify_data(X_train_tensor)
    
train_accuracy = accuracy(train_classifications, y_train_tensor)
print(f"Training Accuracy: {train_accuracy}%") 

Training Accuracy: 91.45%


This is a very encouraging training accuracy! However, it is quite high and could be a result of overtraining our data, or it could be the result of us having a class imbalance.

Let us examine the class imbalance possibility: 

In [143]:
num_pos = sum(y) # we are dealing with 0s and 1s!
num_neg = len(y) - sum(y)
print(f"Number of Positives: {num_pos}")
print(f"Number of Negatives: {num_neg}")
print(f"Ratio of Positives to Negatives: {round(num_pos / num_neg, 2)}")

Number of Positives: 4533
Number of Negatives: 35665
Ratio of Positives to Negatives: 0.13


We do have a rather large imbalance in class size, which could be artificially inflating our accuracy measure. Considering the size imbalance is large, but not to the point where we have too small of a sample size for either class, we will proceed forward with some other metrics to gain more insight into our model's performance on the training data.

In [96]:
# defininig functions to calculate performance metrics 

def calculate_pos_neg(predictions, original):
    ''' Calculates TP, TN, FP, FN of given Tensors
    
        Inputs:
            predictions (tensor): tensor of predicted class probabilities
            original (tensor): original outcome/class assignments
        
        Outputs:
            results (list): List of TP, TN, FP, and FN 
    '''
    classifications = predictions.round() # turns probabilities into 0/1 classifications 
    
    combined = torch.stack((classifications, original), 0) # combined[0] is classifications, combined[1] is y_tensor
    tp, fp, tn, fn = 0, 0, 0, 0
    for i in range(len(classifications)):
        if combined[0][i] == combined[1][i]:
            if combined[0][i] == 1:
                tp += 1
            else:
                tn += 1
        if combined[0][i] != combined[1][i]:
            if combined[0][i] == 1:
                fp += 1
            else: 
                fn += 1
                
    return [tp, fp, tn, fn]

def calculate_metrics(frequencies):
    ''' Calculates Precision, Recall, F1 Score, Specificity, True Negative Rate, and False Negative Rate
    
        Inputs:
            frequencies (list): List consisting of TP, TN, FP, FN (all integers, outputted from calcualte_pos_neg)
        
        Outputs:
            results (list): List of Precision, Recall, F1 Score, Specificity, True Negative Rate, and False Negative Rate
        
    '''
    
    tp = frequencies[0]
    fp = frequencies[1]
    tn = frequencies[2]
    fn = frequencies[3]

    precision = tp / (tp + fp)
    recall = tp / (tp + fn)
    f1 = (2 * precision * recall) / (precision + recall)
    specificity = fp / (fp + tn)
    tnr = 1 - specificity # TNR and Specificity are complements 
    fnr = 1 - recall # FNR and Recall are complements 
    
    return [precision, recall, f1, specificity, tnr, fnr]

In [100]:
training_metrics = calculate_metrics(calculate_pos_neg(y_pred_train, y_train_tensor))


print(f"Training Precision: {round(training_metrics[0] * 100, 2)}%")
print(f"Training Recall: {round(training_metrics[1] * 100, 2)}%")
print(f"Training F1 Score: {round(training_metrics[2], 2)}")
print(f"Training Specificity: {round(training_metrics[3] * 100, 2)}%")
print(f"Training True Negative Rate (TNR): {round(training_metrics[4] * 100, 2)}%")
print(f"Training False Negative Rate (FNR): {round(training_metrics[5] * 100, 2)}%")

Training Precision: 64.7%
Training Recall: 48.79%
Training F1 Score: 0.56
Training Specificity: 3.35%
Training True Negative Rate (TNR): 96.65%
Training False Negative Rate (FNR): 51.21%


As we can see, with these other metrics, our model's accuracy is not reflective of what we would want to see. The training model is exceptionally good at identifying negatives correctly (TNR), but only decent at identifying positives correctly (Precision). 

Our F1 score, which is the weighted mean of precision and recall, suggests that our model is average at best at predicting positives.

### Fitting Model on Test Data

To address the overfitting concern, let us use our test data to assess the model performance:

In [37]:
# Converting test data to Tensors

X_test_tensor = df_to_tensor(X_test, outcome = False)
y_test_tensor = df_to_tensor(y_test, outcome = True)

In [38]:
# computing test accuracy 

test_classifications = classify_data(X_test_tensor)
    
test_accuracy = accuracy(test_classifications, y_test_tensor)
print(f"Test Accuracy: {test_accuracy}%") 

Test Accuracy: 90.78%


We have a similarly high test data accuracy, so let's see if our imbalance problems carry over: 

In [102]:
testing_metrics = calculate_metrics(calculate_pos_neg(y_pred_test, y_test_tensor))


print(f"Testing Precision: {round(testing_metrics[0] * 100, 2)}%")
print(f"Testing Recall: {round(testing_metrics[1] * 100, 2)}%")
print(f"Testing F1 Score: {round(testing_metrics[2], 2)}")
print(f"Testing Specificity: {round(testing_metrics[3] * 100, 2)}%")
print(f"Testing True Negative Rate (TNR): {round(testing_metrics[4] * 100, 2)}%")
print(f"Testing False Negative Rate (FNR): {round(testing_metrics[5] * 100, 2)}%")

Testing Precision: 63.75%
Testing Recall: 45.7%
Testing F1 Score: 0.53
Testing Specificity: 3.37%
Testing True Negative Rate (TNR): 96.63%
Testing False Negative Rate (FNR): 54.3%


We perform roughly the same compared to our training model. 

Is there a way for us to improve model performance?

## Solution 1: Fixing the Class Imbalance

### SMOTE

SMOTE, or Synthetic Minority Oversampling Technique, is a class imbalance correction technique. SMOTE works to correct class imbalance by synthetically creating new minority class datapoints, which avoids the hiccups involved with random techniques like over or undersampling. 

There are some limitations, like trouble translating to higher dimensions, or the possibility of adding noise to the data because it does not consider the majority class when creating synthetic minority class data points.

First, let's generate the new, more balanced classes:

In [47]:
oversample = over_sampling.SMOTE(sampling_strategy = 0.3)
X_bal, y_bal = oversample.fit_resample(X, y)

bal_num_pos = sum(y_bal)
bal_num_neg = len(y_bal) - bal_num_pos

print(f"Original Number of Positives: {num_pos} vs. New Number of Positives: {bal_num_pos}")
print(f"Original Number of Negatives: {num_neg} vs. New Number of Negatives: {bal_num_neg}")

Original Number of Positives: 4533 vs. New Number of Positives: 10699
Original Number of Negatives: 35665 vs. New Number of Negatives: 35665


All we have done is inflate the number of positives so that they now constitute roughly 30% of our dataset. Let us see if this creates any improvement in model performance: 

### Training Model Post-SMOTE

In [50]:
X_bal_train, X_bal_test, y_bal_train, y_bal_test = train_test_split(X_bal, y_bal, random_state=0, train_size = .70)

In [51]:
# converting balanced training data into tensors

X_bal_train_tensor = df_to_tensor(X_bal_train, outcome = False)
y_bal_train_tensor = df_to_tensor(y_bal_train, outcome = True)

In [52]:
train_sgd()

Finished epoch 0, latest loss 0.5528285503387451
Finished epoch 1, latest loss 0.4085456430912018
Finished epoch 2, latest loss 0.37239256501197815
Finished epoch 3, latest loss 0.2951275110244751
Finished epoch 4, latest loss 0.26278969645500183
Finished epoch 5, latest loss 0.24875782430171967
Finished epoch 6, latest loss 0.22522373497486115
Finished epoch 7, latest loss 0.19584278762340546
Finished epoch 8, latest loss 0.15925084054470062
Finished epoch 9, latest loss 0.14214853942394257
Finished epoch 10, latest loss 0.12505365908145905
Finished epoch 11, latest loss 0.10558471828699112
Finished epoch 12, latest loss 0.11162086576223373
Finished epoch 13, latest loss 0.10797242075204849
Finished epoch 14, latest loss 0.09547233581542969
Finished epoch 15, latest loss 0.07443880289793015
Finished epoch 16, latest loss 0.07245483249425888
Finished epoch 17, latest loss 0.08304374665021896
Finished epoch 18, latest loss 0.08149172365665436
Finished epoch 19, latest loss 0.07781846821

### Training Performance w/ SMOTE

In [53]:
bal_train_classifications = classify_data(X_bal_train_tensor)
    
bal_train_accuracy = accuracy(bal_train_classifications, y_bal_train_tensor)
print(f"Balanced Training Accuracy: {bal_train_accuracy}%") 

Training Accuracy: 92.73%


We observe a similar level of accuracy to our previous attempts, but given our lower loss during the training phases, I would believe this will result in better performance metrics.

In [105]:
bal_training_metrics = calculate_metrics(calculate_pos_neg(y_bal_train_pred, y_bal_train_tensor))


print(f"Balanced Training Precision: {round(bal_training_metrics[0] * 100, 2)}%")
print(f"Balanced Training Recall: {round(bal_training_metrics[1] * 100, 2)}%")
print(f"Balanced Training F1 Score: {round(bal_training_metrics[2], 2)}")
print(f"Balanced Training Specificity: {round(bal_training_metrics[3] * 100, 2)}%")
print(f"Balanced Training True Negative Rate (TNR): {round(bal_training_metrics[4] * 100, 2)}%")
print(f"Balanced Training False Negative Rate (FNR): {round(bal_training_metrics[5] * 100, 2)}%")

Balanced Training Precision: 89.6%
Balanced Training Recall: 77.42%
Balanced Training F1 Score: 0.83
Balanced Training Specificity: 2.69%
Balanced Training True Negative Rate (TNR): 97.31%
Balanced Training False Negative Rate (FNR): 22.58%


We see a dramatic improvement in our metrics! We have a very strong positive predictor now.

Let's make sure this carries over to our testing data:

### Testing Model Post-SMOTE

In [109]:
X_bal_test_tensor = df_to_tensor(X_bal_test, outcome = False)
y_bal_test_tensor = df_to_tensor(y_bal_test, outcome = True)

In [111]:
bal_test_classifications = classify_data(X_bal_test_tensor)
    
bal_test_accuracy = accuracy(bal_train_classifications, y_bal_test_tensor)
print(f"Balanced Test Accuracy: {train_accuracy}%") 

Training Accuracy: 92.03%


We have a similarly high training accuracy, but let's see if this carries over to the other metrics:

In [114]:
bal_testing_metrics = calculate_metrics(calculate_pos_neg(y_bal_test_pred, y_bal_test_tensor))


print(f"Balanced Test Precision: {round(bal_testing_metrics[0] * 100, 2)}%")
print(f"Balanced Test Recall: {round(bal_testing_metrics[1] * 100, 2)}%")
print(f"Balanced Test F1 Score: {round(bal_testing_metrics[2], 2)}")
print(f"Balanced Test Specificity: {round(bal_testing_metrics[3] * 100, 2)}%")
print(f"Balanced Test True Negative Rate (TNR): {round(bal_testing_metrics[4] * 100, 2)}%")
print(f"Balanced Test False Negative Rate (FNR): {round(bal_testing_metrics[5] * 100, 2)}%")

Balanced Testing Precision: 88.14%
Balanced Testing Recall: 75.74%
Balanced Testing F1 Score: 0.81
Balanced Testing Specificity: 3.07%
Balanced Testing True Negative Rate (TNR): 96.93%
Balanced Testing False Negative Rate (FNR): 24.26%


Our performance on the test set is rouhgly the same as our training data! We have a strong classifier on our hands! 

Can we make it even better? 