## Imports and Data Cleaning

In [1]:
import pandas as pd 
import numpy as np
from sklearn.model_selection import train_test_split
import torch 
from torch import nn
from torch import optim 

pd.options.mode.chained_assignment = None



In [2]:
df = pd.read_csv("./data/bank-additional-full.csv", delimiter = ";")
df.head(5)

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


The description of the columns can be found at the UCI Machine Learning Repository, linked [here](https://archive.ics.uci.edu/dataset/222/bank+marketing)

In [3]:
df.columns

Index(['age', 'job', 'marital', 'education', 'default', 'housing', 'loan',
       'contact', 'month', 'day_of_week', 'duration', 'campaign', 'pdays',
       'previous', 'poutcome', 'emp.var.rate', 'cons.price.idx',
       'cons.conf.idx', 'euribor3m', 'nr.employed', 'y'],
      dtype='object')

For this particular analysis, I am going to ignore economic factors, as I will take the simplifying assumption that the majority of people who are in this campaign are not making decisions based on economic factors like the consumer price index.

In [4]:
df = df[df.columns[~df.columns.isin(['emp.var.rate', 'cons.price.idx','cons.conf.idx', 'euribor3m', 'nr.employed'])]]
df.head(5)

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,261,1,999,0,nonexistent,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,149,1,999,0,nonexistent,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,226,1,999,0,nonexistent,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,151,1,999,0,nonexistent,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,307,1,999,0,nonexistent,no


In [5]:
df['default'].value_counts()

no         32588
unknown     8597
yes            3
Name: default, dtype: int64

In [6]:
df['housing'].value_counts()

yes        21576
no         18622
unknown      990
Name: housing, dtype: int64

In [7]:
df['loan'].value_counts()

no         33950
yes         6248
unknown      990
Name: loan, dtype: int64

As we can see from above, the number of customers who have defaulted on previous loans is relatively low. So, to avoid dealing with the unknown values, we will make the assumption that the vasy majority of people have not defaulted and therefore this column does not provide us any predictive information. 

We will also drop the unknowns from people who have taken housing or personal loans, as we have a sufficient sample size without the unknowns.

In [8]:
df = df.drop(columns = ['default'])

df = df[(df['housing'] != 'unknown') & (df['loan'] != 'unknown')]

In order to run our neural network, we have to make all of our variables numeric. We do so by turning any categorical variables into dummy variables (1 = variable is true, 0 = false), and, since the binary variables are stored as yes/no variables, we will convert those into 1/0 values as well. 

In [9]:
# converting categorical variables into dummies

df_dummy = pd.get_dummies(df, columns = ['job', 'marital', 'education', 'contact', 'month', 'day_of_week', 'poutcome'], prefix_sep = ': ')

# turning y/n variables into 1/0

df_dummy['housing'] = np.where(df['housing'].values == 'yes', 1, 0)
df_dummy['loan'] = np.where(df['loan'].values == 'yes', 1, 0)
df_dummy['y'] = np.where(df['y'].values == 'yes', 1, 0)

df_dummy.head(5)

Unnamed: 0,age,housing,loan,duration,campaign,pdays,previous,y,job: admin.,job: blue-collar,...,month: oct,month: sep,day_of_week: fri,day_of_week: mon,day_of_week: thu,day_of_week: tue,day_of_week: wed,poutcome: failure,poutcome: nonexistent,poutcome: success
0,56,0,0,261,1,999,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0
1,57,0,0,149,1,999,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0
2,37,1,0,226,1,999,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0
3,40,0,0,151,1,999,0,0,1,0,...,0,0,0,1,0,0,0,0,1,0
4,56,0,1,307,1,999,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0


## Building the Neural Network

For any model building, we want to ensure our model is not trained on the entirety of our dataset. Otherwise, we would have no data to test the model on, and run the risk of overtraining. I have decided to use Scikit-Learn to split the dataset into 70% training data, 30% testing data. An 80-20 split is another popular option but I prefer to have more testing data to ensure the model is even less susceptible to the overtraining problem.

We make sure our outcome variable, y (whether or not the client subscribed to the term deposit, the goal of the campaign), is separated from our other predictor variables.  

In [10]:
X = df_dummy[df_dummy.columns[~df_dummy.columns.isin(['y'])]]
y = df_dummy['y']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, train_size = .70)

Now that we have the data split up, we want to build our model and train it on the training data. We use PyTorch because of the ease of converting data into tensors, and using tensors as inputs to models. 

In [11]:
def df_to_tensor(df, outcome):
    ''' Converts a df into a numpy array and then a Tensor with dtype float32 to be used in a PyTorch model 
        
        Params: 
            df (DataFrame): Input dataframe to be converted
            outcome (Boolean): Whether or not the df is an outcome vector; if it is, must be converted to 1D tensor for processing
        
        Returns:
            as_tensor (Tensor): dtype float32 Tensor; use float32 as it is the input type for torch.nn neural networks
            
    '''
    
    df_np = df.to_numpy() # convert to numpy so we can use torch.from_numpy method
    if outcome: 
        return torch.from_numpy(df_np).reshape(-1, 1).to(torch.float32) # reshape makes the Tensor 1D if it is an outcome vector
    return torch.from_numpy(df_np).to(torch.float32)

In [12]:
# converting training data into tensors

X_train_tensor = df_to_tensor(X_train, outcome = False)
y_train_tensor = df_to_tensor(y_train, outcome = True)

For our model, we will be using PyTorch's Sequential Neural Network (SNN), as it allows us to utilize multiple layers in a sequential order. Being able to apply multiple activation functions to our model allows for more thorough training of the model. 

For this given model, we use two Linear Modules as our hidden layers to perform linear transformations (for ease of calculation), along with the ReLU activation function for both layers. Our output layer makes use of a Sigmoid function, as this maps our transformed data to [0,1] for classification. 

With regards to the number of neurons we use for our hidden layers, many rules of thumb have been proposed. Some have suggested that the number of neurons should be $\frac{2}{3}$ that of the output layer, some have suggested no more than 2x the input layer, so we will split the middle and use 1.3x. This is an arbitrary decision, but since there is not much literature for guidance beyond these softer suggestions, we will make do of this in lieu of extensive guess and check. 

In [40]:
# calculating the number of output neurons for our hidden layers 

N_i = len(X_train.columns) # number of features represents the number of the nodes in the input layer

N_h = int(1.3 * N_i)

In [41]:
# Building the model using PyTorch

model = nn.Sequential(
    nn.Linear(N_i, N_h),
    nn.ReLU(),
    nn.Linear(N_h, N_i),
    nn.ReLU(),
    nn.Linear(num_inputs, 1),
    nn.Sigmoid())

Since we are using a binary classifier (either the campaign is successful or it isn't), we use Binary Cross Entropy loss as it measures the error in mislabeled outcomes for a single outcome vector. We use the Adam optimizer due to its ability to have quick convergence and deal with sparse gradients, which we may deal with as a result of having a lot of dummy variables. 

As for how we go about training the model, we use 100 epochs, which represents the number of times the entire dataset is run through the model. 100 is a fairly traditional number based on the existing NN literature. For batch size, which is the size of the sample that is run through the model at a time, as we are using batch gradient descent, powers of 2 are common, with 32 being considered the upper limit for batch size. We will use this upper limit as we are dealing with a high number of samples (N = ~40,000).

In [42]:
loss_fn = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

n_epochs = 100
batch_size = 32

for epoch in range(n_epochs):
    for i in range(0, len(X_train_tensor), batch_size):
        Xbatch = X_train_tensor[i:i+batch_size]
        y_pred = model(Xbatch)
        ybatch = y_train_tensor[i:i+batch_size]
        loss = loss_fn(y_pred, ybatch)
        optimizer.zero_grad() # resets the gradient for faster performance 
        loss.backward() # computes the gradient for the given batch 
        optimizer.step() # updates parameters based on gradient calculation 
    print(f'Finished epoch {epoch}, latest loss {loss}')

Finished epoch 0, latest loss 0.20889286696910858
Finished epoch 1, latest loss 0.19902357459068298
Finished epoch 2, latest loss 0.2047332525253296
Finished epoch 3, latest loss 0.21192559599876404
Finished epoch 4, latest loss 0.20794029533863068
Finished epoch 5, latest loss 0.19875772297382355
Finished epoch 6, latest loss 0.19160299003124237
Finished epoch 7, latest loss 0.19278623163700104
Finished epoch 8, latest loss 0.18989558517932892
Finished epoch 9, latest loss 0.18587341904640198
Finished epoch 10, latest loss 0.18531538546085358
Finished epoch 11, latest loss 0.1849142163991928
Finished epoch 12, latest loss 0.19510146975517273
Finished epoch 13, latest loss 0.18955069780349731
Finished epoch 14, latest loss 0.1840510219335556
Finished epoch 15, latest loss 0.18330061435699463
Finished epoch 16, latest loss 0.1837574541568756
Finished epoch 17, latest loss 0.18175312876701355
Finished epoch 18, latest loss 0.1751476228237152
Finished epoch 19, latest loss 0.1773135811090

We compute the training accuracy of our model, which is the percentage of correctly labelled outcomes, to determine how well it classifies our desired outcome.

In [43]:
with torch.no_grad(): # we don't want to update our gradient; we just want to see the already classified data
    y_pred_train = model(X_train_tensor)

train_accuracy = (y_pred_train.round() == y_train_tensor).float().mean()
print(f"Training Accuracy: {round(float(train_accuracy)*100, 2)}%") # have to convert accuracy from Tensor to float

Training Accuracy: 91.36%


This is a very encouraging training accuracy! However, it is quite high and could be a result of overtraining our data. 

To ensure this isn't the case, we will use our test data and see how the model performs on this set of data: 

In [44]:
# Converting test data to Tensors

X_test_tensor = df_to_tensor(X_test, outcome = False)
y_test_tensor = df_to_tensor(y_test, outcome = True)

In [45]:
# computing test accuracy 

with torch.no_grad():
    y_pred_test = model(X_test_tensor)

test_accuracy = (y_pred_test.round() == y_test_tensor).float().mean()
print(f"Test Accuracy: {round(float(test_accuracy)*100, 2)}%")

Test Accuracy: 90.94%


We have a similarly high test data accuracy, suggesting we are not dealing with an overtraining problem. 