Michael Wilson

DSC-609 - Machine Learning

Module 3 - Kernelized Support Vector Classification

## Dataset

The dataset used for this assignment is the Bank Marketing Data Set, posted on Kaggle by user ruthgn.  It is a modified version of a dataset shared in the Univeristy of California - Irvine Machine Learning repository.  

The dataset represents the outcomes of marketing campaigns intended to get clients to agree to a term deposit subscription.  The data is a copy of the UCI dataset with the exception of the removal of one input feature that reports the call duration.  This feature was removed from the dataset as it was known to highly affect the target classification, in that a duration of zero guaranteed that the result was a client that declined to agree to the term deposit subscription.

The dataset offers 19 independent variables that can be used to help predict one dependent variable classification.  Here, the dependent variable is whether or not the client has subscribed to a term deposit.  The other 19 independent variables are initially defined as follows:

Bank Client Data:

1 - age (numeric)

2 - job : type of job (categorical: 'admin.', 'blue-collar', 'entrepreneur', 'housemaid', 'management', 'retired', 'self-employed', 'services', 'student', 'technician', 'unemployed', 'unknown')

3 - marital : marital status (categorical: 'divorced', 'married', 'single', 'unknown'; note: 'divorced' means divorced or widowed)


4 - education (categorical:'basic.4y', 'basic.6y', 'basic.9y', 'high.school', 'illiterate', 'professional.course', 'university.degree', 'unknown')

5 - default: has credit in default? (categorical: 'no', 'yes', 'unknown')

6 - housing: has housing loan? (categorical: 'no', 'yes', 'unknown')

7 - loan: has personal loan? (categorical: 'no', 'yes', 'unknown')

Related with the last contact of the current campaign:

8 - contact: contact communication type (categorical: 'cellular', 'telephone')

9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', …, 'nov', 'dec')

10 - dayofweek: last contact day of the week (categorical: 'mon', 'tue', 'wed', 'thu', 'fri')

Other attributes:

11 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

12 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)

13 - previous: number of contacts performed before this campaign and for this client (numeric)

14 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')

Social and economic context attributes:

15 - emp.var.rate: employment variation rate - quarterly indicator (numeric)

16 - cons.price.idx: consumer price index - monthly indicator (numeric)

17 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)

18 - euribor3m: euribor 3 month rate - daily indicator (numeric)

19 - nr.employed: number of employees - quarterly indicator (numeric) (ruthgn, 2021)

From this dataset we will filter the data in order to simplify the analysis through some dimensionality reduction.  In this dataset, we are concerned with clients who weren't previously contacted (poutcome = nonexistant or pdays = 999) for this type of campaign, and for whom none of the categorical variables have a value of "unknown".

To try and perform this classification, we will build a Support Vector Machine, Logistic Regression, and Nearest Neighbors classifier models on the data.

In [1]:
#Import required packages

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

In [2]:
# Read in the file:

BankData = pd.read_csv(r'C:\Users\Mike\Documents\Grad School 2021\DSC-609 Machine Learning\bank-direct-marketing-campaigns.csv')

BankData.head()    

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


In [3]:
# Select Data

# Only education type of interest is those with University degrees
BankDataSelect = BankData[BankData.education == 'university.degree']
# Want to get previously unsolicited clients
BankDataSelect = BankDataSelect[BankDataSelect.poutcome == 'nonexistent']

In [4]:
# Eliminate rows with "unknown" responses for categorical variables (job, default, etc)

BankDataSelect = BankDataSelect[BankDataSelect.job != 'unknown']
BankDataSelect = BankDataSelect[BankDataSelect.marital != 'unknown']
BankDataSelect = BankDataSelect[BankDataSelect.default != 'unknown']
BankDataSelect = BankDataSelect[BankDataSelect.housing != 'unknown']
BankDataSelect = BankDataSelect[BankDataSelect.loan != 'unknown']

In [5]:
# Recode variables.  In lieu of the multiple job categories, we will transform that into a 0 or 1, with
# 0 being unemployed, and 1 being anything else, including retired.

BankDataSelect['job'].mask(BankDataSelect['job'] == 'unemployed', 0, inplace = True)
BankDataSelect['job'].mask(BankDataSelect['job'] != 0, 1, inplace = True)
# Married = 1, not married = 0
BankDataSelect['marital'].mask(BankDataSelect['marital'] != 'married', 0, inplace = True)
BankDataSelect['marital'].mask(BankDataSelect['marital'] == 'married', 1, inplace = True)
# Has credit in default = 1, does not = 0
BankDataSelect['default'].mask(BankDataSelect['default'] == 'no', 0, inplace = True)
BankDataSelect['default'].mask(BankDataSelect['default'] == 'yes', 1, inplace = True)
# Has a housing loan = 1, does not = 0
BankDataSelect['housing'].mask(BankDataSelect['housing'] == 'no', 0, inplace = True)
BankDataSelect['housing'].mask(BankDataSelect['housing'] == 'yes', 1, inplace = True)
# Has a personal loan = 1, does not = 0
BankDataSelect['loan'].mask(BankDataSelect['loan'] == 'no', 0, inplace = True)
BankDataSelect['loan'].mask(BankDataSelect['loan'] == 'yes', 1, inplace = True)
# Code months by number
BankDataSelect['month'].mask(BankDataSelect['month'] == 'jan', 0, inplace = True)
BankDataSelect['month'].mask(BankDataSelect['month'] == 'feb', 1, inplace = True)
BankDataSelect['month'].mask(BankDataSelect['month'] == 'mar', 2, inplace = True)
BankDataSelect['month'].mask(BankDataSelect['month'] == 'apr', 3, inplace = True)
BankDataSelect['month'].mask(BankDataSelect['month'] == 'may', 4, inplace = True)
BankDataSelect['month'].mask(BankDataSelect['month'] == 'jun', 5, inplace = True)
BankDataSelect['month'].mask(BankDataSelect['month'] == 'jul', 6, inplace = True)
BankDataSelect['month'].mask(BankDataSelect['month'] == 'aug', 7, inplace = True)
BankDataSelect['month'].mask(BankDataSelect['month'] == 'sep', 8, inplace = True)
BankDataSelect['month'].mask(BankDataSelect['month'] == 'oct', 9, inplace = True)
BankDataSelect['month'].mask(BankDataSelect['month'] == 'nov', 10, inplace = True)
BankDataSelect['month'].mask(BankDataSelect['month'] == 'dec', 11, inplace = True)
# Code days by number
BankDataSelect['day_of_week'].mask(BankDataSelect['day_of_week'] == 'mon', 0, inplace = True)
BankDataSelect['day_of_week'].mask(BankDataSelect['day_of_week'] == 'tue', 1, inplace = True)
BankDataSelect['day_of_week'].mask(BankDataSelect['day_of_week'] == 'wed', 2, inplace = True)
BankDataSelect['day_of_week'].mask(BankDataSelect['day_of_week'] == 'thu', 3, inplace = True)
BankDataSelect['day_of_week'].mask(BankDataSelect['day_of_week'] == 'fri', 4, inplace = True)
# Code output variable
BankDataSelect['y'].mask(BankDataSelect['y'] == 'no', 0, inplace = True)
BankDataSelect['y'].mask(BankDataSelect['y'] == 'yes', 1, inplace = True)

BankDataSelect = BankDataSelect.drop(['nr.employed','education','contact',
                                      'pdays','previous','poutcome'], axis =1)

#Convert to integer types
BankDataSelect['job'] = BankDataSelect['job'].astype(int)
BankDataSelect['marital'] = BankDataSelect['marital'].astype(int)
BankDataSelect['default'] = BankDataSelect['default'].astype(int)
BankDataSelect['housing'] = BankDataSelect['housing'].astype(int)
BankDataSelect['loan'] = BankDataSelect['loan'].astype(int)
BankDataSelect['month'] = BankDataSelect['month'].astype(int)
BankDataSelect['day_of_week'] = BankDataSelect['day_of_week'].astype(int)
BankDataSelect['y'] = BankDataSelect['y'].astype(int)

BankDataSelect.describe()

Unnamed: 0,age,job,marital,default,housing,loan,month,day_of_week,campaign,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,y
count,8777.0,8777.0,8777.0,8777.0,8777.0,8777.0,8777.0,8777.0,8777.0,8777.0,8777.0,8777.0,8777.0,8777.0
mean,38.182067,0.978466,0.516122,0.0,0.546314,0.161103,5.984733,1.987695,2.646121,0.181987,93.522019,-40.028734,3.776501,0.115073
std,9.176353,0.145163,0.499768,0.0,0.497879,0.367647,2.104222,1.406809,2.828646,1.557595,0.542252,4.562828,1.659514,0.319129
min,20.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,1.0,-3.4,92.201,-50.8,0.634,0.0
25%,31.0,1.0,0.0,0.0,0.0,0.0,4.0,1.0,1.0,-1.8,93.2,-42.7,1.41,0.0
50%,36.0,1.0,1.0,0.0,1.0,0.0,6.0,2.0,2.0,1.1,93.444,-41.8,4.858,0.0
75%,44.0,1.0,1.0,0.0,1.0,0.0,7.0,3.0,3.0,1.4,93.994,-36.1,4.963,0.0
max,91.0,1.0,1.0,0.0,1.0,1.0,11.0,4.0,43.0,1.4,94.767,-26.9,5.045,1.0


Now that we've got all the data reported as the correct type, we can build models.  First we should split the dataset into training and testing sets:

In [6]:
target = BankDataSelect['y']
Predictors = BankDataSelect.drop(['y'], axis = 1)

Predictors_train, Predictors_test, target_train, target_test = train_test_split(Predictors,
                                                                                target, random_state = 5)
print('Number of training records: \t', len(Predictors_train))
print('Number of testing records: \t', len(Predictors_test))

Number of training records: 	 6582
Number of testing records: 	 2195


In [7]:
# Support Vector, default rbf kernel

Bank_svm = SVC(C = 1)
Bank_svm.fit(Predictors_train, target_train)
print('Training Accuracy: {:.4f}'.format(Bank_svm.score(Predictors_train, target_train)))
print('Test Set Accuracy: {:.4f}'.format(Bank_svm.score(Predictors_test, target_test)))

Training Accuracy: 0.8830
Test Set Accuracy: 0.8907


The support vector classifier, with a c-value of 1, returns a model that predicts on the test data at roughly 89% accuracy, which is both quite good and slightly higher than the training accuracy of 88%, indicating it should generalize well to new data of the same nature.

In [8]:
# Standard Logistic Regression

Bank_LR = LogisticRegression(max_iter = 100000)
Bank_LR.fit(Predictors_train, target_train)

LR_test_predict = Bank_LR.predict(Predictors_test)

print('Training Accuracy: {:.4f}'.format(Bank_LR.score(Predictors_train, target_train)))
print('Test Set Accuracy: {:.4f}'.format(Bank_LR.score(Predictors_test, target_test)))


Training Accuracy: 0.8830
Test Set Accuracy: 0.8907


Logistic Regression returns a model with the same accuracy as the support vector classifier, which suggests that the data is linearly separable.

In [9]:
# Nearest-neighbors classifier
k = [2,3,4,5,6,7,8,9,10]

for qty in k:
    Bank_knn = KNeighborsClassifier(n_neighbors = qty)
    Bank_knn.fit(Predictors_train, target_train)
    print('\nNumber of Neighbors = ', qty)
    print('Training Accuracy: {:.4f}'.format(Bank_knn.score(Predictors_train, target_train)))
    print('Test Set Accuracy: {:.4f}'.format(Bank_knn.score(Predictors_test, target_test)))


Number of Neighbors =  2
Training Accuracy: 0.9117
Test Set Accuracy: 0.8825

Number of Neighbors =  3
Training Accuracy: 0.9134
Test Set Accuracy: 0.8688

Number of Neighbors =  4
Training Accuracy: 0.8999
Test Set Accuracy: 0.8838

Number of Neighbors =  5
Training Accuracy: 0.9009
Test Set Accuracy: 0.8793

Number of Neighbors =  6
Training Accuracy: 0.8935
Test Set Accuracy: 0.8861

Number of Neighbors =  7
Training Accuracy: 0.8955
Test Set Accuracy: 0.8802

Number of Neighbors =  8
Training Accuracy: 0.8915
Test Set Accuracy: 0.8843

Number of Neighbors =  9
Training Accuracy: 0.8903
Test Set Accuracy: 0.8806

Number of Neighbors =  10
Training Accuracy: 0.8873
Test Set Accuracy: 0.8870


Looking at the range of test accuracies for the different quantities of neighbors to use for prediction, they are all in a fairly tight range, close to the accuracies for logistic regression and the support vector machine model.

Looking at all of the accuracy results, and then recalling the summary statistic for the target, there might be the explanation.  The mean of the target class is 0.115, or about 11.5% of the time, it's a yes.  88+% test accuracy is pretty good, but being wrong 11% of the time when the target class population is also only about 11% is a good reason to dive deeper to see if we have an asymmetry/sparsity problem.  At 11.5% for the target class, the model could make an unchanging prediction of 'no' for all instances and still achieve an accuracy of 88%.

## References

ruthgn. (October 2021). Bank Marketing Data Set. Retrieved [12 Nov 2021] from https://www.kaggle.com/ruthgn/bank-marketing-data-set.