# Neural Network Modelling & SMOTE under/over sampling

This notebook presents an investigation into the impact on results of using a Multi-layer Perceptron (MLP) neural network models to predict Vaccine willingness on the Wave 1 data (i.e. does the use of neural network modelling improve accuracy/precision/recall scores).

Furthermore, the SMOTE oversampling of minority class target groups is complemented by undersampling of the majority classes, to explore whether a mixed over/under sampling approach has a significant impact on model accuracy.

The notebook is split into the following sections:

- Package imports, data loading & pre-model processing
- MLP and Logistic Regression Modelling
- Summary of Findings


**Package imports and premodel processing**

In [1]:
# All Package imports
import numpy as np
import pandas as pd
import scipy.stats as stats
from sklearn.datasets import make_classification
from imblearn.pipeline import make_pipeline
import collections
from sklearn.tree import DecisionTreeClassifier
from imblearn.pipeline import Pipeline
import joblib
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score, cross_val_predict
import sklearn.metrics as metrics
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.decomposition import PCA
from scikitplot.decomposition import plot_pca_component_variance
from sklearn.model_selection import StratifiedKFold
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from scikitplot.metrics import plot_roc
from sklearn.metrics import plot_confusion_matrix, plot_roc_curve
from sklearn.neural_network import MLPClassifier

from matplotlib.colors import ListedColormap
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')

%config InlineBackend.figure_format = 'retina'

In [2]:
# Data Import
data = pd.read_csv("C:/Users/laure/OneDrive/Documents/Personal Admin Files/Stats or Career Stuff/General Assembly/Course Notes/Capstone Folder/Cleaned_data/wave_1_vaccine_intention_data.csv")
data.drop('Unnamed: 0', axis=1, inplace=True)
data.shape

(9587, 87)

In [3]:
# Preparation of the data for modelling - manually dropping first dummy columns using feature list and total column list

# Creation of total column list
w1columns = []
for column in data.columns:
    w1columns.append(column)
    
# Creation of var_list
var_list = ['DEMAGE', 'VAC_DEC',  'COV_KNOWL_1', 'COV_KNOWL_2',
     'COV_KNOWL_3', 'COV_KNOWL_4', 'COV_KNOWL_5', 'COV_KNOWL_6',
     'COV_KNOWL_7', 'DREAD', 'ANX_1', 'ANX_2', 'ANX_3', 'ANX_4', 'ANX_5',
     'ANX_6', 'DEMREG_East of England',
     'DEMREG_Greater London', 'DEMREG_North East', 'DEMREG_North West',
     'DEMREG_Northern Ireland', 'DEMREG_Scotland', 'DEMREG_South East',
     'DEMREG_South West', 'DEMREG_Wales', 'DEMREG_West Midlands',
     'DEMREG_Yorkshire and The Humber', 'DEMSEX_Male',
     'DEMEDU_2+ A levels or equivalents',
     'DEMEDU_5+ GCSE, O-levels, 1 A level, or equivalents',
     'DEMEDU_Apprenticeship', 'DEMEDU_No academic qualifications',
     'DEMEDU_Other',
     'DEMEDU_Undergraduate or postgraduate degree',
     'DEMWRK_Retired', 'DEMWRK_Student',
     'DEMWRK_Unable to work',
     'DEMWRK_Unemployed',
     'DEMWRK_Working full-time',
     'DEMWRK_Working part-time',
     'DEMREL_Christian', 'DEMREL_Muslim', 'DEMREL_Other',
     'DEMINC_Under £15,000', 'DEMINC_£15,000 to £24,999',
     'DEMINC_£25,000 to £34,999', 'DEMINC_£35,000 to £44,999',
     'DEMINC_£45,000 to £54,999', 'DEMINC_£55,000 to £64,999',
     'DEMINC_£65,000 to £99,999', 'COV_SHIELD_Yes',
     'COV_TRUST_1_National television',
     'COV_TRUST_2_Satellite / international television channels',
     'COV_TRUST_3_Radio', 'COV_TRUST_4_Newspapers',
     'COV_TRUST_5_Social media (Facebook, Twitter, etc)',
     'COV_TRUST_6_National public health authorities (such as the NHS or Public Health England / Wales)',
     'COV_TRUST_7_Healthcare workers',
     'COV_TRUST_8_International health authorities (such as The World Health Organization)',
     'COV_TRUST_9_Government websites',
     'COV_TRUST_10_The internet or search engines',
     'COV_TRUST_11_Family and friends',
     'COV_TRUST_12_Work, school, or college',
     'COV_TRUST_13_Other (please specify)', 'COV_VAC_SELF', 'target_1', 'target_2']

# Manually dropping first columns using total column and feature lists
drop_columns = []
for column in data.columns:
    if column not in var_list:
        print(column)
        drop_columns.append(column)
        
data.drop(drop_columns, axis=1, inplace=True)

DEMREG_East Midlands
DEMSEX_Female
DEMEDU_0-4 GCSE, O-levels, or equivalents
DEMWRK_Looking after the home
DEMREL_Atheist or agnostic
DEMINC_Over £100,000
COV_SHIELD_No
COV_TRUST_1_0
COV_TRUST_2_0
COV_TRUST_3_0
COV_TRUST_4_0
COV_TRUST_5_0
COV_TRUST_6_0
COV_TRUST_7_0
COV_TRUST_8_0
COV_TRUST_9_0
COV_TRUST_10_0
COV_TRUST_11_0
COV_TRUST_12_0
COV_TRUST_13_0


In [4]:
# Obtaining base accuracy.
data['COV_VAC_SELF'].value_counts(normalize=True)
y = data['COV_VAC_SELF']
y.value_counts(normalize=True).max()

0.5620110566391989

In [5]:
y.value_counts()

1    5388
2    2874
3     773
4     552
Name: COV_VAC_SELF, dtype: int64

In [6]:
X = data[['DEMAGE', 'VAC_DEC',  'COV_KNOWL_1', 'COV_KNOWL_2',
     'COV_KNOWL_3', 'COV_KNOWL_4', 'COV_KNOWL_5', 'COV_KNOWL_6',
     'COV_KNOWL_7', 'DREAD', 'ANX_1', 'ANX_2', 'ANX_3', 'ANX_4', 'ANX_5',
     'ANX_6', 'DEMREG_East of England',
     'DEMREG_Greater London', 'DEMREG_North East', 'DEMREG_North West',
     'DEMREG_Northern Ireland', 'DEMREG_Scotland', 'DEMREG_South East',
     'DEMREG_South West', 'DEMREG_Wales', 'DEMREG_West Midlands',
     'DEMREG_Yorkshire and The Humber', 'DEMSEX_Male',
     'DEMEDU_2+ A levels or equivalents',
     'DEMEDU_5+ GCSE, O-levels, 1 A level, or equivalents',
     'DEMEDU_Apprenticeship', 'DEMEDU_No academic qualifications',
     'DEMEDU_Other',
     'DEMEDU_Undergraduate or postgraduate degree',
     'DEMWRK_Retired', 'DEMWRK_Student',
     'DEMWRK_Unable to work',
     'DEMWRK_Unemployed',
     'DEMWRK_Working full-time',
     'DEMWRK_Working part-time',
     'DEMREL_Christian', 'DEMREL_Muslim', 'DEMREL_Other',
     'DEMINC_Under £15,000', 'DEMINC_£15,000 to £24,999',
     'DEMINC_£25,000 to £34,999', 'DEMINC_£35,000 to £44,999',
     'DEMINC_£45,000 to £54,999', 'DEMINC_£55,000 to £64,999',
     'DEMINC_£65,000 to £99,999', 'COV_SHIELD_Yes',
     'COV_TRUST_1_National television',
     'COV_TRUST_2_Satellite / international television channels',
     'COV_TRUST_3_Radio', 'COV_TRUST_4_Newspapers',
     'COV_TRUST_5_Social media (Facebook, Twitter, etc)',
     'COV_TRUST_6_National public health authorities (such as the NHS or Public Health England / Wales)',
     'COV_TRUST_7_Healthcare workers',
     'COV_TRUST_8_International health authorities (such as The World Health Organization)',
     'COV_TRUST_9_Government websites',
     'COV_TRUST_10_The internet or search engines',
     'COV_TRUST_11_Family and friends',
     'COV_TRUST_12_Work, school, or college',
     'COV_TRUST_13_Other (please specify)']]

In [7]:
# Data preprocessing - Train/test Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, stratify=y, test_size=0.2, random_state=3)

# Data scaling
scaler = StandardScaler()
Xstd_train = scaler.fit_transform(X_train)
Xstd_test = scaler.transform(X_test)

# Application of over and under sampling
under = RandomUnderSampler(sampling_strategy= {1: 2000, 2: 2000})
over = SMOTE(sampling_strategy = {3: 2000, 4:2000})

steps = [('o', over), ('u', under)]
pipeline = Pipeline(steps=steps)
X_smote, y_smote = pipeline.fit_resample(Xstd_train, y_train)

In [8]:
# Defining Modelling function. Returns all necessary metrics, appends to the score accumulator dictionary
# and prints out model results.

skf = StratifiedKFold(n_splits=5)

def model_run(model, xtr, xte, ytr, yte):
    """Function to run multiclass model set on train and test data, and return summary results & predictions
    (assigned to the score accumulation dictionary)
    
    :param model: Model with which to fit train data and generate test predictions
    :param xtr: Train data predictor variables
    :param xte: Test data predictor variables
    :param ytr: Train data target variable
    :param yte: Test data target variable
    :return: test data predictions"""
    
    model.fit(xtr, ytr)
    model_name = f'{model}'
    train_acc = model.score(xtr, ytr)
    meancrossval = cross_val_score(model, xtr, ytr, cv=skf).mean()
    meancrossvalroc = cross_val_score(model, xtr, ytr, scoring='roc_auc_ovr', cv=skf).mean()
    test_acc = model.score(xte, yte)
    predictions = model.predict(xte)
    
    variance = predictions.var(axis=0).mean()
    bias_sq = np.mean((yte - predictions.mean(axis=0))**2)
    variance = round(variance, 4)
    bias_sq = round(bias_sq, 4)
    
    varbias = variance + bias_sq
    
    metrics.precision_score(yte, predictions, average='weighted')
    metrics.recall_score(yte, predictions, average='weighted')
    metrics.f1_score(yte, predictions, average='weighted')
    print(f'Training accuracy score: {train_acc}')
    print(f'5-Fold Cross Val accuracy score: {meancrossval}')
    print(f'Test accuracy score: {test_acc}')
    print()
    print(f'5-Fold Cross Val ROCAUC score: {meancrossvalroc}')
    print(f"ROC_AUC_SCORE Test: {roc_auc_score(yte, model.predict_proba(xte), multi_class='ovr')}")
    print()
    print(confusion_matrix(yte, predictions))
    print()
    print(classification_report(yte, predictions))
    print()
    print()
    print(f'Variance: {variance}')
    print(f'Bias sq: {bias_sq}')
    print(f'Variance/Bias: {varbias}')
    
    scores['Model Name'].append(model_name)
    scores['Train Accuracy'].append(train_acc)
    scores['Cross Validation Accuracy'].append(meancrossval)
    scores['Test Accuracy'].append(test_acc)
    scores['Test Precision'].append(metrics.precision_score(yte, predictions, average='weighted'))
    scores['Test Recall'].append(metrics.recall_score(yte, predictions, average='weighted'))
    scores['Test F1 Score'].append(metrics.f1_score(yte, predictions, average='weighted'))
    scores['Cross Validation ROC AUC'].append(meancrossvalroc)
    scores['Test ROC AUC Score'].append(roc_auc_score(yte, model.predict_proba(xte), multi_class='ovr'))
    scores['Variance'].append(variance)
    scores['Bias'].append(bias_sq)
    scores['Variance/bias'].append(varbias)
    
    return predictions


In [9]:
# Initialising model scores dictionary
scores = {
    'Model Name': [],
    'Train Accuracy': [],
    'Cross Validation Accuracy': [],
    'Test Accuracy': [],
    'Test Precision': [],
    'Test Recall': [],
    'Test F1 Score': [],
    'Cross Validation ROC AUC': [],
    'Test ROC AUC Score': [],
    'Variance': [],
    'Bias': [],
    'Variance/bias': []
}


**MLP and Logistic Regression Modelling**

In [10]:
# Instantiating the models

log = LogisticRegression(multi_class='ovr')
clf = MLPClassifier(solver='lbfgs',
                   alpha=0.5,
                   hidden_layer_sizes=(8, 8, 8, 8, 8),
                   activation='relu',
                   random_state=1,
                   batch_size='auto',
                   max_iter=10000)


In [11]:
# Running Model comparison
model_list = [log, clf]

for model in model_list:
    print(f'Model: {model}')
    model_run(model, X_smote, Xstd_test, y_smote, y_test)
    print()
    print('-----------------------')
    print()

Model: LogisticRegression(multi_class='ovr')
Training accuracy score: 0.47
5-Fold Cross Val accuracy score: 0.44975
Test accuracy score: 0.45828988529718456

5-Fold Cross Val ROCAUC score: 0.7245071875000001
ROC_AUC_SCORE Test: 0.6671378008865614

[[641 171 151 115]
 [210 125 141  99]
 [ 46  26  41  42]
 [ 15  11  12  72]]

              precision    recall  f1-score   support

           1       0.70      0.59      0.64      1078
           2       0.38      0.22      0.28       575
           3       0.12      0.26      0.16       155
           4       0.22      0.65      0.33       110

    accuracy                           0.46      1918
   macro avg       0.35      0.43      0.35      1918
weighted avg       0.53      0.46      0.48      1918



Variance: 1.3373
Bias sq: 0.9084
Variance/Bias: 2.2457

-----------------------

Model: MLPClassifier(alpha=0.5, hidden_layer_sizes=(8, 8, 8, 8, 8), max_iter=10000,
              random_state=1, solver='lbfgs')


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


Training accuracy score: 0.610875
5-Fold Cross Val accuracy score: 0.532125
Test accuracy score: 0.4197080291970803

5-Fold Cross Val ROCAUC score: 0.7713220833333333
ROC_AUC_SCORE Test: 0.6161745685714102

[[500 295 193  90]
 [133 219 149  74]
 [ 32  56  36  31]
 [ 10  15  35  50]]

              precision    recall  f1-score   support

           1       0.74      0.46      0.57      1078
           2       0.37      0.38      0.38       575
           3       0.09      0.23      0.13       155
           4       0.20      0.45      0.28       110

    accuracy                           0.42      1918
   macro avg       0.35      0.38      0.34      1918
weighted avg       0.55      0.42      0.46      1918



Variance: 1.0641
Bias sq: 0.9735
Variance/Bias: 2.0376000000000003

-----------------------



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


**MLP and Logistic Regression Optimisation**

In [12]:
# Neural Net Grid search w.r.t ROC AUC
# Note that the code has been commented to prevent accidentally running code requiring significant computational power

# ne_params = {'alpha': [0.001, 0.5, 0.75, 0.99],
#              'hidden_layer_sizes': [(20, 20, 20, 20), (15, 15, 15, 15), (10, 10, 10, 10), (5, 5, 5, 5), (10, 10, 10, 10, 10)],
#              'activation': ['logistic', 'tanh', 'relu']}

# ne = MLPClassifier(random_state=1, max_iter=10000)

# ne_gr = GridSearchCV(ne, ne_params, scoring='roc_auc_ovr', n_jobs=2, cv=5, verbose=1)
# ne_gr.fit(X_smote, y_smote)
# ne_best = ne_gr.best_estimator_

In [13]:
# joblib.dump(ne_best, 'best_network.jlib')

In [14]:
neural = joblib.load('best_network.jlib')

In [15]:
# LOGISTIC REGRESSION Grid search w.r.t ROC AUC. 
# Note that the code has been commented to prevent accidentally running code requiring significant computational power



# log_params = {'l1_ratio': [0.1, 0.25, 0.5, 0.75, 0.9, 0.99],
#               'C': np.logspace(-3, 0, 50)}

# log = LogisticRegression(multi_class='ovr', penalty='elasticnet', solver='saga', max_iter=1000)

# reclog_gr = GridSearchCV(log, log_params, scoring='roc_auc_ovr', n_jobs=2, cv=5, verbose=1)
# reclog_gr.fit(X_smote, y_smote)
# logbest_rec = reclog_gr.best_estimator_
# joblib.dump(logbest_rec, 'underoverlogopt.jlib')

In [16]:
logbest = joblib.load('underoverlogopt.jlib')

In [17]:
# Running Model comparison
model_list = [logbest, neural]

for model in model_list:
    print(f'Model: {model}')
    model_run(model, X_smote, Xstd_test, y_smote, y_test)
    print()
    print('-----------------------')
    print()

Model: LogisticRegression(C=0.044984326689694466, l1_ratio=0.75, max_iter=1000,
                   multi_class='ovr', penalty='elasticnet', solver='saga')
Training accuracy score: 0.46775
5-Fold Cross Val accuracy score: 0.4505
Test accuracy score: 0.47132429614181437

5-Fold Cross Val ROCAUC score: 0.7238298958333333
ROC_AUC_SCORE Test: 0.6748391267895785

[[671 145 145 117]
 [214 119 141 101]
 [ 48  22  40  45]
 [ 16   9  11  74]]

              precision    recall  f1-score   support

           1       0.71      0.62      0.66      1078
           2       0.40      0.21      0.27       575
           3       0.12      0.26      0.16       155
           4       0.22      0.67      0.33       110

    accuracy                           0.47      1918
   macro avg       0.36      0.44      0.36      1918
weighted avg       0.54      0.47      0.49      1918



Variance: 1.3723
Bias sq: 0.897
Variance/Bias: 2.2693000000000003

-----------------------

Model: MLPClassifier(alpha=0.99, 

**Results Analysis**

Results suggest that, as per the original analysis (See Wave 1 modelling and Wave 2 modelling notebooks), the logistic regression performs best on accuracy, precision and recall measures compared to other estimators - in this case, the MLP neural net model, even with gridsearch optimisation of parameters.

Results for the classifier again suggest over fitting on the test data, with training accuracy and ROCAUC scores significantly higher than corresponding test scores. 

Meanwhile, Logistic train and test scores are somewhat more aligned, and the model, both optimised and not-optimised performs broadly in line with results elsewhere (see separate notebooks as per above). While the logistic regression here performs poorly with regard to accuracy, but scores well relative to the MLP classifier on the priority class 3 and 4 recall scores, and on ROC AUC. 

The similarity in findings here and in the main analysis, suggests that the impact of utilising both under and oversampling techniques on the training data is limited, while an MLP neural network model does not perform better than the preferred logistic regression model.