#[COE 197-Z] Project 1: Heart Disease Prediction Model

---

##Competition Score : *0.29193 Log Loss*
##Leaderboard Rank : *10th / 1444*
---

The following model was built both as a submission to the DrivenData [Machine Learning with Heart Competition](https://www.drivendata.org/competitions/54/machine-learning-with-a-heart/) 

and as a project under  CoE 197-Z Deep Learning 2S 1819AY. 

Dataset used was provided by DrivenData, and is publically available at the UCI Machine Learning Repository linked [here](https://archive.ics.uci.edu/ml/datasets/heart+Disease).

---



**Table of Contents**

*  Preprocessing Data using Pandas
*  Implementing K-Fold Validation 
*  Building the Deep Learning Model (3-Layer MLP)
*  Preparing Callbacks (Model Checkpoint, LR Scheduler)
*  Training the Model
*   Evaluating the Model and Submission



**Preprocess Data using Pandas**

Note: Since I have zero background on preprocessing tabular data, the following was used as the main reference for this section: 
[Preprocessing Tabular Data](https://github.com/AnneDeGraaf/DrivenData_WarmUp_HeartDisease/blob/master/data_processing.py?fbclid=IwAR3Spxx1yyaRRpyO2yPeajdlv3SgcWuy-9ZwPLW5SPTWNIpzr0TFtph5h38)
Credits to  .[AnneDeGraaf](https://github.com/AnneDeGraaf/)

Pre-processing Summary:
*  Use pandas dataframes from reading .csv datasets
*  Normalize numerical values
*  Change categorical data into one-hot vectors 
*  Categorical Embedding was considered but not implemented due to references stating it had no significant effect on this particular dataset.
*  Save pre-processed data into new .csv to be loaded later

In [0]:
import tensorflow as tf
import numpy as np
import pandas as pd

from keras.models import Model, load_model
from keras.layers import Dense, Dropout, Input, BatchNormalization
from keras.regularizers import l2
from keras.utils import to_categorical
from keras.optimizers import Adam
from keras.callbacks import EarlyStopping, ModelCheckpoint, ReduceLROnPlateau, LearningRateScheduler
from keras.constraints import unit_norm

import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, StratifiedKFold

train_url = 'https://raw.githubusercontent.com/henritomas/CoE197-Z-Tomas-DL-Experiments/master/train_values.csv'
test_url = 'https://raw.githubusercontent.com/henritomas/CoE197-Z-Tomas-DL-Experiments/master/test_values.csv'
rawTrain = pd.read_csv(train_url)
rawTest = pd.read_csv(test_url)

# change categorical data into one-hot:
trainSlope_oneHot = pd.get_dummies(rawTrain['slope_of_peak_exercise_st_segment'], prefix='slope')
trainThal_oneHot = pd.get_dummies(rawTrain['thal'])
trainChestPain_oneHot = pd.get_dummies(rawTrain['chest_pain_type'], prefix='chestPain')
trainResting_oneHot = pd.get_dummies(rawTrain['resting_ekg_results'], prefix='restingEkg')
testSlope_oneHot = pd.get_dummies(rawTest['slope_of_peak_exercise_st_segment'], prefix='slope')
testThal_oneHot = pd.get_dummies(rawTest['thal'])
testChestPain_oneHot = pd.get_dummies(rawTest['chest_pain_type'], prefix='chestPain')
testResting_oneHot = pd.get_dummies(rawTest['resting_ekg_results'], prefix='restingEkg')

# replace categorical columns by one-hot
rawTrain.drop(['slope_of_peak_exercise_st_segment','thal','chest_pain_type','resting_ekg_results'], axis=1, inplace=True)
rawTrain = rawTrain.join([trainSlope_oneHot, trainThal_oneHot, trainChestPain_oneHot, trainResting_oneHot])
rawTest.drop(['slope_of_peak_exercise_st_segment','thal','chest_pain_type','resting_ekg_results'], axis=1, inplace=True)
rawTest = rawTest.join([testSlope_oneHot, testThal_oneHot, testChestPain_oneHot, testResting_oneHot])

# check for NaN's in dataset
print(rawTrain.isnull().values.any())
print(rawTest.isnull().values.any())

# apply normalization to numerical data
numCols = ['resting_blood_pressure', 'serum_cholesterol_mg_per_dl', 'oldpeak_eq_st_depression', 'age', 'max_heart_rate_achieved']
for col in numCols:
	rawTest[col] = (rawTest[col] - rawTrain[col].mean()) / rawTrain[col].std()
	rawTrain[col] = (rawTrain[col] - rawTrain[col].mean()) / rawTrain[col].std()
	print(rawTrain[col].mean(), rawTrain[col].std()) # should be 0 and 1

# Storing processed data into new file
rawTrain.to_csv('../train_values_normalized.csv')
rawTest.to_csv('../test_values_normalized.csv')

False
False
4.354541418807558e-16 1.0
4.502571155424246e-17 1.0
6.1679056923619804e-18 0.9999999999999992
1.0986582014519779e-16 0.9999999999999994
5.896517841898053e-16 1.0000000000000004


**Splitting Data for 8-Fold Cross Validation**
*  use StratifiedKFold from scikit to split data into 8 folds
*  final processing of data: convert test labels into one-hot vector, drop patient id column in training data



In [0]:
def load_data_kfold(k):
  #Load pre-processed/ normalized data, mark column 0 as the index (patiend id)
  train_labels_url = 'https://raw.githubusercontent.com/henritomas/CoE197-Z-Tomas-DL-Experiments/master/train_labels.csv'
  x_train = pd.read_csv('../train_values_normalized.csv', index_col=0)
  y_train = pd.read_csv(train_labels_url, index_col=0)
  
  folds = list(StratifiedKFold(n_splits=k, shuffle=True, random_state=1).split(x_train, y_train))
  
  #Reshape/Format data
  num_labels = len(np.unique(y_train))
  y_train = to_categorical(y_train)
  x_train = x_train.drop('patient_id',1) #Drops/deletes patient_id column
  
  return folds, x_train, y_train

k = 8
folds, x_train, y_train = load_data_kfold(k)
num_labels=2

** Build the Deep Learning Model **

*   Set random seeds to constant values for reproducibility.

*   Pick 3-Layer MLP as the model due to its advantages over CNN and RNN in classfying tabular data as opposed to image or sequential data. 

*   Batch Size = 128 to achieve sharper but lower log loss. 

*   Applied L2 norm as a regularizer, then unit norm as a constraint.

*   ReLU as activation function, Softmax for last layer

*   main reference for customizing Adam optimizer [here](https://www.kaggle.com/jasontsmith2718/predicting-heart-disease?fbclid=IwAR121u3BDOoqiwx-vSuFFiNJPP6VdfpYIYu1c61eM_mEe4gD4V1WD6fUz_s)

*   Introducing BatchNorm + Dropout makes model more accurate but less confident in predictions, (higher accuracy, BUT higher log loss error)

*   Introducing weight initializers (he and glorot in particular) returned flatter loss at model convergence, but slightly higher loss in general, so it was not used.

In [0]:
#Reproducibility Seeds
np.random.seed(5318)
from tensorflow import set_random_seed
set_random_seed(5318)

batch_size = 128
epochs = 20

def build_model():
  #Network Parameters
  input_dim = (x_train.shape[1],) #required to be a tuple

  kreg = l2(0.0001)

  #Build Model
  inputs = Input(shape=input_dim)
  y = Dense(32,
            input_dim=input_dim,
            activation='relu',
            kernel_regularizer=kreg,
            kernel_constraint=unit_norm())(inputs)
  
  y = Dense(16,
            input_dim=input_dim,
            activation='relu',
            kernel_regularizer=kreg)(y)
  
  outputs = Dense(num_labels, activation='softmax',
                 kernel_regularizer=kreg,
                 kernel_constraint=unit_norm())(y)
  opt = Adam(lr=0.0001, 
             beta_1=0.9, 
             beta_2=0.999, 
             epsilon=1e-7, 
             decay=0.0, 
             amsgrad=False)
  model = Model(inputs=inputs, outputs=outputs)
  model.compile(loss='binary_crossentropy',
                optimizer=opt,
                metrics=['accuracy'])
  
  return model

model = build_model()
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_7 (InputLayer)         (None, 22)                0         
_________________________________________________________________
dense_19 (Dense)             (None, 32)                736       
_________________________________________________________________
dense_20 (Dense)             (None, 16)                528       
_________________________________________________________________
dense_21 (Dense)             (None, 2)                 34        
Total params: 1,298
Trainable params: 1,298
Non-trainable params: 0
_________________________________________________________________


** Prepare Callbacks **

*   Uses Model Checkpoints to save the model at the epoch with the lowest validation loss. 
*   Uses Learning Rate Scheduler to tweak the learning rate at specific epochs.

In [0]:
#define LR scheduler
def scheduler(epoch):
  if epoch <= 9:
    new_lr = 0.009962
  elif 9 < epoch <= 11:
    new_lr = 0.0052
  elif epoch > 11:
    new_lr = 0.0001
    
  return new_lr
  

#checkpoint saves the model with the minimum val_loss
def get_callbacks(name_weights, patience_lr):
    mcp_save = ModelCheckpoint(name_weights, save_best_only=True, monitor='val_loss', mode='min', save_weights_only=False)
    lrate = LearningRateScheduler(scheduler)
    #reduce_lr_loss = ReduceLROnPlateau(monitor='loss', factor=0.1, patience=patience_lr, verbose=1, epsilon=1e-4, mode='min')
    return [mcp_save, lrate]

**Train the Model (3-Layer MLP) with 8-Fold CV**

*  From previous training with 8-fold CV, it was determined that **Fold 3 returns the most accurate models**, and thus **training for this part will only be done on a Fold 3 evaluation**.

*  Learning Rate was tweaked to be large at the first few epochs to skip bad local minima, and then changed smaller later on to converge at good sharp minimas for better loss. 

*  Model Checkpoints is set to save the model at the epoch where the validation loss is lowest.

*  Model is trained on 157 samples and validated on only 23 samples. 

*  For better generalization, it is observed that the training loss should be around ~0.32 log loss while the validation loss should be < 0.2 log loss. Anywhere past this and it seems that the model is overfitting on the training data and competition score gets worse.


In [0]:
kfold_summary = {} #Save results of k-fold validation for each fold here
for j, (train_idx, val_idx) in enumerate(folds):
    
    #If not Fold 3, Skip.
    if j != 3:
      continue
    
    print('\nFold ',j)
    #since x_train is a pandas dataframe, to access its folds, "df.iloc[]" is required
    x_train_cv = x_train.iloc[train_idx] 
    y_train_cv = y_train[train_idx]
    x_valid_cv = x_train.iloc[val_idx]
    y_valid_cv= y_train[val_idx]
    
    name_weights = "final_model_fold" + str(j) + ".h5"
    callbacks = get_callbacks(name_weights=name_weights, patience_lr=2)
    model = build_model()

    history = model.fit(x_train_cv, y_train_cv,
                        validation_data=(x_valid_cv, y_valid_cv),
                        epochs=epochs,
                        batch_size=batch_size, 
                        callbacks=callbacks)
    
    kth_eval = model.evaluate(x_valid_cv, y_valid_cv)
    print(kth_eval)
    kfold_summary[j] = kth_eval
    


Fold  3
Train on 157 samples, validate on 23 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
[0.17539602518081665, 0.95652174949646]


**Evaluating the Model and Submission**

*  Evaluate Model on an approximation of the competition's test data.
*  Store probability predictions in a csv for submission.

In [0]:
#Loads Final Test Data used on competition evaluation
raw_z_test = pd.read_csv('../test_values_normalized.csv', index_col=0)
z_test = raw_z_test.drop('patient_id',1) #Drops/deletes patient_id column
test_labels_url = 'https://raw.githubusercontent.com/henritomas/CoE197-Z-Tomas-DL-Experiments/master/test-labels.csv'
z_labels = pd.read_csv(test_labels_url, index_col=0)
z_labels = to_categorical(z_labels)

#Loads model at its best point and evaluates it on Final Test Data
model = load_model('final_model_fold3.h5')
  
scores = model.evaluate(z_test, z_labels, batch_size=batch_size)
print("val_loss: {} val_acc: {}".format(scores[0], scores[1]))

#Takes the model's probability prediction of heart disease prescence
final_proba = model.predict(z_test)
hd_present_proba = [prob[1] for prob in final_proba]
print(hd_present_proba)

#Stores probability associated with each patient in a csv for submission
submission = pd.DataFrame({'heart_disease_present': hd_present_proba,
                            'patient_id': raw_z_test.patient_id.values})
submission = submission[['patient_id', 'heart_disease_present']]
submission.to_csv("my_submission.csv", index=False)

val_loss: 0.2834474444389343 val_acc: 0.8888888955116272
[0.6029585, 0.06941914, 0.95167613, 0.013983928, 0.95183206, 0.018221926, 0.14276318, 0.96516067, 0.16590169, 0.060598593, 0.16051987, 0.5993903, 0.3471231, 0.97468376, 0.12118039, 0.044034027, 0.009959205, 0.0302823, 0.9356648, 0.030572087, 0.9351477, 0.19196522, 0.2217533, 0.06961808, 0.46480715, 0.95400065, 0.10646738, 0.21047239, 0.6457642, 0.013986788, 0.9526828, 0.47091955, 0.8025575, 0.5152754, 0.23172657, 0.054696035, 0.34962296, 0.057299275, 0.13667071, 0.052689783, 0.9755053, 0.022186643, 0.94820076, 0.053005967, 0.94574356, 0.041678105, 0.102881394, 0.123889275, 0.16437013, 0.83646, 0.6097066, 0.02274888, 0.9915555, 0.0630765, 0.5575688, 0.06162806, 0.9090783, 0.06646026, 0.08949696, 0.5786575, 0.07392717, 0.9763367, 0.07060773, 0.9826621, 0.07142755, 0.8949409, 0.8755535, 0.61311483, 0.9055527, 0.8438389, 0.09458623, 0.9835993, 0.97225404, 0.988846, 0.99634224, 0.9802633, 0.9610296, 0.95830256, 0.17307587, 0.51778984,