### Heart Disease Predictions

**In this notebook we will create three different classification models, train them, test them on unseen data and compare their performance** \
The two models that we will be using are Support Vector Machine, Decision Trees and K-nearest neighbors

The models as well as the data split were implemented using [scikit-learn](https://scikit-learn.org/stable/index.html)

In [2]:
import pandas as pd
import numpy as np
from sklearn import svm
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, accuracy_score, f1_score, precision_score, recall_score

In [3]:
# Lets first import the clean data
df_heart = pd.read_csv('data/data_clean.csv')
df_heart.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


In [4]:
# Lets now prepare the data in order to be fed to the classifier 
categorical_columns = df_heart.select_dtypes(include=[object]).columns.values.tolist()
numerical_columns = [column for column in df_heart.columns.values.tolist() if column not in categorical_columns][:-1]

# Our data has a lot of unlabeled categorical features, thus we will one-hot encode them
oh_encoder = OneHotEncoder(drop='first')
categ_features_encoded = oh_encoder.fit_transform(df_heart[categorical_columns]).toarray()


# First we need to turn the dataset into features and labels and then both into numpy arrays
heart_data = np.hstack((df_heart[numerical_columns].to_numpy(), categ_features_encoded))
heart_labels = df_heart['HeartDisease'].to_numpy()

# Now we split the data into training, validation and test sets with a 70-20-10 split
x_train, x_test, y_train, y_test = train_test_split(heart_data, heart_labels, test_size=0.3, random_state=1)
x_val, x_test, y_val, y_test = train_test_split(x_test, y_test, test_size=0.3, random_state=1)

# Now lets check the counts
print('Training set has ' + str(x_train.shape) + ' examples with ' + str(y_train.shape) + ' labels')
print('Validation set has ' + str(x_val.shape) + ' examples with ' + str(y_val.shape) + ' labels')
print('Test set has ' + str(x_test.shape) + ' examples with ' + str(y_test.shape) + ' labels')

Training set has (522, 15) examples with (522,) labels
Validation set has (156, 15) examples with (156,) labels
Test set has (68, 15) examples with (68,) labels


In [5]:
# Method which receives hyperparameters, training set and data for prediction 
# The method initializaes an svm model with the given hyperparameters, fits the traiing data 
# and predicts labels for the prediction data
def fit_predict_svm(hyper_params, x_train, y_train, x_predict):
    svm_clf = svm.SVC(C=hyper_params[0], kernel=hyper_params[1])  # Initialize the model
    svm_clf.fit(x_train, y_train)  # Train the model using the training data
    predictions = svm_clf.predict(x_predict)  # Predict labels using trained model
    
    return predictions


# Method which received true labels, predictions and an evaluation metric and returns the model's score on that metric 
def evaluation_score(y, predictions, metric='acc'):
    if metric == 'acc':
        return accuracy_score(y, predictions)
    elif metric == 'f1':
        return f1_score(y, predictions)
    elif metric == 'precission':
        return precision_score(y, predictions)
    else:
        return recall_score(y, predictions)

In [6]:
# Now that the data has been processed and split, lets create our model, fit the training data to the model and test it
# We will begin with the SVM model

# We begin with a hyperparameter search to find the best performing config for this dataset
best_score = 0
for reg in [0.1, 0.3, 0.6, 1, 1.5, 2]:  # Values for the regularization parameter C
    for kernl in ['linear', 'poly', 'rbf']:  # Different kernels that can be used
        val_predictions = fit_predict_svm((reg, kernl), x_train, y_train, x_val)
        
        score = evaluation_score(y_val, val_predictions, 'recall')  # We chose recall to prioritise the false negatives
        
        if score > best_score:
            best_hp = (reg, kernl)
            best_score = score

print('This is the best svm config (regularization parameter, kernel):', best_hp)
print()
print('Below the test performance of the above config can be seen:')
print(classification_report(y_test, fit_predict_svm(best_hp, x_train, y_train, x_test), 
                            target_names = ['No Disease', 'Heart Disease']))

This is the best svm config (regularization parameter, kernel): (0.1, 'linear')

Below the test performance of the above config can be seen:
               precision    recall  f1-score   support

   No Disease       0.87      0.81      0.84        32
Heart Disease       0.84      0.89      0.86        36

     accuracy                           0.85        68
    macro avg       0.85      0.85      0.85        68
 weighted avg       0.85      0.85      0.85        68



In [7]:
# We will now train and predict with the decision tree model

# Method which receives hyperparameters, training set and data for prediction 
# The method initializaes a decision tree model with the given hyperparameters, fits the training data 
# and predicts labels for the prediction data
def fit_predict_dtree(hyper_params, x_train, y_train, x_predict):
    tree_clf = DecisionTreeClassifier(random_state=hyper_params[0])  # Initialize the model
    tree_clf.fit(x_train, y_train)  # Train the model using the training data
    predictions = tree_clf.predict(x_predict)  # Predict labels using trained model
    
    return predictions

# Decision trees dont have many hyperpramaters that are sensible to vary, thus we wont perform an automated
# hyperparameter search 
print('Below the test performance of the decision tree can be seen:')
print(classification_report(y_test, fit_predict_dtree((0,), x_train, y_train, x_test), 
                            target_names = ['No Disease', 'Heart Disease']))

Below the test performance of the decision tree can be seen:
               precision    recall  f1-score   support

   No Disease       0.73      0.75      0.74        32
Heart Disease       0.77      0.75      0.76        36

     accuracy                           0.75        68
    macro avg       0.75      0.75      0.75        68
 weighted avg       0.75      0.75      0.75        68



In [31]:
# We will now train and make predictions using the K-nearest neighbors model

# Method which receives hyperparameters, training set and data for prediction 
# The method initializaes a decision tree model with the given hyperparameters, fits the training data 
# and predicts labels for the prediction data
def fit_predict_knn(hyper_params, x_train, y_train, x_predict):
    knn_clf = KNeighborsClassifier(n_neighbors=hyper_params[0], weights=hyper_params[1])  # Initialize the model
    knn_clf.fit(x_train, y_train)  # Train the model using the training data
    predictions = knn_clf.predict(x_predict)  # Predict labels using trained model
    
    return predictions

# Again we will not do an automatic hyperparameter search
print('Below the test performance of the nearest neighbor classifier can be seen:')
print(classification_report(y_test, fit_predict_knn((9,'distance'), x_train, y_train, x_test), 
                            target_names = ['No Disease', 'Heart Disease']))

Below the test performance of the nearest neighbor classifier can be seen:
               precision    recall  f1-score   support

   No Disease       0.71      0.84      0.77        32
Heart Disease       0.83      0.69      0.76        36

     accuracy                           0.76        68
    macro avg       0.77      0.77      0.76        68
 weighted avg       0.78      0.76      0.76        68

