## Introduction 

A powerful application that is being explored for machine learning today is medical and healthcare-related problems. In this final project, I employed Python and machine learning algorithms from the scikit-learn library in order to make predictions of heart disease based on various markers from patient health records. 

## The dataset 

I obtained my data from the "Heart Disease UCI" database on Kaggle. 

The database contains the health information for 303 patients, some with heart disease and some without.  It also includes the values of 13 different markers of cardiovascular health for the patients.  My goal was to be able to train ML models on part of these datasets, and subsequently predict whether patients in the remaining data had heart disease or not based on the 13 markers of health. Such a model could help doctors predict heart disease in patients and support early detection of the disease. 

## Importing and processing the data

First, I import the relevant packages and load the data using pandas: 

In [27]:
import pandas as pd  
import numpy as np  
import matplotlib.pyplot as plt  

data = pd.read_csv("heart.csv")

After loading the data, we can use a few convenient functions to explore it. 

First, using data.head(), we can peek at the contents of the data.  We can see that the first 13 columns contain various pieces of health-related data about each of the patients.  The 14th column contains a 0 or 1, a 0 if the patient has heart disease or a 1 if the patient does not. 

The 13 features in this dataset are:  
- age 
- sex: (1 = male, 0 = female)
- cp: Chest pain experienced (1: typical angina, 2: atypical angina, 3: non-anginal pain, Value 4: asymptomatic)
- trestbps: Resting blood pressure, in mm Hg
- chol: Cholesterol level, in mg/dl 
- fbs: Fasting blood sugar (Is > 120 mg/dl?, 1 = true; 0 = false)
- restecg: Resting electrocardiographic measurement (0 = normal, 1 = ST-T wave abnormality, 2 = probable or definite left ventricular hypertrophy by Estes' criteria)
- thalach: Maximum heart rate achieved
- exang: Exercise induced angina (1 = yes; 0 = no)
- oldpeak: ST depression induced by exercise relative to rest ('ST' relates to positions on the ECG plot)
- slope: Slope of the peak exercise ST segment (Value 1: upsloping, Value 2: flat, Value 3: downsloping)
- ca: Number of major vessels (0-3)
- thal: Blood disorder called thalassemia (3 = normal, 6 = fixed defect, 7 = reversable defect) 

In [28]:
data.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


Next, using data.shape, we can obtain the shape of this data matrix.

The dimensions turn out to be 303 x 14.  So, we have 303 rows and 14 columns.  The 303 rows indicate that there are 303 patients in the dataset, and the 14 columns represent the 13 features and 1 target (heart disease?). 

In [29]:
data.shape

(303, 14)

Next, we have to divide the data into features and the target.  I accomplish this by storing the data columns containing the features to train the model on in the variable X, and storing the data column containing the target (the last column) in the variable y.

In [30]:
X = data.drop('target', axis=1)  
y = data['target']

Next, we have to split the data into a training set to build the model, and a testing set to evaluate the model.  The scikit-learn package contains a useful library, train_test_split, which can be used to accomplish this splitting.  As shown below, I reserve 80% of the data for training purposes, and 20% for testing. 

In [31]:
from sklearn.model_selection import train_test_split  
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)  

Now, we have preprocessed the data and are ready to build and run machine learning models on it!

## Model 1: Support Vector Machine (SVM)

The first model I used is Support Vector Machine, or SVM. SVM is a commonly used supervised machine learning classification algorithm.  

To build an SVM model, I first import the SVC class from the scikit-learn package.  I am using the SVC (support vector classifier) class because I am performing a classification task.  The SVC task accepts one parameter, which is the kernel type. We set this parameter to "linear" because simple SVM's can only classify linearly separable data.(?) 

To train the algorithm on the training data, I call the fit method of the SVC class. 

In [40]:
from sklearn.svm import SVC  
svclassifier = SVC(kernel='linear')  
svclassifier.fit(X_train, y_train)  

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='linear', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

Now that we've trained the model using the training data, we can use the model to predict heart disease in patients in the testing data.  I do this by calling the predict method of the SVC class.

In [33]:
y_pred = svclassifier.predict(X_test)

Finally, we can output a confusion matrix -- a table that is commonly used to evaluate the model's performance. 

In the confusion matrix, the first row represents the results for the class 0 (patients who do NOT have heart disease). The left value is the # of patients the model classified as 0 (not diseased), and the right value is the # of patients the model classified as 1 (diseased).  

The second row represents the results for class 1 (patients who DO have heart disease). The left value is the # of patients the model classified as 0 (not diseased), and the right value is the # of patients the model classified as 1 (diseased). 

We can see that overall, the model did a decent job.  The testing set consists of 61 patients (20% of 303).  Out of the 26 patients without heart disease, the model classified 18 of them (69%) correctly as having no disease.  Out of the 35 patients with heart disease, the model classified all 34 of them (97.1%) correctly as having no disease. 

Ski-learn's metrics library also contains a function to output several formal metrics to evaluate the model: precision, recall, and f1-score.  

What each of these metrics means: 

- *Precision*: Out of patients labeled as having/not having the disease, what percent of them actually had/didn't have the disease? 

This uses the left column for 0 and the right column for 1. 

For 0: 18/19, For 1: 34/42 

- *Recall*: Out of patients with/without the disease, what percent of them were classified correctly? 

This uses the top row for 0 and the bottom row for 1. 

For 0: 18/26, For 1: 34/35 

- *f1-score*:  Weighted average of the precision and recall.  F1-score = 2*(Recall*Precision) / (Recall + Precision) 

In [34]:
from sklearn.metrics import classification_report, confusion_matrix  
print(confusion_matrix(y_test,y_pred))  
print(classification_report(y_test,y_pred))

[[18  8]
 [ 1 34]]
              precision    recall  f1-score   support

           0       0.95      0.69      0.80        26
           1       0.81      0.97      0.88        35

   micro avg       0.85      0.85      0.85        61
   macro avg       0.88      0.83      0.84        61
weighted avg       0.87      0.85      0.85        61



## Model 2: Logistic Regression

The next model I used is Logistic Regression, another common machine learning algorithm that uses the sigmoid function to make predictions and for 0 or 1 classification.    

To build a logistic regression model, I first import the LogisticRegression class from the scikit-learn library.  Then, 
I train the algorithm on the training data by calling the fit() method in the LogisticRegression class.  

Finally, I use the model to make predictions on the testing set by calling the predict() method of the 
LogisticRegression class.

In [35]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train,y_train)

y_pred2 =logreg.predict(X_test) 



Like for SVM, I outputted a confusion matrix to evaluate the performance of the Logistic Regression model. As a reminder, this is how each of the metrics was calculated:

- Precision: Out of patients labeled as having/not having the disease, what percent of them actually had/didn't have the disease?

This uses the left column for 0 and the right column for 1.

For 0: 18/18, For 1: 35/43 

- Recall: Out of patients with/without the disease, what percent of them were classified correctly?

This uses the top row for 0 and the bottom row for 1.

For 0: 18/26, For 1: 35/35  

- f1-score: Weighted average of the precision and recall. F1-score = 2(RecallPrecision) / (Recall + Precision)

In [36]:
from sklearn.metrics import classification_report, confusion_matrix  
print(confusion_matrix(y_test,y_pred2))  
print(classification_report(y_test,y_pred2))

[[18  8]
 [ 0 35]]
              precision    recall  f1-score   support

           0       1.00      0.69      0.82        26
           1       0.81      1.00      0.90        35

   micro avg       0.87      0.87      0.87        61
   macro avg       0.91      0.85      0.86        61
weighted avg       0.89      0.87      0.86        61



## Model 3: XGBoost (Tree Boosting)

The final model I used was XGBoost, an algorithm that utilizes decision trees for predicting its target's values. XGBoost uses gradient boosting to improve the speed and performance of the decision tree algorithm. 



In [37]:
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

model = XGBClassifier()
model.fit(X_train,y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

Next, I make predictions: 

In [38]:
y_pred3 = model.predict(X_test)

Finally, the results:  

- Precision: Out of patients labeled as having/not having the disease, what percent of them actually had/didn't have the disease?

This uses the left column for 0 and the right column for 1.

For 0: 20/25, For 1: 30/36 

- Recall: Out of patients with/without the disease, what percent of them were classified correctly?

This uses the top row for 0 and the bottom row for 1.

For 0: 20/26, For 1: 30/335 

- f1-score: Weighted average of the precision and recall. F1-score = 2(RecallPrecision) / (Recall + Precision)

In [39]:
from sklearn.metrics import classification_report, confusion_matrix  
print(confusion_matrix(y_test,y_pred3))  
print(classification_report(y_test,y_pred3))

[[20  6]
 [ 5 30]]
              precision    recall  f1-score   support

           0       0.80      0.77      0.78        26
           1       0.83      0.86      0.85        35

   micro avg       0.82      0.82      0.82        61
   macro avg       0.82      0.81      0.81        61
weighted avg       0.82      0.82      0.82        61



## Conclusions 

The 3 algorithms I tried -- SVM, Logistic Regression, and XGBoost -- all performed well at predicting heart diseases from the 13 features in the dataset after being trained on 80% of the 303 total patient cases. 

Comparison between the algorithms' performances: 
- We can compare the f1-scores of the 2 models as the f1 scores are a good "catch-all" metric to evaluate the performance of a model.  
- SVM had an f1-score of 80% for 0 and 88% for 1 
- LogReg had an f1-score of 82% for 0 and 90% for 1 
- XGBoost had an f1-score of 78% of 0 and 85% for 1 

Overall, all 3 models had quite high f1-scores, indicating strong predictive ability.  Interestingly, all 3 models seemed to perform better in classifying patients who had the disease vs. patients who did not have the disease.  Finally, although in this run it seems that Logistic Regression performed better than SVM, which performed better than XGBoost, I can't read too much into these results.  This is because each time I ran the 3 algorithms, I got different rankings pretty much every time -- no model consistently outperformed the others. 