The following notebook uses pandas, numpy, and sklearn to examine the public dataset available from http://archive.ics.uci.edu/ml/datasets/heart+disease.


We will examine this dataset for predictive ML possibilities (classification) using
* Logistic Regression (for classification)
* Random Forest
* Decision Trees
* Support Vector Machines


The dataset contains the following:

CONTINUOUS FEATURES
 * age 
 * trestbps – resting blood pressure (in mm Hg on admission to the hospital)
 * chol – serum cholesterol in mg/dl
 * thalach – maximum heart rate achieved
 * oldpeak – ST depression induced by exercise relative to rest
 * ca – (I guess) number of major vessels (0-3) colored by fluoroscopy

CATEGORICAL FEATURES
 * sex – 1/0 for male/female
 * cp – chest pain type, Value 1: typical angina, Value 2: atypical angina, Value 3: non-anginal pain, Value 4: asymptomatic
 * fbs – fasting blood sugar > 120 mg/dl  (1 = true; 0 = false)
 * restecg – resting electrocardiographic results, Value 0: normal, Value 1: having ST-T wave abnormality – T wave inversions and/or ST elevation or depression of > 0.05 mV, Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
 * exang  – exercise induced angina (1 = yes; 0 = no)
 * slope – the slope of the peak exercise ST segment, Value 1: upsloping, Value 2: flat, Value 3: downsloping
 * thal – 3 = normal; 6 = fixed defect; 7 = reversable defect
 * num – diagnosis of heart disease (angiographic disease status), Value 0: < 50% diameter narrowing, Value 1: > 50% diameter narrowing, in any major vessel: attributes 59 through 68 are vessels (0 is absent, 1 is present); this is our RESPONSE

## Pre-Processing

We will need the following libraries:

In [2]:
import pandas as pd
import numpy as np

Below we ingest our cleaned dataset, removing any unnecessary features. We also "categorize" our nonbinary features (one hot encode).

In [3]:
df = pd.read_csv("view_processed_cleveland.txt")
df = df.drop(["num"], axis=1)  # use num_binary not num

# create one-hot encoding on needed categorical
cats = pd.get_dummies(df[['cp','restecg','slope','thal']].astype('category'))
df = df.drop(['cp','restecg','slope','thal'], axis=1)
df = df.join (cats)
df.head()


Unnamed: 0,age,sex,trestbps,chol,fbs,thalach,exang,oldpeak,ca,num_binary,...,cp_4,restecg_0,restecg_1,restecg_2,slope_1,slope_2,slope_3,thal_3,thal_6,thal_7
0,63,1,145,233,1,150,0,2.3,0,0,...,0,0,0,1,0,0,1,0,1,0
1,67,1,160,286,0,108,1,1.5,3,1,...,1,0,0,1,0,1,0,1,0,0
2,67,1,120,229,0,129,1,2.6,2,1,...,1,0,0,1,0,1,0,0,0,1
3,37,1,130,250,0,187,0,3.5,0,0,...,0,1,0,0,0,0,1,1,0,0
4,41,0,130,204,0,172,0,1.4,0,0,...,0,0,0,1,1,0,0,1,0,0


Below we create our train/test data w/ a 70%/30% split.

In [4]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.3)
train.shape, test.shape

((207, 23), (90, 23))

## Logistic Regression

In [41]:
# lbfgs
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(solver='lbfgs', max_iter=10000)  # tune max_iter hyperparameter to achieve convergence
model = model.fit(train.loc[:,df.columns!="num_binary"]
                  ,train["num_binary"])

Basic prediction, confusion matrix, and model score.

In [43]:
# prediction, confusion matrix, model score ("model score" here is calc of confusion matrix)
from sklearn.metrics import confusion_matrix
pred = model.predict(test.loc[:,df.columns!="num_binary"])  # calc predictions
print(confusion_matrix(pred, test["num_binary"]))  # confusion matrix
score = model.score(test.loc[:,df.columns!="num_binary"],test["num_binary"])
print(f'{score*100:.5}%')


[[43 12]
 [ 2 33]]
84.444 %


In [44]:
# sag
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(solver='sag', max_iter=10000)  # tune max_iter hyperparameter to achieve convergence
model = model.fit(train.loc[:,df.columns!="num_binary"]
                  ,train["num_binary"])

Basic prediction, confusion matrix, and model score.

In [54]:
# prediction, confusion matrix, model score ("model score" here is calc of confusion matrix)
from sklearn.metrics import confusion_matrix
pred = model.predict(test.loc[:,df.columns!="num_binary"])  # calc predictions
print(confusion_matrix(pred, test["num_binary"]))  # confusion matrix
score = model.score(test.loc[:,df.columns!="num_binary"],test["num_binary"])
print(f'{score*100:.5}%')

[[39 11]
 [ 6 34]]
81.111%


## Random Forest

In [46]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators = 5000, random_state = 51)
model = model.fit(train.loc[:,df.columns!="num_binary"], train["num_binary"])

Basic prediction, confusion matrix, and model score.

In [47]:
# prediction, confusion matrix, model score ("model score" here is calc of confusion matrix)
from sklearn.metrics import confusion_matrix
pred = model.predict(test.loc[:,df.columns!="num_binary"])  # calc predictions
print(confusion_matrix(pred, test["num_binary"]))  # confusion matrix
score = model.score(test.loc[:,df.columns!="num_binary"],test["num_binary"])
print(f'{score*100:.5}%')

[[39 11]
 [ 6 34]]
81.111%


## SVM

In [49]:
from sklearn import svm
model = svm.SVC(gamma='scale', kernel='linear')  # need linear
model = model.fit(train.loc[:,df.columns!="num_binary"], train["num_binary"])

Basic prediction, confusion matrix, and model score.

In [50]:
# prediction, confusion matrix, model score ("model score" here is calc of confusion matrix)
from sklearn.metrics import confusion_matrix
pred = model.predict(test.loc[:,df.columns!="num_binary"])  # calc predictions
print(confusion_matrix(pred, test["num_binary"]))  # confusion matrix
score = model.score(test.loc[:,df.columns!="num_binary"],test["num_binary"])
print(f'{score*100:.5}%')

[[41 11]
 [ 4 34]]
83.333%


## Decision Trees

In [52]:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(criterion = "gini", random_state = 101,
                               max_depth=10, min_samples_leaf=10)
model = model.fit(train.loc[:,df.columns!="num_binary"], train["num_binary"])

Basic prediction, confusion matrix, and model score.

In [53]:
# prediction, confusion matrix, model score ("model score" here is calc of confusion matrix)
from sklearn.metrics import confusion_matrix
pred = model.predict(test.loc[:,df.columns!="num_binary"])  # calc predictions
print(confusion_matrix(pred, test["num_binary"]))  # confusion matrix
score = model.score(test.loc[:,df.columns!="num_binary"],test["num_binary"])
print(f'{score*100:.5}%')

[[39 11]
 [ 6 34]]
81.111%
