## Heart Attack Predictions
William Hoppes
March 19, 2018
Today we’re going to diagnose heart disease!

Specifically we’re looking at data on heart attacks from Cleveland in 1988. Patients would come in either with heart pain (angina) or some other concern. Doctors took information from the patient, then hooked them up to electrodes and made them run on a treadmill and measured the results.


Based on these results, we want to determine whether the patient has a heart disease. Note that, probably for insurance/liability reasons, we don’t care how serious the heart disease is.

Our final product should be a machine learning algorithm that takes in patient data and delivers a diagnosis of “Heart Disease” or “No Heart Disease” and it should be as accurate as possible.

So let’s start by looking at our raw data.

In [2]:
heartAttack_raw<-read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data",header=F)

head(heartAttack_raw, 10)

V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14
63,1,1,145,233,1,2,150,0,2.3,3,0.0,6.0,0
67,1,4,160,286,0,2,108,1,1.5,2,3.0,3.0,2
67,1,4,120,229,0,2,129,1,2.6,2,2.0,7.0,1
37,1,3,130,250,0,0,187,0,3.5,3,0.0,3.0,0
41,0,2,130,204,0,2,172,0,1.4,1,0.0,3.0,0
56,1,2,120,236,0,0,178,0,0.8,1,0.0,3.0,0
62,0,4,140,268,0,2,160,0,3.6,3,2.0,3.0,3
57,0,4,120,354,0,0,163,1,0.6,1,0.0,3.0,0
63,1,4,130,254,0,2,147,0,1.4,2,1.0,7.0,2
53,1,4,140,203,1,2,155,1,3.1,3,0.0,7.0,1


Uggggllly!

So before we even look at the data, we need to clean it up so we know what we’re looking at. This involves going through the ‘Data Set Description’ that came with the dataset and manually change the values, ie data munging/cleaning.

So let’s start with what our variables are. According to the data set description, we have 14 variables:

Age
Sex/Gender
Chest pain
Chest pain has four different values: typical angina, atypical angina, non-anginal pain, and asymptomatic.

Resting Blood Pressure
Cholesterol
Fasting Blood Sugar > 120 mg/dl
This is T/F

Resting Electrocardiographic Results
Three different values: Normal, ST-T Wave Abnormality, Probable or definite left ventricular hypertrohpy by Estes’ criteria

Maximum Heart Rate Achieved
Exercise Induced Angina
This is T/F

ST depression induced by exercise relative to rest
Slope of the Peak Exercise Segment
Number of Major Vessels (0-3) colored by flourosopy
Thal
Three different values: normal, fixed defect, and reversable defect.

Diagnosis
At this point we need to make a call as to what data to include.

To provide some context, the full data set has 76 variables. Most studies use the set of 14 variables listed above but that mean 62 variables were cut out.

Discussion: Should we use all the variables? Only some? If so, which ones?
.
.
.
.
.
.
.
.
.

So when I did this, I decided to cut out ST deperession induced by exercise, Slope of Peak, Number of Major Vessels colored, and Thal. This was done because I just didn’t know what these were and I’m uncomfortable including any data that I don’t understand.

Cleaning code:

In [3]:
colnames(heartAttack_raw)<-c("Age", "Sex", "Chest_Pain_Type", 
                         "Resting _Blood_Pressure", "Cholesterol",
                         "Fasting_Blood_Sugar120", 
                         "Resting_Electrocardiographic_Results",
                         "Maximum_Heart_Rate", "Exercise_Induced_Angina",
                         "oldpeak", "slope", "N_Heart_Vessels_Flourosopy", "thal", 
                         "Diagnosis")

#Cut
heartAttack<-heartAttack_raw[,c(1:9,14)]

#Base Cleaning
heartAttack$Sex<-factor(heartAttack$Sex, levels=c(0,1), labels=c("Female", "Male"))
heartAttack$Chest_Pain_Type<-factor(heartAttack$Chest_Pain_Type,
                                    levels=c(1,2,3,4), 
                                    labels=c("Typical_Angina", "Atypical_Angina", 
                                             "Non-anginal_Pain", "Asymptomatic"))
heartAttack$Resting_Electrocardiographic_Results<-factor(heartAttack$Resting_Electrocardiographic_Results,
                                                         levels=c(0,1,2),
                                                         labels=c("Normal", "ST-T_Wave_Abnormality", 
                                                                  "Probable_Hypertrophy"))

heartAttack$Fasting_Blood_Sugar120<-as.logical(heartAttack$Fasting_Blood_Sugar120)
heartAttack$Exercise_Induced_Angina<-as.logical(heartAttack$Exercise_Induced_Angina)
heartAttack$Diagnosis<-heartAttack$Diagnosis>0

You can use lasso regression to determine relevant problems. Similar to regularization minmizes absolute value. 