INTRODUCTION

According to the World Health Organization (WHO), cardiovascular disease is the leading cause of death globally. In fact, over 17.9 million patients die from the disease each year and over 229 billion dollars are spent on cardiovascular medical care in the USA alone each year. Heart disease is characterized by high blood pressure, cholesterol and obestity and leads to symptoms such as heart failure, arrhythmia and heart attack.  

Machine learning algorithms have vast clinical relevance because it takes a set of data inputs and learns to predict output values from it, for example diagnoses. Hence, our goal is to create and train a heart disease prediction classification model that can be used to predict whether individuals have heart disease based on various clinical attributes using the KNN classification algorithm. 

Research question: Which individuls are likely to have heart disease according to various clinical attributes? 

*this is just me rambling feel free to edit/cut it out

PRELIMINARY DATA ANALYSIS

The "hungarian heart disease" date set obtained from the machine learning repository contains 294 observations that represent diagnosed patients and 14 columns which detail various clinical attributes of heart disease as well as whether or not the patient was diagnosed with heart disease on a scale of 0 to 4.

We began by loading in the data set. Since it does not have column names, we renamed them to be the variable names. 

*do we need to narrate our work like this? 

In [15]:
library(tidyverse)
library(tidymodels)
set.seed(123)

untidy_heart_data <- read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.hungarian.data", col_names = FALSE)

untidy_heart_data_2 <- rename(untidy_heart_data, 
                        Gender = X1, 
                        Sex = X2, 
                        Cp = X3, 
                        Trestbps = X4, 
                        Chol = X5, 
                        Fbs = X6, 
                        Restecg = X7, 
                        Thalach = X8, 
                        Exang = X9, 
                        Oldpeak = X10, 
                        Slope = X11, 
                        Ca = X12, 
                        Thal = X13, 
                        Diagnosis = X14)            

untidy_heart_data_3 <- untidy_heart_data_2|>
    mutate(across(c(Gender:Diagnosis), as.factor))
  untidy_heart_data_3   


[1mRows: [22m[34m294[39m [1mColumns: [22m[34m14[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (9): X4, X5, X6, X7, X8, X9, X11, X12, X13
[32mdbl[39m (5): X1, X2, X3, X10, X14

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


Gender,Sex,Cp,Trestbps,Chol,Fbs,Restecg,Thalach,Exang,Oldpeak,Slope,Ca,Thal,Diagnosis
<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>
28,1,2,130,132,0,2,185,0,0,?,?,?,0
29,1,2,120,243,0,0,160,0,0,?,?,?,0
29,1,2,140,?,0,0,170,0,0,?,?,?,0
30,0,1,170,237,0,1,170,0,0,?,?,6,0
31,0,2,100,219,0,1,150,0,0,?,?,?,0
32,0,2,105,198,0,0,165,0,0,?,?,?,0
32,1,2,110,225,0,0,184,0,0,?,?,?,0
32,1,2,125,254,0,0,155,0,0,?,?,?,0
33,1,3,120,298,0,0,185,0,0,?,?,?,0
34,0,2,130,161,0,0,190,0,0,?,?,?,0


Since the data is already in tidy format and the variables are the correct data type, we can now divide our heart disease data set into a training set and a testing set using set.seed for a random yet repdoducible split. 

In [24]:
split_heart_data <- initial_split(untidy_heart_data_2, prop = 0.75, strata = Diagnosis)
training_heart_data <- training(split_heart_data)
testing_heart_data <- testing(split_heart_data)
glimpse(training_heart_data) 
glimpse(testing_heart_data)

Rows: 220
Columns: 14
$ Gender    [3m[90m<dbl>[39m[23m 29, 29, 30, 32, 32, 34, 34, 35, 35, 35, 36, 36, 37, 37, 37, …
$ Sex       [3m[90m<dbl>[39m[23m 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, …
$ Cp        [3m[90m<dbl>[39m[23m 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 3, 3, 2, 3, 4, 2, 3, 4, 2, 3, …
$ Trestbps  [3m[90m<chr>[39m[23m "120", "140", "170", "105", "125", "150", "98", "120", "120"…
$ Chol      [3m[90m<chr>[39m[23m "243", "?", "237", "198", "254", "214", "220", "160", "308",…
$ Fbs       [3m[90m<chr>[39m[23m "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", …
$ Restecg   [3m[90m<chr>[39m[23m "0", "0", "1", "0", "0", "1", "0", "1", "2", "0", "0", "0", …
$ Thalach   [3m[90m<chr>[39m[23m "160", "170", "170", "165", "155", "168", "150", "185", "180…
$ Exang     [3m[90m<chr>[39m[23m "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", …
$ Oldpeak   [3m[90m<dbl>[39m[23m 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

We can now use the training data to create a summary table that shows the number and percentage of observations that do and do not have a heart disease diagnosis. As shown, there are 141 observations that do not have heart disease (64.09%) and 79 observations that do have heart disease diagnosis(35.90%). 

In [25]:
summary_heart_data <- training_heart_data|>
    group_by(Diagnosis)|>
    summarize(Count=n())|>
    mutate(Percent = Count/220)
summary_heart_data

Diagnosis,Count,Percent
<dbl>,<int>,<dbl>
0,141,0.6409091
1,79,0.3590909


METHODS

We will create our model by first scaling the predictor variables and then creating a recipe that can be used within the model workflow. Our classifier will be based on the following predictor variables: chest pain, blood pressure, resting electrocardiographic results... etc (need to state reasons why we chose those variables). 

*not sure if this is what we are supposed to talk about here....

OUTCOMES AND SIGNIFICANCE 

We expect our classifier to be able to predict whether individuals have heart disease depending on the variables that indicate heart disease within the training data. 

Future questions include whether we can create a classifier model that can predict the presence of a disease based on risk factors for another disease (correlation of two related diseases?). Further, we might ask whether other variables exist outside of the hungarian data set that might contribute to heart disease and improve the accuracy of the model we created. Or, how might we make these alrogithms more acessible in the health care industry or even for at home diagnosis. Finally, we might ask how practitioners might be able to use use classification algorithms such as ours to better treat patients with heart disease. 

This model is significant because it provides the opportunity to provide easier, faster and more accurate diagnosis without the need for diagnosis by a clinical practitioner? and can prevent life threats by predicting presence of heart disease based on risk factors like high blood pressure. Early diagnosis could potentially revolutionize health care for heart disease patients with risk factors by allowing for early treatment/care before the disease progresses to fatal levels. 