In [1]:
url="https://archive.ics.uci.edu/dataset/45/heart+disease"

In [4]:
library(tidyverse)
library(repr)
library(tidymodels)
library(janitor)

ERROR: Error in library(janitor): there is no package called ‘janitor’


#  Presence of heart disease of patients in Cleveland, Ohio

## Introduction

Heart disease is the leading cause of death in the United States, and one of every five deaths in the United States can be attributed to heart disease (Multiple Cause of Death Data on CDC WONDER, n.d.). This makes the quick and accurate diagnosis of heart disease an extremely important topic of study. In 1989, a probability algorithm was created for the diagnosis of coronary artery disease (Detrano et al., 1989). In their report, they tested their algorithm on the test results of 303 patients from the Cleveland Clinic. 
Here, we will use two criteria commonly associated with heart disease to predict the presence of it in the sample group of patients.  Cholesterol levels can increase the risk of heart disease and can be easily tested for during routine blood tests, making it an important variable to predict the presence of heart disease (High Cholesterol - Symptoms and Causes - Mayo Clinic, n.d.). The second variable we chose was resting blood pressure. High blood pressure, also known as hypertension can lead to heart disease and can be easily tested at home or during regular checkups (CDC, 2023). 
Using these two routinely checked variables, we will determine if the presence of heart disease can be predicted in patients in Cleveland, Ohio using the method of classification.
The heart disease dataset that will be used in our analysis has a sample size of 303 patients, and has 14 variables. However, as mentioned previously, we will only be using two variables (cholesterol levels and resting blood pressure) to see if a quick determination of the presence of heart disease can be made through routinely checked criteria. The sample group is patients from Cleaveland Clinic in Ohio. The sample set includes both male and female patients with an average age of 54. (Janosi, 1988). 


### Initial Data

In [3]:
set.seed(1234)
main_data_column_2 <- read_table("cleve.mod", col_names = c('Age', 'Sex', 'Chest Pain Type', 'Resting Blood Pressure',
                                                            'Cholesterol', 'Fasting Blood Sugar <120',
                                                        'Resting ECG Reading', 'Max Heart Rate',
                                                       'Exercise Induced Angina (TRUE or FALSE)',
                                                            'Old Peak', 'Slope', 'Number Of Vessels Coloured', 'thal','Health'), skip = 20)

heart_data<-clean_names(main_data_column_2)
heart_data <- heart_data |>
        mutate(health = as_factor(health))
head(heart_data)


ERROR: Error in clean_names(main_data_column_2): could not find function "clean_names"


### Summary Data

In [None]:
num_obs <- heart_data |>
    group_by(health) |>
    summarize(counts = n())
num_obs

predictor_means<-heart_data|>
    select(resting_blood_pressure,cholesterol)|>
    summarize(across(resting_blood_pressure:cholesterol,mean))
predictor_means

### Initial Visualization

In [None]:

heart_data<-heart_data|>
    select(sex,resting_blood_pressure,cholesterol,health)
heart_split<-initial_split(heart_data,prop=0.75,strata=health)  
heart_train<-training(heart_split)   
heart_test<-testing(heart_split)

head(heart_train)
head(heart_test)

heart_plot<-heart_train|>
    ggplot(aes(x=resting_blood_pressure,y=cholesterol,color=health))+
        geom_point()+
        labs(x="Resting Blood Pressure ",y="Cholesterol", color="Health")+
        theme(text=element_text(size=20))+
        ggtitle("Cholesterol vs Resting Blood Pressure")+
        facet_grid(.~sex)
heart_plot

## Methods and Results

To begin our data analysis, we began by loading the untidy data into our code and proceeded to add column names to our untidy data. After, we decided to clean up our data by compiling it into the correct columns and making sure all the data is correct and is compiled into a readable manner. We then begun to utilize the k-nearest neighbors system to help assist us in finding out if our variables were good predictors of heart disease or not. We did this by utilizing learned methods of creating training and testing data, making a recipe, creating a nearest neighbor function and workflowing it. After finding out the k-nearest neighbor, we then continued and created a plot summarizing and visualizing our data so that everyone can clearly see our findings. Furthermore, we then decided to predict our data and check the accuracy of our data by using the predict and metrics functions to do the respective processes above. Lastly, we checked which K to use by folding our heart data, and check for its accuracy before finishing off with our final graph of comparing healthy and sick patients and their cholesterol and resting blood pressure rates.


In [None]:
train_counts <- heart_train |>
    group_by(health) |>
    summarize(n = n()) 

train_counts

In [None]:
heart_recipe<-recipe(health~cholesterol+resting_blood_pressure,data=heart_train)

knn_tune <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
    set_engine("kknn") |>
    set_mode("classification")

k_vals<-tibble(neighbors=seq(2,20))

yVfold<-vfold_cv(heart_train,v=10,strata = health)

knn_results<-workflow()|>
  add_recipe(heart_recipe)|>
  add_model(knn_tune)|>
  tune_grid(resamples=yVfold,grid=k_vals)|>
  collect_metrics() 
head(knn_results)


In [None]:
accuracy<-knn_results|>
  filter(.metric=="accuracy")

cross_val_plot <- accuracy|>
    ggplot(aes(x = neighbors, y = mean))+
        geom_point()+
        geom_line()+
        labs(x="Neighbors", y = "Accuracy Estimate")+ 
        theme(text=element_text(size=20))
cross_val_plot

In [None]:
knn_spec<-nearest_neighbor(weight_func="rectangular",neighbors=10)|>
      set_engine("kknn")|>
      set_mode("classification")

heart_fit<-workflow()|>
      add_recipe(heart_recipe)|>
      add_model(knn_spec)|>
      fit(data=heart_train)
heart_fit

In [None]:
heart_test_predictions<-predict(heart_fit,heart_test)|>
    bind_cols(heart_test)
head(heart_test_predictions)

heart_prediction_accuracy<-heart_test_predictions|>
    metrics(truth=health,estimate=.pred_class)|>
    filter(.metric == "accuracy")
head(heart_prediction_accuracy)

conf_matrix <- heart_test_predictions |>
    conf_mat(truth = health, estimate = .pred_class)
conf_matrix

In [None]:
heart_predict_plot <- heart_test_predictions |>
    ggplot(aes(x = resting_blood_pressure, y = cholesterol, color = .pred_class,shape=health)) +
    geom_point(size = 3) +
    labs(x = "Resting Blood Pressure", y = "Cholesterol", color = "Predicted Health",shape="Actual Health") +
    theme(text = element_text(size=15)) +
    ggtitle("Cholesterol vs Blood Pressure Prediction Results")
heart_predict_plot

### Plots of inital data and predicted results

In [None]:
heart_data_plot <- heart_data |>
    ggplot(aes(x = resting_blood_pressure, y = cholesterol, color = health)) +
    geom_point() +
    labs(x = "Resting Blood Pressure", y = "Cholesterol", color = "Health") +
    theme(text = element_text(size=15)) +
    ggtitle("Cholesterol vs Resting Blood Pressure ")
heart_data_plot

heart_predict_plot

## Discussion

Overall, we found that resting blood pressure was not an ideal way of classifying people who are healthy and people who are sick with some form of heart disease. Ultimately, the spread between healthy and sick were simply too close to each other to draw any meaningful conclusions or justifiable assumptions based on the data. This is exhibited by the nearly zero correlation between resting blood pressure and someone’s health status. Furthermore, the difference between male and female data shows little correlation between gender and the increased prevalence of heart disease in either gender. However, it is important to note that very large amount of females who were healthy had a wide range of cholesterol, which is significant as it signifies that cholesterol levels can vary a lot while having little to no impact to someone’s heart health, especially in females. 

In totality, we definitely were not expecting the results we encountered as we were almost certain that either on their own, or a mixture of both, that cholesterol and blood pressure would have some sort of effect on someone’s heart health. We were rather stunned by the data and its suggestion that resting blood pressure and cholesterol levels, have minimal amount of effect in increasing the risk of people developing some form of heart disease.

### Future Questions
Are there any variables we are not aware of that may have impacted whether or not the patients had heart disease? It is known that heart disease is inherited, do the patients’ families have any pre existing heart disease history? This variable can introduce factors which could affect the outcome. Smoking is a well-established risk factor for heart disease, and accounting for this variable can enhance the accuracy of the analysis. Additionally, Excessive alcohol intake can contribute to heart-related issues and understanding this variable is crucial for a thorough analysis. Considering these lifestyle and behavioral factors can help identify potential correlations between certain habits and the presence of heart disease. Additionally, it allows for a more holistic approach in understanding the multifaceted nature of cardiovascular health. As these factors are often modifiable, insights gained from their analysis can also inform potential interventions and lifestyle modifications for preventing or managing heart disease.


## References

Janosi, Andras, Steinbrunn,William, Pfisterer,Matthias, and Detrano,Robert. (1988). Heart Disease. UCI Machine Learning Repository. https://doi.org/10.24432/C52P4X.

CDC. (2023, August 29). High Blood Pressure Symptoms, Causes, and Problems | cdc.gov. Centers for Disease Control and Prevention. https://www.cdc.gov/bloodpressure/about.htm

Detrano, R., Janosi, A., Steinbrunn, W., Pfisterer, M., Schmid, J.-J., Sandhu, S., Guppy, K. H., Lee, S., & Froelicher, V. (1989). International application of a new probability algorithm for the diagnosis of coronary artery disease. The American Journal of Cardiology, 64(5), 304–310. https://doi.org/10.1016/0002-9149(89)90524-9

High cholesterol—Symptoms and causes—Mayo Clinic. (n.d.). Retrieved December 5, 2023, from https://www.mayoclinic.org/diseases-conditions/high-blood-cholesterol/symptoms-causes/syc-20350800

Multiple Cause of Death Data on CDC WONDER. (n.d.). Retrieved December 4, 2023, from https://wonder.cdc.gov/mcd.html