# Visualizing the Effects of Cholesterol and Blood Pressure on Heart Disease

*Mikel Ibarra Gallardo, Akshat Karla, Caitlin Lichimo, JunYuan Liu*

## Introduction

Heart disease is one of, if not the most common causes for death in the United States. The CDC estimates that approximately ⅕ deaths in the country are caused by it, with it being so severe that a person will die every 34 seconds because of heart disease. There are a variety of factors that can contribute to the likelihood of heart disease, along with the varieties that exist, with coronary heart disease (CAD) being the most common of them. Out of the contributing factors towards CAD, there are two in particular that are more influential: blood pressure and cholesterol level in plasma. Studies in the past have shown there to be a correlation between both blood pressure & cholesterol levels and heart disease. A study focusing on blood pressure found that it can be  a major risk factor towards CAD, especially with those who are over 35 and those diagnosed with hypertension (Jeremiah Statler, et al., 1989). Another study which focused on cholesterol found that it has a very strong correlation with CAD, due to the formation of fatty plaque within the arteries which is aided by low density proteolipids (Scott M. Grundy, 1986). The latter study, as well as the CDC, both suggest that one of the most efficient ways to prevent high cholesterol is through a change in diet, regular exercise, and a reduction of consumption of harmful substances (such as alcohol or tobacco). However, this research will be asking the following question: is cholesterol and/or blood pressure good preditors of heart disease?

We will be using the Heart Disease Data Set provided by UC Irvine. This data set describes heart condition information based on 303 individuals from different regions. This data is divided into 4 databases: Cleveland, Hungary, Switzerland, and the VA Long Beach, however we will only be utilizinng the Cleaveland data. The data set contains 76 attributes, but so far only 14 have been cited in literature:

- Age: age in years
- Sex: sex (1 = male; 0 = female)
- Cp: chest pain type (value 1 = typical angina, value 2 = atypical angina, value 3 = non-anginal pain, value 4 = asymptomatic)
- Trestbps: resting blood pressure (in mm Hg on admission to the hospital)
- Chol: serum cholesterol in mg/dl
- Fbs: fasting blood sugar > 120 mg/dl (1 = true; 0 = false)
- Restecg: resting electrocardiographic results (value 0 = normal, value 1 = having ST-T wave abnormality, value 2 = showing probably or definite left ventricular hypertrophy by Estes’ criteria
- Thalach: maximum heart rate achieved
- Exang: exercise induced angina (1 = yes; 0 = no)
- Oldpeak: ST depression induced by exercise relative to rest
- Slope: the slope of the peak exercise ST segment (value 1 = upsloping, value 2 = flat, value 3 = downsloping)
- Ca: number of major vessels (0-3) colored by fluoroscopy
- Thal: 3 = normal, 6 = fixed defect, 7 = reversible defect
- Num: diagnosis of heart disease (angiographic disease status) (value 0 = <50% diameter narrowing, value 1 = > 50% diameter narrowing)

## Method

In [None]:
library(tidyverse)
library(tidymodels)

In [None]:
df_cleveland <- read_delim("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data",
                           ",", col_names = c("Age","Sex", "p","Trestbps", "Chol", "Fbs", "Restecg", "Thalach", "Exang",
                                              "Oldpeak", "Slope", "Ca", "Thal", "Num"))
df_cleveland[df_cleveland == "?"] <- NA
df_cleveland <- mutate(df_cleveland, Num2 = Num/Num)
df_cleveland[df_cleveland == "NaN"] <- 0
df_cleveland <- df_cleveland |> 
        mutate("Location" = "Cleveland")
df_cleveland <- df_cleveland |>
                    mutate(Num2 = as.factor(Num2))

df_cleveland

In [None]:
BP_vs_Age_plot <- df_cleveland |>
    ggplot(aes(x=Age,y=Trestbps, color=factor(Num2))) + 
        geom_point() +
        labs(x = "Age",y = "Blood Pressure", color = "Diagnosis of Heart Disease")
BP_vs_Age_plot

In [None]:
Chol_vs_Age_plot <- df_cleveland |>
    ggplot(aes(x=Age,y=Chol, color=factor(Num2))) + 
        geom_point() +
        labs(x = "Age",y = "Chol", color = "Diagnosis of Heart Disease")
Chol_vs_Age_plot

In [None]:
BP_vs_Chol_plot <- df_cleveland |>
    ggplot(aes(x=Chol,y=Trestbps, color=factor(Num2))) + 
        geom_point() +
        labs(x = "Chol",y = "Blood Pressure", color = "Diagnosis of Heart Disease")
BP_vs_Chol_plot

In [None]:
# normalization 
df_cleveland <- df_cleveland |>
  mutate(Age = scale(Age, center = TRUE),
         Trestbps = scale(Trestbps, center = TRUE),
         Chol = scale(Chol, center = TRUE))

df_cleveland

In [None]:
cleveland_split <- initial_split(df_cleveland, prop = 0.75, strata = Num2)
cleveland_train <- training(cleveland_split)
cleveland_test <- testing(cleveland_split)

In [None]:
set.seed(1)

cleveland_recipe <- recipe(Num2 ~ Age + Trestbps, data = cleveland_train) |>
  step_scale(all_predictors()) |>
  step_center(all_predictors())

knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
  set_engine("kknn") |>
  set_mode("classification")

cleveland_vfold <- vfold_cv(cleveland_train, v = 5, strata = Num2)
gridvals <- tibble(neighbors = seq(from = 1, to = 80, by = 5))

cleveland_results <- workflow() |>
  add_recipe(cleveland_recipe) |>
  add_model(knn_spec) |>
  tune_grid(resamples = cleveland_vfold, grid = gridvals) |>
  collect_metrics() |>
  filter(.metric == "accuracy")

cleveland_results

In [None]:
cleveland_bps_knn_plot <- cleveland_results |>
    ggplot(aes(x = neighbors, y = mean)) +
    geom_line() +
    geom_point() +
    labs(x = "Neighbors", y = "Predictor Accuracy") +
    theme(text = element_text(size = 12))

cleveland_bps_knn_plot

In [None]:
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 25) |>
            set_engine("kknn") |>
            set_mode("classification")

cleveland_recipe <- recipe(Num2 ~ Age + Trestbps, data = cleveland_train) |>
                        step_scale(all_predictors()) |>
                        step_center(all_predictors())

cleveland_fit <- workflow() |>
                    add_recipe(cleveland_recipe) |>
                    add_model(knn_spec) |>
                    fit(data = cleveland_train)
cleveland_fit

In [None]:
cleveland_prediction <- predict(cleveland_fit, cleveland_test)
cleveland_testing_set_prediction <- cbind(cleveland_test,cleveland_prediction)
cleveland_testing_set_prediction

In [None]:
prediction_plot <- cleveland_testing_set_prediction |> ggplot(aes(x=Age, y=Trestbps, color=factor(`.pred_class`))) + 
        geom_point() +
        labs(x = "Age",y = "Blood Pressure", color = "Prediction of Getting Heart Disease") + ggtitle("Predicted Heart Disease")
prediction_plot

In [None]:
actual_plot <- cleveland_test |> ggplot(aes(x=Age, y=Trestbps, color=factor(Num2))) + 
        geom_point() +
        labs(x = "Age",y = "Blood Pressure", color = "Actually getting Heart Disease") + ggtitle("Actual Heart Disease")
actual_plot

In [None]:
set.seed(1)

cleveland_recipe <- recipe(Num2 ~ Age + Chol,data = cleveland_train) |>
  step_scale(all_predictors()) |>
  step_center(all_predictors())

knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
  set_engine("kknn") |>
  set_mode("classification")

cleveland_vfold <- vfold_cv(cleveland_train, v = 5, strata = Num2)
gridvals <- tibble(neighbors = seq(from = 1, to = 80, by = 5))

cleveland_results <- workflow() |>
  add_recipe(cleveland_recipe) |>
  add_model(knn_spec) |>
  tune_grid(resamples = cleveland_vfold, grid = gridvals) |>
  collect_metrics() |>
  filter(.metric == "accuracy")

cleveland_results


In [None]:
cleveland_bps_knn_plot <- cleveland_results |>
    ggplot(aes(x = neighbors, y = mean)) +
    geom_line() +
    geom_point() +
    labs(x = "Neighbors", y = "Predictor Accuracy") +
    theme(text = element_text(size = 12))

cleveland_bps_knn_plot

In [None]:
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 35) |>
            set_engine("kknn") |>
            set_mode("classification")

cleveland_recipe <- recipe(Num2 ~ Age + Chol, data = cleveland_train) |>
                        step_scale(all_predictors()) |>
                        step_center(all_predictors())

cleveland_fit <- workflow() |>
                    add_recipe(cleveland_recipe) |>
                    add_model(knn_spec) |>
                    fit(data = cleveland_train)
cleveland_fit

In [None]:
prediction_plot <- cleveland_testing_set_prediction |> ggplot(aes(x=Age, y=Chol, color=factor(`.pred_class`))) + 
        geom_point() +
        labs(x = "Age",y = "Cholesterol", color = "Prediction of Getting Heart Disease") + ggtitle("Predicted Heart Disease")
prediction_plot

In [None]:
actual_plot <- cleveland_test |> ggplot(aes(x=Age, y=Chol, color=factor(Num2))) + 
        geom_point() +
        labs(x = "Age",y = "Cholesterol", color = "Actually getting Heart Disease") + ggtitle("Actual Heart Disease")
actual_plot

## Discussion

Based on our results, cholesterol is the stronger indicator for diagnosing heart disease than blood pressure. We determine this by looking at the scatterplots - the points on the plot for cholesterol vs age are much more condensed than the points on the plot for blood pressure vs age, and thus we can see a stronger correlation (60% accuracy with k nearest neighbors algorithm). We did indeed expect to find that both cholesterol and blood pressure are significant indicators for heart disease, but we did not know which one would be stronger. However, although cholesterol can definitely be considered a strong indicator, it should not be utilized nor considered as the only one. For simplicity of this project, we chose only to consider two of the most common health issues in Americans. Future research should account for other indicative variables, such as the other 74 attributes from the UC Irvine Heart Disease Data Set (e.g., chest pain, maximum heart rate, number of major vessels, etc). The more we understand about potential indicators of heart disease, the better we can treat at-risk patients.

The biggest consequence of this high rate of heart disease is that it is the leading cause of death in America, with ⅕ deaths being caused by it (note that although our sample is only from Cleveland, we will interpret the results for the US as a whole). Adding to this, heart disease costs the country hundreds of billions of dollars in health care services which could instead be allocated towards other non-preventable medical needs. When looking at cholesterol and high blood pressure, the reduction of these is critical if one wishes to lower the likelihood of heart disease developing. That is not to say that by lowering cholesterol and high blood pressure levels one can fully prevent and avoid developing heart disease, but one can lower their chances of it happening. Further research in the future could focus on how people in the US can more efficiently control these factors, but also what can be done in order to aid them or to alleviate the responsibility of doing so independently and focus on improving health as a whole. 

The CDC recommends several methods for preventing high blood pressure and cholesterol, all involving practicing healthier living habits. Eating a healthier diet, maintaining a healthier weight, staying physically active, not smoking, drinking less alcohol, and sleeping more are the most effective ways of preventing high blood pressure and cholesterol (Centers for Disease Control and Prevention, 2022). Another important method is to seek medical professional help. These practices seem fairly simple and intuitive, so why is heart disease such an issue in a well-developed country like the US? A big reason is due to the capitalist state of the country where the billions of dollars made from junky, processed food takes priority over the health of its citizens; junk food is cheap and healthy foods are becoming exceedingly expensive making it inaccessible to eat healthy. Another reason is a general lack of education; processed food companies make their nutrition facts labels confusing enough that the average citizen wouldn’t understand how to interpret the numbers (Health US News, 2012).
Further research findings should also be readily accessible to the general public and should utilize accessible language for easy interpretation. Clear and concise publications in the form of infographics are an example of a way to effectively relay information to the public and avoid medical misinformation/mistrust, along with outreach programs in schools and workplaces, better access to recreational/physical activities, health care, and easily accessible nutrition. 


## References

Centers for Disease Control and Prevention. (2022, October 14). Heart disease facts. Centers for Disease Control and Prevention. Retrieved October 24, 2022, from https://www.cdc.gov/heartdisease/facts.htm#:~:text=Heart%20disease%20is%20the%20leading,groups%20in%20the%20United%20States.&text=One%20person%20dies%20every%2034,United%20States%20from%20cardiovascular%20disease.&text=About%20697%2C000%20people%20in%20the,1%20in%20every%205%20deaths.

Grundy, S. M. (1986). Cholesterol and coronary heart disease. A New Era. JAMA: The Journal of the American Medical Association, 256(20), 2849–2858. https://doi.org/10.1001/jama.256.20.2849 

Stamler, J., Neaton, J. D., &amp; Wentworth, D. N. (1989). Blood pressure (systolic and Diastolic) and risk of fatal coronary heart disease. Hypertension, 13(5_supplement). https://doi.org/10.1161/01.hyp.13.5_suppl.i2 

UCI Machine Learning Repository: Heart disease data set. (n.d.). Retrieved October 28, 2022, from https://archive.ics.uci.edu/ml/datasets/Heart+Disease