Heart Disease Prediction 

### *Introduction*
The dataset that our group focuses on is the Coronary Artery Heart Disease data which records 14 attributes of patients in four regions: Cleveland, Hungary, Switzerland, and VA Long Beach. The attributes include age, stage of heart disease, blood pressure, and more.

For our project, we will be focusing on the predictive question,
- Can we predict patients’ stage of heart disease in Cleveland using the predictors age, resting blood pressure, serum cholesterol, and ST depression induced by exercise relative to rest?

We aim to build a classification model that predicts the condition of patients, classifying them in one of the five stages of heart disease (0-4). 0 means that the patients have no heart disease, and the others represent the severity of heart disease in increasing order. We also want to know if our model is accurate in making predictions.


In [1]:
library(repr)
library(tidyverse)
library(tidymodels)
library(GGally)
options(repr.matrix.max.rows = 6)

“package ‘ggplot2’ was built under R version 4.3.2”
── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.3     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.0
[32m✔[39m [34mggplot2  [39m 3.5.0     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.2     [32m✔[39m [34mtidyr    [39m 1.3.0
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom    

In [None]:
set.seed(3456)
url <- "https://raw.githubusercontent.com/Jessieec/Group-Proposal/main/heart_disease_uci.csv"
heart_data <- read_csv(url)
heart_data

### *Preliminary exploratory data analysis*

In [None]:
heart_data_wrangled <- heart_data |>
    select(age, dataset, trestbps, chol, oldpeak, num) |>
    rename(region = dataset, stage = num) |>
    filter(region == "Cleveland") |> 
    mutate(stage = as_factor(stage))
    
heart_data_wrangled

In [None]:
#training data
heart_data_split <- initial_split(heart_data_wrangled, prop = 0.75, strata = stage)
heart_data_training <- training(heart_data_split)
heart_data_testing <- testing(heart_data_split)
heart_data_training

In [None]:
# mean of selected predictors
heart_data_mean <- summarize(heart_data_training, 
                             age_mean = mean(age),
                             chol_mean = mean(chol), 
                             trestbps_mean = mean(trestbps),
                            oldpeak_mean = mean(oldpeak))|>
                    pivot_longer(cols = age_mean:oldpeak_mean,
                                 names_to = "variables",
                                 values_to = "mean")
heart_data_mean

# number of observations for each class
heart_data_observations <- group_by(heart_data_training, stage) |>
    summarize(count = n())
heart_data_observations

# rows with missing data
heart_data_missing <- filter(heart_data_training, is.na(trestbps) & is.na(chol) & is.na(stage)) |>
    count()
heart_data_missing

In [None]:
#graph for Mean Data
heart_data_mean_plot <- heart_data_mean |>
    ggplot(aes(x = variables, y = mean)) +
    geom_bar(stat = "identity") +
    labs(x = "Variable Names", y = "Mean of Variables") +
    ggtitle("Variables Chosen vs. Mean of Each Variable") +
    theme(text = element_text(size = 15))
heart_data_mean_plot

#### *Note About Oldpeak on the Mean Bar Chart:*
- Since we will be scaling the data later on in our analysis, the oldpeak value being comparably smaller will not be an issue.

In [None]:
#graph for number of observations
# This section help us visualize the distribution of patients for different stages of heart disease
heart_data_observation_plot <- heart_data_observations |>
    ggplot(aes(x = stage, y = count)) +
    geom_bar(stat = "identity") +
    labs(x = "Stages of Heart Disease", y = "Number of Patients") +
    ggtitle("Number of Observations for Each Stages of Heart Disease") +
    theme(text = element_text(size = 15))
heart_data_observation_plot

In [1]:
set.seed(1234) 

options(repr.plot.height = 5, repr.plot.width = 6)

heart_vfold <- vfold_cv(data = heart_data_training, v = 5, strata = stage)

knn_tune <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
            set_engine("kknn") |>
            set_mode("classification")

heart_recipe <- recipe(stage ~ age + trestbps + chol + oldpeak, data = heart_data_training) |>
                step_scale(all_predictors()) |>
                step_center(all_predictors())

values_1 <- tibble(neighbors = seq(from = 1, to = 100, by = 5))

heart_fit_1 <- workflow() |>
                add_model(knn_tune) |>
                add_recipe(heart_recipe) |>
                tune_grid(resamples = heart_vfold, grid = values_1) |>
                collect_metrics() |>
                filter(.metric == "accuracy")

cross_val_plot_1 <- ggplot(heart_fit_1, aes(x = neighbors, y = mean)) + 
                    geom_point() + 
                    geom_line() + 
                    labs(x = "Neighbors", y = "Accuracy Estimate")

cross_val_plot_1

values_2 <- tibble(neighbors = seq(from = 10, to = 15, by = 1))

heart_fit_2 <- workflow() |>
                add_model(knn_tune) |>
                add_recipe(heart_recipe) |>
                tune_grid(resamples = heart_vfold, grid = values_2) |>
                collect_metrics() |>
                filter(.metric == "accuracy")

cross_val_plot_2 <- ggplot(heart_fit_2, aes(x = neighbors, y = mean)) + 
                    geom_point() + 
                    geom_line() + 
                    labs(x = "Neighbors", y = "Accuracy Estimate")

cross_val_plot_2



ERROR: Error in vfold_cv(data = heart_data_training, v = 5, strata = stage): could not find function "vfold_cv"


I first turned from 1 - 100. Peak reached b/w 1-20. Then I made the graph smaller. Peak accuracy = 14. 

In [None]:
set.seed(1234) 

heart_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 14) |>
                set_engine("kknn") |>
                set_mode("classification")

heart_final_fit <- workflow() |>
            add_recipe(heart_recipe) |>
            add_model(heart_spec) |>
            fit(data = heart_data_training)

heart_final_fit

In [None]:
# Set the seed. Don't remove this!
set.seed(1234)

heart_predictions <- heart_final_fit |>
    predict(heart_data_training) |>
    bind_cols(heart_data_training)

heart_metrics <- heart_predictions |>
    metrics(truth = stage, estimate = .pred_class) |>
    filter(.metric == "accuracy")
   

heart_conf_mat <- heart_predictions |>
    conf_mat(truth = stage, estimate = .pred_class)

heart_metrics
heart_conf_mat

In [None]:
set.seed(1234)

patient <- tibble(age = 74, trestbps = 188, chol = 240, oldpeak = 2.3)

heart_predict <- predict(heart_final_fit, patient)

heart_predict

### *Method*

- We chose four variables as our predictors: age, resting blood pressure (trestbps), serum cholesterol (chol), and ST depression induced by exercise relative to rest (oldpeak). We removed character and categorical variables because they are difficult to use as predictors in K-NN classification. The “thalach” column is excluded because maximum heart rate happens once. The chosen variables are better indicators because they represent the overall condition of a person.
- We will train our model to use K-NN classification to classify the stage of heart disease from 0 to 4 using our predictors.
- The visualization of scatterplot matrix will help us visualize the training data, with different colors representing disease stages, showing the relation between predictors and the stage of disease.
- Cross-validation will help us determine the best K-value. With the K-value, we can create the K-NN model by setting the mode to “classification” and the engine to “kknn”, and then scale the data using “recipe”. Then, we can use workflow to fit the model to the scaled data.
- The prediction accuracy will be calculated using the training model to classify the data from the testing data and then by dividing the number of correct predictions by the total number of predictions. 
We will optimize the accuracy by testing different training and testing data splits, and then by choosing the model with the highest accuracy.


### *Expected outcomes and significance*

*What do you expect to find?*
- We expect the results to show higher stages of heart disease in the elderly and individuals with higher blood pressure, cholesterol, and ST depression.

*What impact could such findings have?*
- Using models to predict heart disease stages is beneficial for early intervention.
- Sometimes, certain medical tests require a doctor's referral. Our model allows people to easily see for themselves if they are at risk for heart disease.

*What future questions could this lead to?*
- Is our model accurate for the general public (outside of our dataset)?
- How can we improve our model to use categorical and character variables as well to predict heart disease?
- Despite our model’s focus on Coronary Artery Disease, can our model accurately predict heart disease stages for other heart diseases of the general public?
