# Title: Predicting Risk of Heart Disease Using Classification #
Group 14:
Jackie Hagstrom,
Hannah Reyes,
Mikayla Ditosto,
Minal Nijhawan 

# Introduction #
The theme of our project is heart disease, and we will be utilizing the data set sourced from the UC Irvine Machine Learning Repository. This data encompasses information from four distinct databases: Cleveland, Hungary, Switzerland, and VA Long Beach. Our study will concentrate on the Cleveland data set. This dataset classifies the presence of heart disease in a patient on a scale of 0-4 (0 implying the absence of risk, and 4 indicating the greatest presence).

Through this project, we are trying to answer the question: What variables can be used to classify a person’s risk of heart disease in Cleveland? Given that the response variable, num, is categorical, we will be solving this as a classification problem. There are a total of 14 different variables provided in the data set that can help detect the level of risk of heart disease in a patient. Please see the bottom of this document for the variable meanings.

# Preliminary exploratory data analysis #

In [None]:
library(tidyverse)
library(repr)
library(tidymodels)

In [None]:
# dataset was read into R
set.seed(2023)
cleveland_data <- read_csv("https://raw.githubusercontent.com/mikayladitosto/dsci-100-2023s1-group-14/main/processed.cleveland.data", col_names = FALSE)
colnames(cleveland_data) <- c("age",
                              "sex",
                              "cp",
                              "trestbps",
                              "chol",
                              "fbs",
                              "restecg",
                              "thalach",
                              "exang",
                              "oldpeak",
                              "slope",
                              "ca",
                              "thal",
                              "num")

# changed data types of columns to be able to perform classification and EDA
cleveland_data <- cleveland_data |>
    mutate(num = as_factor(num),
           ca = as.numeric(ca),
           thal = as.numeric(thal))

# removed N/A ("?") values from the dataset
cleveland_data <- cleveland_data |>
    filter(age != "?",
           sex != "?",
           cp != "?",
           trestbps != "?",
           chol != "?",
           fbs != "?",
           restecg != "?",
           thalach != "?",
           exang != "?",
           oldpeak != "?",
           slope != "?",
           ca != "?",
           thal != "?"
          )

head(cleveland_data)

We downloaded the Cleveland data set and uploaded it to GitHub through our shared repository. After this step, we copied the URL into R, and we were able to read it with the read_csv function. Demonstrating that the dataset can be read from the web into R. 

In [None]:
# split the data into training and testing splits
cleveland_split <- initial_split(cleveland_data, prop = 0.75, strata = num)
cleveland_training <- training(cleveland_split)
cleveland_testing <- testing(cleveland_split)

In [None]:
num_obs <- nrow(cleveland_training)
cleveland_training |>
  group_by(num) |>
  summarize(
    count = n(),
    percentage = n() / num_obs * 100
  )

This table shows us the number of observations in each category of heart disease. There is not an equal amount of people in each category, meaning our graphs can look disproportionate. During the classification analysis in our final project, we can make the classes proportionate by using class balancing to ensure the rarer classes are equally represented.

In [None]:
cleveland_mean <- cleveland_training |>
    group_by(num) |>
    summarise(mean_age = mean(age),
              mean_sex = mean(sex),
              mean_cp = mean(cp),
              mean_trestbps = mean(trestbps),
              mean_chol = mean(chol),
              mean_fbs = mean(fbs),
              mean_restecg = mean(restecg),
              mean_thalach = mean(thalach),
              mean_exang = mean(exang),
              mean_oldpeak = mean(oldpeak),
              mean_slope = mean(slope),
              mean_ca = mean(ca),
              mean_thal = mean(thal))
cleveland_mean

cleveland_sd <- cleveland_training |>
    group_by(num) |>
    summarise(sd_age = sd(age),
              sd_sex = sd(sex),
              sd_cp = sd(cp),
              sd_trestbps = sd(trestbps),
              sd_chol = sd(chol),
              sd_fbs = sd(fbs),
              sd_restecg = sd(restecg),
              sd_thalach = sd(thalach),
              sd_exang = sd(exang),
              sd_oldpeak = sd(oldpeak),
              sd_slope = sd(slope),
              sd_ca = sd(ca),
              sd_thal = sd(thal))
cleveland_sd

In [None]:
options(repr.plot.height = 8, repr.plot.width = 10)
age_hist <- ggplot(cleveland_training, aes(x = age, fill = num)) +
  geom_histogram(position = "identity") +
  facet_grid(rows = vars(num)) +
  labs(x = "Age (years)",
       y = "Count",
       fill = "Heart Disease Diagnosis") +
  ggtitle("Relationship Between Age and Degree of Heart Disease") +
  theme(text = element_text(size = 18))
age_hist

age_plot <- cleveland_training |>
    ggplot(aes(y = age, x = num)) +
    geom_boxplot() +
    labs(x = "Degree of Heart Disease",
         y = "Age (years)") +
    theme(text = element_text(size = 15)) +
    ggtitle("Relationship between Age and Heart Disease")
age_plot

Based on the results from our histogram, it shows that there is no distinct relationship between age and num.

In [None]:
options(repr.plot.height = 8, repr.plot.width = 8)
thalach_plot <- cleveland_training |>
    ggplot(aes(x = thalach, fill = num)) +
    geom_histogram() +
    facet_grid(rows = vars(num)) +
    labs(x = "Maximum Heart Rate Achieved",
         y = "Count",
      fill = "Heart Disease Diagnosis") +
    theme(text = element_text(size = 16)) +
    ggtitle("Relationship between Maximum Heart Rate and Heart Disease")
thalach_plot

thalach_plot2 <- cleveland_training |>
    ggplot(aes(y = thalach, x = num)) +
    geom_boxplot() +
    labs(x = "Degree of Heart Disease",
         y = "Maximum Heart Rate Achieved") +
    theme(text = element_text(size = 15)) +
    ggtitle("Relationship between Maximum Heart Rate and Heart Disease")
thalach_plot2

? Explanation about max heart rate graph

In [None]:
chol_plot <- cleveland_training |>
    ggplot(aes(y = chol, x = num)) +
    geom_boxplot() +
    labs(x = "Degree of Heart Disease",
         y = "Cholesterol Level") +
    theme(text = element_text(size = 15)) +
    ggtitle("Relationship between Cholesterol Level and Heart Disease")
chol_plot

? Explanation for chol plot

In [None]:
bp_plot <- cleveland_training |>
    ggplot(aes(y = trestbps, x = num)) +
    geom_boxplot() +
    labs(x = "Degree of Heart Disease",
         y = "Resting Blood Pressure (mm Hg)") +
    theme(text = element_text(size = 15)) +
    ggtitle("Relationship between Resting Blood Pressure and Heart Disease")
bp_plot

In [None]:
trestbps_thalach_plot <- cleveland_training |>
    ggplot(aes(y = trestbps, x = thalach, color = num)) +
    geom_point() +
    labs(x = "Maximum Heart Rate Achieved",
         y = "Resting Blood Pressure (mm Hg)") +
    theme(text = element_text(size = 15)) +
    ggtitle("Relationship between Resting Blood Pressure and Heart Disease")
trestbps_thalach_plot

In [None]:
oldpeak_plot <- cleveland_training |>
    ggplot(aes(y = oldpeak, x = num)) +
    geom_boxplot() +
    labs(x = "Degree of Heart Disease",
         y = "ST Depression Induced By Exercise Relative To Rest") +
    theme(text = element_text(size = 15)) +
    ggtitle("Relationship between Resting ST Depression Induced By Exercise and Heart Disease")
oldpeak_plot

Relationship here???

# Building the Classification Model #

In [None]:
#library(themis)

cleveland_recipe <- recipe(num ~ thalach + oldpeak, data = cleveland_training) |>
#step_upsample(num, over_ratio = 1, skip = FALSE) |>
step_scale(all_predictors()) |>
step_center(all_predictors())

knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
set_engine("kknn") |>
set_mode("classification")

cleveland_wkflw <- workflow() |>
    add_recipe(cleveland_recipe) |>
    add_model(knn_spec)


gridvals <- tibble(neighbors = seq(from = 1, to = 100, by = 3))
vfold_val <- vfold_cv(cleveland_training, v = 5, strata = num)

cleveland_results <- cleveland_wkflw |>
tune_grid(resamples = vfold_val, grid = gridvals) |>
collect_metrics() |>
filter(.metric == "accuracy")
cleveland_results

# Expected outcomes and significance #
From our research and findings, we expect that lower maximum heart rate (thalach) is a possible indicator/association for heart disease. We hope that our findings can validate other papers. The impact of these findings could influence the population of Cleveland to change certain practices according to our analysis. Future research related to heart disease could go down different paths, such as nutrition or  different lifestyle habits or practices. For example,  focus on different foods that help reduce cholesterol levels, which could potentially reduce heart disease. 

# Variables #
 1. (age) -> age in years
2. (sex) -> sex (1 = male; 0 = female)  
3. (cp) -> cp: chest pain type
4. (trestbps) -> resting blood pressure (in mm Hg on admission to the 
        hospital)
5. (chol) -> serum cholesterol in mg/dl
6. (fbs) -> (fasting blood sugar > 120 mg/dl)  (1 = true; 0 = false)
7. (restecg) -> resting electrocardiographic results
8. (thalach) -> maximum heart rate achieved
9. (exang) -> exercise induced angina 1= yes, 0= no
10. (oldpeak) -> ST depression induced by exercise relative to rest
11. (slope) ->  the slope of the peak exercise ST segment
 Value 1: upsloping
Value 2: flat
Value 3: downsloping
12. (ca) -> number of major vessels (0-3) colored by flourosopy
13. (thal) -> 3= normal, 6= fixed defect, 7= reversible defect
14. (num) -> (the predicted attribute) diagnosis of heart disease
Value 0: < 50% diameter narrowing
Value 1: > 50% diameter narrowing (in any major vessel)
