# ***Detecting presence of Heart Disease using classification***
#### Group Members: Ishan Kumar Singh, Tony Kashimori, Jeffrey Kim

### Introduction

Heart diseases, a global health concern, predominantly affect the elderly. With countless victims each year, early detection remains crucial. 

Our project investigates the likelihood of heart disease using key indicators: age, sex, resting blood pressure, cholesterol, and maximum heart rate.

The primary question is: **"Can age, sex, resting blood pressure, cholesterol, and maximum heart rate predict heart disease?"** 

The data set combines five heart datasets, offering the most extensive collection on heart disease prediction with 918 unique observations, focusing on key indicators to assess heart disease risk. Shedding light on these factors' roles, enhances our understanding of future preventive medical approaches.

## Methods

### Preliminary exploratory data analysis

In [29]:
### Run this cell before continuing. 
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)

In [30]:
data <- read_csv("https://raw.githubusercontent.com/jeffreyykim/DSCI-project-009-40-Group_Contract/94fda1d002bf5ab24d2be98a7c63061a1dad7ab0/heart.csv")

[1mRows: [22m[34m918[39m [1mColumns: [22m[34m12[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (5): Sex, ChestPainType, RestingECG, ExerciseAngina, ST_Slope
[32mdbl[39m (7): Age, RestingBP, Cholesterol, FastingBS, MaxHR, Oldpeak, HeartDisease

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


In [31]:
# cleaning and wrangling the data, renaming the values in a column
data <- data |>
        mutate(HeartDisease = as_factor(HeartDisease)) |>
        mutate(HeartDisease = fct_recode(HeartDisease, "Positive" = "1", "Negative" = "0"))
# selecting only the relevant columns
selected_data <- select(data, HeartDisease, Cholesterol,  Age, RestingBP, MaxHR, Sex) |>
                 filter(Cholesterol != 0, MaxHR !=0, RestingBP != 0)
# spliting data into testing and training sets
data_split <- initial_split(selected_data, prop = 0.75, strata = HeartDisease)
training_data <- training(data_split)
testing_data <- testing(data_split)
training_data
testing_data

HeartDisease,Cholesterol,Age,RestingBP,MaxHR,Sex
<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
Negative,283,37,130,98,M
Negative,195,54,150,122,M
Negative,339,39,120,170,M
⋮,⋮,⋮,⋮,⋮,⋮
Positive,193,68,144,141,M
Positive,131,57,130,115,M
Positive,236,57,130,174,F


HeartDisease,Cholesterol,Age,RestingBP,MaxHR,Sex
<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
Negative,289,40,140,172,M
Negative,211,37,130,142,F
Positive,164,58,136,99,M
⋮,⋮,⋮,⋮,⋮,⋮
Positive,169,44,120,144,M
Positive,197,63,124,136,F
Negative,175,38,138,173,M


In [32]:
# characteristics of the training data
training_data |>
group_by(HeartDisease, Sex) |>
summarize(Mean_Cholesterol = mean(Cholesterol, na.rm = TRUE), Mean_Age = mean(Age, na.rm = TRUE), Mean_RestingBP = mean(RestingBP, na.rn = TRUE),
          Mean_MaxHR = mean(MaxHR, na.rm = TRUE))

[1m[22m`summarise()` has grouped output by 'HeartDisease'. You can override using the
`.groups` argument.


HeartDisease,Sex,Mean_Cholesterol,Mean_Age,Mean_RestingBP,Mean_MaxHR
<fct>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
Negative,F,255.3396,52.19811,127.7736,147.9811
Negative,M,234.7312,48.77419,130.2849,148.7796
Positive,F,277.5862,56.27586,145.2414,137.5517
Positive,M,250.3613,55.83193,136.1975,129.7479


In [44]:
set.seed(1)
knn_recipe <- recipe(HeartDisease ~ . , data = training_data) |>
                step_center(all_predictors(-Sex)) |>
                step_scale(all_predictors(-Sex)) 
knn_recipe

knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
            set_engine("kknn") |>
            set_mode("classification")
knn_spec

Recipe

Inputs:

      role #variables
   outcome          1
 predictor          5

Operations:

Centering for all_predictors(-Sex)
Scaling for all_predictors(-Sex)

K-Nearest Neighbor Model Specification (classification)

Main Arguments:
  neighbors = tune()
  weight_func = rectangular

Computational engine: kknn 


In [45]:
set.seed(1)

grid_vals <- tibble(neighbors = 1:200)

knn_vfold <- vfold_cv(training_data, v = 5, strata = HeartDisease)


knn_fit <- workflow() |>
            add_recipe(knn_recipe) |>
            add_model(knn_spec) |>
            tune_grid(resamples = knn_vfold, grid = grid_vals) |>
            collect_metrics() |>
            filter(.metric == "accuracy")


[31mx[39m [31mFold1: preprocessor 1/1:
  [1m[33mError[31m in `step_center()`:[22m
  [1mCaused by error in `prep()`:[22m
  [33m![31m Problem while evaluating `all_predictors(-Sex)`.
  [1mCaused by error in `all_predictors()`:[22m
  [33m![31m unused argument (-Sex)[39m

[31mx[39m [31mFold2: preprocessor 1/1:
  [1m[33mError[31m in `step_center()`:[22m
  [1mCaused by error in `prep()`:[22m
  [33m![31m Problem while evaluating `all_predictors(-Sex)`.
  [1mCaused by error in `all_predictors()`:[22m
  [33m![31m unused argument (-Sex)[39m

[31mx[39m [31mFold3: preprocessor 1/1:
  [1m[33mError[31m in `step_center()`:[22m
  [1mCaused by error in `prep()`:[22m
  [33m![31m Problem while evaluating `all_predictors(-Sex)`.
  [1mCaused by error in `all_predictors()`:[22m
  [33m![31m unused argument (-Sex)[39m

[31mx[39m [31mFold4: preprocessor 1/1:
  [1m[33mError[31m in `step_center()`:[22m
  [1mCaused by error in `prep()`:[22m
  [33m![31m Pr

ERROR: [1m[33mError[39m in `estimate_tune_results()`:[22m
[33m![39m All of the models failed. See the .notes column.
