# ***Detecting presence of Heart Disease using classification***
#### Group Members: Ishan Kumar Singh, Tony Kashimori, Jeffrey Kim

### Introduction

Heart diseases, a global health concern, predominantly affect the elderly. With countless victims each year, early detection remains crucial. 

Our project investigates the likelihood of heart disease using key indicators: age, sex, resting blood pressure, cholesterol, and maximum heart rate.

The primary question is: **"Can age, sex, resting blood pressure, cholesterol, and maximum heart rate predict heart disease?"** 

The data set combines five heart datasets, offering the most extensive collection on heart disease prediction with 918 unique observations, focusing on key indicators to assess heart disease risk. Shedding light on these factors' roles, enhances our understanding of future preventive medical approaches.

## Methods

### Preliminary exploratory data analysis

In [7]:
### Run this cell before continuing. 
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)
library(dplyr)

In [8]:
data <- read_csv("https://raw.githubusercontent.com/jeffreyykim/DSCI-project-009-40-Group_Contract/94fda1d002bf5ab24d2be98a7c63061a1dad7ab0/heart.csv")

[1mRows: [22m[34m918[39m [1mColumns: [22m[34m12[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (5): Sex, ChestPainType, RestingECG, ExerciseAngina, ST_Slope
[32mdbl[39m (7): Age, RestingBP, Cholesterol, FastingBS, MaxHR, Oldpeak, HeartDisease

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


In [37]:
# cleaning and wrangling the data, renaming the values in a column
data <- data |>
        mutate(HeartDisease = as_factor(HeartDisease)) |>
        mutate(HeartDisease = fct_recode(HeartDisease, "Positive" = "1", "Negative" = "0"))
# selecting only the relevant columns
selected_data <- select(data, HeartDisease, Cholesterol,  Age, RestingBP, MaxHR, Sex) |>
                 filter(Cholesterol != 0, MaxHR !=0, RestingBP != 0) |>
                mutate(male=as.integer(Sex == "M")) |>
                select(-Sex)
                 # mutate(Sex = as.factor(Sex)) |> mutate(Sex = fct_recode(Sex, "1" = "M", "0" = "F"))
selected_data

# spliting data into testing and training sets
data_split <- initial_split(selected_data, prop = 0.75, strata = HeartDisease)
training_data <- training(data_split)
testing_data <- testing(data_split)
training_data
testing_data

[1m[22m[36mℹ[39m In argument: `HeartDisease = fct_recode(HeartDisease, Positive = "1",
  Negative = "0")`.
[33m![39m Unknown levels in `f`: 1, 0”


HeartDisease,Cholesterol,Age,RestingBP,MaxHR,male
<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<int>
Negative,289,40,140,172,1
Positive,180,49,160,156,0
Negative,283,37,130,98,1
⋮,⋮,⋮,⋮,⋮,⋮
Positive,131,57,130,115,1
Positive,236,57,130,174,0
Negative,175,38,138,173,1


HeartDisease,Cholesterol,Age,RestingBP,MaxHR,male
<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<int>
Negative,289,40,140,172,1
Negative,283,37,130,98,1
Negative,208,54,110,142,1
⋮,⋮,⋮,⋮,⋮,⋮
Positive,193,68,144,141,1
Positive,131,57,130,115,1
Positive,236,57,130,174,0


HeartDisease,Cholesterol,Age,RestingBP,MaxHR,male
<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<int>
Negative,195,54,150,122,1
Negative,339,39,120,170,1
Negative,237,45,130,170,0
⋮,⋮,⋮,⋮,⋮,⋮
Negative,342,55,132,166,0
Positive,197,63,124,136,0
Positive,176,59,164,90,1


In [38]:
# characteristics of the training data
training_data |>
group_by(HeartDisease, Sex) |>
summarize(Mean_Cholesterol = mean(Cholesterol, na.rm = TRUE), Mean_Age = mean(Age, na.rm = TRUE), Mean_RestingBP = mean(RestingBP, na.rn = TRUE),
          Mean_MaxHR = mean(MaxHR, na.rm = TRUE))

ERROR: [1m[33mError[39m in `group_by()`:[22m
[1m[22m[33m![39m Must group by variables found in `.data`.
[31m✖[39m Column `Sex` is not found.


In [39]:
set.seed(1)
knn_recipe <- recipe(HeartDisease ~ . , data = training_data) |>
                step_center(all_predictors()) |>
                step_scale(all_predictors()) 
knn_recipe

knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
            set_engine("kknn") |>
            set_mode("classification")
knn_spec

Recipe

Inputs:

      role #variables
   outcome          1
 predictor          5

Operations:

Centering for all_predictors()
Scaling for all_predictors()

K-Nearest Neighbor Model Specification (classification)

Main Arguments:
  neighbors = tune()
  weight_func = rectangular

Computational engine: kknn 


In [41]:
set.seed(1)

grid_vals <- tibble(neighbors = 1:200)

knn_vfold <- vfold_cv(training_data, v = 5, strata = HeartDisease)


knn_fit <- workflow() |>
            add_recipe(knn_recipe) |>
            add_model(knn_spec) |>
            tune_grid(resamples = knn_vfold, grid = grid_vals) |>
            collect_metrics() |>
            filter(.metric == "accuracy")
knn_fit

neighbors,.metric,.estimator,mean,n,std_err,.config
<int>,<chr>,<chr>,<dbl>,<int>,<dbl>,<chr>
1,accuracy,binary,0.6617556,5,0.01930164,Preprocessor1_Model001
2,accuracy,binary,0.6617556,5,0.01930164,Preprocessor1_Model002
3,accuracy,binary,0.6491429,5,0.01553145,Preprocessor1_Model003
⋮,⋮,⋮,⋮,⋮,⋮,⋮
198,accuracy,binary,0.6993383,5,0.02176040,Preprocessor1_Model198
199,accuracy,binary,0.7011401,5,0.02095737,Preprocessor1_Model199
200,accuracy,binary,0.7011401,5,0.02095737,Preprocessor1_Model200


In [16]:
install.packages("kknn")

Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done

