# Group 29 Project Proposal: An Investigation of Distinguishing the Presence and Absence of Heart Disease

## Introduction:
Provide some relevant background information on the topic so that someone unfamiliar with it will be prepared to understand the rest of your proposal

Heart disease is any type of condition that affects the structure or function of the heart often caused by high blood pressure, high cholestorol, and smoking. Unfortunately, in the United States of America, heart disease-related complications (e.g. heart attacks) are the leading cause of death - averaging around 659,000 deaths each year. 

Clearly state the question you will try to answer with your project
Identify and describe the dataset that will be used to answer the question

To investigate the key factors behind heart disease, we ask the question: "<b>How accurate is KNN classification at detecting the presence of heart disease?</b>" using a K-nearest neighbors algorithm on the <i>Heart Disease Data Set</i> donated by David W. Aha. This dataset contains 14 quantitative variables and 303 observations. The column names are listed below:

1. Age
2. Sex
3. Chest Pain Type (0 - 4)
4. Resting Blood Pressure (in mm Hg on admission to the hospital)
5. Serum Cholestorol in mg/dl
6. Fasting Blood Sugar > 120 mg/dl (1 = true; 0 = false)
7. Resting Electrocardiographic Results (0 = normal; 1 = having ST-T wave abnormality; 2 = showing probable or definite left ventricular hypertrophy by Estes' criteria) 
8. Maximum Heart Rate Achieved
9. Exercised Induced Angina (1 = yes; 0 = no)
10. ST Depression induced by exercise relative to rest
11. The slope of the peak exercise ST segment (1 = upsloping; 2 = flat; 3 = downsloping)
12. The number of major vessels (0 - 3) colored by fluoroscopy 
13. Thalassemia Blood Disorder (3 = normal; 6 = fixed defect; 7 = reversable defect)
    - this disorder causes red blood cells to carry less oxygen to the heart
14. Diagnosis of Heart Disease (1 = presence; 0 = absence) 

Preliminary exploratory data analysis:
Demonstrate that the dataset can be read from the web into R 
Clean and wrangle your data into a tidy format

### Preliminary Exploratory Data Analysis


- mutate the last column (assuming everything 1+ is 1 to follow their documentation)
    - values 1 - 4 represented the severity of the presence of heart disease 
    - 0 value means absence
- mutated last column to be factor
- mutated 2 columns to be numeric instead of character

In [2]:
library(tidyverse)
library(tidymodels)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.2     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.3     [32m✔[39m [34mdplyr  [39m 1.0.2
[32m✔[39m [34mtidyr  [39m 1.1.2     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.5.0

“package ‘ggplot2’ was built under R version 4.0.1”
“package ‘tibble’ was built under R version 4.0.2”
“package ‘tidyr’ was built under R version 4.0.2”
“package ‘dplyr’ was built under R version 4.0.2”
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

“package ‘tidymodels’ was built under R version 4.0.2”
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 0.1.1 ──

[32m✔

In [4]:
# nrow(cleveland_data) 303 rows
cleveland_data <- read_csv("data/processed.cleveland.data",
                          col_names = c("age", "sex", "chest_pain_type", "resting_blood_sugar", "cholestorol", "fasting_blood_sugar", "electrocardio_results", "max_heart_rate", "exercise_induced_angina", "ST_depression", "ST_peak_slope", "major_vessels", "thal", "diagnosis"))
cleveland_data

# selects the diagnosis column and replaces all values > 1 with 1
# code inspired from https://datacornering.com/replace-r-data-frame-column-values-conditionally/
cleveland_data[c("diagnosis")][which(cleveland_data$diagnosis > 1), ] <- 1

# cleaning up columns to be the correct type
cleveland_data <- cleveland_data %>%
    mutate(major_vessels = as.numeric(major_vessels, na.rm = TRUE)) %>%
    mutate(thal = as.numeric(thal, na.rm = TRUE)) %>%
    mutate(diagnosis = as.factor(diagnosis)) 

# cleveland_data

knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 10) %>%
    set_engine("kknn") %>%
    set_mode("classification")
# knn_spec

knn_recipe <- recipe(diagnosis ~ ., data = cleveland_data) %>%
    step_scale(all_predictors()) %>%
    step_center(all_predictors())
# knn_recipe

knn_workflow <- workflow() %>%
    add_recipe(knn_recipe) %>%
    add_model(knn_spec) 
# knn_workflow

knn_fit <- knn_workflow %>%
    fit(data = cleveland_data)
# knn_fit

new_patient = tibble(age = 63, sex = 0, chest_pain_type = 3, resting_blood_sugar = 200, cholestorol = 300, fasting_blood_sugar = 0, electrocardio_results = 2, max_heart_rate = 180, exercise_induced_angina = 0, ST_depression = 0.9, ST_peak_slope = 3, major_vessels = 2, thal = 6)

prediction <- predict(knn_fit, new_patient)

# 0 = No Heart Disease, 1 = Heart Disease
prediction


Parsed with column specification:
cols(
  age = [32mcol_double()[39m,
  sex = [32mcol_double()[39m,
  chest_pain_type = [32mcol_double()[39m,
  resting_blood_sugar = [32mcol_double()[39m,
  cholestorol = [32mcol_double()[39m,
  fasting_blood_sugar = [32mcol_double()[39m,
  electrocardio_results = [32mcol_double()[39m,
  max_heart_rate = [32mcol_double()[39m,
  exercise_induced_angina = [32mcol_double()[39m,
  ST_depression = [32mcol_double()[39m,
  ST_peak_slope = [32mcol_double()[39m,
  major_vessels = [31mcol_character()[39m,
  thal = [31mcol_character()[39m,
  diagnosis = [32mcol_double()[39m
)



age,sex,chest_pain_type,resting_blood_sugar,cholestorol,fasting_blood_sugar,electrocardio_results,max_heart_rate,exercise_induced_angina,ST_depression,ST_peak_slope,major_vessels,thal,diagnosis
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<dbl>
63,1,1,145,233,1,2,150,0,2.3,3,0.0,6.0,0
67,1,4,160,286,0,2,108,1,1.5,2,3.0,3.0,2
67,1,4,120,229,0,2,129,1,2.6,2,2.0,7.0,1
37,1,3,130,250,0,0,187,0,3.5,3,0.0,3.0,0
41,0,2,130,204,0,2,172,0,1.4,1,0.0,3.0,0
56,1,2,120,236,0,0,178,0,0.8,1,0.0,3.0,0
62,0,4,140,268,0,2,160,0,3.6,3,2.0,3.0,3
57,0,4,120,354,0,0,163,1,0.6,1,0.0,3.0,0
63,1,4,130,254,0,2,147,0,1.4,2,1.0,7.0,2
53,1,4,140,203,1,2,155,1,3.1,3,0.0,7.0,1


“Problem with `mutate()` input `major_vessels`.
[34mℹ[39m NAs introduced by coercion
[34mℹ[39m Input `major_vessels` is `as.numeric(major_vessels, na.rm = TRUE)`.”
“NAs introduced by coercion”
“Problem with `mutate()` input `thal`.
[34mℹ[39m NAs introduced by coercion
[34mℹ[39m Input `thal` is `as.numeric(thal, na.rm = TRUE)`.”
“NAs introduced by coercion”


.pred_class
<fct>
1


Using only training data, summarize the data in at least one table (this is exploratory data analysis). An example of a useful table could be one that reports the number of observations in each class, the means of the predictor variables you plan to use in your analysis and how many rows have missing data. 
- show the number of 

Using only training data, visualize the data with at least one plot relevant to the analysis you plan to do (this is exploratory data analysis). An example of a useful visualization could be one that compares the distributions of each of the predictor variables you plan to use in your analysis.
- use one plot to visualize distributions of predictor variables (bar plot for last column?)

Methods:
Explain how you will conduct either your data analysis and which variables/columns you will use. Note - you do not need to use all variables/columns that exist in the raw data set. In fact, that's often not a good idea. For each variable think: is this a useful variable for prediction?
Describe at least one way that you will visualize the results

Expected outcomes and significance:
What do you expect to find?
What impact could such findings have?
What future questions could this lead to?

### Citations
https://www.cdc.gov/heartdisease/risk_factors.htm#:~:text=About%20half%20of%20all%20Americans,%2C%20high%20cholesterol%2C%20and%20smoking.&text=Some%20risk%20factors%20for%20heart,the%20factors%20you%20can%20control.

“Heart Disease Facts.” Centers for Disease Control and Prevention, Centers for Disease Control and Prevention, 7 Feb. 2022, https://www.cdc.gov/heartdisease/facts.htm. 