In [None]:
#load these before doing anything else
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)


Title: Clinical Determinants of Heart Disease

Introduction:
We are using the Heart Disease Dataset from Kaggle (link: https://www.kaggle.com/datasets/ineubytes/heart-disease-dataset)
The data set contains 14 columns, which include information of the patients’ age, sex, chest pain, cholesterol levels, and other clinical factors. The last column, target, indicated by values 0, and 1, represents whether the patient has heart defects (0 = does not have heart disease; 1 = have heart disease). The dataset seems to already be in a tidy format.

Research question: We are aiming to predict whether an individual has heart disease (ie. the target value)  using 4 predictors, age, serum cholesterol in mg/dl(chol), maximum heart rate achieved(thalach), resting blood pressure (in mm Hg on admission to the hospital)(trestbps). 

We chose these four predicators because they seem clinically related to the risk of heart disease. Clinical data identifies cholesterol at and above 240 mg/dl to be high. Systolic blood pressure over 140 mmHg and/or diastolic blood pressure over 90 mmHg are considered high blood pressure. It is considered critical if one’s maximum heart rate exceed their age subtracted by 200. These factors have all be proven to increase the risk of heart disease in patients. We are analyzing our data with the help of these clinical standards. 


Preliminary exploratory data analysis:

In [None]:
#reading in our data set:
heart_data<-read_csv("data/heart (1).csv")

heart_data_1<- heart_data|>
select(age, trestbps, chol, thalach, target) |>
mutate(target=as_factor(target))
#splitting it into training and test sets
heart_split <- initial_split(heart_data_1, prop = 0.75, strata = target)
heart_training<-training(heart_split)
heart_testing<- testing(heart_split)
heart_training



In [None]:
#extra exploratory information
mean_data<- heart_data|>
    map_df(mean, na.rm=TRUE)

mean_data 


In [None]:
#Section: summarizing the data into table(s)

# Cholesterol (chol column)

# target: 0=not diagnosed with heart disease
#         1= diagnosed with heart disease

# Note: High cholesterol is clinically diagnosed when a patient's cholesterol is equal to or over 240 mg/dl
high_chol <- heart_training |>
select(chol, target)|>
filter(chol >= 240)|>
group_by(target)|>
summarize(count=n())

high_chol


In [None]:
#Section: summarizing the data into table(s), part 2
#Maximum heart rate (thalach column)
normal_max_heart_rate <- heart_training|>
mutate(normal_thalach = 220 - age)

high_heart_rate <- normal_max_heart_rate|>
mutate(diff_thalach= normal_thalach - thalach)|>
mutate(diff_thalach_2 = ifelse(diff_thalach >= 0, "no", "yes"))
high_heart_rate

#normal max heart rate is calculated using a fixed equation: 220-age
#comparing calculated (normal) max heart rate with the measured heart rate(thalach column)
#we can quantify whether heart rate is critically high by comparing these two rows.
#if the difference between the 2 rows is a negative value then max heart rate is considered high and potentially dangerous.


In [None]:
#Section: Visualization for Maximum Heart Rate (thalach column)
high_heart_plot <- high_heart_rate |>
ggplot(aes(x = diff_thalach_2, fill = target)) + 
geom_bar(stat = "count") + 
labs(x = "Whether patients have elevated max heart rate", y = "Count", fill= "Diagnosed with Heart Disease")+
ggtitle("Heart Disease in relation to Maximum Heart Rate") 
high_heart_plot

#second plot (same data):
plot<- high_heart_rate|>
ggplot(aes(x= target, y= diff_thalach, color= target))+
geom_point()+
labs(x= "Heart Disease", y= "Difference in normal heart rate", colour= "Diagnosed with Heart Disease" )+
ggtitle("Heart Disease in relation to Maximum Heart Rate")
plot

#We added both graphs because we wanted feedback on which graph visualizes this data better


Both graphs show that patients with a negative difference between their hypothesized "normal" max heart rate and their actual recorded max heart rate (ie, recorded max heart rate is over the normal limit), tend to be diagnosed with heart disease (1). In the first graph in particular, a greater proportion of people with a max heart rate over the normal limit have been diagnosed with heart disease, as compared to the patients without elevated max heart rates.

Methods:
We will select for the age, trestbps (resting blood pressure), chol (cholesterol), thalach (maximum heart rate), and target (diagnosed heart disease or not) columns. We will do a K-nearest neighbours classification on the heart disease dataset, using the target column as our class label, and the other 4 columns as our predictors. We will test for which K to use to generate the highest accuracy. Then we will show the results of our prediction in a graph or a few graphs linked together. We believe that age, resting heart rate, cholesterol, and maximum heart rate are important factors in determining and predicting heart disease in an individual.
We will create multiple scatter plots for example, plotting target (x-axis) and age (y-axis) to determine the positive or negative trends.


Expected outcomes and significance:
We expect that factors such as increased age, blood pressure, cholesterol and heart rates would increase the chances of heart disease, meaning that you could be more susceptible to heart disease. 
This would be a very helpful way to make quick predictions of whether certain factors increase the chances of heart disease. Of course, these predictions are not professional diagnoses; however, they can still be good baseline tests for medical professionals to use on patients.
We can wonder what other factors and combinations of factors can increase the chances of heart disease. 
