# Introduction

Heart disease is such a common disease that it could cause many problems. Early prediction is the key to preventing it from killing people. Thinking of a way to predict the disease is one of the most important things we should be doing. A dataset from UCI has summarized the 13 important attributes that could be used for predicting heart diseases in four locations: Switzerland, Hungary, Va Long Beach, and Cleveland. It also shows what the heart conditions it has out of 4 severity. 

https://archive.ics.uci.edu/dataset/45/heart+disease

The following information will be used for the answer:

## Predictive Question: To what extent do age, cholesterol level, Resting Blood Pressure, and Maximum Heart Rate help us create a predictive classifer for the diagnosis of heart rate? 

The intention of this question is to see if these relatively easily accessible data could predict the presence of heart disease. This would be helpful for people to know; thus, if this parameter do form a good classifier, it would help a lot of people and regions in the world.There are regions in the world that do not have access to high-quality data. Thus, it would be helpful to have a classifier that only uses easy access variables. 

## 13 key attributes
- Age: Age of the patient (years).
- Sex: Gender (1 = male, 0 = female).
- CP (Chest Pain Type)
- Trestbps (Resting Blood Pressure): Resting blood pressure (mm Hg) 
- Chol (Serum Cholesterol): Serum cholesterol level (mg/dl).
- Fbs (Fasting Blood Sugar > 120 mg/dl):  (1 = true, 0 = false).
- Restecg (Resting Electrocardiographic Results): Results of the resting electrocardiogram 
- Exang (Exercise-Induced Angina): Whether exercise-induced angina
- Oldpeak (ST Depression Induced by Exercise): ST depression induced by exercise relative to rest.
- Slope (Slope of the Peak Exercise ST Segment): Slope of the peak exercise ST segment
- Ca (Number of Major Vessels Colored by Fluoroscopy): Number of major vessels (0-3) colored by fluoroscopy.
- Thal: Thalassemia 
- Num: Diagnosis of heart disease (0 = no disease, 1-4 = presence of disease with increasing severity).






## Expectation
We would not have the best result as the data used is easily accessible and has smaller details than some. However, some factors, such as cholesterol and heart rate, are known to be very important. Thus, we could expect the model to have a relevant accuracy of around 70%. 
It is predicted that age is a significant variable in determining whether a patient has heart disease or not. As age increases, the heart rate also increases. 
For cholesterol and blood pressure, we expect them to have a positive correlation with age. As age increases, the number of heart diseases increases. Moreover, I expect maximum heart rate to have a negative correlation in relation to age and heart disease to increase as age increases. 

In [None]:
install.packages("ggplot2")  # Install ggplot2 package
library(ggplot2)      

library(tidyverse)
library(repr)
library(tidymodels)
library(dplyr)
options(repr.matrix.max.rows = 6)
install.packages("ggplot2")
install.packages("patchwork")
library(ggplot2)
library(patchwork)
library(knitr)

Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.3     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.0
[32m✔[39m [34mlubridate[39m 1.9.2     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mpurrr    [39m 1.0.2     [32m✔[39m [34mtidyr    [39m 1.3.0
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.5

## Reading Data
we would read the data using read_csv

In [None]:

cleveland <- read_csv("data/heart_disease/processed.cleveland.data",
                      col_names = c("age", "sex", "cp", "tresbps", "chol", "fbs", "restcg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "num")) |>
                       mutate(num = as_factor(num))


In [None]:
cleveland

Table 1

## Wrangling and Tidying the Data
I will be using the select() to only select age, tresbps, chol, thalach
 Since we only want to know if the patient has a heart disease or not, I created a new column that does this. This was done through using fct_recode(). 0 was Turned into a no and 1,2,3, and, 4 were turned into a yes. <br>
 YES: Heart Disease Present <br>
 NO: Heart disease not present

In [None]:
heart_disease_selected <- mutate(cleveland, diagnosis = fct_recode(num, "No" = "0" , "Yes" = "1", "Yes" = "2", "Yes" = "3", "Yes" = "4")) |>
         select( age, tresbps, chol, thalach, diagnosis)
                        
heart_disease_selected

Table 2

## Balance of Data
now we will check if the data is balanced for each classes.

In [None]:
heart_disease_selected |>
  group_by(diagnosis) |>
  summarize(
    count = n(),
    percentage = n() / 303 * 100
  )


Table 3

This Table shows the that the data is balanced as it has almost equal amount of yes and no diagnosed data. 

### Splitting Data
I will be splitting the data into training data and testing data. I have split the data into two parts. 75% into training and 25% into testing for KNN classification

In [None]:
set.seed(1)
heart_disease_split <- initial_split(heart_disease_selected, prop = 0.75, strata = diagnosis)
heart_disease_train <- training(heart_disease_split)
heart_disease_test <- testing(heart_disease_split)

heart_disease_train

Table 4

## Visualization

Using the training data, we will investigate the relationship between age and Cholesterol level, Maximum Heart Rate, and Resting Blood Pressure.
Using a Scapper plot to see if there are correlation between each variable and if the presence of the heart disease is affected by each variable or not. 


In [None]:
hd_age_bp <- select(heart_disease_train, age, tresbps, diagnosis)
hd_age_bp 



Table 5

In [None]:
age_chol_plot <- ggplot(heart_disease_train, aes(x = age, y = chol)) +
                    geom_point(aes(color = diagnosis, shape = diagnosis, alpha = 0.2)) +
                    labs(x = "Age", y = "Cholesterol (mg/dl)", color = "Diagnosis of Heart disease", shape = "Diagnosis of Heart disease", caption = "Figure 1") +
                    ggtitle("Age vs Cholesterol Scatterplot")

age_chol_plot

In [None]:
age_rate_plot <- ggplot(heart_disease_train, aes(x = age, y = thalach)) +
                    geom_point(aes(color = diagnosis, shape = diagnosis, alpha = 0.2)) +
                       labs(x = "Age", y= "Maximum Heart Rate (bpm)", colour = "Heart Disease", caption = "Figure 2") +       
                       ggtitle("Age vs Maximum Heart Rate Scatterplot")

age_rate_plot

In [None]:
age_bp_plot <- ggplot(heart_disease_train, aes(x = age, y = tresbps)) +
                    geom_point(aes(color = diagnosis, shape = diagnosis, alpha = 0.2)) +
                    labs(x = "Age", y = "Resting Blood Pressure  (mm/Hg)", color = "Diagnosis of Heart disease", shape = "Diagnosis of Heart disease", caption = "Table 3") +
                    ggtitle("The Relationship between age and Resting Blood Pressure")

age_bp_plot

# Methods

I will be using the knn(k nearest neighbor) classification method to make a classification model to answer my question: To what extent do age, cholesterol level, Resting Blood Pressure, and Maximum Heart Rate help us predict the diagnosis of heart rate? This would be done.

These are the key attributes, and it could be predicted that if we use all of the variables presented above, it would be very helpful to use this. However, in this project, we would only focus on easily accessible information to see if this information could be used to predict heart diseases because some countries or regions do not have all the equipment to collect the complicated data. However, most of the regions would have machines to measure cholesterol levels and Resting Blood Pressure. 

This information would be the following:
- Age
- Chol (Cholesterol Levels)
- Trestbps (Resting Blood Pressure)
- Thalach (Maximum Heart Rate)


These are the four factors used for the classification. 

The method would be to see multiple k values and tune the value to get the most k value here. After this, we would use this to see how accurate the model is with the factors I chose. I would be using the testing value as well. This would answer the maximum accuracy we could get from the factor used. We will also be using the confusion matrix to see what the model got wrong with. 

One way to visualize this is to plot a line graph for the k value on the x-axis and accuracy on the y-axis. 
We could also be plotting the predicted value scatter plot and the True value scatter plot for the test data to compare the accuracy as well. 

## Training the classifier

We made the recipe, scaled all of the variables, and found the best k values by graphing the k value and accuracy. We used 'initial_split' to create the training data and the testing data. 75% of the original data was split into the training data. We will use the training data called 'heart_disease_train' to train the classification model. 

We used the 'nearest_neighbor' function for the spec that we created, and the 'weight_func' was "rectangular" as we intend to calculate the distance for k number of points. We tune() the neighbors so that we would get the best k value with the highest accuracy used for cross-validation. The recipe will be created calle 'heart_recipe' and will be scaled and centered to make each parameter have the same weight.
We use the 'vfold_cv()' function. I set v = 5 to use 5-fold cross-validation to split our overall training set into 5 parts. We also used 'tune_grid' to fit the model for each value in the range of parameter values. 


In [None]:
set.seed(9999)

knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>

set_engine("kknn") |>
    set_mode("classification")

heart_recipe <- recipe(diagnosis ~ age + tresbps + chol + thalach + diagnosis, data = heart_disease_train) |>
    step_scale(all_predictors()) |>
  step_center(all_predictors())

heart_vfold <- vfold_cv(heart_disease_train, v = 5, strata = diagnosis)

k_vals <- tibble(neighbors = seq(from = 1, to = 50))

knn_results <- workflow() |>
    add_recipe(heart_recipe) |>
    add_model(knn_spec) |>
    tune_grid(resample = heart_vfold, grid = k_vals) |>
    collect_metrics()

accuracies <- knn_results |>
    filter(.metric == "accuracy")



accuracy_vs_k <- ggplot(accuracies, aes(x = neighbors, y = mean)) +
  geom_point() +
  geom_line() +
  labs(x = "Neighbors", y = "Accuracy Estimate", caption = "Table 4") +
  theme(text = element_text(size = 12))

accuracy_vs_k


We have assigned the k value with maximum accuracy to max_k whcih is 19

In [None]:
best_k <- accuracies |>
    arrange(desc(mean)) |>
    head(1) |>
     pull(neighbors)
best_k

As seen in the graph that has been plotted above, it is visible that K = 19 provides the highest cross-validation accuracy estimate which is around 70%. However, the accucaracy for 20 is near the value. This time I will choose 19 for the best k value to train the data.

## Testing the data
Now, using the k value that we have chosen, with the tuned k-nn classifier I will evaluate the quality of the classifier. 

In [None]:
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = best_k) |>
  set_engine("kknn") |>
  set_mode("classification")

knn_fit <- workflow() |>
  add_recipe(heart_recipe) |>
  add_model(knn_spec) |>
  fit(data = heart_disease_train)

knn_fit

Now accuracy and precision will be calculated to evaluate the code.

In [None]:
heart_test_predictions <- predict(knn_fit, heart_disease_test) |>
  bind_cols(heart_disease_test)

heart_test_predictions



Table 6

### Prediction Plot Analysis
This will be made to compare the prediction and the actual data visually




In [None]:
options(repr.plot.height = 5, repr.plot.width = 10)

hd_chol_prediction_plot <- ggplot (heart_test_predictions, aes (x = age, y = chol, color = .pred_class)) +
                                geom_point(alpha = 0.5) +
                                labs(x = "Age", y = "Cholesterol (mg/dl)", colour = "Predictions", title = "Predictions Plot, Cholesterol" , caption = "Table 5")
hd_chol_truth_plot <- ggplot (heart_test_predictions, aes (x = age, y = chol, color = diagnosis)) +
                                geom_point(alpha = 0.5) +
                                labs(x = "Age", y = "Cholesterol (mg/dl)", colour = "Truth", title = "Truth Plot, Cholesterol", , caption = "Table 6")


hd_rate_prediction_plot <- ggplot(heart_test_predictions, aes(x = age, y = thalach, color = .pred_class)) +
                                geom_point(alpha = 0.5) +
                                labs(x = "Age", y = "Max Heart Rate (bpm)", color = "Predictions", title = "Predictions Plot, Max Heart Rate", , caption = "Table 7")
hd_rate_truth_plot <- ggplot(heart_test_predictions, aes(x = age, y = thalach, color = diagnosis)) +
                                geom_point(alpha = 0.5) +
                                labs(x = "Age", y = "Max Heart Rate (bpm)", color = "Truth", title = "Truth Plot, Max Heart Rate", , caption = "Table 8")


hd_trestbps_prediction_plot <- ggplot(heart_test_predictions, aes (x = age, y = tresbps, color = .pred_class)) +
                                geom_point(alpha = 0.5) +
                                labs(x = "Age", y = "Resting Blood Pressure (mm/hg)", color = "Predictions", title = "Predictions Plot, Blood Pressure",  caption = "Table 9")
hd_trestbps_truth_plot <- ggplot(heart_test_predictions, aes (x = age, y = tresbps, color = diagnosis)) +
                                geom_point(alpha = 0.5) +
                                labs(x = "Age", y = "Resting Blood Pressure (mm/hg)", color = "Truth", title = "Truth Plot, Blood Pressure", caption = "Table 10")

hd_chol_prediction_plot + hd_chol_truth_plot 
hd_rate_prediction_plot + hd_rate_truth_plot
hd_trestbps_prediction_plot + hd_trestbps_truth_plot




### Analysis
These graphs show the actual value for age vs. Cholesterol, Heart Rate, and Blood Pressure and the prediction value. The only difference between these two variables is the color of the plots, with one showing how the actual test data's trend for heart disease diagnosis looks versus the color of the prediction. It is visible to see whether the prediction was right or wrong. For example, when looking at the graph for maximum heart rate, the prediction graph shows that the heart rate value is separated distinctively. However, looking at the truth plot, it is shown that the dots are mixed and that the quantity of heart rate does not really show if they have heart disease or not. This is a good visualization tool to see where the prediction got wrong. 

# Evaluation
Now we will evaluate the code that we have here to see how good the model is by calculating the accuracy

In [None]:
heart_test_predictions |>
  metrics(truth = diagnosis, estimate = .pred_class) |>
  filter(.metric == "accuracy")

heart_test_predictions |> pull(diagnosis) |> levels()

heart_test_predictions|>
     precision(truth = diagnosis, estimate = .pred_class, event_level="second")

Table6, Table 7

## Confusion Matrix

In [None]:

heart_test_predictions |>
             conf_mat(truth = diagnosis, estimate = .pred_class)

## Analysis of Confusion Matrix (Accuracy, Precision and Recall)

Accuracy could be calculated from the confusion matrix, as the formula is
Accuracy = Number of Correct Predictions / Total Number of Predictions
0.645 = 49 / 76

The accuracy I got from this is 65%, meaning that 35% of the time, some patients are misdiagnosed. The accuracy is moderate, as it would be a measure for some hospitals to use for the initial diagnosis of heart disease. However, this is not high enough to decide whether the patient has heart disease. Finding a way to improve this model (Additional Parameters) is necessary. This is not a complete model to see who has heart disease or not, as people's lives are dependent on it. 

Precision
Precision = Number of Correct Positive / Number of True Positive + False Positive.  (In this case, POSITIVE Prediction means that a person is diagnosed with heart disease)
0.66 = 16 / 24

The Precision shows how many of the positive predictions the classifier made were actually positive. The precision we have is approximately 66% meaning that 34% of the times the person do not have a heart disease. 

Recall 
Recall = Number of True Positive/ Number of True Positive + False Negative
0.45 = 16 / 35

The recall of the model is 42%, which is an awful number as the model cannot predict well. This shows that 58% with heart disease is not predicted as a heart disease. This is the worst case to have a false negative because we would leave people with heart disease without any treatment, which could lead to exacerbation.



# Discussion

#### Summarize what you found
The model's accuracy is 64.47%, meaning it correctly predicts heart disease in about 64.47% of cases.
The precision is 67%, indicating that when the model predicts heart disease, it is correct 66.7% of the time.
The recall is 45%, showing that the model only identifies 45% of actual heart disease cases as not being good enough for a heart disease model. 
The scatter plots visualize the relationship between age and various health metrics (cholesterol, maximum heart rate, and resting blood pressure). From this, we found out the level of pressure does not heavily affect the presence of heart disease, as shown in the scatter plot. It shows how people with lower pressure has heart disease in the similar amount to the ones with hight resting blood pressure. It also showed a weak correlation between Resting Blood Pressure and age and the points spread apart. 


#### Discuss whether this is what you expected to find.

The incidence of heart disease was expected to increase with age. The scatter plot shows that the older the person, the higher the incidence of heart disease. This was what we expected to find. However, the model created could not predict this. Although age positively correlated with resting blood pressure and cholesterol levels, this did not affect the presence of heart disease, and the model could not distinguish individuals' presence of heart disease. It was unexpected that the resting Blood pressure did not give us useful information to distinguish whether a person has a heart disease. 

Regarding the model's accuracy, we expected an accuracy of approximately 70%. The actual KNN model fell short at 65%. However, it is close to what we expected, as the variables that were chosen were easily accessible variables. We expected that these variables were not enough to make the model good enough to predict the presence of heart disease. This was close to our hypothesis, and it matched it to some degree. However, the recall of the model was around 42%. This is not what I expected here. This means that there are many false negatives, which are too low for the model to be good. This is an unexpected value. This indicates the model has to be improved. 


#### Discuss what impact such findings could have.
The theme of the research is to see if easily accessible parameters can form a good model with high prediction accuracy. The goal is to see if people in different regions who do not have access to high-quality parameters can predict the presence of heart disease. From this point of view, we found out that it is impossible to do this as the recall was about 42%, meaning that there were a lot of false negative predictions when this model was used.
This is not a good thing at all. We found this is not possible through age, blood pressure, cholesterol level, and maximum heart rate. This is a good step to show that this parameter on its own would not be able to form a good model, as we found out that resting blood pressure is not a good parameter for heart disease prediction. From the findings that we made, we would be able to move on to see if other easily accessible data, such as sex and others, would be able to contribute to making a better prediction model. Moreover, we learned that the model itself has a high amount of false negatives. This leads to the next question of why this is the case. However, it also showed what machine learning is capable of. However, it has an accuracy of 65%. It could be used for some regions because it is better than nothing. It could lead to early detection and prediction of the disease. 


#### Discuss what future questions this could lead to.

What other parameter would enhance the accuracy of the model?
Would the model change if the data is not limited to Cleveland?


#### Answering the Predictive question
the variables did not help to create a good predictive classifier

## Significance


These variables could be collected very easily in most regions, and it would be great if this basic information could be used to predict heart diseases. This would be a great thing for us to do. The information that had been created could then be applied to people worldwide, even though this information is limited to Cleveland. If this is true people around the world could be helped. If not so, it will be important for researchers to see what easy accessible data would be a good factor to predict heart disease.

## Further Question
What factor do we need to improve the accuracy of the model that is relatively accessible for most region and countries. 


# Work Cited
“Heart Health and Aging.” National Institute on Aging, U.S. Department of Health and Human Services, 1 June 2018, www.nia.nih.gov/health/heart-health/heart-health-and-aging#:~:text=People%20age%2065%20and%20older,heart%20disease.<br>
Franklin, S. S., Larson, M. G., Khan, S. A., Wong, N. D., Leip, E. P., Kannel, W. B., & Levy, D. (2001). Does the relation of blood pressure to coronary heart disease risk change with aging? Circulation, 103(9), 1245–1249. https://doi.org/10.1161/01.cir.103.9.1245

