# Factors for Heart Disease Proposal

## Introduction

There are many different kinds of factors that contribute to different kinds of heart disease. These factors include symptoms, sex, and age. Symptoms can include chest pains, fbs, cholestorel, etc. The question that we will be analyzing and trying to answer is: what is the diagnosis of heart disease based off of these predictors? The data set we will be using to complete our project is a data set of heart disease, and it includes patients, and the symptoms that they have relative to heart disease. It includes the patients age, sex, and different symptoms.

## Reading the data from the web into R

For our project, we will be needing to read the heart_disease.xlsx file, located in our data folder. To do this we will use read_excel from the readxl library that we will load alongside tidyverse, tidymodels and dplyr that can be used. We will be naming our data as heart_data.

In [2]:
library(tidyverse)
library(tidymodels)
library(readxl)
library(dplyr)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.3     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.0
[32m✔[39m [34mggplot2  [39m 3.4.4     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.0
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.5     [32m✔[39m [34mrsample     [39

In [3]:
heart_data <- read_excel("data/heart_disease.xlsx")

heart_data |>
slice(0:10)

age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<dbl>
63,1,1,145,233,1,2,150,0,2.3,3,0,6,0
67,1,4,160,286,0,2,108,1,1.5,2,3,3,2
67,1,4,120,229,0,2,129,1,2.6,2,2,7,1
37,1,3,130,250,0,0,187,0,3.5,3,0,3,0
41,0,2,130,204,0,2,172,0,1.4,1,0,3,0
56,1,2,120,236,0,0,178,0,0.8,1,0,3,0
62,0,4,140,268,0,2,160,0,3.6,3,2,3,3
57,0,4,120,354,0,0,163,1,0.6,1,0,3,0
63,1,4,130,254,0,2,147,0,1.4,2,1,7,2
53,1,4,140,203,1,2,155,1,3.1,3,0,7,1


With this data now loaded, we need to prepare it for processing before the application of our chosen method. The first change we will notice is that some values that are supposed to be read as numbers, are being read as characters. To change this we will need to use mutate to read these values as doubles instead of characters. We will take this chance to also use na.omit(), to remove any data that has non-existant data since we cannot use that for our predictions.

In [4]:
set.seed(9997)
heart_data <- read_excel("data/heart_disease.xlsx") |>
                        mutate(ca = as.numeric(ca), thal = as.numeric(thal), num = as.factor(num)) |>
                        na.omit()

heart_data|>
slice(0:10)

[1m[22m[36mℹ[39m In argument: `ca = as.numeric(ca)`.
[33m![39m NAs introduced by coercion


age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
63,1,1,145,233,1,2,150,0,2.3,3,0,6,0
67,1,4,160,286,0,2,108,1,1.5,2,3,3,2
67,1,4,120,229,0,2,129,1,2.6,2,2,7,1
37,1,3,130,250,0,0,187,0,3.5,3,0,3,0
41,0,2,130,204,0,2,172,0,1.4,1,0,3,0
56,1,2,120,236,0,0,178,0,0.8,1,0,3,0
62,0,4,140,268,0,2,160,0,3.6,3,2,3,3
57,0,4,120,354,0,0,163,1,0.6,1,0,3,0
63,1,4,130,254,0,2,147,0,1.4,2,1,7,2
53,1,4,140,203,1,2,155,1,3.1,3,0,7,1


## Creating the training and test set
In this section, the data will be split into a training and test set, so we are capable of training and then measuring the accuracy of our model. To do this, the data points will have to be randomly selected, based on a specific target outcome variable, so it can ensure that the distribution of the outcomes is equal, in this case that would be the column num. The prop data is 0.8 as we want to use only 80% of our data for training and keep the rest for testing. We set the seed as 9997 to keep results consistent across the board.

In [5]:
set.seed(9997) # set a seed to be consistent
heart_split <- initial_split(heart_data, prop = 0.80, strata = num)  
heart_train <- training(heart_split)   
heart_test <- testing(heart_split)

heart_train |>
    slice(0:5)
heart_test |> 
    slice(0:5)

age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
63,1,1,145,233,1,2,150,0,2.3,3,0,6,0
37,1,3,130,250,0,0,187,0,3.5,3,0,3,0
41,0,2,130,204,0,2,172,0,1.4,1,0,3,0
57,0,4,120,354,0,0,163,1,0.6,1,0,3,0
57,1,4,140,192,0,0,148,0,0.4,2,0,6,0


age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
56,1,2,120,236,0,0,178,0,0.8,1,0,3,0
56,1,3,130,256,1,2,142,1,0.6,2,1,6,2
54,1,4,140,239,0,0,160,0,1.2,1,0,3,0
58,1,3,132,224,0,2,173,0,3.2,1,2,7,3
59,1,4,135,234,0,0,161,0,0.5,2,0,7,0


We can now see that both the training and the test data not have different values, and are split into two different tibbles that we can use within our model.

## Summarizing training data

From here on out, we will only be using the training data for our analysis. First we will group our data by the column num, and then we summarise all numeric values and generate their means, so we can understand what the averages are for each different category.

In [None]:
summary_table <- heart_train |>
      group_by(num) |>
      summarise(across(where(is.numeric), list(mean = ~mean(.x, na.rm = TRUE))),
            Count = n())

summary_table

With this data, it can be seen that the values for patients that come back negative for any heart disease (num = 0) have completely different means than the others. This shows an important pattern for us to understand what values to look at when categorizing whether or not a patient has some sort of heart disease. There are also other patterns within the table that show the differences between each diagnosis, which suggests that our model will also be able to accurately predict which diagnosis a patient has.

## Visualizing Training Data

Now that the data has been summarized, it is also important for it to be visualised. In this case, pivot_longer prepares the data for it to be used with facet_wrap during graph generation. With these graphs the distribution of the different predictors becomes clear, and that allows for us to see what data is continuous, and what data is discrete.

In [1]:
heart_data_long <- heart_data |>
  pivot_longer(cols = -num, names_to = "Predictor", values_to = "Value")

heart_data_long |>
ggplot(aes(x = Value)) +
  geom_histogram(bins = 30, fill = "red", color = "black", alpha = 0.7) +
  facet_wrap(~ Predictor, scales = "free") +
  labs(title = "Distribution of Predictor Variables",
       x = "Value", 
       y = "Count") +
      theme_minimal()

ERROR: Error in pivot_longer(heart_data, cols = -num, names_to = "Predictor", : could not find function "pivot_longer"


The graphs show the distribution of the different variables, and with this we are able to see which of the different predictor variables we will want to focus on. There are five predictors that stand out due to their continuous distribution, which are age, chol, oldpeak, thalach, trestbps. On the other hand, the variables ca, cp, exang, fbs, resetcg, sex, slope and thal are all discrete. This information allows us to see what variables we want to rely most on, as using continous data allows for a greater accuracy in our model.

## Methods
### Which variables/columns to use 
For this project, we're going to use every variable we have. They are all relevant and important factors in predicting the diagnosis of heart disease. 

### How we will conduct our data analysis 
We will be using the K-Nearest Neighbor algorithm for predicting the diagnosis of heart disease. Firstly, we need to create accuracies vs neighbours line plots using the template we created below to choose the best K-value.

In [None]:
# KNN Steps
# Cross-Validation
heart_vfold <- vfold_cv(heart_train, v = 5, strata = num)
k_vals <- tibble(neighbors = c(2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)) # K-Values you want to test out

knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
            set_engine("kknn") |>
            set_mode("classification")

heart_recipe <- recipe(num ~ . , data = heart_train) |>
                    step_scale(all_predictors()) |>
                    step_center(all_predictors())


heart_fit <- workflow() |>
                     add_recipe(heart_recipe) |>
                     add_model(knn_spec) |>
                     tune_grid(resamples = heart_vfold, grid = k_vals)

heart_results <- collect_metrics(heart_fit)

accuracies <- heart_results |>
              filter(.metric == "accuracy")

cross_val_plot <- ggplot(accuracies, aes(x = neighbors, y = mean))+
       geom_point() +
       geom_line() +
       labs(x = "Neighbors (k)", y = "Accuracy Estimate") +
       theme(text = element_text(size = 20)) 
       # scale_x_continuous(breaks = seq(0, 14, by = 1)) +  # adjusting the x-axis
       # scale_y_continuous(limits = c(0.4, 1.0)) # adjusting the y-axis
cross_val_plot
# K = 10 is the best K - value

### How we will conduct our data analysis (Continued)
Secondly, we will evaluate the performance of the classifier by computing accuracy using the metrics function and producing confusion matrices. We will change the proportion of the training data until we get the best accuracy

In [None]:
# Build model specifications with the best k value
knn_spec_final <- nearest_neighbor(weight_func = "rectangular", neighbors = 10) |>
                  set_engine("kknn") |>
                  set_mode("classification")

final_fit <- workflow() |>
            add_recipe(heart_recipe) |>
            add_model(knn_spec_final) |>
            fit(data = heart_test)

heart_predictions <- predict(final_fit, heart_test) |>
                     bind_cols(heart_test)

heart_metrics <- heart_predictions |>
                 metrics(truth = num, estimate = .pred_class)

heart_conf_mat <- heart_predictions |>
                 conf_mat(truth = num, estimate = .pred_class)

heart_metrics
heart_conf_mat

### Expected outcomes and significance
Our primary goal is to create a K-Nearest Neighbours (KNN) model that can reliably forecast heart disease risk utilising essential medical indicators. We want to find critical predictors like age, cholesterol, and maximal heart rate to improve the model's accuracy. Determining the ideal number of neighbours (K) will help to avoid underfitting and overfitting, enhancing model reliability. This study is essential because it may lead to early identification and improved clinical decision-making, allowing healthcare practitioners to identify and intervene with at-risk individuals sooner. The success of this effort promises to improve medical prediction modelling and could inspire more uses of machine learning in healthcare, perhaps saving lives by enabling for timely and focused treatment techniques for heart disease.