## Heart Disease Prediction

## Introduction:

**Background Information**
<br>
Cardiovascular diseases affect a person's heart or blood vessels. These types of diseases are the leading cause of death across the world which explains why diagnosis at an early stage is important so that patients can get imediate treatment. Heart diseases are a specific group of cardiovascular diseases that refer to conditions specifically affecting the behaviour or structure of the heart. Multiple underlying health conditions and lifestyle choices can increase the likelihood of heart diseases, such as diabetes, obesity, high cholesterol, and high blood pressure. Some possible symptoms of heart disease are heart attacks, heart failure, chest pain, and strokes.

**Our Question:**
<br>
How accurately can we identify whether or not a patient has heart disease based on their age, resting blood pressure, and cholesterol level?



This dataset is a combination of 5 smaller datasets including the Cleveland, Hungarian, Switzerland, Stalog (Heart) Data Set, and Long Beach VA datasets all originally sourced from the UCI Machine Learning Repository. It contains information on health factors of patients and whether or not these patients had heart disease. Important factors that this dataset include are the patients' age, cholesterol level, and resting blood pressure level.



## Preliminary exploratory data analysis:

In [None]:
# import needed libraries
library(tidyverse)
library(repr)
library(tidymodels)
library(gridExtra)

## Reading and scaling the data

In [None]:
# read dataset from web
heart <- read_csv("https://raw.githubusercontent.com/josephsoo/dsci_100_group_12/main/data/heart.csv")


# clean and wrangle into tidy format
# by turning HeartDisease in a factor type and 
# removing rows where cholesterol = 0
heart <- mutate(heart, HeartDisease = as_factor(HeartDisease)) |> filter(Cholesterol != 0)
head(heart)

# for reference: 1 = has heart disease, 0 = does not have heart disease

# scale the dataset and select relevant variables
heart_scaled <- heart|>
    mutate(scaled_Cholesterol = scale(Cholesterol, center = TRUE),
        scaled_RestingBP = scale(RestingBP, center = TRUE),
        scaled_Age = scale(Age, center = TRUE))|>
    select(HeartDisease, scaled_Cholesterol, scaled_RestingBP, scaled_Age)

head(heart_scaled)

## Splitting the data

In [None]:
set.seed(3456) 

heart_split <- initial_split(heart_scaled, prop = 0.75, strata = HeartDisease)  
heart_train <- training(heart_split)   
heart_test <- testing(heart_split)

## Creating tables to summarize the training data

In [None]:
# Using only training data, summarize the data in at least one table (this is exploratory data analysis). 
# An example of a useful table could be one that reports the number of observations in each class, 
# the means of the predictor variables you plan to use in your analysis and how many rows have missing data. 

# number of observations in each class of heart disease
heart_disease_count <- heart_train |> group_by(HeartDisease) |> summarize(n = n())
heart_disease_count

# averages of each predictor variable
predictor_means <- heart_train |> select(-HeartDisease) |> map_df(mean)
colnames(predictor_means) <- c("scaled_Cholesterol_mean", "scaled_RestingBP_mean", "scaled_Age_mean")
predictor_means

## Creating data visualizations of the training data

We have made 3 scatter plots which show the distribution and relationship between all combinations of our predictor variables. The reason why we used only 2D visualizations is because 3D visualizations can be confusing and hard to look at and determine relationships. These plots allow us to see if there are any general patterns in the data, as well as compare the distributions of predictor variables

In [None]:
# Using only training data, visualize the data with at least one plot relevant to the analysis you plan to do 
# (this is exploratory data analysis). An example of a useful visualization could be one that compares the 
# distributions of each of the predictor variables you plan to use in your analysis.
options(repr.plot.width = 20, repr.plot.height = 8)

cholesterol_resting_bp_plot <- heart_train |> ggplot(aes(x = scaled_Cholesterol, y = scaled_RestingBP, color = HeartDisease)) +
    geom_point() +
    labs(x = "Scaled Cholesterol", y = "Scaled Resting Blood Pressure", color = "Heart Disease", title = "Cholesterol vs Resting Blood Pressure")

cholesterol_age_plot <- heart_train |> ggplot(aes(x = scaled_Cholesterol, y = scaled_Age, color = HeartDisease)) +
    geom_point() +
    labs(x = "Scaled Cholesterol", y = "Scaled Age", color = "Heart Disease", title = "Cholesterol vs Age")

resting_bp_age_plot <- heart_train |> ggplot(aes(x = scaled_RestingBP, y = scaled_Age, color = HeartDisease)) +
    geom_point() +
    labs(x = "Scaled Resting Blood Pressure", y = "Scaled Age", color = "Heart Disease", title = "Resting Blood Pressure vs Age")

grid.arrange(cholesterol_resting_bp_plot, cholesterol_age_plot, resting_bp_age_plot, ncol=3)

## Distribution of predictors

We can also visualize the distribution of each predictor variable using histograms to see how the data we are working with looks like.

In [None]:
cholesterol_plot <- heart_train |> ggplot(aes(x = scaled_Cholesterol)) +
    geom_histogram() +
    labs(x = "Scaled Cholesterol", y = "Count") +
    theme(text = element_text(size=20))

resting_bp_plot <- heart_train |> ggplot(aes(x = scaled_RestingBP)) +
    geom_histogram() +
    labs(x = "Scaled Resting Blood Pressure", y = "Count") +
    theme(text = element_text(size=20))

age_plot <- heart_train |> ggplot(aes(x = scaled_Age)) +
    geom_histogram() +
    labs(x = "Scaled Age", y = "Count") +
    theme(text = element_text(size=20))

cholesterol_plot
resting_bp_plot
age_plot

## Balancing

Here, we are graphing the distribution of the classifier (Heart disease == 1 or 0)

In [None]:
heart_train_count <- heart |>
group_by(HeartDisease) |>
summarize(n = n()) |>
ggplot(aes(x = HeartDisease, y = n))+
geom_bar(stat = "identity")

heart_train_count



Because there are close to equal observations of each class, we should not need to worry about class imbalance

## Methods:

**Variables For Analysis**
<br>
We have 3 potential predictors:
- Age
- Cholestrol 
- Resting Blood Pressure

and the variable we are predicting is:
- Heart Disease


**Method and Rationale**
<br>
Using all the variables in a data set is rarely a good idea in classification problems. 
Therefore, we are going to determine which subset of predictors offers the best performance to our model. One way of doing so is to iteratively build up a model by adding predictors over each iteration. This is known as forward selection.

Once we have found the subset of predictors which gives us the highest accuracy, we will create a classification model using that subset with the best k value. 

Some ways we can visualize the results:
- line graph of accuracy vs number of predictors
- scatter plot with background color of predicted class

**Interpretation:**
<br>
As shown by the forward selection algorithm, we can see that the most accurate configuration for the model utilizes all three predictors. Therefore, through the forward selection algorithm, we have arrived at a conclusion for our model using a systematic science based approach. 

## Expected outcomes and significance:

**What do you expect to find?**
- In this project, we expect to be able to use the predictors Age, Cholestreol, and resting blood pressure to answer the question: Will someone have heart disease?

**What impact could such findings have?**

- By knowing if someone will ahve heart disease, we could be able to take note of people at risk and monitor them more closely. This might allow people who may be at risk for heart disease to recieve treatment before it become too late

**What future questions could this lead to?**

- If could we identify factors that make people more likely to get heart disease, could we then be able to change the lifestyle of the person to change the chance of them getting heart disease?

