# Project Proposal 

### Livleen Randhawa

## Introduction

The term heart disease refers to a type of disease that effects the heart and/or blood vessels. Risk factors for heart disease include high blood pressure and high cholesterol (National Cancer Institute, n.d.).

High blood pressure is linked to heart disease as it can narrow and damage the arteries that deliver blood to the heart (Mayo Clinic, 2023). While Cholesterol is important for healthy cells, high levels of cholesterol result in fatty deposits in blood vessels that clog them. The deposits grow, restricting blood flow linking high cholesterol to heart disease (Mayo Clinic, 2023).

The objective  of this project will be to use these risk factors to classify patients based on likeliness to have heart disease.

The question I will be addressing is: Can the blood pressure and cholesterol of a patient accuraturely predict whether they have heart disease or not?

## Preliminary Exploratory Data Analysis 

In [None]:
#add library
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)

In [None]:
#set the seed 
set.seed(4567)

In [None]:
#reading in the data
cleveland_data <- read_csv("data/heart_disease/processed.cleveland.data", col_names = FALSE)

In [None]:
#adding column names 
colnames(cleveland_data) <- c("age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope","ca", "thal", "num")
cleveland_data

In [None]:
#making the target binary 
cleveland_data <- cleveland_data|> 
                   mutate(num = case_when(num >= 1 ~ "presence",
                                          num == 0 ~ "absence"))
cleveland_data

In [None]:
#removing columns with NA values 
cleveland_data <- cleveland_data|>
                  select(-ca, -thal)
cleveland_data

In [None]:
#splitting the data 
cleveland_split <- initial_split(cleveland_data, prop = 0.75, strata = num)
cleveland_train <- training(cleveland_split)
cleveland_test <- testing(cleveland_split)
cleveland_train

In [None]:
#summary of training data 
cleveland_train_explore <- cleveland_train|>
                            select(chol, trestbps, num)|>
                            group_by(num) |>
                            summarize(
                            count = n(),
                            mean_chol = mean(chol),
                            mean_trestbps = mean(trestbps)
                            )
cleveland_train_explore

In [None]:
#visualization of training data 
cleveland_plot <- ggplot(cleveland_train, aes(x= chol, y = trestbps, color = num))+
        geom_point()+
        xlab("Serum Cholestrol (mg/dl)")+
        ylab("Resting Blood Pressure")+
        labs(colour = "Heart Disease")
cleveland_plot 

## Methods

I am using processed.cleveland.data from the Heart Disease Database to predict if a patient from Cleveland will have heart disease. The columns are:

age: age

sex: sex (1 = male, 0 = female)

cp: chest pain type

trestbps: resting blood pressure in mmHg

chol: serum cholestoral in mg/dl

fbs: fasting blood sugar > 120 mg/dl? (1 = True, 0 = False)

restecg: resting electrocardiographic results

thalach: maximum heart rate achieved

exang: whether exercise induced angina (1 = True, 0 = False)

oldpeak: ST depression induced by exercise, relative to rest

slope: the slope of the peak exercise ST segment (1 = upslope, 2 = flat, 3 = downslope)

ca: number of major vessels (0-3) colored by flourosopy

thal: (3 = normal, 6 = fixed defect, 7 = reversable defect)

num: diagnosis of heart disease (1,2,3,4 = presence, 0 = no presence)

To clean the data, I removed the columns with missing data. Since num uses integers to distinguish presence (1,2,3,4) from absence (0), and we want to determine whether or not a patient has heart disease, a new binary column diag has been appended to narrow diagnoses down to absence or presence.

I used initial_split() to split our dataframe into 75% training and 25% testing data, stratifying for diag and am only using the training set for analysis.

To summarize our data, I grouped by num then summarized for the mean for the predictors I'm using, chol and trestbps. I noted that the mean cholesterol and blood pressure of patients with heart disease is higher. 

To visualize relationships in our data, I generated a scatter plot, plotting the two predictors against each other, and colouring the data points based on whether heart disease was absent or present.  

## Expected outcomes and significance 

I expect to find that high cholesterol and blood pressure are predictors for heart disease. 

Being able to use an accurate classification system for heart disease could make it easier for doctors to diagnose patients, helping patients get treatment earlier.

Some future questions this could lead to are:

What measures can be taken to prevent heart disease?
Are there other predictors for heart disease with more significant relationships?

## References 

1. Mayo Foundation for Medical Education and Research. (2023a, January 11). High cholesterol. Mayo Clinic. https://www.mayoclinic.org/diseases-conditions/high-blood-cholesterol/symptoms-causes/syc-20350800#:~:text=Your%20body%20needs%20cholesterol%20to,to%20flow%20through%20your%20arteries. 
2. Mayo Foundation for Medical Education and Research. (2023b, November 28). How high blood pressure can affect the body. Mayo Clinic. https://www.mayoclinic.org/diseases-conditions/high-blood-pressure/in-depth/high-blood-pressure/art-20045868 
3. NCI Dictionary of Cancer terms. Comprehensive Cancer Information - NCI. (n.d.). https://www.cancer.gov/publications/dictionaries/cancer-terms/def/heart-disease 