# Predicting the Likelihood of Heart Disease within a Patient

## Introduction
Heart disease is a medical condition that affects the heart and blood vessels, preventing proper blood circulation. We plan to address the question: “Is there a correlation between the variables in the dataset outcome of whether individuals have heart disease?” The dataset we selected contains the recorded health variables of various patients in Cleveland that are either tested positive or negative for heart disease. 

## Methods & Results


In [1]:
#loading all packages
library(tidymodels)
library(tidyverse)
library(repr)
library(rvest)
library(readxl)
library(RColorBrewer)
library(cowplot)
     

── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.0.0 ──

[32m✔[39m [34mbroom       [39m 1.0.0     [32m✔[39m [34mrecipes     [39m 1.0.1
[32m✔[39m [34mdials       [39m 1.0.0     [32m✔[39m [34mrsample     [39m 1.0.0
[32m✔[39m [34mdplyr       [39m 1.0.9     [32m✔[39m [34mtibble      [39m 3.1.7
[32m✔[39m [34mggplot2     [39m 3.3.6     [32m✔[39m [34mtidyr       [39m 1.2.0
[32m✔[39m [34minfer       [39m 1.0.2     [32m✔[39m [34mtune        [39m 1.0.0
[32m✔[39m [34mmodeldata   [39m 1.0.0     [32m✔[39m [34mworkflows   [39m 1.0.0
[32m✔[39m [34mparsnip     [39m 1.0.0     [32m✔[39m [34mworkflowsets[39m 1.0.0
[32m✔[39m [34mpurrr       [39m 0.3.4     [32m✔[39m [34myardstick   [39m 1.0.0

── [1mConflicts[22m ───────────────────────────────────────── tidymodels_conflicts() ──
[31m✖[39m [34mpurrr[39m::[32mdiscard()[39m masks [34mscales[39m::discard()
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m

In [21]:
#read csv file from UCI
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
heart_data <- read_csv(url, col_names = FALSE)

[1mRows: [22m[34m303[39m [1mColumns: [22m[34m14[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (2): X12, X13
[32mdbl[39m (12): X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X14

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


In [22]:
# Cleaning and wrangling data
# Added meaningful column names. 
# We changed the orignal attribute name "num" to "Heart_Disease" because "num" had little meaning

set.seed(1000)

heart_data <- rename(heart_data,
                     Age = X1,
                     Sex = X2,
                     Chest_Pain_Type = X3,
                     Resting_Blood_Pressure = X4,
                     Serum_Cholestoral = X5,
                     Fasting_Blood_Sugar = X6,
                     Resting_Electrocardiographic_Results = X7,
                     Maximum_Heart_Rate = X8,
                     Exercise_Induced_Angina = X9,
                     ST_Depression = X10,
                     Slope_Peak_excercise = X11,
                     Major_Vessels = X12,
                     Thalassemia = X13,
                     Heart_Disease = X14)



We are predicting if the patient has heart disease so we want to change it into a factor.

In [23]:
heart_data$Heart_Disease <- as.factor(heart_data$Heart_Disease)

We only want to know if each patient is tested positive or negative for heart disease. This means we only need the numbers 0 (negative) and 1 (postive) and want to remove other numbers. We reassigned the numbers 2, 3, and 4 to 1 because numbers that are greater 1 also mean that the patient has heart disease.

In [24]:
heart_data$Heart_Disease[heart_data$Heart_Disease== "4"]<- "1"
heart_data$Heart_Disease[heart_data$Heart_Disease== "3"]<- "1"
heart_data$Heart_Disease[heart_data$Heart_Disease== "2"]<- "1"


Here we summarized the data in one table

In [25]:
summary_table <- heart_data |>
        group_by(Heart_Disease) |>
        summarize(number_patients = n(),
                  mean_age = mean(Age, na.rm = TRUE),
                  median_age = median(Age, na.rm = TRUE),
                  mean_resting_blood_pressure = mean(Resting_Blood_Pressure, na.rm = TRUE),
                  median_resting_blood_pressure = median(Resting_Blood_Pressure, na.rm = TRUE),
                  mean_max_heart_rate = mean(Maximum_Heart_Rate, na.rm = TRUE),
                  median_max_heart_rate = median(Maximum_Heart_Rate, na.rm = TRUE),
                  number_rows_missing_data = sum(heart_data == "?"))
summary_table

Heart_Disease,number_patients,mean_age,median_age,mean_resting_blood_pressure,median_resting_blood_pressure,mean_max_heart_rate,median_max_heart_rate,number_rows_missing_data
<fct>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>
0,164,52.58537,52,129.25,130,158.378,161,6
1,139,56.6259,58,134.5683,130,139.259,142,6


In [26]:
heart_split <- initial_split(heart_data, prop = 0.75, strata = Heart_Disease) 
heart_train <- training(heart_split)   
heart_test <- testing(heart_split)


#### Table Legend
Table 1 <br>
Table 2 <br>
Table 3 <br>

#### Figure Legend
Figure 1 <br>
Figure 2 <br>
Figure 3 <br>

## Discussion




#### Conclusion & Future Areas of Investigation


## References
