# Predicting the Likelihood of Heart Disease within a Patient

## Introduction
Heart disease describes a group of serious medical conditions that affect the heart and blood vessels. There are several types of heart disease including heart failure, arrhythmia, valvular heart disease, and congenital heart disease. They all depict several states in which the flow of blood is unable to properly circulate throughout the heart. Due to insufficient blood, a range of symptoms can occur that significantly affect the patient’s quality of life including chest pain, fatigue, nausea and shortness of breath. Heart disease is one of the leading causes of death, accounting for almost one-third of all deaths. With our project, we plan to address the question: “Is there a correlation between the variables in the dataset outcome of whether individuals have heart disease?” The dataset we selected contains the recorded health of various patients that are either tested positive or negative for heart disease. Through an algorithm, we plan to analyze degrees of correlation between the fourteen different recorded variables of patients in Cleveland that are with and without heart disease within the dataset. Through this analysis, we will be able to determine whether or not the presence of heart disease can be detected based on a correlation between variables. Our end goal for this group project is to be able to use this dataset to detect with a level of certainty, whether or not a patient will test positive for heart disease.

## Preliminary Exploratory Data Analysis

In [2]:
#loading all packages
library(tidymodels)
library(tidyverse)
library(repr)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.6     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.7     [32m✔[39m [34mdplyr  [39m 1.0.9
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.0.0 ──

[32m✔[39m [34mbroom       [39m 1.0.0     [32m✔[39m [34mrsample     [39m 1.0.0
[32m✔[39m [34mdials       [39m 1.0.0     [32m✔[39m [34mtune        [39m 1.0.0
[32m✔[39m [34minfer       [39m 1.0.2     [32m✔[39m [34mworkflows   [39m 1.0.0
[32m✔

ERROR: Error in file(filename, "r", encoding = encoding): cannot open the connection


In [63]:
#read csv file from UCI
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
heart_data <- read_csv(url, col_names = FALSE)

[1mRows: [22m[34m303[39m [1mColumns: [22m[34m14[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (2): X12, X13
[32mdbl[39m (12): X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X14

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


In [64]:
# Cleaning and wrangling data
# Added meaningful column names. 
# We changed the orignal attribute name "num" to "Heart_Disease" because "num" had little meaning

set.seed(1)

heart_data <- rename(heart_data_unnamed,
                     Age = X1,
                     Sex = X2,
                     Chest_Pain_Type = X3,
                     Resting_Blood_Pressure = X4,
                     Serum_Cholestoral = X5,
                     Fasting_Blood_Sugar = X6,
                     Resting_Electrocardiographic_Results = X7,
                     Maximum_Heart_Rate = X8,
                     Exercise_Induced_Angina = X9,
                     ST_Depression = X10,
                     Slope_Peak_excercise = X11,
                     Major_Vessels = X12,
                     Thalassemia = X13,
                     Heart_Disease = X14)

In [61]:
# We only want to know if each patient is tested positive or negative for heart disease
# This means we only need the numbers 0 (negative) and 1 (postive) and want to remove other numbers
# We reassigned the numbers 2, 3, and 4 to 1 because numbers that are greater 1 also mean that the patient has heart disease

heart_data$Heart_Disease[heart_data$Heart_Disease== "4"]<- "1"
heart_data$Heart_Disease[heart_data$Heart_Disease== "3"]<- "1"
heart_data$Heart_Disease[heart_data$Heart_Disease== "2"]<- "1"

heart_data

Age,Sex,Chest_Pain_Type,Resting_Blood_Pressure,Serum_Cholestoral,Fasting_Blood_Sugar,Resting_Electrocardiographic_Results,Maximum_Heart_Rate,Exercise_Induced_Angina,ST_Depression,Slope_Peak_excercise,Major_Vessels,Thalassemia,Heart_Disease
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<chr>
63,1,1,145,233,1,2,150,0,2.3,3,0.0,6.0,0
67,1,4,160,286,0,2,108,1,1.5,2,3.0,3.0,1
67,1,4,120,229,0,2,129,1,2.6,2,2.0,7.0,1
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
57,1,4,130,131,0,0,115,1,1.2,2,1.0,7.0,1
57,0,2,130,236,0,2,174,0,0.0,2,1.0,3.0,1
38,1,3,138,175,0,0,173,0,0.0,1,?,3.0,0


In [65]:
# summarizing the data in one table

summary_table <- summarize(heart_data,
                   mean_age = mean(Age, na.rm = TRUE),
                   median_age = median(Age, na.rm = TRUE),
                   mean_resting_blood_pressure = mean(Resting_Blood_Pressure, na.rm = TRUE),
                   median_resting_blood_pressure = median(Resting_Blood_Pressure, na.rm = TRUE),
                   mean_max_heart_rate = mean(Maximum_Heart_Rate, na.rm = TRUE),
                   median_max_heart_rate = median(Maximum_Heart_Rate, na.rm = TRUE),
                   number_heart_disease_0 = length(which(heart_data$Heart_Disease==0)),
                   number_heart_disease_1 = length(which(heart_data$Heart_Disease==1)),
                   number_rows_missing_data = sum(is.na(heart_data)))


summary_table


mean_age,median_age,mean_resting_blood_pressure,median_resting_blood_pressure,mean_max_heart_rate,median_max_heart_rate,number_heart_disease_0,number_heart_disease_1,number_rows_missing_data
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<int>,<int>
54.43894,56,131.6898,130,149.6073,153,164,55,0


In [59]:
# visualizing the data with at least one plot relevant to the analysis we plan to do 

heart_split <- initial_split(ckd_data, prop = 0.75, strata = num) 
heart_train <- training(heart_split)   
heart_test <- testing(heart_split)

heart_train
heart_test

#use box plots

ERROR: Error in parse(text = x, srcfile = src): <text>:15:0: unexpected end of input
13: # visualize the data with at least one plot relevant to the analysis you plan to do (this is exploratory data analysis)
14: #use box plots
   ^
