**Title: Heart Disease Predictor**

This dataset is about the records of 303 individuals in Cleveland. The diagnosis of heart disease is based on the variable call "num". If the num's value is 0, it means the individual is not diagnosed as having heart disease, but if the value is equal to or larger than 1, it is typically considered having heart disease. I have to clarify the data, which consists of 14 variables that are modifiable risk factors. This means they are related to lifestyle choices such as diet, physical activity, or anything unrelated to genes. This project aims to use my existing knowledge to enhance my understanding of heart disease by finding the correlation between those variables and heart disease to build a prediction heart disease model.

**Question**

Is it possible to develop a predictive model to identify whether an individual has heart disease?

**Dataset**

* Datatype: numerical and categorical data
* Variables (use) :
  - `age` : age in year
  - `sex` : (1 = male; 0 = female)
  - `cp` : chest pain type
      - Value 1: typical angina
      - Value 2: atypical angina
      - Value 3: non-anginal pain
      - Value 4: asymptomatictic
  - `trestbps` : resting blood pressure (in mm Hg on admission to the hospital)
  - `chol`serum cholestoral in mg/dlia
  - `exang` exercise induced angina (1 = yes; 0 = no) :
  - `oldpeak` ST depression induced by exercise relative to rest
  - `ca` : number of major vessels (0-3) colored by flourosopy
  - `num` : diagnosis of heart disease (angiographic disease status)
    - Value 0: < 50% diameter narrowing
    - Value 1: > 50% diameter narrowing

In [1]:
library(tidyverse)

“package ‘ggplot2’ was built under R version 4.3.2”
── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.3     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.0
[32m✔[39m [34mggplot2  [39m 3.5.0     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.2     [32m✔[39m [34mtidyr    [39m 1.3.0
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


In [2]:
col_names = c("age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "num")

data_cleveland <- read.table("data/heart_disease/processed.cleveland.data", sep = ",", col.names = col_names)

In [3]:
data_cleveland |> head()

Unnamed: 0_level_0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<int>
1,63,1,1,145,233,1,2,150,0,2.3,3,0.0,6.0,0
2,67,1,4,160,286,0,2,108,1,1.5,2,3.0,3.0,2
3,67,1,4,120,229,0,2,129,1,2.6,2,2.0,7.0,1
4,37,1,3,130,250,0,0,187,0,3.5,3,0.0,3.0,0
5,41,0,2,130,204,0,2,172,0,1.4,1,0.0,3.0,0
6,56,1,2,120,236,0,0,178,0,0.8,1,0.0,3.0,0


In [4]:
data_cleveland <- data_cleveland |>
    select(-fbs, -restecg, -thalach, -slope, -thal)
data_cleveland

age,sex,cp,trestbps,chol,exang,oldpeak,ca,num
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<int>
63,1,1,145,233,0,2.3,0.0,0
67,1,4,160,286,1,1.5,3.0,2
67,1,4,120,229,1,2.6,2.0,1
37,1,3,130,250,0,3.5,0.0,0
41,0,2,130,204,0,1.4,0.0,0
56,1,2,120,236,0,0.8,0.0,0
62,0,4,140,268,0,3.6,2.0,3
57,0,4,120,354,1,0.6,0.0,0
63,1,4,130,254,0,1.4,1.0,2
53,1,4,140,203,1,3.1,0.0,1
