In [1]:
library(tidyverse)
library(data.table)
#comment
#another comment

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.4.4     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.0
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors

Attaching package: ‘data.table’


The following objects are masked from ‘package:lubridate’:

    hour, isoweek, mday, minute, month, quarter, second, wday, week,
    y

# 1. Data Description

The Cleveland Heart Disease dataset from 1988 was donated to the UCI Machine Learning Repository and contains 303 observations of patients from Cleveland, Ohio, USA. Some patients in the dataset have heart disease while some others do not. The dataset has 14 total variables. The description of each is as follows:

- `age`: The age in $\text{years}$.
- `sex`: The sex.
    - 0: female
    - 1: male
- `cp`: Type of chest pain.
    - 1: Typical angina
    - 2: Atypical angina
    - 3: Non-anginal pain
    - 4: Asymptomatic
- `trestbps`: Resting blood pressure in $\text{mm Hg}$.
- `chol`: Serum cholestoral in $\text{mg/dl}$.
- `fbs`: Is fasting blood sugar > 120 mg/dl?
    - 0: False
    - 1: True
- `restecg`: Resting electrocardiographic results
    - 0: Normal
    - 1: Having ST-T wave abnormality
    - 2: Showing probable or definite left ventricular hypertrophy by Estes' criteria
- `thalach`: Maximum heart rate achieved.
- `exang`: Exercise induced angina.
    - 0: No
    - 1: Yes
- `oldpeak`: ST depression induced by exercise relative to rest.
- `slope`: The slope of the peak exercise ST segment
    - 1: Upsloping
    - 2: Flat
    - 3: Downsloping
- `ca`: Number of major vessels (0-3) coloured by flourosopy. (Categorical)
- `thal`: Thalassemia.
    - 3: Normal
    - 6: Fixed defect
    - 7: Reversible defect
- `num`: Diagnosis of heart disease (angiographic disease status).
    - 0: Heart disease is absent
    - 1-4: Heart disease is present

# 2. Question

Because heart disease is very serious, it is very important to be able to identify heart disease when it is present in a patient. Thus, the question we will be focusing on prediction and is as follows: What are the most effective attributes in identifying heart disease and how accurately can we identify heart disease using them? <br>
Since we are trying to predict heart disease, our response variable will be `num`. However, since `num` is a categorical variable that ranges from 0-4, we will have to mutate the variables to be either `Yes` or `No` for our question since we aren't interested in predicting the type of heart disease. The input variables will be some amount of the remaining 13 attributes in the dataset. Since we don't know which ones are effective and which may be redundant, the number of input variables is currently unknown.

# 3. Exploratory Data Analysis and Visualization

First we load the dataset.

In [2]:
col_names = c("age", "sex", "cp", "testbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "num")

heart_data <- fread("data/processed.cleveland.data", col.names = col_names)
nrow(heart_data)
head(heart_data)

age,sex,cp,testbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<int>
63,1,1,145,233,1,2,150,0,2.3,3,0.0,6.0,0
67,1,4,160,286,0,2,108,1,1.5,2,3.0,3.0,2
67,1,4,120,229,0,2,129,1,2.6,2,2.0,7.0,1
37,1,3,130,250,0,0,187,0,3.5,3,0.0,3.0,0
41,0,2,130,204,0,2,172,0,1.4,1,0.0,3.0,0
56,1,2,120,236,0,0,178,0,0.8,1,0.0,3.0,0


From part 1, we can see that there are a lot of categorical variables in this dataset. However, our current dataset has many of these categorical columns labeled as numerical which will cause inaccuracies with our model. To fix this, we can change these columns to categorical using `as.factor()`. Since `num` is our response variable and we need it to be binomial, we should also change the values in `num` to be `No` for a value of 0, and `Yes` for values 1-4.

In [3]:
heart_data <- heart_data %>%
    mutate(sex = as.factor(sex)) %>%
    mutate(cp = as.factor(cp)) %>%
    mutate(fbs = as.factor(fbs)) %>%
    mutate(restecg = as.factor(restecg)) %>%
    mutate(exang = as.factor(exang)) %>%
    mutate(slope = as.factor(slope)) %>%
    mutate(ca = as.factor(ca)) %>%
    mutate(thal = as.factor(thal)) %>%
    mutate(num = as.factor(num)) %>%
    mutate(num = if_else(num == 0, "No", "Yes"))

head(heart_data)

age,sex,cp,testbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
<dbl>,<fct>,<fct>,<dbl>,<dbl>,<fct>,<fct>,<dbl>,<fct>,<dbl>,<fct>,<fct>,<fct>,<chr>
63,1,1,145,233,1,2,150,0,2.3,3,0.0,6.0,No
67,1,4,160,286,0,2,108,1,1.5,2,3.0,3.0,Yes
67,1,4,120,229,0,2,129,1,2.6,2,2.0,7.0,Yes
37,1,3,130,250,0,0,187,0,3.5,3,0.0,3.0,No
41,0,2,130,204,0,2,172,0,1.4,1,0.0,3.0,No
56,1,2,120,236,0,0,178,0,0.8,1,0.0,3.0,No


Multicollinearity will be an issue that we have to look out for when fitting this model. To help identify potential covariates that are colinear, I will visualize the correlation between all covariates in the dataset using `ggpairs`. This way, the visualization can help me decide if I should drop any covariates before fitting the model due to colinearity which can help make the resulting model more reliable.

# 4. Methods and Plan

To help find the best subset of covariates to use in the model, I plan to use forward selection as my method. This method is appropriate because it is designed to find the best subset of covariates of each size, which is one of the tasks I am trying to accomplish. There are also no assumptions that have to be met for forward selection. A slight weakness of using forward selection is that because it is a greedy algorithm, the solutions it finds may not be the most optimal. However, it is much more faster than brute forcing every possible combination and is thus an acceptable trade-off. To evaluate which subset of covariates performs the best, I will use the BIC metric as my criteria and choose the subset with the best BIC score. I decided to use BIC since it penalizes more complex models, which will help with overfitting.