First, we will load the tidyverse package to be able to perform data analysis and visualization.

In [13]:
library(tidyverse)

Next we read in the data from the file. Although the dataset was provided as a .data file from the original source, the file contains comma-separated values, so we use read_csv. Since the data has no column names, we will rename them all using information available at [https://archive.ics.uci.edu/dataset/45/heart+disease].

In [15]:
data <- read_csv("processed.cleveland.data", col_names = FALSE) |>
    rename(age = X1, sex = X2, chest_pain = X3, resting_blood_pressure = X4,
           cholesterol = X5, fasting_blood_sugar = X6, #1 if above 120mg/L, 0 if below
           resting_ecg = X7, max_heart_rate = X8, exercise_angina = X9,
           st_depression = X10, slope = X11, major_vessels = X12, thal = X13,
           diagnosis = X14)



[1mRows: [22m[34m303[39m [1mColumns: [22m[34m14[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (2): X12, X13
[32mdbl[39m (12): X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X14

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


Rows: 303
Columns: 7
$ age                    [3m[90m<dbl>[39m[23m 63, 67, 67, 37, 41, 56, 62, 57, 63, 53, 57, 56,…
$ sex                    [3m[90m<dbl>[39m[23m 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1,…
$ chest_pain             [3m[90m<dbl>[39m[23m 1, 4, 4, 3, 2, 2, 4, 4, 4, 4, 4, 2, 3, 2, 3, 3,…
$ resting_blood_pressure [3m[90m<dbl>[39m[23m 145, 160, 120, 130, 130, 120, 140, 120, 130, 14…
$ cholesterol            [3m[90m<dbl>[39m[23m 233, 286, 229, 250, 204, 236, 268, 354, 254, 20…
$ fasting_blood_sugar    [3m[90m<dbl>[39m[23m 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,…
$ diagnosis              [3m[90m<dbl>[39m[23m 0, 2, 1, 0, 0, 0, 3, 0, 2, 1, 0, 0, 2, 0, 0, 0,…


Now since we don't want to use every variable in the data set, we will select only those that we are interested in.

In [17]:
# Leave only age, sex, resting blood pressure, cholesterol, fasting blood sugar, diagnosis 

data_selected <- select(data, age, sex, resting_blood_pressure, cholesterol, fasting_blood_sugar, diagnosis)

glimpse(data_selected)

Rows: 303
Columns: 6
$ age                    [3m[90m<dbl>[39m[23m 63, 67, 67, 37, 41, 56, 62, 57, 63, 53, 57, 56,…
$ sex                    [3m[90m<dbl>[39m[23m 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1,…
$ resting_blood_pressure [3m[90m<dbl>[39m[23m 145, 160, 120, 130, 130, 120, 140, 120, 130, 14…
$ cholesterol            [3m[90m<dbl>[39m[23m 233, 286, 229, 250, 204, 236, 268, 354, 254, 20…
$ fasting_blood_sugar    [3m[90m<dbl>[39m[23m 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,…
$ diagnosis              [3m[90m<dbl>[39m[23m 0, 2, 1, 0, 0, 0, 3, 0, 2, 1, 0, 0, 2, 0, 0, 0,…


Since the data is already tidy, our next step is to explore the data. In order to better visualize the data, the variables that are actually categories but are stored as integers (i.e. sex, chest pain, and fasting blood sugar) will be converted to factor variables. The explanations of the numerical categories were provided by the authors of the data set, available at [https://archive.ics.uci.edu/dataset/45/heart+disease].

In [12]:
data_converted <- data_selected |>
    mutate(sex = as_factor(sex)) |>
    mutate(sex = fct_recode(sex, "Male" = "1", "Female" = "0"))

data_converted <- data_converted |>
    mutate(chest_pain = as_factor(chest_pain)) |>
    mutate(chest_pain = fct_recode(chest_pain,
                            "Typical angina" = "1",
                            "Atypical angina" = "2",
                            "Non-anginal pain" = "3",
                            "Asymptomatic" = "4"))

data_converted <- data_converted |>
    mutate(fasting_blood_sugar = as_factor(fasting_blood_sugar)) |>
    mutate(fasting_blood_sugar = fct_recode(fasting_blood_sugar,
                            "> 120 mg/dl" = "1",
                            "< 120 mg/dl" = "0"))


glimpse(data_converted)

Rows: 303
Columns: 7
$ age                    [3m[90m<dbl>[39m[23m 63, 67, 67, 37, 41, 56, 62, 57, 63, 53, 57, 56,…
$ sex                    [3m[90m<fct>[39m[23m Male, Male, Male, Male, Female, Male, Female, F…
$ chest_pain             [3m[90m<fct>[39m[23m Typical angina, Asymptomatic, Asymptomatic, Non…
$ resting_blood_pressure [3m[90m<dbl>[39m[23m 145, 160, 120, 130, 130, 120, 140, 120, 130, 14…
$ cholesterol            [3m[90m<dbl>[39m[23m 233, 286, 229, 250, 204, 236, 268, 354, 254, 20…
$ fasting_blood_sugar    [3m[90m<fct>[39m[23m > 120 mg/dl, < 120 mg/dl, < 120 mg/dl, < 120 mg…
$ diagnosis              [3m[90m<dbl>[39m[23m 0, 2, 1, 0, 0, 0, 3, 0, 2, 1, 0, 0, 2, 0, 0, 0,…


Here we see that the data values were successfully renamed. Now we will summarize some information about the data, including

summarize the data in at least one table (this is exploratory data analysis).
An example of a useful table could be one that reports the number of observations in each class,
the means of the predictor variables you plan to use in your analysis and how many rows have missing data.

Now we further explore the data by visualizing it in order to discern any patterns in the data.

Using only training data, visualize the data with at least one plot relevant to the analysis you plan to do
(this is exploratory data analysis). An example of a useful visualization could be one that compares the distributions
of each of the predictor variables you plan to use in your analysis.