# Exploring heart disease data obtained by the Cleveland Clinic

## Introduction
- Provide some relevant background information on the topic so that someone unfamiliar with it will be prepared to understand the rest of your proposal
- Clearly state the question you will try to answer with your project
- Identify and describe the dataset that will be used to answer the question

## Preliminary exploratory data analysis

First, we will load the tidyverse package to be able to perform data analysis and visualization.

In [1]:
library(tidyverse)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.3     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.0
[32m✔[39m [34mggplot2  [39m 3.4.4     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.0
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


Next we read in the data from the file. Although the dataset was provided as a .data file from the original source, the file contains comma-separated values, so we use read_csv. Since the data has no column names, we will rename them all using information available at [https://archive.ics.uci.edu/dataset/45/heart+disease].

In [2]:
data <- read_csv("processed.cleveland.data", col_names = FALSE) |>
    rename(age = X1, sex = X2, chest_pain = X3, resting_blood_pressure = X4,
           cholesterol = X5, fasting_blood_sugar = X6, #1 if above 120mg/L, 0 if below
           resting_ecg = X7, max_heart_rate = X8, exercise_angina = X9,
           st_depression = X10, slope = X11, major_vessels = X12, thal = X13,
           diagnosis = X14)

[1mRows: [22m[34m303[39m [1mColumns: [22m[34m14[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (2): X12, X13
[32mdbl[39m (12): X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X14

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


Now since we don't want to use every variable in the data set, we will select only those that we are interested in: age, sex, resting blood pressure, cholesterol, fasting blood sugar, and diagnosis. Also, the authors of the data have indicated that all values 1-4 of the diagnosis variable mean a positive diagnosis, so we will combine all those values into 1 to simplify the data.

In [6]:
data_selected <- select(data, age, sex, resting_blood_pressure, cholesterol, fasting_blood_sugar, diagnosis) |>
    mutate(diagnosis = as_factor(diagnosis)) |>
    mutate(diagnosis = fct_recode(diagnosis, "1" = "2", "1" = "3", "1" = "4"))

glimpse(data_selected)

Rows: 303
Columns: 6
$ age                    [3m[90m<dbl>[39m[23m 63, 67, 67, 37, 41, 56, 62, 57, 63, 53, 57, 56,…
$ sex                    [3m[90m<dbl>[39m[23m 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1,…
$ resting_blood_pressure [3m[90m<dbl>[39m[23m 145, 160, 120, 130, 130, 120, 140, 120, 130, 14…
$ cholesterol            [3m[90m<dbl>[39m[23m 233, 286, 229, 250, 204, 236, 268, 354, 254, 20…
$ fasting_blood_sugar    [3m[90m<dbl>[39m[23m 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,…
$ diagnosis              [3m[90m<fct>[39m[23m 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0,…


Since the data is already tidy, our next step is to explore the data. In order to better visualize the data, the variables that are actually categories but are stored as integers (i.e. sex, chest pain, and fasting blood sugar) will be converted to factor variables. The explanations of the numerical categories were provided by the authors of the data set, available at [https://archive.ics.uci.edu/dataset/45/heart+disease].

In [8]:
data_converted <- data_selected |>
    mutate(sex = as_factor(sex)) |>
    mutate(sex = fct_recode(sex, "Male" = "1", "Female" = "0"))

data_converted <- data_converted |>
    mutate(fasting_blood_sugar = as_factor(fasting_blood_sugar)) |>
    mutate(fasting_blood_sugar = fct_recode(fasting_blood_sugar,
                            "> 120 mg/dl" = "1",
                            "< 120 mg/dl" = "0"))

glimpse(data_converted)

Rows: 303
Columns: 6
$ age                    [3m[90m<dbl>[39m[23m 63, 67, 67, 37, 41, 56, 62, 57, 63, 53, 57, 56,…
$ sex                    [3m[90m<fct>[39m[23m Male, Male, Male, Male, Female, Male, Female, F…
$ resting_blood_pressure [3m[90m<dbl>[39m[23m 145, 160, 120, 130, 130, 120, 140, 120, 130, 14…
$ cholesterol            [3m[90m<dbl>[39m[23m 233, 286, 229, 250, 204, 236, 268, 354, 254, 20…
$ fasting_blood_sugar    [3m[90m<fct>[39m[23m > 120 mg/dl, < 120 mg/dl, < 120 mg/dl, < 120 mg…
$ diagnosis              [3m[90m<fct>[39m[23m 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0,…


Here we see that the data values were successfully renamed. Now we will summarize some information about the data, including counting the number of observations for each diagnosis, the number of males vs. females, the average of our numerical predictor variables (age, blood presure and cholesterol), and the number of rows with missing values in the data set.

In [12]:
number_of_observations <- data_selected |>
    group_by(diagnosis) |>
    summarize(count = n())

genders <- data_converted |>
    group_by(sex) |>
    summarize(count = n())

mean_values <- data_selected |>
    select(age, resting_blood_pressure, cholesterol) |>
    summarize(mean_age = mean(age), mean_blood_pressure = mean(resting_blood_pressure), mean_cholesterol = mean(cholesterol))

missing_data <- data_selected |>
    filter(age == -9.0 | sex == -9.0 | resting_blood_pressure == -9.0 |
           cholesterol == -9.0 | fasting_blood_sugar == -9.0 | diagnosis == -9.0) |>
    summarize(number_of_missing_values = n())

number_of_observations
genders
mean_values
missing_data

diagnosis,count
<fct>,<int>
0,164
1,139


sex,count
<fct>,<int>
Female,97
Male,206


mean_age,mean_blood_pressure,mean_cholesterol
<dbl>,<dbl>,<dbl>
54.43894,131.6898,246.6931


number_of_missing_values
<int>
0


So we see that, luckily, no rows have missing data. Additionally, the classes appear to be roughly balanced, with around the same number of positive diagnoses (diagnosis = 1) as negative (diagnosis = 0).

Now we further explore the data by visualizing it in order to discern any patterns in the data.

Using only training data, visualize the data with at least one plot relevant to the analysis you plan to do
(this is exploratory data analysis). An example of a useful visualization could be one that compares the distributions
of each of the predictor variables you plan to use in your analysis.