# Project Proposal: Group 59
### Jin Kim, Emily Ishii, Natasha Larin, Syed Haque

### Introduction

For our DSCI 100 Project, our group will be using the data found in the "processed.cleveland.data" file in the Heart Disease dataset directory. This dataset is composed of 14 fields:

* Age of patient (age) - years
* Sex of the patient (sex)
    * 1 = Male
    * 2 = Female
* Chest pain type (cp)
    * 1 = Typical Angina
    * 2 = Atypical Angina
    * 3 = Non-aginal Pain
    * 4 = Asymptomatic
* Blood preassure (trestbps) - mm Hg
* Serum cholestoral (chol) - mg/dl
* Fasting blood sugar > 120 mg/dl (fbs) - true or false
* Resting ECG results (restecg)
    * 0 = Normal
    * 1 = Abnormal
    * 2 = Probable or definite left ventricular hypertrophy
* Maximum heart rate achieved (thalach) - bpm
* Excercize induced agina (exang)
    * 0 = No
    * 1 = Yes
* Depression induced by excercize related to rest (oldpeak)
* Slope of the peake excercise segment (slope)
* Number of major vessels (ca)
* Condition (thal)
    * 3 = Normal
    * 6 = fixed defect
    * 7 = reversable defect
* Diagnosis of heart disease (num)
    * idk what the numbers mean
    
These fields are parts of a patient's health report, so they can be used to predict the severity or existence of heart disease within a patient, which is what variable **num** represents.

The question that we will be exploring is: **Which factors are most influential towards the cause of heart disease?**

### Preliminary Exploratory Data Analysis

In [None]:
library(tidyverse)
library(repr)
library(tidymodels)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.6     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.7     [32m✔[39m [34mdplyr  [39m 1.0.9
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.0.0 ──

[32m✔[39m [34mbroom       [39m 1.0.0     [32m✔[39m [34mrsample     [39m 1.0.0
[32m✔[39m [34mdials       [39m 1.0.0     [32m✔[39m [34mtune        [39m 1.0.0
[32m✔[39m [34minfer       [39m 1.0.2     [32m✔[39m [34mworkflows   [39m 1.0.0
[32m✔

**READING IN AND TIDYING THE DATA**

In [None]:
names <- c("age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "num")
patient_data <- read_csv("data/processed.cleveland.data", col_names = names) |>
    mutate(num = as_factor(num)) |>
    mutate(cp = as_factor(cp)) |>
    mutate(sex = as_factor(sex)) |>
    mutate(exang = as_factor(exang)) |>
    mutate(slope = as_factor(slope)) |>
    mutate(restecg = as_factor(restecg)) |>
    mutate(ca = as_factor(ca)) |>
    mutate(thal = as_factor(thal)) |>
    mutate(fbs = as.logical(fbs)) |>
    select(age, trestbps, chol, thalach, oldpeak, num) # Since we are classifying using these variables, we should select only continuous variables (data of type double)
patient_data

[1mRows: [22m[34m303[39m [1mColumns: [22m[34m14[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (2): ca, thal
[32mdbl[39m (12): age, sex, cp, trestbps, chol, fbs, restecg, thalach, exang, oldpea...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


age,trestbps,chol,thalach,oldpeak,num
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
63,145,233,150,2.3,0
67,160,286,108,1.5,2
67,120,229,129,2.6,1
37,130,250,187,3.5,0
41,130,204,172,1.4,0
56,120,236,178,0.8,0
62,140,268,160,3.6,3
57,120,354,163,0.6,0
63,130,254,147,1.4,2
53,140,203,155,3.1,1


**SUMMARIZING THE DATA**

In [12]:
data_mean <- patient_data |>
    select(-num) |>
    map_df(mean) |>
    pivot_longer(cols = age:oldpeak,
               names_to = "predictor",
               values_to = "mean") |>
    select(-predictor)
data_max <- patient_data |>
    select(-num) |>
    map_df(max) |>
    pivot_longer(cols = age:oldpeak,
               names_to = "predictor",
               values_to = "max")
data_min <- patient_data |>
    select(-num) |>
    map_df(min) |>
    pivot_longer(cols = age:oldpeak,
               names_to = "predictor",
               values_to = "min") |>
    select(-predictor)
data_summary <- bind_cols(data_max, data_min, data_mean)
data_summary

predictor,max,min,mean
<chr>,<dbl>,<dbl>,<dbl>
age,77.0,29,54.438944
trestbps,200.0,94,131.689769
chol,564.0,126,246.693069
thalach,202.0,71,149.607261
oldpeak,6.2,0,1.039604
