## **Chest Pain in Cleveland (Section 002, Group 31 Proposal)**

## Introduction

**Background:**
Heart disease is the leading cause of death in the United States, including the city of Cleveland in Ohio. Factors that affect chances of heart disease include unhealthy diet, high blood pressure, high cholesterol, minimal physical activity, and more (https://www.cdc.gov/nchs/hus/topics/heart-disease-deaths.htm#:~:text=Heart%20disease%20has%20been%20the,excessive%20alcohol%20use%20(2)).

Chest pain, such as angina, is often associated with heart disease. Angina is caused by poor blood flow to the heart due to buildup of thick plaques on the inner walls of the arteries carrying blood to the heart (https://www.mayoclinic.org/diseases-conditions/chest-pain/symptoms-causes/syc-20370838).

Several genetic and lifestyle factors contribute to the risk of developing angina. There is evidence of a positive relationship between age and developing angina; an increase of plaque build up in the arteries is correlated with an increase in age (https://www.nhlbi.nih.gov/health/angina/causes). 


**Question:** What is the predicted chest pain type (typical angina, atypical angina, non-anginal pain or asymptomatic*) for a given heart disease patient based on various medical factors (age, resting blood pressure, cholesterol, electrocardiographic results, heart rate achieved, and maybe sex, fasting blood sugar amount, and exercise-induced angina)?

**Dataset:** Cleveland Heart Disease Processed Data set from https://archive.ics.uci.edu/ml/datasets/Heart+Disease. 

## Preliminary Data Analysis

Here, we will show that our dataset can be read from the web into R. We will also wrangle our data into a tidy format and visualize it. First, we will attach the necessary libraries.

In [3]:
library(tidyverse)
library(repr)
library(tidymodels)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.6     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.7     [32m✔[39m [34mdplyr  [39m 1.0.9
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.0.0 ──

[32m✔[39m [34mbroom       [39m 1.0.0     [32m✔[39m [34mrsample     [39m 1.0.0
[32m✔[39m [34mdials       [39m 1.0.0     [32m✔[39m [34mtune        [39m 1.0.0
[32m✔[39m [34minfer       [39m 1.0.2     [32m✔[39m [34mworkflows   [39m 1.0.0
[32m✔

Next, we will read in the data (stored in our notebook) and add column names.

In [5]:
#read data here
data <- read_csv("processed.cleveland.data", col_names=FALSE)

heart_data <- data <- rename(data,
            age = X1,
            sex = X2,
            cp = X3,
            restbp = X4,
            chol = X5,
            fbs = X6, 
            restecg = X7,
            thalach = X8,
            exang = X9,
            oldpeak = X10,
            slope = X11,
            ca = X12,
            thal = X13,
            num = X14)
heart_data

[1mRows: [22m[34m303[39m [1mColumns: [22m[34m14[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (2): X12, X13
[32mdbl[39m (12): X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X14

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


age,sex,cp,restbp,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<dbl>
63,1,1,145,233,1,2,150,0,2.3,3,0.0,6.0,0
67,1,4,160,286,0,2,108,1,1.5,2,3.0,3.0,2
67,1,4,120,229,0,2,129,1,2.6,2,2.0,7.0,1
37,1,3,130,250,0,0,187,0,3.5,3,0.0,3.0,0
41,0,2,130,204,0,2,172,0,1.4,1,0.0,3.0,0
56,1,2,120,236,0,0,178,0,0.8,1,0.0,3.0,0
62,0,4,140,268,0,2,160,0,3.6,3,2.0,3.0,3
57,0,4,120,354,0,0,163,1,0.6,1,0.0,3.0,0
63,1,4,130,254,0,2,147,0,1.4,2,1.0,7.0,2
53,1,4,140,203,1,2,155,1,3.1,3,0.0,7.0,1


Below is a description of each of the columns of data and what the labels correspond to:

1. `age` (years)
2. `sex` (1=male; 0=female)
3. `cp` is chest pain type. (1=typical angina; 2=atypical angina; 3=non-anginal pain; 4=asymptomatic)
4. `restbp` is resting blood pressure. (mm Hg)
5. `chol` is serum cholesterol. (mg/dl)
6. `fbs` is whether fasting blood sugar is greater than 120 mg/dl. (1=yes; 0=no)
7. `restecg` is resting electrocardiographic results. (0=normal; 1=ST-T wave abnormality; 2=probable or definite left ventricular hypertrophy)
8. `thalach` is maximum heart rate achieved.
9. `exang` is exercise-induced angina. (1=yes; 0=no) 
10. `oldpeak` is ST depression induced by exercise relative to rest.
11. `slope` is slope of the peak exercise ST segment. (1=upsloping; 2=flat, 3=downsloping)
12. `ca` is number of major vessels colored by flourosopy. (1-3)
13. `thal` is ?? (the website gave no information on its meaning). (3=normal, 6=fixed defect, 7=reversable defect)
14. `num` is diagnosis of heart disease. (0=<50% diameter narrowing, 1=>50% diameter narrowing)

We will not be using all of the data to answer our research question, and some columns are untidy (categorical values are listed numerically, which may be unhelpful when trying to visualize and understand the data). Thus, we will tidy and wrangle the data we want in the coding cell below.

In [33]:
#clean and wrangle data here? or is it already tidy idk
#filter data for variables to be included in dataset
#filter out rows with missing data

filter_data <- heart_data|>
    select(cp, age, restbp, chol, restecg, thalach, sex, exang)|>
    mutate(cp = as_factor(cp))|>
    mutate(sex = as_factor(sex)) 


table <- filter_data |>
    group_by(cp)|>
    
table

    

cp,age
<fct>,<dbl>
1,55.86957
2,51.36
3,53.69767
4,55.72222


In [7]:
set.seed(2000)

heart_split <- initial_split(heart_data, prop = 0.75, strata=cp)
heart_train <- training(heart_split)
heart_test <- testing(heart_split)

In [3]:
#using training data, summarize data into a table here

In [4]:
#using training data, visualize data here

## Methods

**Conducting Data Analysis:**
1. Attach necessary libraries
2. Read in “processed Cleveland” data from https://archive.ics.uci.edu/ml/datasets/Heart+Disease, which is downloaded in the notebook
3. Tidy data
4. Split into training and testing set
5. Create recipe (scale any predictors if needed), knn specification (with tune, so we can find which *k* value to use), and classifier using training set
6. Perform cross-fold validations and test the metrics with various *k* values, then plot the data to find ideal *k* value
7. Once ideal *k* value is found (one that yields a high accuracy estimate), input that into a new specification and classifier
8. Create prediction classification (via recipe, new specification, and new classifier)
9. Evaluate the estimate accuracy of the classifier on the test set using the predict function

Variables used as predictors are: 
`age`, 
`resting blood pressure`, 
`cholesterol`, 
`electrocardiographic results`, 
`heart rate achieved`, 
and maybe `sex`, `fasting blood sugar amount`, and `exercise-induced angina`


**Visualizing results:** We will use a scatterplot graph, with chest pain colour-coded.

## Expected Outcomes and Significance

**We expect to find:**
the type of chest pain a patient with heart disease/conditions could expect to have (based on other aspects of their medical information).

**What impact could our findings have?**
Predicting chest pain type can help the patient/doctor prepare for treating that symptom. Further data analysis could describe which factor(s) directly lead to a certain chest pain type.

**What future questions could this lead to?**
- Is this classification model consistent with data in other parts of the world (eg. Hungary, Switzerland, etc)?
- What other variables could be used in predicting chest pain and if those variables were added would that increase the estimate accuracy?
