# Title: Which variables are good at predicting whether or not someone has heart disease?

## Introduction (Kevin)
- Provide some relevant background information on the topic so that someone unfamiliar with it will be prepared to understand the rest of your proposal
- Clearly state the question you will try to answer with your project
- Identify and describe the dataset that will be used to answer the question


*INTRO TEXT GOES HERE*

## Preliminary exploratory data analysis (Chloe)
- Demonstrate that the dataset can be read from the web into R 
- Clean and wrangle your data into a tidy format
- Using only training data, summarize the data in at least one table (this is exploratory data analysis). An example of a useful table could be one that reports the number of observations in each class, the means of the predictor variables you plan to use in your analysis and how many rows have missing data. 
- Using only training data, visualize the data with at least one plot relevant to the analysis you plan to do (this is exploratory data analysis). An example of a useful visualization could be one that compares the distributions of each of the predictor variables you plan to use in your analysis.

In [2]:
#PRELIMINARY EXPLORATORY DATA ANALYSIS CODE HERE


NameError: name 'library' is not defined

In [1]:
# necessary library
library(tidyverse)
library(readr)
library(tidymodels)
# read data
heart_data <- read_csv("data/processed.cleveland.data", 
                 col_names = c("age", "sex", "cp", "trestbps", 
                               "chol", "fbs", "restecg", "thalach", "exang", 
                               "oldpeak", "slope", "ca", "thal", "disease"), na = "?")
# turn integer to factor (to categorical data)
heart_data <- heart_data |>
        mutate(sex = as_factor(sex), cp = as_factor(cp), fbs = as_factor(fbs), restecg = as_factor(restecg), 
               exang = as_factor(exang), slope = as_factor(slope), thal = as_factor(thal), disease = as_factor(disease))

head(heart_data)
set.seed(888)
# split data into training and testing data
heart_split <- initial_split(heart_data, prop = 0.75, strata = disease)
heart_train <- training(heart_split)
heart_test <- testing(heart_split)
#locate missing data (training data)
train_missing_column <- colSums(is.na(heart_train))
locate_missing_train <- which(!complete.cases(heart_train))
sum_missing_train <- sum(is.na(heart_train))

cat("Summary of Missing Data in Training Data ")
cat("Number of Missing Data in Each Column: \n")
train_missing_column
cat("Location of Rows That Have Missing Data: ", locate_missing_train, "\n")
cat("Number of Rows That have Missing Data: ", sum_missing_train)
# Frequency of Each Class
cat("Frequency of Each Class (0 = Healthy, 1-4 = Sick): \n")
table(heart_data$disease)
#Proportion of Each Class
cat("Proportion of Each Class (0 = Healthy, 1-4 = Sick): \n")
prop.table(table(heart_data$disease))
# Visualization (boxplots)
#compare thalach 
thalach_boxplot <- heart_train |>
  ggplot(aes(x = disease, y = thalach, group = disease)) +
  geom_boxplot(aes(fill = disease)) +
  labs(title = "Compare Maximum Heart Rate Achieved",
       x = "Diagnosis of Heart Disease", 
       y = "Maximum Heart Rate Achieved",
       fill = "Diagnosis of Heart Disease") +
       theme(text = element_text(size = 15))
thalach_boxplot

cat("The box-plots does not overlap, maximum heart rate achieved is statistical significant in determine heart disease.")


#compare serum cholestoral level between sick or not
chol_boxplot <- heart_train |>
  ggplot(aes(x = disease, y = chol, group = disease)) +
  geom_boxplot(aes(fill = disease)) +
  labs(title = "Compare Serum Cholesterol Level (mg/dl)",
       x = "Diagnosis of Heart Disease",
       y = "Serum Cholesterol Level",
       fill = "Diagnosis of Heart Disease") +
       theme(text = element_text(size = 15))
chol_boxplot

cat("The box-plots overlap, serum cholestoral level is not statistical significant in determine heart disease.")

#compare ST depression induced by exercise relative to rest between sick or not
oldpeak_boxplot <- heart_train |>
  ggplot(aes(x = disease, y = oldpeak, group = disease)) +
  geom_boxplot(aes(fill = disease)) +
  labs(title = "Compare Depression Induced by Exercise Relative to Rest",
       x = "Diagnosis of Heart Disease",
       y = "Depression Induced by Exercise",
       fill = "Diagnosis of Heart Disease") +
       theme(text = element_text(size = 15))
oldpeak_boxplot

cat("The box-plots do not overlap, depression induced by exercise relative to rest is statistical significant in determine heart disease.")

trestbps_boxplot <- heart_train |>
  ggplot(aes(x = disease, y = trestbps, group = disease)) +
  geom_boxplot(aes(fill = disease)) +
  labs(title = "Compare Resting Blood Pressure",
       x = "Diagnosis of Heart Disease",
       y = "Compare Resting Blood Pressure (mm/Hg)",
       fill = "Diagnosis of Heart Disease") +
       theme(text = element_text(size = 15))
trestbps_boxplot

cat("The box-plots overlap, resting blood pressure  is not statistical significant in determine heart disease.")
# Visualization (boxplots)
prop_heart <- heart_train|>
    group_by(disease) |>
    summarize(prop_exang = mean(exang == 1), prop_fbs = mean(fbs == 1))
head(prop_heart)
exang_bar <- prop_heart |>
  ggplot(aes(x = disease, y = prop_exang, fill = disease)) +
  geom_bar(stat = "identity") +
  labs(title = "Compare Proportion Exercise Induced Angina",
       x = "Exercise Induced Angina", 
       y = "Proportion",
       fill = "Diagnosis of Heart Disease") +
       theme(text = element_text(size = 15))
exang_bar

cat("The proportion of people who has exercise induced angina is generally higher in groups of people who has heart disease")

fbs_bar <- prop_heart |>
  ggplot(aes(x = disease, y = prop_fbs, fill = disease)) +
  geom_bar(stat = "identity") +
  labs(title = "Compare Proportion of Fasting Blood Sugar > 120 mg/dl",
       x = "Fasting Blood Sugar > 120 mg/dl", 
       y = "Proportion",
       fill = "Diagnosis of Heart Disease") +
       theme(text = element_text(size = 15))
fbs_bar

cat("The proportion of people who has fasting blood sugar > 120 mg/dl is approximatly the same in groups of people who has heart disease")


── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.3     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.0
[32m✔[39m [34mggplot2  [39m 3.4.4     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.0
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.5     [32m✔[39m [34mrsample     [39

ERROR: Error: 'data/processed.cleveland.data' does not exist in current working directory ('/home/jovyan/work/DSCI-100-project-010-4/.ipynb_checkpoints').


## Methods (Tom):
- Explain how you will conduct either your data analysis and which variables/columns you will use. Note - you do not need to use all variables/columns that exist in the raw data set. In fact, that's often not a good idea. For each variable think: is this a useful variable for prediction?
- Describe at least one way that you will visualize the results

*METHODS TEXT GOES HERE*

## Expected outcomes and significance (Michael):
- What do you expect to find?
- What impact could such findings have?
- What future questions could this lead to?

*EXPECTED OUTCOMES TEXT GOES HERE*