# 🫀 HW5 Lab: Predicting heart disease with logistic regression

<img src="img/heart-disease.jpeg" alt= “spam-email” width="300" />

## ✅ Setup and data import
In this lab, we will use attempt to predict heart disease using an aggregated [dataset of patients](https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction).

In [None]:
# Load in additional functions
library(tidyverse)
library(lubridate)

# Use three digits past the decimal point,
# and don't use scientific notation.
options(digits = 3, scipen = 999)

# Format plots with a white background and dark features.
theme_set(theme_bw())

# Increase the default text size of plots.
# If you are *not* working in Google Colab, we recommend commenting
# out this line of code.
theme_update(text = element_text(size = 20))

# Increase the default plot width and height.
# If you are *not* working in Google Colab, we recommend commenting
# out this line of code.
options(repr.plot.width=12, repr.plot.height=8)

# Read in the data
data = read_csv('https://raw.githubusercontent.com/joshuagrossman/mse125-labs-public/main/week5/heart_failure.csv')

# peek at 10 random rows
sample_n(data, 10)

── [1mAttaching packages[22m ────────────────────────────── tidyverse 1.3.2 ──
[32m✔[39m [34mggplot2[39m 3.5.0     [32m✔[39m [34mpurrr  [39m 1.0.2
[32m✔[39m [34mtibble [39m 3.2.1     [32m✔[39m [34mdplyr  [39m 1.1.4
[32m✔[39m [34mtidyr  [39m 1.3.0     [32m✔[39m [34mstringr[39m 1.5.1
[32m✔[39m [34mreadr  [39m 2.1.4     [32m✔[39m [34mforcats[39m 1.0.0
── [1mConflicts[22m ───────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

Attaching package: ‘lubridate’


The following objects are masked from ‘package:base’:

    date, intersect, setdiff, union


[1mRows: [22m[34m918[39m [1mColumns: [22m[34m12[39m
[36m──[39m [1mColumn specification[22m [36m───────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (5): Sex, ChestPainType, RestingECG, ExerciseAngina

Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
<dbl>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<chr>,<dbl>,<chr>,<dbl>
53,M,ASY,120,246,0,Normal,116,Y,0.0,Flat,1
76,F,NAP,140,197,0,ST,116,N,1.1,Flat,0
67,M,ASY,120,237,0,Normal,71,N,1.0,Flat,1
58,M,ASY,100,213,0,ST,110,N,0.0,Up,0
53,M,NAP,130,246,1,LVH,173,N,0.0,Up,0
51,M,ASY,128,0,0,Normal,107,N,0.0,Up,0
66,M,ASY,112,212,0,LVH,132,Y,0.1,Up,1
63,F,ASY,108,269,0,Normal,169,Y,1.8,Flat,1
59,F,ATA,130,188,0,Normal,124,N,1.0,Flat,0
67,M,ASY,120,229,0,LVH,129,Y,2.6,Flat,1


## 🚀 Exercise 1

a) Fit a linear probability model to the heart disease data to predict the likelihood that a patient has heart disease. 
- The model should only take into account the `Sex` column. 
- Print the summary of your model.

In [None]:
# Your code here for part a)!



b) How do you interpret the coefficients of your model? Answer in no more than one sentence per coefficient.

In [None]:
# Your written answer for part b) here!
# 
# 



c) Do male patients in the dataset have a significantly higher risk of heart disease than female patients in the dataset? Justify your response with a 95% confidence interval.

In [None]:
# Your code here for part c)!



In [None]:
# Your written answer for part c) here!
# 
# 



d) Write code to show how you could calculate the `(Intercept)` and `SexM` coefficients without using the linear regression algorithm.

In [None]:
# Your code here for part d)!



## 🚀 Exercise 2

**For each of the parts of Exercise 2, justify your answer by training a linear probability model, calculating a 95% confidence interval, and writing a one sentence explanation.**

a) Is each additional year of life sigificantly associated with a higher estimated risk of heart disease?

In [None]:
# Your code and written answer here for part a)!



b) For a male patient and a female patient of the same age, does the male patient have a significantly higher estimated risk of heart disease?

In [None]:
# Your code and written answer here for part b)!



c) Does the estimated change in risk per additional year of life significantly differ for male patients versus female patients? 

In [None]:
# Your code and written answer here for part c)!



## 🚀 Exercise 3

For each of the 3 linear probability models from Exercise 2, fit a logistic regression model using the same features. 

> In other words, you should fit three logistic regression models.

Are your answers to Exercise 2 consistent with the output of the logistic regression models? Answer in no more than three sentences.

In [None]:
# Your code here!



In [None]:
# Your written answer here!
# 
# 



## 🚀 Exercise 4

a) Fit two models to the dataset:
1. A linear probability model accounting for all covariates in the data.
2. A logistic regression model accounting for all covariates in the data.

> Models like this are often called "fully saturated".

Print the model summary of each model.

In [None]:
# Your code here for part a)!



b) Create a plot to compare the estimated probabilities for the fully saturated logistic regression model and the fully saturated linear probability model. 

> The choice of plot is up to you, as long as your chosen plot allows for comparison of the probabilities. 

In no more than two sentences, compare the estimated probabilities of each model.

In [None]:
# Your code here for part b)!



In [None]:
# Your written answer for part b)!
# 
# 



## 🚀 Exercise 5

Using a threshold of 0.5, calculate and compare the accuracy of the linear probability model and the logistic regression models from Exercise 4.

In [None]:
# Your code here! 



## 🚀 Exercise 6

Plot the ROC curve for the logistic regression model from Exercise 5. 

> Plot the ROC curve manually with `ggplot2`. In other words, do not use an external package to generate the plot.

Identify where on the curve corresponds to the threshold of 0.5. 

Provide a one-sentence description of how you identified the point corresponding to the 0.5 threshold.

In [None]:
# Your code here!



In [None]:
# Your written answer here!
# 
# 



## 🚀 Exercise 7

a) In two sentences, describe what a false negative and false positive is in the context of the logistic regression model from Exercises 5-7.

In [None]:
# Your written answer for Part a)
# 
# 



b) Suppose that a clinical expert tells you that a false negative is 4 times worse than a false positive for this problem. Find the optimal threshold for the logistic regression model using this information.

In [None]:
# Your code here for Part b!

relative_cost = 4
thresholds = seq(0, 1, by=0.01)



c) Using the threshold you computed in part b), compute the true positive rate and false positive rate, and interpret these values in the context of the problem.

In [None]:
# Your code here for Part c!



In [None]:
# Your written answer for part c!
# 
# 

