<a href="https://colab.research.google.com/github/meljel/meljel/blob/main/comp341_hw3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## COMP 341: Practical Machine Learning
## Homework Assignment 3: Heart Attack Prediction
### Due: Tuesday, October 3 at 11:59pm on Gradescope

Heart attacks are one of the leading causes of death both in the US and globally. Currently, we know that there are certain lifestyle factors (exercise, smoking, weight, diet, etc) and pre-existing health conditions (arthritis, diabetes, etc) associated with the likelihood of developing a heart attack, but the exact contribution of these factors remains unknown.

In this assignment we will attempt to predict whether or not someone has had a heart attack given a snapshot of their health information (mostly based on self-reported survey data) from the previous month. Helpful predictive models as well as the feature importances we can extract can have downstream positive effects for public health, allowing doctors to work together with patients to suggest early actions reduce heart attack risk in individuals with high risk features.

As always, fill in missing code following `# TODO:` comments or `####### YOUR CODE HERE ########` blocks and be sure to answer the short answer questions marked with `[WRITE YOUR ANSWER HERE]` in the text.

All code in this notebook will be run sequentially so make sure things work in order! Be sure to also use good coding practices (e.g., logical variable names, comments as needed, etc), and make plots that are clear and legible.

For this assignment, there will be **15 points** allocated for general coding points:
* **10 points** for coding style
* **5 points** for code flow (accurate results when everything is run sequentially)

### Setup
First, we need to import some libraries that are necessary to complete the assignment.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

Add additional modules/libraries to import here (rather than wherever you first use them below):

In [None]:
# additional modules/libraries to import


We provide some code to get the data file for this assignment into your workspace below. You only need to do the following 4 steps once:
1. Go to 'My Drive' in your own Google Drive
2. Make a new folder named `comp341`
3. From the [Google Drive link](https://drive.google.com/file/d/1kSd84tQIMv_UlUEi5QvU_ZLn7WsdChoP/view?usp=sharing), click Download. You should now have a single file entitled `heart_health.csv` on your computer.
4. In the `comp341` folder you created in step 2, click `New -> File Upload` and select the `heart_health.csv` file from your computer.

Now, we will mount your local Google Drive in colab so that you can read the file in (you will need to do this each time your runtime restarts).

In [None]:
# note that this command will trigger a request from google to allow colab
# to access your files: you will need to accept the terms in order to access
# the files this way
from google.colab import drive
drive.mount('/content/drive')

# if you followed the instructions above exactly, CVA.csv should be
# in comp341/; if your files are in a different directory
# on your Google Drive, you will need to change the path below accordingly
DATADIR = '/content/drive/My Drive/comp341/'

Now that your Google Drive is mounted, you can read in the data in `CVA.csv` into a pandas DataFrame:

In [None]:
df = pd.read_csv(DATADIR + "heart_health.csv")

### Part 0: Getting Familiar with the Data [13 points]
This time we are starting with tidy data (yay!), but we still need to get a feel for what is going on. One quick way is to check some basic dataset attributes such as identifying the number of individuals measured, the types of features, the number of missing values and where they occur, and distribution of the labels (heart attack vs no heart attack).

Most of the features are relatively self explanatory. The `recent_x_consumption` features all specify the number of times `x` was consumed in the last month (whether it be alcohol, fruit, veggies, or fried foods).

Use the code area below to explore the data, and answer the two short answer questions that follow. (Note: You are free to take whichever approach to find the answers to the questions, i.e., the code itself will not be graded, but your answers to the short answer questions will be.)

In [None]:
# explore the data to answer the two short answer questions below
####### YOUR CODE HERE #########


**Short Answer Question:** How many individuals in this dataset? [1 pt]

`[WRITE YOUR ANSWER HERE]`

**Short Answer Question:** List the nominal features, ordinal features, and numeric features in this data. [2 pts]

`[WRITE YOUR ANSWER HERE]`

Visualize the breakdown of heart attack status for 1 feature per category above (nominal, ordinal, numeric). Make use of descriptive/informative plots that make sense of the corresponding data type. [4 pts]

In [None]:
# TODO: visualize how heart attack relates to your choice of nominal, ordinal, and numeric feature


In [None]:
# TODO: how many individuals in this dataset have had a heart attack? [1 pt]


In [None]:
# TODO: data completeness: how many rows are affected by missing values? how many columns? [1 pt]


**Short Answer Question:** Without any additional coding - do you think the missing values determined above will be problematic during our classification task? Why or why not? [1 pt]

`[WRITE YOUR ANSWER HERE]`

**Short Answer Question:** Again without additional coding - which features do you think will be the most informative? Explain. (Don't panic-- there are no wrong answers here as long as you've explain some rationale.) [1 pt]

`[WRITE YOUR ANSWER HERE]`

Before we begin running any methods we need to set aside some samples as a test set so that we can evaluate how well our trained models are doing in the later sections as well as a validation set to tune any hyperparameters. We need to first transform our data into a set of features and a set of labels (the column that holds predictions). Recall that by convention in ML the features are referred to a matrix or dataframe called X and the prediction labels in a vector called y.

In [None]:
# TODO: separate the features from the prediction labels (heart_attack) into
# two data frames, X and y, respectively [1 pt]


In [None]:
# this next snippet of code is provided so that everyone will have the same training, validation, and test splits
# the percentage of data to take for test is specified by test_size, and since rows are chosen at random
# the random_state parameter sets the seed that keeps the sets reproducible
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=341)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=341)

In [None]:
# TODO: fill in any missing values with the most frequently appearing value (mode) for each feature
# we want to do this after the test/train split so that the train and test datasets
# do not use the other set of data to impute its values [1 pt]


### Part 1: Decision Trees [25 points]
In this first part we will run decision trees carving up the underlying data in different ways and observing changes in classification performance. Unless specified, for the remainder of the homework we will be evaluating classification performance using the accuracy of predictions made on the test set.

In [None]:
# TODO: run decision trees (using default parameters) to predict heart attacks on numeric features only
# evaluating accuracy, as well as precision, recall, and F1 score for heart attack predictions on the validation dataset
# Note that for precision, recall, and F1, we are primarily interested in performance for predicting the heart attack cases,
# so report metrics for those cases (and not also for predicting the absence of heart attacks) [4 pts]


In [None]:
# TODO: convert any categorical features that need to be one-hot encoded [1 pt]


In [None]:
# TODO: train a decision tree model (again default) with all categorical and numeric features
# and evaluate the accuracy, precision, recall, F1 on the validation set as you did earlier [2 pts]


**Short Answer Question:** Did you notice any performance improvements after adding the categorical features? Comment on why you think adding categorical features did or did not help. [1 pt]

`[WRITE YOUR ANSWER HERE]`

In [None]:
# TODO: scale all numeric features and retrain and re-test the decision tree model [2 pts]


**Short Answer Question:** Did you notice any performance improvements after scaling? Comment on why you think scaling did or did not help. [1 pt]

`[WRITE YOUR ANSWER HERE]`

In [None]:
# TODO: what is the depth of the resulting tree when you did not set any bounds? [1 pt]


As we see above, and as we saw in lecture, the default tree left unchecked can grow pretty large if there is no maximum depth set. Instead of choosing a random depth lets set one by determining the size that maximizes accuracy on the training and test set.

In [None]:
# TODO: vary the maximum depth parameter from 4-35 and calculate the
# accuracy and f1 score on both the training and validation set
# NOTE: it is okay to use a for loop here (and may take a few minutes) [4 pts]


In [None]:
# TODO: make a single plot showing maximum depth vs accuracy on the training and validation data [1 pts]


In [None]:
# TODO: make a single plot showing maximum depth vs f1 score on the training and validation [1 pts]


**Short Answer Question:** Which maximum depth parameter would you choose? [1 pt]

`[WRITE YOUR ANSWER HERE]`

In [None]:
# TODO: using the decision tree model with the optimal maximum depth,
# visualize the whole resulting tree [2 pts]


In [None]:
# TODO: still using the decision tree model with the optimal maximum depth,
# analyze the feature importances but making a sorted DataFrame of features
# and their corresponding importances [3 pts]


**Short Answer Question:** Looking at the features importances determined above does this match your initial expections? Why or why not? [2 pts]

`[WRITE YOUR ANSWER HERE]`


### Part 2: Logistic Regression [27 points]
We achieved pretty good performance using decision trees. Now let's see what happens when we switch our ML algorithm to logistic regression. In this next section, we will explore logistic regression and look deeper into the model weights to better understand which data set features are important for heart attack prediction.

In [None]:
# TODO: run logistic regression with default parameters on all features,
# where numeric features are unscaled, and categorical features are one-hot encoded
# and calculate accuracy, precision, recall, and f1 on the validation set [2 pts]


In [None]:
# TODO: now run logistic regression with default parameters on all features,
# where numeric features are now scaled, and categorical features are one-hot encoded
# and calculate accuracy, precision, recall, and f1 on the validation set [2 pts]


**Short Answer Question:** Does scaling have an effect on the performance for logistic regression? Why or why not? [2 pts]

`[WRITE YOUR ANSWER HERE]`

In [None]:
# TODO: extract the feature coefficients from the above logistic regression model
# and display them in a easily parseable DataFrame [2 pts]


**Short Answer Question:** How would you interpret the feature coefficients you observe above? [3 pts]

`[WRITE YOUR ANSWER HERE]`

We were able to extract a nice set of feature weights above - let's see what effects different solvers and regularization schemes have on the feature weights.

In [None]:
# TODO: run logistic regression with the liblinear solver with L1 as well as L2 regularization,
# lbfgs with no regularization, and saga with elasticnet;
# use the parameter C=0.005 when there is regularization, and for elasticnet, use l1_ratio=0.5,
# extracting the coefficients each time for comparison, as well as their
# accuracy and precision on the validation set [10 pts]
# NOTE: it is okay to use for loops here; in fact, you are encouraged to do so
# versus repeating the same code multiple times


**Short Answer Question:** Do the different regularization methods change the feature weights? If so, which features are affected? [3 pts]

`[WRITE YOUR ANSWER HERE]`

**Short Answer Question:** Furthermore, based on the feature weights across the different regularization schemes above would you say that regularization might help prevent overfitting in this case? Explain. [3 pts]

`[WRITE YOUR ANSWER HERE]`


### Part 3: Linear Discriminant Analysis and Final Performance Comparisons [20 points]
In this section we will apply one more ML algorithm, LDA, and then make some final comparisons across all three different classification models.

In [None]:
# TODO: run LDA with default settings on the features (with numeric scaled, one-hot encoded categorical)
# and calculate accuracy, precision, recall, and f1 on the validation set [2 pts]


In [None]:
# TODO: extract the feature coefficients from the LDA model
# and display them in a easily parseable DataFrame [2 pts]


**Short Answer Question:** Can the coefficients that are extracted from LDA be interpreted as feature importances? Why or why not? [2 pts]

`[WRITE YOUR ANSWER HERE]`

In [None]:
# TODO: now run LDA with shrinkage enabled, using your choice of solver, and
# report accuracy, precision, recall, and f1 [2 pts]


**Short Answer Question:** How does the performance of your non-shrinkage enabled LDA and shrinkage-enabled LDA compare? If there are differences, what is your interpretation of them? [2 pts]

`[WRITE YOUR ANSWER HERE]`

Now that you have run three different classsification methods (decision trees, logistic regression, LDA) with various parameter settings, choose the best performing model for each method in terms of accuracy, as well as the best performing model in terms of precision (these may be overlapping but do not necessarily have to be), and evaluate the accuracy and precision of all models on the test set. This is your final model comparison and report of the models you explored.

In [None]:
# TODO: calculate accuracy and precision for the test set for the best model for each classification method
# based on accuracy and precision (which means your final comparison will include a minimum of 3 and a maximum of 6 models)
# and display your results in a easily interpretable Dataframe [6 pts]


**Short Answer Question:** Now that you have played around with these three different classsification methods (decision trees, logistic regression, LDA) across different settings, which method do you think is best suited for the heart attack prediction problem here? Which would prefer to use for identifying potential risk factors? Are these the same method or different? Why? [4 pts]

`[WRITE YOUR ANSWER HERE]`

### Bonus: Improving Feature Selection and Hyperparameter Tuning [Extra Credit: up to 10 points]

In class, we discussed how highly correlated features can sometimes affect the performance of our models. For up to 5 points extra credit, do a more thorough exploration (it is recommended to use visual aids as part of this!) of which features are correlated. Explore how removing correlated features affects the performance (as well as coefficient / feature importance estimates) of the methods you explored above.

In addition, for the sake of space here, we did not manage to do a thorough exploration of some important hyperparameters (e.g., ccp_alpha in decision trees, C for logistic regression, different shrinkage parameters for LDA). For up to 5 points extra credit, systematically explore a range of values for the 3 aforementioned hyperparameters in their corresponding methods (you should feel free to fix the other hyperparameters in this exploration). Plot the training and validation performance (accuracy and precision) and finally report the test performance for your best models.

In [None]:
# EXTRA CREDIT: TODO: explore the impact of correlated features [+5 pts]


In [None]:
# EXTRA CREDIT: TODO: explore more hyperparameters [+5 pts]


## To Submit
Download the notebook from Colab as a `.ipynb` notebook (`File > Download > Download .ipynb`) and upload it to the corresponding Gradescope assignment.