# Prediction of Heart Failure Mortality [100 points total]


**Heart Failure** (HF) is a life-threating contition in which the heart fails to perfuse the body with blood.  In this prelab, we will predict patients' mortaility from this condition from clinical data using **Logistic Regression**.  Unlike linear regression, logistic regression is used to predict binary or categorical outcomes.  The ability to predict mortality from past cases can help clinitians make better treatment predictions in the future.

### 1) Load Data [5 points]
We have given you a spreadsheet file `hf.xlsx` containing clinical data from nearly 300 patients.  The data comes from the [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/dataset/519/heart+failure+clinical+records). Each row represents a single patient and each column represents a clinical feature.  Decriptions of each feature can be found in the orignal paper [Ahmad *et al.* 2017](https://doi.org/10.1371/journal.pone.0181001) as well as [Chicco and Jurman 2020](https://doi.org/10.1186/s12911-020-1023-5). The last column, labeled **death_event** indicates whether the condition was fatal for the patient. We have given you the function `readxlsx` to read the spreadsheat into a **pandas** `DataFrame`.

* Use `pd.read_excel` the data into a dataframe named **`df`**.  Then display the first 10 rows

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

pd.options.display.max_columns = None # Tell pandas to display all columns

try:
    import os
    from google.colab import drive
    drive.mount("/content/drive/")
    os.chdir("/content/drive/My Drive/")
except:
    pass

"""
Write your code here
"""
df = pd.read_excel("hf.xlsx")
df.head(10)


Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,death_event
0,75.0,0,582,0,20,1,265000.0,1.9,130,1,0,4,1
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6,1
2,65.0,0,146,0,20,0,162000.0,1.3,129,1,1,7,1
3,50.0,1,111,0,20,0,210000.0,1.9,137,1,0,7,1
4,65.0,1,160,1,20,0,327000.0,2.7,116,0,0,8,1
5,90.0,1,47,0,40,1,204000.0,2.1,132,1,1,8,1
6,75.0,1,246,0,15,0,127000.0,1.2,137,1,0,10,1
7,60.0,1,315,1,60,0,454000.0,1.1,131,1,1,10,1
8,65.0,0,157,0,65,0,263358.03,1.5,138,0,0,10,1
9,80.0,1,123,0,35,1,388000.0,9.4,133,1,1,10,1


### 2) Data Cleaning [10 points]
 * In our dataset the **`time`** column represents the number of days after diagnosis that the patient followed up with a physician. Since this is not a clinical measure, remove this column from the data frame.  Then display the first 10 rows of the updated data frame.

In [None]:
"""
Write your code here
"""

# remove the time column


### 3) Specify Features and Target [10 points]
 * Define **`X`** as a dataframe of consisting of all columns except the last one.  These will be our features [4 points].  
 * Then define target **`y`** as the last column containing mortality data [3 points].
 * Display the shape of your original dataframe `df` followed by `X` and `y` to verify that you did this correctly [3 points].

In [None]:
"""
Write your code here
"""

### 4) Data Exploration [15 points]

We can look at the relationships clinical features to get a general idea of how features relate to each other. The titles of the columns represent the clinical features.  
  * The classes are unbalanced as there are fewer fatal cases than non-fatal ones. Report the percentage of cases that were fatal rounded to 1 decimal places [5 points].
  * Plot a bar graph of the correlations of each feature with the target [10 points]

In [None]:
"""
Write your code here
"""

### 5) Standardization [10 points]

Numerical columns in **`X`** in have very different magnitudes. When training the a model, columns with larger magnitudes can dominate the learning process and bias the model towards those features. Therefore, **standardization** of features is needed to produce a less biased model. After standardizing the columns, values will be transformed, and each column would have **zero mean** (mean or average = 0) and **unit variance** (variance = 1).  This is equivelant to converting each features values to it's corresponding **z-score**.

* Use sklearn's `StandardScaler` to standardize these columns. Then display the first 10 rows of the results of standardization.

In [None]:
from sklearn.preprocessing import StandardScaler

"""
Write your code here
"""

### 6) Train-Test Split [5 points]

Now, let's perform **train-test** split on the data.


* Use **`train_test_split`** to perform a train-test split on `X` and `y`. Set the `test_size` as 0.5. Since the diagnostic class is unbalanced, we reccomend that you use the **`stratify`** option to ensure that fatal cases are evenly distrubuted amoung the train and test cases.


In [None]:
from sklearn.model_selection import train_test_split
"""
Write your code here
"""

### 7) Train model [10 points]

We are ready now to train a our model.  
* We have imported `LogisticRegression` from `sklearn.linear_model`. Train a logistic regression model on the **training** data [5 points].  
* Then report the accuracy of this model on the **testing** data using `accuracy_score` [5 points].

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
"""
Write your code here
"""

### 8) Refine Hyperparameter [15 points]
A **hyperparameter** is a constant value that is specified before optimzation.
One of the primary hyperparameters is the regularization constant $\lambda = \frac{1}{C}$. By default `LogisticRegression` uses a default value of $C=1$

 * Refine this model by estimating an optimal value for $C = \frac{1}{\lambda}$ using **k-fold Cross Validation** on the **training data**.  The funcion [StratifiedKFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html) from `sklearn.model_selection` splits the training data set into smaller class stratified chunks that can be fed into the model. Use $k=5$ splits (folds) and several values for $C \in [0.001, 1000]$.  Calculate the mean accuracy for each $C$ value across the folds [10 points].
 * The best $C$ value is the one with the highest accuracy.  Display this value [5 points]

In [None]:
from sklearn.model_selection import StratifiedKFold
"""
Write your code here
"""

### 9) Retrain Model [10 points]
 * Retrain your model on **training** data using your best regularization hyperparameter $C$ from the previous step [5 points]
 * Then report the accuracy of this model on the **testing** data using `accuracy_score` [5 points].

In [None]:
"""
Write your code here
"""

### 10) Logistic Regression Coefficients [10 points]
The coefficients vector $\hat \beta$ from the model tells us the prediction of how feature in $X$ relates to the target $y$. If $\hat \beta_k > 0$  then the $k$th feature is predicted to increase the likelyhood of heart failure while $\hat \beta_k < 0$ decreases the likehood.
 * Use the model's `.coef[0]` attribute to make a bar graph of the coefficient values for each feature

In [None]:
"""
Write your code here
"""