# Names (Enter your names below)
**Your Name and JHED:** ...

**Partner's Name and JHED (If applicable):**  ...

# Lab 3: Prediction of Septic Shock in Patients

By **Benjamín Béjar Haro** and edited by **Kwame Kutten** and **Joseph Greenstein**

Sepsis is a life-threatening condition caused by an inflammatory immune response to an infection. It is the leading cause of death in hospitals and has a greater risk of mortality in its advanced state, also called *Septic Shock*. Early treatment of Septic Shock can dramatically increase the survival rate. Therefore, a prediction system capable of foreseeing Septic Shock would provide an early intervention window that has the potential to translate into improved patient outcomes. In this lab we look at the problem Septic Shock prediction following the approach described in [Liu et al. 2019](https://doi.org/10.1038/s41598-019-42637-5). Your goal in this lab is to reproduce some of the results in above paper. In particular you will train a a logistic regression model (referred to as GLM in the paper) to predict Septic Shock and will apply it to a test patient dataset.

In [None]:
# Import modules
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import datetime as date
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import KFold

# Fix random number generator for reproducibility
np.random.seed(0)

## Read Data
You have been provided with the curated data used in Liu *et al.* which is a subset of the publicly available MIMIC-III database ([Johnson et al. 2016](https://doi.org/10.1038/sdata.2016.35)). The data corresponds to electronic health record data of a large population of patients, and consists of measured values over time for $28$ different features such as, heart rate, blood pressure, respiratory rate, temperature, etc. Each data point represents a particular measurement in time and for a particular patient. The data has been split into training and testing as described in Liu *et al.* and is provided to you in the form of `.csv` files. Inside those files `x` columns correspond to feature values while the `y` column represents the associated label of a particular row of feature values.  Thus $y=0$ means that the patient didn't go into Septic Shock, while a label $y=1$ indicates that the patient eventually went into Septic Shock).

In [None]:
#===========================================
# Read data. Change path if necessary
#===========================================
try:
    # Executes if running in Google Colab
    from google.colab import drive
    drive.mount('gdrive/')
    path = "gdrive/My Drive/" # Change path to location of data if necessary
except:
    # Executes if running locally (e.g. Anaconda)
    path = "./"

# Read training data
traindata = pd.read_csv('/'.join((path,'glm.training.data.csv')))
Xtrain = traindata.iloc[:,1:29].values # Rows are patients, columns are clinical indicators
ytrain = traindata.iloc[:,-1].values

# Read testing data
testdata = pd.read_csv('/'.join((path,'glm.test.data.csv')))
Xtest = testdata.iloc[:,1:29].values
ytest = testdata.iloc[:,-1].values

## 1. Normalize Data [5 points]
Normalize the training and test data such that each column has zero mean and unit standard deviation.  Then for both the training and test data use `np.isclose` to verify that the means and standard deviations of the normalized data are correct.

In [None]:
"""
Write your code here
"""

## 2. Train Generalized Linear (Logistic Regression) Model [15 points total]
 * Train a logistic regression model on the normalized data using [Stochastic Gradient Descent](https://en.wikipedia.org/wiki/Stochastic_gradient_descent). You should use the [SGDClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html) from `sklearn.linear_model` with a `"log_loss"` loss, `"balanced"` class weights and a `"l1"` lasso penalty.  Specify a value for regularization parameter $\alpha \in (0,1]$ [10 points]. 
 * Plot the ROC curve and display the AUC [5 points].

In [None]:
"""
Write your code here
"""

## 3. Hyperparameter Tuning [20 points total] 
 * Refine this model by determining an optimal regularization hyperparameter $\alpha > 0$ which maximizes AUC via **5-fold cross validation** similar to Liu *et al.* You should use [KFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) from `sklearn.model_selection` to split your training dataset into smaller chunks that you feed to the SGDClassifier. Try several values of $\alpha \in (0,1]$. To save time you may use a small number of iterations (e.g. 5) in this step for your SGDClassifier [15 points]. 
 * Then display your optimal $\alpha$ [5 points]

In [None]:
"""
Write your code here
"""

## 4. Retrain Model [20 points total]
 * Retrain your model using your optimal regularization hyperparameter $\alpha$ from the previous step [5 points].
 * Plot the ROC curve and display it's AUC [5 points].
 * Find the **Operating Point** which gives the best TPR and FPR and add it to the plot [5 points].  
 * Display the training data accuracy at the operating point [5 points].

In [None]:
"""
Write your code here
"""

## 5. Plot Feature Weights [10 points]
The exponentiated coefficients from our model tell us how much each feature is weighted when making a prediction.  Find the weights by exponentiating the coefficients from your model (Get coefficients from the `'coef_` attribute). Plot a bar graph of these weights with their corresponding names.  Your results should be similar to Figure 3 in Liu *et al.*

In [None]:
featureNames = traindata.iloc[:,1:29].keys() # Names of features

"""
Write your code here
"""

## 6. Test Model on Patient Data [30 points total]
 * Use the patient column from the test dataset to determine the number of patients in our test dataset.  Use the test labels to create an array of this size which is $1$ if the patient went into Septic Shock *at any time* during their hospital stay and $0$ otherwise.  We can be certain that a patient went into Septic Shock if their maximum probability (risk score) attained over their hospital stay excedes some operating threshold.  Therefore you should also create a corresponding array which contains these maximum probabilities [15 points].  

 * Create an ROC curve using these arrays and display the AUC [5 points].
 * Find operating point and add it to the plot [5 points].
 * Display the accuracy for test patients at the operating point [5 points].

In [None]:
patientsCol = testdata['patient'].values # Patients column from test dataset

"""
Write your code here
"""