# Problem Set 4: Physiological Signals

In this problem set, we will explore the process of using multivariate time series data from vital signs from MIMIC-III. 

In [1]:
%matplotlib inline

from matplotlib import pyplot as plt

import pandas as pd
import numpy as np
import pickle as pk

from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Dropout

from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error, roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, LinearRegression
np.random.seed(7)

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


# 1: Vital signals are vital

We have provided vitals information from the MIMIC-III dataset. In particular, we have provided vitals for patients from the first 24 hours for the following vitals
 - Heart rate in bpm 
 - Respiratory rate in breaths / minute 
 - Pulse oximetry 
 - Mean arterial pressure
 - Systolic blood pressure in mmHg 
 - Diastolic blood pressure in mmHg 
 - Glucose reading
 - Temperature

We have aggregated each vital signal into the min/max/mean of that hour. Because not all vital signs are measured at the same rate, missing values are designated as NaN. If no vital signs were measured in that hour, the patient ICU stay will have no row for that hour. 


In [381]:
df = pd.read_csv('vitals_24h.csv')

## 1.1 

How many rows are in vitals_24h.csv? For each of the vital types, how many vital signs are there per patient ICU stay on average? Plot the distribution of values across patient ICU stays for each vital sign using histograms with 40 bins. 

In [2]:
# TODO: 1.1

## 1.2 

We want to model patients with ICU stays where we have hourly readings for the first 24 hours. Patients with ICU stays where measures aren’t recorded have been marked as null. 

 - How many patient ICU stays are included in this dataset in any form?
 - How many patients ICU stays show up with 24 hours of measurements, even if some of the hours have some missing values? 
 - How many patients ICU stays have 24 hours measurements with all present measurements? 


In [3]:
# TODO: 1.2

## 1.3 

In order to increase the number of include patient ICU stays, we want to keep vital readings that appear >80% of the time over all rows in vitals_24h.csv. Which vitals would we keep? Which vitals would we remove?


In [4]:
# TODO: 1.3

## 1.4 

For the rest of the problem set, we will keep vitals that appear >80% of the time across all rows, but that is only one way to handle missing data. Besides throwing out rows that are incomplete, what are three other ways we can impute missing data for multivariate time series modeling?

# 2. LSTM time

Let's build an LSTM!

## 2.1 

We are interested in using this time series information to model patient care. Keep vitals that appear in >80% the rows. **How many patient ICU stays are in the dataset?** 

We want to analyze patient ICU stays that last longer than 48 hours by predicting from the first 24 hours, but vitals_24h.csv includes vitals from all patients in the MIMIC dataset. Join with demo.csv. **How many patients are left in the dataset now?**

In [5]:
# TODO: 2.1

Create a data tensor of shape: [n_samples, n_timesteps, n_features] for data labeled train, valid, and test. Data should be scaled using MinMaxScaler from sklearn using feature_range=(0,1) across the entire dataset before being separated into train, valid, and test. 

You may find it helpful to create column names of the form: t00_heartrate_min, t00_heartrate_max, …, t23_heartrate_max, …

We often find that sex and age are predictive features. Encode sex as a “is_female” feature. Include age and sex in the tensor: t00_is_female, t00_age, …, t23_is_female, t23_age. Note that t00_age, …, t23_age will all be the same and similar with the sex features. These features can be found in demo.csv.

In [6]:
# TODO: create data tensor

## 2.2 

Build an LSTM using Keras for multivariate time series classification. Use an LSTM with:
 - 20 units
 - a dense layer with sigmoid activation
 - binary cross-entropy loss
 - ADAM optimizer
 - Dropout of 0.2
 - 100 epochs
 - Batch size of 128

You may find these guides helpful:
 - [Multivariate time series forecasting for regression](https://machinelearningmastery.com/multivariate-time-series-forecasting-lstms-keras/)
 - [Sequence classification using LSTMs](https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/)

What is the train loss, accuracy, and AUC? Experiment with Dropout = [0.0,0.2,0.5, 0.8] by comparing the AUC on the validation data. What Dropout value yields the best AUC on the validation data? Using this best Dropout value, what is the corresponding test AUC?

In [7]:
# TODO: 2.2

## 2.3

Select two patient ICU stays for whom the LSTM predicts the patient ICU stay will end in patient mortality and two patient ICU stays for whom the LSTM predicts the patient ICU stay will NOT end in patient mortality. Plot any vital features that are particularly meaningful over time and describe any noticeable differences.

# 3. Linear baseline

Whenever we use more sophisticated deep learning models, we want to compare against a simpler baseline to make sure the computational effort is worth it. 


## 3.1

What is Logistic Regression with just the raw values from 2.2? You should use a matrix of shape [n_shapes, n_timesteps * n_features], which is an unrolled version of the data tensor from earlier. You can keep age and sex in their time-multiplied form or only include them once. 


In [8]:
# TODO: 3.1

## 3.2 

Hyperparameter tune on C=[0.01,0.1,0.5,1.0, 5.0] and penalty=[‘l1’,’l2’] by comparing the AUC on the validation data. What are the best hyperparameters on the validation data? What is the test AUC using these best hyperparameters? How does this compare to the results from 2.2?

In [9]:
# TODO: 3.2