## Mortality Model Using MIMIC ICU Data*

This is an open case study problem for you.  Your goal is to develop a model to predict mortality in the ICU given by the `hospital_expire_flag`.  We will judge the model using two criteria:

- the `roc_auc_score` of your predictions -- this measures how well your model ranked the cases from riskiest to least risky
- the `log_loss` of your predictions -- this measures how accurate your probabilities of mortality were.


Please roughly go through the following steps:

1) Build a test/train split using the `random_state=42` given in the code below. 

2) Explore the data.  Write at least one reusable function that you find useful in exploring the data.  There is missing data in many columns, so you will have to decide how to deal with that.

3) Build a simple "baseline" model.  See how well you can do with just 3-5 variables in predicting mortality.  In fact, have one baseline model with Logistic Regression and another with a Random Forest.

4) Build a more complicated model.  Use more variables and more complicated methods and see how much you can improve from your baseline.  Use at least one Penalized Logistic Regression model and one Random Forest or Gradient Boosted model.

5) "Engineer" at least one feature and demonstrate how it improved your model.


*MIMIC-III, a freely accessible critical care database. Johnson AEW, Pollard TJ, Shen L, Lehman L, Feng M, Ghassemi M, Moody B, Szolovits P, Celi LA, and Mark RG. Scientific Data (2016).
https://mimic.physionet.org


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier

from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

pd.set_option("display.max_columns",200)

In [2]:
# Load dataset derived from the MMIC database

lab_aug_df = pd.read_csv("data/lab_vital_icu_table.csv")

In [3]:
lab_aug_df.head(10)

Unnamed: 0,subject_id,hadm_id,icustay_id,aniongap_min,aniongap_max,albumin_min,albumin_max,bicarbonate_min,bicarbonate_max,bilirubin_min,bilirubin_max,creatinine_min,creatinine_max,chloride_min,chloride_max,hematocrit_min,hematocrit_max,hemoglobin_min,hemoglobin_max,lactate_min,lactate_max,platelet_min,platelet_max,potassium_min,potassium_max,ptt_min,ptt_max,inr_min,inr_max,pt_min,pt_max,sodium_min,sodium_max,bun_min,bun_max,wbc_min,wbc_max,subject_id.1,hadm_id.1,icustay_id.1,gender,admittime,dischtime,los_hospital,age,ethnicity,admission_type,hospital_expire_flag,hospstay_seq,first_hosp_stay,intime,outtime,los_icu,icustay_seq,first_icu_stay,subject_id.2,hadm_id.2,icustay_id.2,heartrate_min,heartrate_max,heartrate_mean,sysbp_min,sysbp_max,sysbp_mean,diasbp_min,diasbp_max,diasbp_mean,meanbp_min,meanbp_max,meanbp_mean,resprate_min,resprate_max,resprate_mean,tempc_min,tempc_max,tempc_mean,spo2_min,spo2_max,spo2_mean
0,9,150750,220597,13.0,13.0,,,26.0,30.0,0.4,0.4,1.2,1.4,100.0,103.0,37.4,45.2,12.9,15.4,1.9,2.7,249.0,258.0,2.8,3.0,21.7,21.7,1.1,1.1,12.7,12.7,136.0,140.0,16.0,17.0,7.5,13.7,9,150750,220597,M,2149-11-09 13:06:00,2149-11-14 10:15:00,5.0,41.7887,UNKNOWN/NOT SPECIFIED,EMERGENCY,1,1,Y,2149-11-09 13:07:02,2149-11-14 20:52:14,5.0,1,Y,9,150750,220597,82.0,111.0,92.5,106.0,217.0,159.375,53.0,107.0,79.525,67.0,132.0,98.85,14.0,19.0,14.369565,35.500001,37.888887,37.049383,95.0,100.0,97.65
1,13,143045,263738,10.0,14.0,3.9,3.9,23.0,24.0,0.4,0.4,0.5,0.8,106.0,116.0,24.0,35.6,7.9,12.3,,,115.0,216.0,3.0,5.3,30.8,44.1,1.2,1.8,13.3,16.5,137.0,140.0,13.0,18.0,16.6,19.3,13,143045,263738,F,2167-01-08 18:43:00,2167-01-15 15:15:00,7.0,39.864,WHITE,EMERGENCY,0,1,Y,2167-01-08 18:44:25,2167-01-12 10:43:31,4.0,1,Y,13,143045,263738,60.0,124.0,83.6,102.0,151.0,126.136364,53.0,84.0,66.0,73.0,111.0,93.772727,11.0,25.0,15.32,35.944443,37.400002,36.653534,94.0,100.0,97.7
2,20,157681,264490,12.0,12.0,,,21.0,21.0,,,0.8,0.8,108.0,108.0,26.0,35.0,8.5,11.8,,,111.0,132.0,3.6,4.6,31.3,34.2,1.3,1.6,14.1,15.7,137.0,143.0,18.0,18.0,17.5,17.5,20,157681,264490,F,2183-04-28 09:45:00,2183-05-03 14:45:00,5.0,75.8757,WHITE,ELECTIVE,0,1,Y,2183-04-28 15:00:36,2183-04-29 16:13:48,1.0,1,Y,20,157681,264490,67.0,80.0,79.121951,81.0,167.0,127.825,36.0,76.0,54.5,52.0,102.0,75.058333,10.0,27.0,15.404762,35.900002,37.299999,36.545714,95.0,100.0,98.435897
3,28,162569,225559,13.0,13.0,,,23.0,23.0,,,0.9,1.0,109.0,112.0,26.0,41.0,8.6,13.6,,,137.0,150.0,3.8,5.2,31.8,32.0,1.1,1.4,13.0,15.1,136.0,141.0,13.0,17.0,6.9,6.9,28,162569,225559,M,2177-09-01 07:15:00,2177-09-06 16:00:00,5.0,74.3836,WHITE,ELECTIVE,0,1,Y,2177-09-01 09:32:26,2177-09-02 12:28:42,1.0,1,Y,28,162569,225559,74.0,103.0,88.428571,98.0,153.0,121.266667,38.0,61.0,47.933333,55.0,86.0,69.133333,9.0,32.0,16.677419,35.900002,37.700001,37.033333,92.0,100.0,96.419355
4,37,188670,213503,9.0,10.0,,,33.0,35.0,,,0.8,1.0,100.0,103.0,28.9,33.9,9.5,10.3,,,263.0,310.0,3.8,4.0,24.1,24.6,1.1,1.2,13.1,13.3,139.0,143.0,25.0,37.0,10.4,13.9,37,188670,213503,M,2183-08-21 16:48:00,2183-08-26 18:54:00,5.0,68.9269,WHITE,EMERGENCY,0,1,Y,2183-08-23 12:01:45,2183-08-24 15:22:53,1.0,1,Y,37,188670,213503,69.0,91.0,81.0,103.0,144.0,123.035714,33.0,60.0,48.428571,65.333298,82.0,73.29761,15.0,30.0,22.241379,36.833335,38.055556,37.333334,89.0,99.0,96.533333
5,71,111944,211832,13.0,30.0,3.6,4.7,17.0,26.0,0.4,0.5,0.5,0.8,101.0,113.0,31.1,37.3,11.3,13.1,2.9,8.3,215.0,295.0,3.2,3.9,26.1,26.1,1.1,1.1,13.1,13.1,141.0,145.0,4.0,16.0,15.0,27.0,71,111944,211832,F,2164-02-03 22:07:00,2164-02-08 14:00:00,5.0,36.5046,ASIAN,EMERGENCY,0,1,Y,2164-02-03 22:07:49,2164-02-06 18:47:31,3.0,1,Y,71,111944,211832,98.0,137.0,112.444444,94.0,157.0,114.62963,31.0,130.0,61.518519,58.0,139.0,79.222208,13.0,25.0,17.130435,35.722224,37.833332,37.351852,99.0,100.0,99.862069
6,72,156857,239612,18.0,18.0,,,20.0,20.0,3.5,3.5,0.6,0.6,105.0,105.0,40.2,47.1,13.9,15.9,,,160.0,193.0,5.6,5.6,,,,,,,137.0,137.0,22.0,22.0,15.4,18.4,72,156857,239612,M,2163-09-22 23:52:00,2163-09-29 11:40:00,7.0,0.0,BLACK/AFRICAN AMERICAN,NEWBORN,0,1,Y,2163-09-22 23:59:48,2163-09-29 12:21:37,7.0,1,Y,72,156857,239612,113.0,186.0,145.8,,,,,,,,,,,,,,,,,,
7,78,100536,233150,9.0,9.0,2.7,3.1,26.0,26.0,0.8,0.8,0.5,0.5,106.0,106.0,30.6,30.6,10.8,10.8,,,45.0,45.0,3.4,3.4,37.7,37.7,1.2,1.2,13.7,13.7,138.0,138.0,9.0,9.0,1.6,1.6,78,100536,233150,M,2177-02-14 00:16:00,2177-02-17 22:12:00,3.0,48.6253,BLACK/AFRICAN AMERICAN,EMERGENCY,0,1,Y,2177-02-14 04:10:26,2177-02-15 15:54:43,1.0,1,Y,78,100536,233150,56.0,73.0,63.117647,134.0,206.0,165.833333,78.0,119.0,98.777778,98.666702,145.0,121.129705,11.0,24.0,16.764706,36.333334,36.833335,36.577778,96.0,100.0,98.470588
8,88,123010,297289,13.0,18.0,,,19.0,26.0,,,0.7,1.0,102.0,112.0,32.2,37.3,10.7,12.3,5.4,6.3,22.0,161.0,3.3,4.2,28.5,30.0,1.2,1.5,13.4,14.5,137.0,146.0,7.0,10.0,4.8,12.3,88,123010,297289,M,2111-08-29 03:03:00,2111-09-03 14:24:00,5.0,28.6577,BLACK/AFRICAN AMERICAN,EMERGENCY,0,1,Y,2111-08-29 03:04:42,2111-08-30 21:08:09,1.0,1,Y,88,123010,297289,95.0,132.0,107.230769,102.0,160.0,138.807692,42.0,91.0,70.307692,63.0,111.0,91.884615,9.0,45.0,20.352941,35.722224,39.111112,37.810185,99.0,100.0,99.962963
9,95,160891,216431,13.0,17.0,,,23.0,26.0,,,0.9,1.3,105.0,106.0,36.1,40.4,11.7,13.2,,,340.0,425.0,4.0,4.4,21.5,24.8,1.0,1.1,12.2,12.6,141.0,141.0,13.0,22.0,7.1,8.1,95,160891,216431,M,2157-12-25 16:28:00,2157-12-27 15:25:00,2.0,44.1598,BLACK/AFRICAN AMERICAN,EMERGENCY,0,1,Y,2157-12-25 16:29:37,2157-12-26 10:12:30,1.0,1,Y,95,160891,216431,61.0,87.0,69.142857,118.0,134.0,126.0,69.0,87.0,77.928571,85.666702,101.333,93.952386,14.0,20.0,16.363636,35.722224,36.666667,36.305556,96.0,100.0,98.071429


## Test/train split done here

In [4]:
X_full = lab_aug_df.drop(['hospital_expire_flag'], axis=1)
y = lab_aug_df.hospital_expire_flag

X_train_full, X_test_full, y_train, y_test = train_test_split(X_full, y, test_size=.3, random_state=42)

In [5]:
## Verify the split was the same for you and your neighbor
np.mean(X_train_full.age), np.mean(X_test_full.age)

(65.77252070895523, 65.16924826431521)

## Do some Exploratory Data Analysis here

### Make a function that helps your EDA here

## Build your baseline model here

## Build your more complicated models here