## Preprocessing MIMIC

Now that we've explored the contents of the MIMIC database, let's put a heavier focus on the prediction tasks that we are trying to achieve:

* (1) mortality prediction
* (2) hospital readmission 
* (3) length of stay

We've seen in the previous part that the availble amount of features is extremely large. For that reason, we'll have to carefully select the columns and measurement types from the patient data in order to do our analysis. In this part, we'll dive into the feature selection and the preprocessing of the data so that it is usable for our machine learning classifiers.

## Features

We shall consider the following features:

* Age at admission
* Gender
* Admission type
* Vital signs
    * Heart rate
    * Body temperature
    * Respiratory rate
    * Mean and systolic blood pressure
    * SpO2

Outcomes:

* Length of ICU & hospital stay
* Hospital re-admission
* Mortality 30 days post-discharge


To consider:

* Diagnoses

In [4]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import psycopg2
import getpass

# below imports are used to print out pretty pandas dataframes
from IPython.display import display, HTML

%matplotlib inline
plt.style.use('ggplot')

In [5]:
# SQL database config
sqluser = ''
dbname = 'MIMIC3'
schema_name = 'mimiciii'
hostname = ''
port = 5432
pwd = getpass.getpass()

········


In [18]:
# Connect to local postgres version of mimic
con = psycopg2.connect(dbname=dbname, user=sqluser, host=hostname, port=5432, password=pwd)
cur = con.cursor()
cur.execute('SET search_path to ' + schema_name)

In [20]:
query = \
"""
WITH population as (
SELECT ce.icustay_id, charttime, itemid, valuenum,
    adm.subject_id, adm.hadm_id, i.icustay_id, 
    adm.admittime as hosp_admittime, adm.dischtime as hosp_dischtime, 
    i.first_careunit, 
    DENSE_RANK() over(PARTITION BY adm.hadm_id ORDER BY i.intime ASC) as icu_seq,
    p.dob, p.dod, i.intime as icu_intime, i.outtime as icu_outtime, 
    i.los as icu_los,
    round((EXTRACT(EPOCH FROM (adm.dischtime-adm.admittime))/60/60/24) :: NUMERIC, 4) as hosp_los, 
    p.gender, 
    round((EXTRACT(EPOCH FROM (adm.admittime-p.dob))/60/60/24/365.242) :: NUMERIC, 4) as age_hosp_in,
    round((EXTRACT(EPOCH FROM (i.intime-p.dob))/60/60/24/365.242) :: NUMERIC, 4) as age_icu_in,
    hospital_expire_flag,
    CASE WHEN p.dod IS NOT NULL 
        AND p.dod >= i.intime - interval '6 hour'
        AND p.dod <= i.outtime + interval '6 hour' THEN 1 
        ELSE 0 END AS icu_expire_flag
FROM chartevents ce
INNER JOIN icustays ie
ON ce.icustay_id = ie.icustay_id        
INNER JOIN admissions adm
ON ce.hadm_id = adm.hadm_id
INNER JOIN icustays i
ON ce.hadm_id = i.hadm_id
INNER JOIN patients p
ON adm.subject_id = p.subject_id
ORDER BY adm.subject_id, i.intime
)
SELECT first_careunit, icu_los, hosp_los, gender, age_hosp_in, age_icu_in, hospital_expire_flag, icu_expire_flag
FROM population
WHERE age_hosp_in >= 16;
"""

query_output = pd.read_sql_query(query,con)
query_output.head()

## Exclusions

The following parameters will be taken into account to exclude a subset of the data.

* Patients aged less than 16 years old
    * This also removed neonates and children, which likely have different predictors than adults
* Second admissions
    * Simplifies analysis which assumes independent observations
    * We avoid taking into account that ICU stays are highly correlated
* Length of stay less than 2 days
    * helps remove false positives that we're placed in ICU for precaution

To consider:
    
* Exclude patients based on hospital services
    * Gets a more homogenous group of patients

In [8]:
query = \
"""
WITH co AS
(
SELECT icu.subject_id, icu.hadm_id, icu.icustay_id
, EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay
, EXTRACT('epoch' from icu.intime - pat.dob) / 60.0 / 60.0 / 24.0 / 365.242 as age
, RANK() OVER (PARTITION BY icu.subject_id ORDER BY icu.intime) AS icustay_id_order

FROM icustays icu
INNER JOIN patients pat
  ON icu.subject_id = pat.subject_id
LIMIT 10
)
SELECT
  co.subject_id, co.hadm_id, co.icustay_id, co.icu_length_of_stay
  , co.age
  , co.icustay_id_order
  
  , CASE
        WHEN co.icu_length_of_stay < 2 then 1
    ELSE 0 END
    AS exclusion_los
  , CASE
        WHEN co.age < 16 then 1
    ELSE 0 END
    AS exclusion_age
  , CASE 
        WHEN co.icustay_id_order != 1 THEN 1
    ELSE 0 END 
    AS exclusion_first_stay
FROM co
"""
df = pd.read_sql_query(query, con)
df.head()

Unnamed: 0,subject_id,hadm_id,icustay_id,icu_length_of_stay,age,icustay_id_order,exclusion_los,exclusion_age,exclusion_first_stay
0,2,163353,243653,0.091829,0.002434,1,1,1,0
1,3,145834,211552,6.06456,76.526792,1,0,0,0
2,4,185777,294638,1.678472,47.845047,1,1,0,0
3,5,178980,214757,0.084444,0.000693,1,1,1,0
4,6,107064,228232,3.672917,65.942297,1,0,0,0
