# Patient Monitoring and Decision Support using Health Data

**Notebook by Olivier Nguyen (Github: [olinguyen](http://github.com/olinguyen))**

In the past decade, great efforts have been made to better use information technology in healthcare. Every day, large amounts of patient and clinical data from hospitals and clinics are being generated and stored in electronic health records. However, little real-world applications have been developed in practice due to data previously being inaccessible to researchers and the sheer complexity of analyzing large datasets [1].

With advances in data collection, machine learning and big data, the analysis of health data now enables the possiblity to provide real-time decision support for patients based on data from similar patients in previously similar scenarios. The vast amount of data currently available still remains a current challenge for clinicians and scientists. There is therefore a need to develop effective techniques and algorithms to fully exploit electronic health records to better improve the quality of care for the individual patient.

For this Google Summer of Code 2017 data project, we will be exploring machine learning algorithms using the [Shogun Toolbox](http://shogun.ml/) applied to health data, more specifically the [MIMIC database](https://mimic.physionet.org/
). The main objective is to exploit the rich information of the MIMIC database to build and evaluate models for

* (1) mortality prediction
* (2) hospital readmission 
* (3) length of stay

Before we dive into any machine learning or data analysis, we'll first explore the contents of the MIMIC database.

## MIMIC Database

The MIMIC database, short for Medical Information Mart For Intensive Care, is a database that holds a large amount of information relating to patients admitted to the intensive care unit (ICU) at a large care hospital. The dataset contains contains clinical data (demographics, diagnoses, laboratory values), time-stamped nurse-verified physiological measurements, documented progress notes by care providers, and much more.

Although access to the database is freely available to researchers around the world, a Human Subjects training is required to handle sensitive patient information. More details for the formal request can be found in the following  [link](https://mimic.physionet.org/gettingstarted/access/).

The following code will assume a prior installation of the MIMIC database in a local Postgres database. The installation procedure can be followed [here](https://mimic.physionet.org/tutorials/install-mimic-locally-ubuntu/).

## Exploring the MIMIC database

Since MIMIC is a relational database containing tables of patient data who stayed at the ICU, data will be extracted using SQL queries and then stored as Pandas dataframes for easy manipulation. To better understand the the database, we'll create the tables from the MIMIC-III paper [2] to get a better idea of what type of data we are dealing with.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import psycopg2
import getpass
%matplotlib inline

In [2]:
# SQL database config
sqluser = ''
dbname = 'MIMIC3'
schema_name = 'mimiciii'
hostname = ''
port = 5432
pwd = getpass.getpass()

········


In [3]:
# Connect to local postgres version of mimic
con = psycopg2.connect(dbname=dbname, user=sqluser, host=hostname, port=5432, password=pwd)
cur = con.cursor()
cur.execute('SET search_path to ' + schema_name)

A more complete description of the entire MIMIC database can be found [here](mimic.physionet.org/gettingstarted/overview), but you can find a quick overview below.

The tables are linked by identifiers with the suffix `ID`: `SUBJECT_ID` which represents a unique patient, `HADM_ID` which represents a unique patient hospital stay and `ICUSTAY_ID` which represents a unique patient ICU stay.

The following tables are used to define and track patient stays:

* `ADMISSIONS`: Every unique hospitalization for each patient in the database (defines HADM_ID)
* `CALLOUT`: Information regarding when a patient was cleared for ICU discharge and when the patient was actually discharged
* `ICUSTAYS`: Every unique ICU stay in the database (defines ICUSTAY_ID)
* `PATIENTS`: Every unique patient in the database (defines SUBJECT_ID)
* `SERVICES`: The clinical service under which a patient is registered
* `TRANSFERS`: Patient movement from bed to bed within the hospital, including ICU admission and discharge

A simple SQL query can therefore be used as follows, which would select only female subjects from the database, or patients with the subject_id 109 or 127.

```sql
SELECT *
FROM patients
WHERE subject_id = 109
OR subject_id = 127;
OR gender = 'F'
```

## Joining admissions, icustays, and patients tables

We'll now combine multiple tables in order obtain a more detailed description of the data. In particular, we'll extract the following columns that contain basic information regarding the patients in the database:

* Identifiers (subject_id, hadm_id, icustay_id)
* The ICU care unit 
* The patients date of birth, death
* The time the patient entered and left the ICU
* The length of stay at the hospital
* The age during the ICU and the hospital stay
* Whether the patient died in the ICU or in the hospital

In [9]:
query = \
"""
WITH population as (
SELECT a.subject_id, a.hadm_id, i.icustay_id, 
    a.admittime as hosp_admittime, a.dischtime as hosp_dischtime, 
    i.first_careunit, 
    DENSE_RANK() over(PARTITION BY a.hadm_id ORDER BY i.intime ASC) as icu_seq,
    p.dob, p.dod, i.intime as icu_intime, i.outtime as icu_outtime, 
    i.los as icu_los,
    round((EXTRACT(EPOCH FROM (a.dischtime-a.admittime))/60/60/24) :: NUMERIC, 4) as hosp_los, 
    p.gender, 
    round((EXTRACT(EPOCH FROM (a.admittime-p.dob))/60/60/24/365.242) :: NUMERIC, 4) as age_hosp_in,
    round((EXTRACT(EPOCH FROM (i.intime-p.dob))/60/60/24/365.242) :: NUMERIC, 4) as age_icu_in,
    hospital_expire_flag,
    CASE WHEN p.dod IS NOT NULL 
        AND p.dod >= i.intime - interval '6 hour'
        AND p.dod <= i.outtime + interval '6 hour' THEN 1 
        ELSE 0 END AS icu_expire_flag
FROM admissions a
INNER JOIN icustays i
ON a.hadm_id = i.hadm_id
INNER JOIN patients p
ON a.subject_id = p.subject_id
ORDER BY a.subject_id, i.intime
)
SELECT *
FROM population
WHERE age_hosp_in >= 16;
"""

query_output = pd.read_sql_query(query,con)
query_output.head()

Unnamed: 0,subject_id,hadm_id,icustay_id,hosp_admittime,hosp_dischtime,first_careunit,icu_seq,dob,dod,icu_intime,icu_outtime,icu_los,hosp_los,gender,age_hosp_in,age_icu_in,hospital_expire_flag,icu_expire_flag
0,3,145834,211552,2101-10-20 19:08:00,2101-10-31 13:58:00,MICU,1,2025-04-11,2102-06-14,2101-10-20 19:10:11,2101-10-26 20:43:09,6.0646,10.7847,M,76.5268,76.5268,0,0
1,4,185777,294638,2191-03-16 00:28:00,2191-03-23 18:41:00,MICU,1,2143-05-12,NaT,2191-03-16 00:29:31,2191-03-17 16:46:31,1.6785,7.759,F,47.845,47.845,0,0
2,6,107064,228232,2175-05-30 07:15:00,2175-06-15 16:00:00,SICU,1,2109-06-21,NaT,2175-05-30 21:30:54,2175-06-03 13:39:54,3.6729,16.3646,F,65.9407,65.9423,0,0
3,9,150750,220597,2149-11-09 13:06:00,2149-11-14 10:15:00,MICU,1,2108-01-26,2149-11-14,2149-11-09 13:07:02,2149-11-14 20:52:14,5.3231,4.8813,M,41.7902,41.7902,1,1
4,11,194540,229441,2178-04-16 06:18:00,2178-05-11 19:00:00,SICU,1,2128-02-22,2178-11-14,2178-04-16 06:19:32,2178-04-17 20:21:05,1.5844,25.5292,F,50.1483,50.1483,0,0


Using the above columns, we can compute the following points for general statistics of the dataset:

* Hospital admissions, no. (% of total admissions)
* Distinct ICU stays, no. (% of total unit stays)
* Age, yrs, mean ± SD
* Gender, male, percent of unit stays
* ICU length of stay, median days (IQR)
* Hospital length of stay, median days (IQR)
* ICU mortality, percent of unit stays
* Hospital mortality, percent of unit stays

## Distinct patients, no. (% of total admissions)

In [8]:
print('\nTotal number patients: {}'.format(len(query_output.subject_id.unique())))

print('\nNumber of patients by first careunit:\n')
print(query_output[['first_careunit','subject_id']] \
                    .drop_duplicates(['subject_id']) \
                    .groupby('first_careunit').count())
    
print('\nProportion of total hospital admissions:\n')
print(query_output[['first_careunit','subject_id']] \
                    .drop_duplicates(['subject_id']) \
                    .groupby('first_careunit')
                    .count()/len(query_output.subject_id.unique())*100)


Total number patients: 38597

Number of patients by first careunit:

                subject_id
first_careunit            
CCU                   5674
CSRU                  7611
MICU                 13649
SICU                  6372
TSICU                 5291

Proportion of total hospital admissions:

                subject_id
first_careunit            
CCU              14.700624
CSRU             19.719149
MICU             35.362852
SICU             16.509055
TSICU            13.708319


## Hospital admissions, no. (% of total admissions)

In [10]:
print('\nTotal hospital admissions: {}'\
    .format(len(query_output.hadm_id.unique())))

print('\nNumber of hospital admissions by first careunit:\n')
print(query_output[['first_careunit','hadm_id']] \
                    .drop_duplicates(['hadm_id']) \
                    .groupby('first_careunit').count())
    
print('\nProportion of total hospital admissions:\n')
print(query_output[['first_careunit','hadm_id']] \
                    .drop_duplicates(['hadm_id']) \
                    .groupby('first_careunit') \
                    .count()/len(query_output.hadm_id.unique())*100)


Total hospital admissions: 49785

Number of hospital admissions by first careunit:

                hadm_id
first_careunit         
CCU                7258
CSRU               8640
MICU              19770
SICU               8110
TSICU              6007

Proportion of total hospital admissions:

                  hadm_id
first_careunit           
CCU             14.578688
CSRU            17.354625
MICU            39.710756
SICU            16.290047
TSICU           12.065883


## Distinct ICU stays, no. (% of total unit stays)

In [11]:
print('\nTotal ICU stays: {}'\
    .format(len(query_output.icustay_id.unique())))

print('\nNumber of ICU stays by careunit:\n')
print(query_output[['first_careunit','icustay_id']] \
          .groupby('first_careunit').count())

print('\nProportion of total ICU stays:\n')
print(query_output[['first_careunit','icustay_id']] \
          .groupby('first_careunit') \
          .count()/len(query_output.icustay_id.unique())*100)


Total ICU stays: 53423

Number of ICU stays by careunit:

                icustay_id
first_careunit            
CCU                   7726
CSRU                  9311
MICU                 21087
SICU                  8891
TSICU                 6408

Proportion of total ICU stays:

                icustay_id
first_careunit            
CCU              14.461936
CSRU             17.428823
MICU             39.471763
SICU             16.642645
TSICU            11.994834


## Age, yrs, median ± IQR

We will report the median IQR for the age, since patients aged 90 and older appear with value 300.

In [12]:
print('Median age, years: {} '.format(query_output.age_icu_in.median()))
print('Lower quartile age, years: {} '.format(query_output.age_icu_in.quantile(0.25)))
print('Upper quartile age, years: {} \n '.format(query_output.age_icu_in.quantile(0.75)))

print('Median age by careunit, years:\n ')
print(query_output[['first_careunit','age_icu_in']] \
      .groupby('first_careunit').median())

print('\nLower quartile by careunit, years:\n ')
print(query_output[['first_careunit','age_icu_in']] \
      .groupby('first_careunit').quantile(0.25))

print('\nUpper quartile by careunit, years:\n ')
print(query_output[['first_careunit','age_icu_in']] \
      .groupby('first_careunit').quantile(0.75))

Median age, years: 65.769 
Lower quartile age, years: 52.8361 
Upper quartile age, years: 77.80005 
 
Median age by careunit, years:
 
                age_icu_in
first_careunit            
CCU                70.5697
CSRU               67.9286
MICU               64.9124
SICU               63.5819
TSICU              59.6125

Lower quartile by careunit, years:
 
                age_icu_in
first_careunit            
CCU               58.44555
CSRU              58.32150
MICU              51.65590
SICU              51.44315
TSICU             42.45665

Upper quartile by careunit, years:
 
                age_icu_in
first_careunit            
CCU              80.541375
CSRU             76.727250
MICU             78.170950
SICU             76.457500
TSICU            75.601550


## Gender, percent of unit stays

In [13]:
print('Gender:\n')
print(query_output.loc[query_output.icu_seq==1].groupby('gender').gender.count())
print(query_output.loc[query_output.icu_seq==1].groupby('gender').gender.count() \
     /query_output.loc[query_output.icu_seq==1].gender.count()*100)

print('Gender by careunit:\n')
print(query_output.loc[query_output.icu_seq==1] \
    .groupby(['first_careunit','gender']).gender.count())

print('\nProportion by unit:\n')
print(query_output.loc[query_output.icu_seq==1] \
    .groupby(['first_careunit','gender']) \
    .gender.count()/query_output.loc[query_output.icu_seq==1] \
    .groupby(['first_careunit']).gender.count()*100)

Gender:

gender
F    21802
M    27983
Name: gender, dtype: int64
gender
F    43.792307
M    56.207693
Name: gender, dtype: float64
Gender by careunit:

first_careunit  gender
CCU             F          3055
                M          4203
CSRU            F          2963
                M          5677
MICU            F          9577
                M         10193
SICU            F          3859
                M          4251
TSICU           F          2348
                M          3659
Name: gender, dtype: int64

Proportion by unit:

first_careunit  gender
CCU             F         42.091485
                M         57.908515
CSRU            F         34.293981
                M         65.706019
MICU            F         48.442084
                M         51.557916
SICU            F         47.583231
                M         52.416769
TSICU           F         39.087731
                M         60.912269
Name: gender, dtype: float64


## ICU length of stay, median days

In [14]:
print('Median ICU length of stay, days: {}'.format(query_output.icu_los.median()))

print('Median length of ICU stay by careunit, days:\n ')
print(query_output[['first_careunit','icu_los']] \
      .groupby('first_careunit').median())

Median ICU length of stay, days: 2.14425
Median length of ICU stay by careunit, days:
 
                icu_los
first_careunit         
CCU             2.19775
CSRU            2.15295
MICU            2.09560
SICU            2.25220
TSICU           2.11260


## ICU mortality, percent of unit stays

In [15]:
print('ICU mortality, number:\n')
print(query_output \
    .groupby(['icu_expire_flag']) \
    .icu_expire_flag.count())

print('\nICU mortality, %:\n')
print(query_output.groupby(['icu_expire_flag']) \
    .icu_expire_flag.count() / query_output.icu_expire_flag.count()*100)

print('\nICU mortality by careunit:\n')
print(query_output \
    .groupby(['first_careunit','icu_expire_flag']) \
    .icu_expire_flag.count())

print('\nProportion by unit:\n')
print(query_output \
    .groupby(['first_careunit','icu_expire_flag']) \
    .icu_expire_flag.count()/query_output \
    .groupby(['first_careunit']).icu_expire_flag.count()*100)

ICU mortality, number:

icu_expire_flag
0    48858
1     4565
Name: icu_expire_flag, dtype: int64

ICU mortality, %:

icu_expire_flag
0    91.454991
1     8.545009
Name: icu_expire_flag, dtype: float64

ICU mortality by careunit:

first_careunit  icu_expire_flag
CCU             0                   7041
                1                    685
CSRU            0                   9010
                1                    301
MICU            0                  18865
                1                   2222
SICU            0                   8078
                1                    813
TSICU           0                   5864
                1                    544
Name: icu_expire_flag, dtype: int64

Proportion by unit:

first_careunit  icu_expire_flag
CCU             0                  91.133834
                1                   8.866166
CSRU            0                  96.767265
                1                   3.232735
MICU            0                  89.462702
            

## Patient characteristics

Let's first see how many charted observations are available for each hospitalization. To do so, we'll use the `CHARTEVENTS` table which contains the charted data, or the electronic chart of a patient. The electronic chart holds a patient's routine vital signs and additional information relevant to their care. The entries contain the `SUBJECT_ID`, `HADM_ID` and `ICUSTAY_ID` identified, and are associated with an `ITEMID` which is an identifier for a single measurement type, a `CHARTTIME` which is the timestamp at which the observation was made and a `VALUE` which contains the value measured identified by `ITEMID`. The `ITEMID` identifier description can be found in the `D_ITEMS` table which holds additional information regarding that measurement type, e.g. `ITEMID 212` corresponds to a heart rate measurement. 

In [16]:
query = \
"""
WITH chartobs AS (
SELECT hadm_id, count(hadm_id) as obs
FROM chartevents
GROUP BY hadm_id)
SELECT avg(obs)
FROM chartobs;
"""

query_output = pd.read_sql_query(query,con)
query_output.head()

Unnamed: 0,avg
0,5774.418267


In [5]:
query = \
"""
SELECT *
FROM D_ITEMS;
"""

items_df = pd.read_sql_query(query,con)
print("Number of different measurement types:", len(items_df))
items_df.head()

Number of different measurement types: 12487


Unnamed: 0,row_id,itemid,label,abbreviation,dbsource,linksto,category,unitname,param_type,conceptid
0,457,497,Patient controlled analgesia (PCA) [Inject],,carevue,chartevents,,,,
1,458,498,PCA Lockout (Min),,carevue,chartevents,,,,
2,459,499,PCA Medication,,carevue,chartevents,,,,
3,460,500,PCA Total Dose,,carevue,chartevents,,,,
4,461,501,PCV Exh Vt (Obser),,carevue,chartevents,,,,


As seen from above, the database contains a huge amount of data (over 12,000 different measurement types), with each hospitalization having on average close to 5700 charted observations. For that reason, we'll have to carefully select the columns and measurement types from the patient data in order to do our analysis.

## TODO 

* Plots 
* Table with features
* Reduce amount of tables, or try to be more concise
* MIMIC Postgres database setup tutorial
* SQL intro/tutorial
* Add better descriptions

## References

1. Ross, M. K., Wei Wei, and L. Ohno-Machado. "“Big data” and the electronic health record." Yearbook of medical informatics 9.1 (2014): 97.

2. Johnson, Alistair EW, et al. "MIMIC-III, a freely accessible critical care database." Scientific data 3 (2016).