In [1]:
from IPython.core.display import HTML
import urllib2
HTML(urllib2.urlopen('https://gist.githubusercontent.com/mattlewissf/83989910849fdb4a04a72d431e84053f/raw/cefa015a9065665faccd0219774c7087be7d21a8/skeleton.css').read())

### MIMIC Deep Dive - Getting Started with Our Model 
**[Intro](#intro)**   
**[30 Day Readmission](#30_day_readmission)**  
**[The MIMIC Dataset](#mimic_dataset)**  
**[Setting up the database](#setting_up_db)** 

<a id='intro'></a>

#### Intro: MIMIC and 30 day readmission prediction

What follows is a breakdown of what we did to create our 30 day readmission predictive model based on the MIMIC-III dataset. We'll be going somewhat into depth around different decisions, errors, and processes that we went through to create the project. While nothing below is particuarly novel, we're pleased with the results, and the project shows that careful feature selection can lead to decent predictive results using relatively straight-forward applications of machine learning tools and techniques. For a peek at what cutting edge medical ML looks like, check out [Deep Patient](https://www.nature.com/articles/srep26094). 

<br></br>
**Some goals:**
* Use the MIMIC-III dataset to build a model to predict whether a discharged patient will be readmited within a thirty day period from their discharge. 
* Use both demographic as well as diagnostic data to improve our predictive ability
* Learn more about the best way to evaluate and apply different classifiers

<br></br>

We'll be building our model in Python, and working with commonly used machine learning libraries, like [sk_learn](http://scikit-learn.org/stable/) and [matplotlib](https://matplotlib.org/), to describe and fit our data. 

<br></br>

( Interested in the data? [Here’s how you can get access](Interested? Here’s the step on how you can get access to the MIMIC III database: https://mimic.physionet.org/gettingstarted) to the MIMIC III database )

<a id='mimic_dataset'></a>
### The MIMIC Dataset


For this prjoect we'll be using the MIMIC-III dataset, which contains records for 58,000 hospital admissions for 38,645 adults and 7,875 infants, spanning from June 2001 - October 2012. We're using the MIMIC-III dataset for a few reasons: 
* publicly available for free use, 
* large and diverse set of patients
* high resolution features

<br></br>

There's a ton of data in the MIMIC-III set that we 
probably won't be using: data around caregivers and chartevents; lab event and microbiology results; and the massive MIMIC waveform database that 'contains thousands of recordings of multiple physiologic signals ("waveforms") and time series of vital signs ("numerics") collected from bedside patient monitors in adult and neonatal intensive care units (ICUs).'

<br></br>

We'll be focusing on getting data on the patient, admission, and diagnosis level. However, as part of the project we'll be rigging up our model to be able to import most everything from MIMIC-III - we just won't end up actually pulling that data in most cases.



Because of how awesome MIMIC is, there’s a lot of research that uses this dataset. Here’s just a little bit of what people have done with the MIMIC III (and the predecessor, MIMIC II): 

- [Optimzing patient dosing](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4157935/) 
- [Evaluating current severity scoring systems for things like septic shock](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3609896/) 
- [Research on ‘context-aware EHRs for heterogeneous medical events in a uniform space’](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5148810/) 

But beyond this, MIMIC gives us enough data to play around with different approaches to using machine learning on real healthcare data, and occupies a useful spot on the spectrum between real world / not-prohibitively messy.

#### Anonymity and the MIMIC dataset

Since MIMIC data is from actual patients, work has gone into making sure that data that could potentially identify individual patients (think names, addresses, dates, medical record and other identification numbers) aren't included in the set. To quote a [paper in Nature](https://www.nature.com/articles/sdata201635) that formally announced MIMIC-III:

<br></br>

"The deidentification process for structured data required the removal of all eighteen of the identifying data elements listed in HIPAA, including fields such as patient name, telephone number, address, and dates. In particular, dates were shifted into the future by a random offset for each individual patient in a consistent manner to preserve intervals, resulting in stays which occur sometime between the years 2100 and 2200. Time of day, day of the week, and approximate seasonality were conserved during date shifting. **Dates of birth for patients aged over 89 were shifted... these patients appear in the database with ages of over 300 years**."

<br></br>

We'll look into how the highlighted part affects our model later. 

<a id='30_day_readmission'></a> 
#### Why work on a 30 day readmission model? 

- The 30 day readmission model is a pretty common use of administrative level data. There are provisions in the American Care Act (ACA) that both punish and incentivize based on changes to this rate. Since other people having been using this measure on different datasets, we'll be able to compare our results to others. [Here's a good read](http://www.ajmc.com/journals/issue/2016/2016-vol22-n8/opinions-on-the-hospital-readmission-reduction-program-results-of-a-national-survey-of-hospital-leaders) if you want to dive deeper into opinions about the use of 30 day readmission rates. 
- 30 readmission results are (somewhat) actionable (in the real world), and have been used to inform care and support efforts. 



<a id='setting_up_db'></a>
#### Setting up a database from MIMIC

Each MIMIC table was downloaded as a flat .csv file, for a total of ~ 46gb. We probably won't use many of the tables, but we'd like to be able to extract anything that we think might help our model later. 

<br></br>

Our next step is to walk through the steps to build a database from these files - luckily, the MIMIC team has an [incredibly helpful repo](https://github.com/MIT-LCP/mimic-code/tree/master/buildmimic/postgres) that walks you through the process with scripts. We went with a Postgres database. After following these steps (and a ton of Googling around the edges), we had our database set up and ready to query. 

Let's see if it works: 

```
psql (9.5.6)
mimic=# SET search_path TO mimiciii;
SET
mimic=# SELECT COUNT(*) FROM patients;
 count 
-------
 46520
(1 row)

mimic=# \dt
               List of relations
  Schema  |        Name        | Type  | Owner 
----------+--------------------+-------+-------
 mimiciii | admissions         | table | mimic
 mimiciii | callout            | table | mimic
 mimiciii | caregivers         | table | mimic
 mimiciii | chartevents        | table | mimic
 mimiciii | chartevents_1      | table | mimic
 mimiciii | chartevents_10     | table | mimic
 mimiciii | chartevents_3      | table | mimic
 mimiciii | chartevents_4      | table | mimic
 mimiciii | transfers          | table | mimic
 
 etc...
 ```

Sweet! Similar queries convince us that we've set things up correctly. Now that we have the data, we need to get it into shape so that we can easily extract features from it. 



---

#####  **Next  |   [Preprocessing the Data]()**