In [1]:
from IPython.core.display import HTML
import urllib2
HTML(urllib2.urlopen('https://gist.githubusercontent.com/mattlewissf/83989910849fdb4a04a72d431e84053f/raw/cefa015a9065665faccd0219774c7087be7d21a8/skeleton.css').read())

#### MIMIC Deep Dive - Extracting Basic Features
**[Intro](#intro)**   
**[30 Day Readmission](#30_day_readmission)**  
**[The MIMIC Dataset](#mimic_dataset)**  
**[Setting up the database](#setting_up_db)** 

There's a ton of patient data in the MIMIC III database: prescriptions, lab events, microbiology results, information about caregivers and providers. One of our initial jobs is to figure out what we think will be useful in helping to improve our predictive model for emergency department readmissions, and what won't. 

<br></br>

Generally, we're going to be working on creating a useful processs for *feature extraction*. Before we go into what we're actually looking to pull out of the MIMIC III dataset, it might be useful to define feature extraction, and how we're doing it (and what we're not doing).

We can understand a feature as anything that might be useful for helping to make a prediction, which in our case is: what is the probability that a patient will be re-admitted to the ER in the next 30 days. For our purposes, a feature can be either binary (i.e. the presence or absence of a disease category within a time period), categorical (marital status), or continuous (age, or a previously seen readmission rate to the ER).

One of the general goals of feature engineering is reducing the amount of resources needed to generally describe a data set, as analyzing a large and complex dataset will require a lot of computational power. Additionally, feature selection (ideally) will help us avoid overfitting the data and improve our ability to predict new data.  

#### A note on 'manual' vs. automatic feature constuction

We're intentionally selecting our features out of patient records, and while we might be collecting, transforming, or scoring that data to come up with a feature, we're very much using human intutition (and other research) to steer what we're including in the model. *Automatic feature learning* uses machine learning to disocer features and representations from raw data, and thus reduce or remove the need for manual curation of features. 

<br></br>

We won't (at least initially) be using any feature learning in our model, but it is something being used in state of the art approaches to healthcare data. Check out the super interesting [Deep Patient paper](http://www.nature.com/articles/srep26094) for a cool overview of an implementation of feature learning.

<a id='index_admission'></a>
#### Determining an Index Admission

Since we're building a model that wants to accurately predict whether a given admission for a person will result in a readmission, we'll need to figure out a way to select an index admission for each (qualified) person in our system. So what's an index admission? 

<br></br>

[CMS defines an index admission as](https://www.cms.gov/Medicare/Quality-Initiatives-Patient-Assessment-Instruments/MMS/downloads/MMSHospital-WideAll-ConditionReadmissionRate.pdf)  "any eligible admission to an acute care hospital assessed in the measure for the outcome (readmitted or not within 30 days." For our purposes, an index admission is any admission that we choose from an elible patient that we can determine the answer to the question: did this person get readmitted within 30 days of being discharged? We'll be using this index admission as a sort of pin on the patient record, and it will help us to determine the correct value for things like: 

* Patient age at time of index admission
* How long they were admitted 
* How many admissions they've had in the past year from the admission

<br></br>

For some patients, there will be only a single admission to work with; for others there may be a few, and we may want to randomly pick one. We can come back to refining this concept, but this is enough to go on for now, and we can pretty easily create one of our first extraction functions: 

In [None]:
# within extractors.py 
def assign_index_record(person): 

    # look through all of the admissions for a person
    sub_records = person.visit_occurances
    # select only those who have admissions to work with 
    if len(sub_records) != 0: 
        # determine how many admissions there are, and pick a random one
        sub_records_shape = np.shape(sub_records)[0]
        sub_records_sample = np.random.choice(sub_records_shape)
        index_record = sub_records[sub_records_sample]
        # we're not going to work with newborns, so filter them out
        if index_record.admission_type == 'NEWBORN':
            pass
        # assign this admission to the person as the index_admission
        person.index_admission = index_record

<a id='basic_features'></a>
#### Extracting some basic features

And now that we've got an index admission, we can some of the other features from above: 
* Age at index admission
* Length of index admission (in days) 
* Person gender



In [None]:
def get_person_index_age(person):
    try: 
        index_record_date = person.index_admission.visit_start_date
    except AttributeError: 
        print('person_index_age_error')
    person_dob = person.DOB
    person_age_at_index = index_record_date - person_dob
    person_age_at_index = format((person_age_at_index.total_seconds() / (365.25 * 86400)), '.2f') # convert to years 
    return float(person_age_at_index)

def get_index_admission_length(person):
    index_admission = person.index_admission
    index_admission_length = index_admission.visit_end_date - index_admission.visit_start_date 
    index_admission_length = format((index_admission_length.total_seconds() / 86400), '.2f')  # convert to days 
    if index_admission_length < 0: 
        index_admission_length = 0
    return float(index_admission_length)

def get_person_gender(person):
    person_gender = person.gender
    if person_gender == "M":
        return 0
    if person_gender == "F": 
        return 1

And with our index admission set for each person, we can now look to see if a given patient did indeed get readmitted within a 30 day period of being discharged from this index admission. We're using the visit_end_date value from the admission, and using [relativedelta](http://dateutil.readthedocs.io/en/stable/relativedelta.html) to determine the period. 

In [None]:
def get_readmit_30(person):

    period_end = person.index_admission.visit_end_date + relativedelta(days=30)
    admissions_within_30_days = [admission for admission in person.visit_occurances 
                                 if admission.visit_start_date > person.index_admission.visit_end_date 
                                 and admission.visit_start_date < period_end]
    
    if len(admissions_within_30_days) > 0: 
        return 1
    else: 
        return 0

<a id='admission_type'></a>
### Extracting admission type

Something we also want to be able to feed the model is the type of admission that the index admission represents. Here there are only four possible values for each: elective, urgent, emergency, and newborn. We've already filtered out all cases of newborn, so we're left with just the first three. We'll represent each as its own binary feature.


We'll be adding features as we go along, but for now let's just focus on this super basic set of features to extract. 

<br></br>


**Current features [v0]:**

| 	Type	|	Feature	|
|	---	| 	---	|
|	Admission	|	person_index_age	|
|		|	index_admission_length	|	
|		|	person_gender	|		
|		|	admission_rate	|	
|       |   Urgent | 
|       |   Emergency | 
|       |   Elective  | 
| y-value      |   readmit_30      | 



<a id='marital_ethnicity'></a>
### Bucketing ethnicity and marital status features

We're interested in getting some additional demographic information out of the MIMIC data, specifically anything that gives us a clue about marital status and the ethnic background of the individual patient. This information isn't something that is provided with the Patient table, but is often reported on individual Admission examples associated with a person.

First, let's take a look at what values we're actually going to see in the admissions table for ethnicity and marital status by going into the postgresql interface and running some queries: 

For **ethnicity**: 
```
SET search_path TO mimiciii;
mimic=# SELECT DISTINCT ethnicity FROM admissions; 
----------------------------------------------------------
 HISPANIC/LATINO - CUBAN
 HISPANIC/LATINO - MEXICAN
 UNKNOWN/NOT SPECIFIED
 BLACK/HAITIAN
 BLACK/AFRICAN AMERICAN
 HISPANIC/LATINO - DOMINICAN
 CARIBBEAN ISLAND
 HISPANIC/LATINO - GUATEMALAN
 HISPANIC/LATINO - CENTRAL AMERICAN (OTHER)
 HISPANIC OR LATINO
 MIDDLE EASTERN
 ASIAN - JAPANESE
 ASIAN
 PATIENT DECLINED TO ANSWER
 AMERICAN INDIAN/ALASKA NATIVE FEDERALLY RECOGNIZED TRIBE
 ASIAN - VIETNAMESE
 ASIAN - KOREAN
 BLACK/CAPE VERDEAN
 OTHER
...
(41 rows) 
```




And for **marital status**: 

```
mimic=# SELECT DISTINCT marital_status FROM admissions; 
  marital_status   
-------------------
 
 SEPARATED
 MARRIED
 DIVORCED
 UNKNOWN (DEFAULT)
 SINGLE
 WIDOWED
 LIFE PARTNER
(8 rows)
```

Great! However, one thing that we notice is that there is a lot of spread in the values for ethnicity, and some patients have multiple values associated with ethnicity in the records. One approach we've seen in other papers ([such as Sushmita](https://www.aaai.org/ocs/index.php/WS/AAAIW16/paper/view/12669/12402))  was to generalize this data into fewer buckets, and we'll take that approach here too. We'll end up with the following ethnicity buckets: 

- 'race_other'  
- 'white'  
- 'black'   
- 'latino'  
- 'asian'  
- 'multi_racial'   
- 'middle_eastern'  
- 'pacific_islander'   
- 'american_indian'  

<br></br>

To do this, we'll want to manually create a mapping dictionary (see the code here) for our 41 ethnicity values so we can match each separate record with a more general bucket (i.e. 'White - Russian' --> 'White').

We'll also want to do something similar for marital status, so we'll end up collapsing a few values to get:
- 'unknown'  
- 'single'  
- 'cohab'   
- 'separated' 

Now that we've got these values, we can treat each of these items in the ethnicity / marital status buckets as binary features, and with each person represented as either having (1) or not having (0) that feature. 

<br></br> 

**Current Features [v1]:** 

| 	Type	|	Feature	| 
|	---	| 	---	|	----	| 	---	| 
|	Admission                   	|	person_id	|
|		|	person_index_age	|	
|		|	index_admission_length	|	
|		|	person_gender	|	
|		|	admission_rate	|
|		|	URGENT	|	
|		|	ELECTIVE	|	
|		|	EMERGENCY	|	 
|	Ethnicity	|	race_other	|	
|		|	white	|
|		|	black	|	
|		|	latino	|
|		|	asian	|	
|		|	multi_racial	|
|		|	middle_eastern	|	
|		|	pacific_islander	|	
|		|	american_indian	|	
|		|	unknown	|	
|	Marital Status	|	single	|	
|		|	cohab	|	
|		|	separated	|		

#####  **Next  |   [Constructing Features from Diagnostic Codes]()**