In [1]:
from IPython.core.display import HTML
import urllib2
HTML(urllib2.urlopen('https://gist.githubusercontent.com/mattlewissf/83989910849fdb4a04a72d431e84053f/raw/cefa015a9065665faccd0219774c7087be7d21a8/skeleton.css').read())

#### MIMIC Deep Dive - Extracting Features into a Pandas Dataframe
**[Intro](#intro)**   
**[30 Day Readmission](#30_day_readmission)**  
**[The MIMIC Dataset](#mimic_dataset)**  
**[Setting up the database](#setting_up_db)** 


We've set up functions to determine and extract individual features; now we need to set up a way to run all of them on a given patient record. We've set up a single function, apply_extractors(), that takes a person record and returns a list of feature variables:


It's almost time to run our extraction and move toward actually fitting our model and seeing how well our predcitive ability is. Here's a quick look back at what we've done to get to this point: 

- Downloaded the data 
- Prepped the data
- Found that that SQLAlchemy is really slow and moved to another ORM 
- Started writing extractors for basic features
- Thought more about what features we'd like to see 
- Built a way to apply Charlson codes to our admission records 
- Categorized dx codes into CCS categories 

<a id='pandas'></a>
#### Extracting features

We've set up functions to determine and extract individual features; now we need to set up a way to run all of them on a given patient record. We've set up a single function, apply_extractors(), that takes a person record and returns a list of feature variables:


In [None]:
def apply_extractors(person, codeset):
        assign_index_record(person)
        if person.index_admission == None:
            return None
        if person.index_admission.admission_type == 'NEWBORN':
            return None
        person_index_age = get_person_index_age(person)
        if person_index_age > 120: 
            return None
        person_id = person.person_id
        index_admission_length = get_index_admission_length(person)
        index_admission_type_features = get_index_admission_type(person)
        person_gender = get_person_gender(person)
        ethnicity_features = get_person_ethnicity(person)
        marital_features = get_person_marital(person)
        admission_rate = get_admission_rate(person)
        codes = get_person_icd_codes(person)
        # ETC.. 

#### Setting up a dataframe

Now that we've got a way of getting our features, the next step is to set up a Pandas dataframe to hold all of it.  [Pandas](http://pandas.pydata.org/pandas-docs/stable/) is "a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive." In short, Pandas sets up a tabular data scheme, a 'dataframe', and allows for easy (and powerful) data manipulation. Super awesome. 

<br></br>

We're going to want to set up a dataframe and then start extracting features directly into it. Fist, we'll want to set up a blank dataframe by creating columns that match our features and assigning those to a dataframe:

In [None]:
df_columns = ["person_id", "person_index_age","index_admission_length","person_gender", "admission_rate"]
[df_columns.append(feature) for feature in index_admission_types.keys()]
[df_columns.append(feature) for feature in ethnicity_values.keys()]
[df_columns.append(feature) for feature in marital_values.keys()]
[df_columns.append(feature) for feature in charleston_values.keys()]
[df_columns.append(feature) for feature in sorted(codeset.dx_single_level_codes.keys())]
df_columns.append("readmit_30") 
empty_col = [0 for x in df_columns]
np_data = np.array(empty_col)

Now that we've got a dataframe with the appropriate setup, we can start iterating through patient records and assigning features to differnet columns in the dataframe. 

In [None]:
# within the extract_to_dataframe() function... 

empty_col = [0 for x in df_columns]
    np_data = np.array(empty_col)

    for person in persons: 
        features = apply_extractors(person, codeset) 
        if features: 
            np_data = np.vstack((np_data, features))
        
    df = pd.DataFrame(data=np_data[1:,:], columns=df_columns)
    return df

So in short, we're going through the entire process here to extract features. For each patient record we are iterating through (and for the whole set it is > 40k): 
- Create a OMOP Person object and associated records (ICD codes, etc)
- Calculate an index admission, and assign that to the Person itself 
- Extract and calculate features based off of records with their index admission as anchor
- Determine whether they represented a readmit within 30 days of their index admission discharge
- Assign these features to the dataframe as binary or continuous values

<br></br>

For all of the records in MIMIC, this can take a while.. 

<br></br>

<img src='https://media.giphy.com/media/LQoYS7mhDQkRG/giphy.gif' width="400" height="400" align='float' margin-right=50px />
<br></br>








And...  Almost...

<br></br>

Done! After doing our data munging and preprocessing, we're finally at the stage where all of our data is ready and sitting in a pandas dataframe:

```
35446 entries, 0 to 35445
memory usage: 15.7 MB
58 features
```

We can used df.head() to check out the first few lines.

#####  **Next  |   [Thinking about Classification]()**