In [1]:
from IPython.core.display import HTML
import urllib2
HTML(urllib2.urlopen('https://gist.githubusercontent.com/mattlewissf/83989910849fdb4a04a72d431e84053f/raw/cefa015a9065665faccd0219774c7087be7d21a8/skeleton.css').read())

#### MIMIC Deep Dive - Creating Features from Diagnosis Codes
**[Intro](#intro)**   
**[30 Day Readmission](#30_day_readmission)**  
**[The MIMIC Dataset](#mimic_dataset)**  
**[Setting up the database](#setting_up_db)** 

<a id='dx_codes'></a>
If we frame features as 'something that might be useful for prediction, based off of the data', we'll want to come up with useful ways to reframe the data that we already have. In the case of the MIMIC data set, we know a lot about individual patients: hospital admissions, prescriptions, diagnosis, and personal demographics. But one of the questions that we are trying to include into the model is roughly: how sick is this patient?

<br></br>

There's something like 14,000 individual ICD9CM codes, so individually representing each diagnosis as a binary feature doesn't super plausible for our model. Instead, we're going to look into two different ways we can use these diagnosis codes to represent an understanding of the health of a patient: **comorbidity indexes** and **clustering existing conditions**. 

<a id='icd9_codes'></a>
#### Getting ICD9 codes from patient admissions

Before we can explore how to use comorbidity and condition clustering, we need a way to extract ICD9CM codes for a particular patient. Here's some code that goes through a person's admission records for a period. We're using a parse_dx_code function from a CCS module we wrote (more on that later) to make sure that the code represents a real ICD9 code. 

In [None]:
def get_person_icd_codes(person, period=365):       
    '''
    goes through conditions assigned to patient (and not to specific admission)
    - uses ccs module to parse icd9
    '''
    period_start = person.index_admission.visit_start_date - relativedelta(days=period)
    codes = [] 
    conditions = {}
    for condition in person.conditions:
        admission_id = str(condition.admission_id)
        if admission_id in conditions:  
            try:
                conditions[admission_id].append(condition.icd9_code)
            except AttributeError:
                print('what')
        else: 
            conditions[admission_id] = [condition.icd9_code]
    codes = []
    for visit in person.visit_occurances: 
        if visit.visit_start_date > period_start <= person.index_admission.visit_start_date: 
            for raw_code in conditions[str(visit.visit_occurance_id)]:
                try:
                    code = parse_dx_code(raw_code)
                    codes.append(code)
                except TypeError:
                    print("Type Error", raw_code)
    return codes

<a id='charlson'></a>
####  Exploring severity of illness with the Charlson Comorbidity Index

In particular, we might find useful the concept of **comorbidity**, which attempts to combine each individual (and usually serious) conditions into a single, predictive variable. There are a few different methods around the concept of easily calculating a comorbidity score, with the Charlson Comorbidity index being perhaps the best known. For our purposes, we’re going to use the updated Charlson comorbidity index ('enchanced ICD9-CM'), as outlined in [Quan 2005](http://czresearch.com/dropbox/Quan_MedCare_2005v43p1130.pdf).

The Charlson Comorbidity method looks for different categories of diagnosis, as represented by ICD codes (in our case, ICD9 codes) in order to approximate an overall concept of comorbidity. Different conditions are given different weights based on the project impact on mortality, and age is factored in as well, giving a total CACI (Charlson Age Comorbidity Index) score. 

![An online Exlixhauser Index calculator](http://i.imgur.com/Sm7DYFx.png)

In order to extract Charlson categories from our patient data, we gathered ICD code definitions from this paper <sup>2</sup>. Because the codes are presented as ranges with wildcards (i.e. ICD9 codes '334.x-335.x'), we had to build a way to properly parse the code definitions to return all the relevant codes ( [here's the code](https://github.com/mattlewissf/mimic/blob/5f5c2b494051639c777eea76fe670654832b7e55/mimic_package/data_model/definitions.py#L213) ). 

<br></br>


This returns us a set of full ICD9 codes for each category, and since we’ve already extracted relevant ICD codes from each patient admission within a year of their index admission date, we're able to compare those codes with the Charlson categories. 

<br></br>

Next we wanted to calculate the actual CACI score for each patient. This is pretty straightforward - the presence of a given conditon warrents either 1,2,3, or 6 points based on severity.

We used the Charlson scoring concept in two ways to generate features for our model: 

1. Determining 'membership' into a particular category (i.e. hypothyroidism) and giving a binary value

* Calculating the appropriate CACI score for the patient, and returns that as a continuous feature

<br></br>


Here it is as a snippet of the feature dataframe:

| ...| liver disease | hypothyroidism| metastatic cancer  | CACI Score |
| --- | ------------- |:------------- | ----------------  | -----------| 
|     | 0             | 0             | 0                 |         1  | 
|     | 0             | 1             |  0                |         3  | 
|     | 1             | 0             |  1                |          8 | 

**Current features [v2]**

| 	Type	|	Feature	| 	Type	|	Feature	| 
|	---	| 	---	|	
|	Admission                   	|	person_id	|	Charlson	|	AIDS/HIV	|
|		|	person_index_age	|		|	Cerebrovascular disease	|	
|		|	index_admission_length	|	|	Rheumatologic disease	|	
|		|	person_gender	|	|	Renal disease	|	
|		|	admission_rate	|	|	Hemiplegia or paraplegia	|	
|		|	URGENT	|	|	Peripheral vascular disease	|	
|		|	ELECTIVE	|	|	Mild liver disease	|	
|		|	EMERGENCY	|		|	Any malignancy, including leukemia and lymphom
|	Ethnicity	|	race_other	|		|	Metastatic solid tumor	| 
|		|	white	|	|	Chronic pulmonary disease	| 
|		|	black	|		|	Congestive heart failure	| 
|		|	latino	|	|	Peptic ulcer disease	| 
|		|	asian	|		|	Myocardial infarction	| 
|		|	multi_racial	| | Dementia	| 
|		|	pacific_islander	|	|	Moderate or severe liver disease	| 
|		|	american_indian	|	|	Diabetes with chronic complications	| 
|		|	unknown	|	|	Diabetes without chronic complications	|
|	Marital Status	|	single	| | 	Charlson score | 
|		|	cohab	|	
|		|	separated	|	





<a id='ccs_modules'></a>
#### Creating existing conditon buckets using CCS

Another tool we can use to group diagnosis codes to better inform our model is represented by CCS, a concept developed by the [Healthcare Cost and Utilization Project (HCUP)](https://www.hcup-us.ahrq.gov/overview.jsp): 

```
Clinical Classifications Software (CCS) is a tool for clustering patient diagnoses and procedures into a manageable number of clinically meaningful categories... this "clinical grouper" makes it easier to quickly understand patterns of diagnoses and procedures so that health plans, policy makers, and researchers can analyze costs, utilization, and outcomes associated with particular illnesses and procedures. 
```

In short, CCS is a set of ICD9 groups that describe different condition groups. CCS has different ways of breaking the codes into differnet levels of granularity: 

Table 1. Examples of single-level CCS diagnosis categories 

    98. Essential hypertension
    99. Hypertension with complications and secondary hypertension
    100. Acute myocardial infarction
    101. Coronary atherosclerosis and other heart disease 
    
Table 3. Examples of multi-level CCS diagnosis categories 

    7. Diseases of the circulatory system
      7.1. Hypertension
        7.1.1. Essential hypertension [98]
        7.1.2. Hypertension with complications and secondary hypertension [99]
          7.1.2.1. Hypertensive heart and/or renal disease
          7.1.2.2. Other hypertensive complications
      7.2. Diseases of the heart


** CCS - a Python module for retrieving CCS code groups **

<br></br>

HCUP provides some code implementations of their CCS concept, but I hadn't found anything that makes it easy for someone working in Python to check codes against the CCS concept. 

<br></br>

So I built one: ***[CCS](https://github.com/mattlewissf/ccs)*** (it could use a flashier name). It basically allows you to determine a level (single, multi-level) concept and get back sets of codes for those definitions. It provides both ICD9 and ICD10 codes. 

#### Using CCS to extract existing condition features

Using the CCS module, we can then check a patient's diagnosis codes against those in a particular CCS level, and return membership in particular groups. Here's a function we wrote that checks a users codes against a codeset of CCS codes, and returns those features and membership in them as features. Note that the feature generated don't have names (as there are multiple levels that you can use this function that return different numbers of features). 


In [None]:
def check_against_ccs(user_codes, codeset, code_type='dx', code_level='single'):
    
    mapper = {'dx': {'single': codeset.dx_single_level_codes, 'category': codeset.dx_category_level_codes, 'multi': codeset.dx_multilevel_codes}, 
              'px': {'single': codeset.px_single_level_codes, 'category': codeset.px_category_level_codes, 'multi': codeset.px_multilevel_codes}}
    
    f = mapper[code_type][code_level]
    

    ccs_features = {} 
    user_set = set(user_codes)
    for k,v in f.iteritems():
        overlap = user_set & v
        if overlap: 
            ccs_features[k] = 1
        else:
            ccs_features[k] = 0
    
    return collections.OrderedDict(ccs_features)

This gives us another set of features that represent clustered diagnoses. We'll represent these as binary features for each user.

<br></br>

**Current Features [v3]**

| 	Type	|	Feature	| 	Type	| 	Feature	| 
|	---	| 	---	|	----	| 	---	| 
|	Admission                   	|	person_id	|	Charlson	|	Metastatic solid tumor	| 
|		|	person_index_age	|		|	Chronic pulmonary disease	| 
|		|	index_admission_length	|		|	Congestive heart failure	| 
|		|	person_gender	|		|	Peptic ulcer disease	| 
|		|	admission_rate	|		|	Myocardial infarction	| 
|		|	URGENT	|		|	Dementia	| 
|		|	ELECTIVE	|		|	Moderate or severe liver disease	| 
|		|	EMERGENCY	|		|	Diabetes with chronic complications	| 
|	Ethnicity	|	race_other	|		|	Diabetes without chronic complications	| 
|		|	white	|		|	charlson_score	| 
|		|	black	|	CCS Level 1	|	1	| 
|		|	latino	|		|	10	| 
|		|	asian	|		|	11	| 
|		|	multi_racial	|		|	12	| 
|		|	middle_eastern	|		|	13	| 
|		|	pacific_islander	|		|	14	| 
|		|	american_indian	|		|	15	| 
|		|	unknown	|		|	16	| 
|	Marital Status	|	single	|		|	17	| 
|		|	cohab	|		|	18	| 
|		|	separated	|		|	2	| 
|		|	Cerebrovascular disease	|		|	3	| 
|	Charlson	|	AIDS/HIV	|		|	4	| 
|		|	Rheumatologic disease	|		|	5	| 
|		|	Renal disease	|		|	6	| 
|		|	Hemiplegia or paraplegia	|		|	7	| 
|		|	Peripheral vascular disease	|		|	8	| 
|		|	Mild liver disease	|		|	9	| 
|		|	Any malignancy, including leukemia and lymphoma	|		|		| 



#####  **Next  |   [Pandas and Extraction]()**