# HW3: Clinical concepts

UMLS is a powerful (although flawed) tool to extract clinical concepts from raw clincial text. Here we will explore a subset of the MIMIC-III discharge summaries and use the clinical concepts to build relationships between diseases and symptoms.

## 0. Load data

In [1]:
import pandas as pd
%matplotlib inline
from matplotlib import pyplot as plt

import numpy as np

We have pulled ~10k discharge summaries from patients and of those ~2k patient extracted concepts are provided.

In [2]:
concepts = pd.read_csv('adult_dc_concepts.csv')

In [3]:
dc = pd.read_csv('adult_dc_summaries.csv')

Because of how clinical concepts were extracted, the index from the discharge summaries corresponds to the `index` column of the clinical concepts.

In [4]:
dc['index'] = dc.index
df = dc.merge(concepts, on='index', how='right')

# 1. What are clinical concepts? How do they work?

We can do some basic statistics for our data.

In [5]:
print len(df['index'].value_counts()), 'patients'
print len(df), 'extracted concepts'

2090 patients
1934817 extracted concepts


In [None]:
# TODO: 1.1 How many extracted concepts per patient on average? Plot the histogram.

Let's take a look at the table we have now. 

In [7]:
df.head(1)

Unnamed: 0,subject_id,hadm_id,icustay_id,gender,age,mort_hosp,mort_icu,mort_oneyr,first_hosp_stay,first_icu_stay,...,pos_info,mm,trigger,semtypes,patientid,preferred_name,score,location,tree_codes,cui
0,64,172056,232593,0,26.0,0,0,0,1,1,...,"16646/7;16444/7;16219/7;15898/7;[14554/7],[148...",MMI,"[""*^patient""-tx-41-""patient""-noun-0,""*^patient...",[podg],,Patients,226.31,TX,M01.643,C0030705


In particular, we see a few columns of interest: 
 - `trigger`: the source word(s) from which the clinical concept was extracted
 - `semtypes`: group of clinical concept extracted ([more info](https://metamap.nlm.nih.gov/Docs/SemanticTypes_2018AB.txt))
 - `preferred_name`: explanation of concept in human-readable form
 - `score`: UMLS assigned score of extracted concept, larger is more confident
 - `dc_chart`: raw discharge summary from which concept was extracted
 - `cui`: the concept unique identifier for each extracted concept
 
Let's take a look at the patient with `icustay_id = 232593`

In [8]:
print len(df[(df['icustay_id'] == 232593)]), 'extracted concepts for icustay_id = 211552'

959 extracted concepts for icustay_id = 211552


In [10]:
df[df['icustay_id'] == 232593][['trigger', 'semtypes', 'preferred_name','cui']].head()

Unnamed: 0,trigger,semtypes,preferred_name,cui
0,"[""*^patient""-tx-41-""patient""-noun-0,""*^patient...",[podg],Patients,C0030705
1,"[""TO""-tx-20-""to""-adv-0,""TO""-tx-19-""to""-adv-0,""...",[geoa],Togo,C0040363
2,"[""TO""-tx-20-""to""-adv-0,""TO""-tx-19-""to""-adv-0,""...","[aapp,enzy]",Tryptophanase,C0041260
3,"[""HOSPITAL""-tx-20-""hospital""-noun-0,""HOSPITAL""...","[hcro,mnob]",Hospitals,C0019994
4,"[""yeast""-tx-15-""yeast""-noun-0,""yeast""-tx-14-""y...",[fngs],Saccharomyces cerevisiae,C0036025


If we look above, we can examine the `trigger` or source word and examine what clinical concepts were extracted. For example, we see in row 4, the clinical concept `Saccharomyces cerevisiae` or `cui = C0036025` is extracted. By looking at the [Semantic Type guide](https://metamap.nlm.nih.gov/Docs/SemanticTypes_2018AB.txt), we see that `semtypes=[fngs]` meaning this CUI is a fungus.

In [None]:
# TODO: 1.2 Analyze the patient with icustay_id = 232593, specifically why one concept has "Fruit" in the preferred_name column.

# 2. Diseases and symptoms

We are particularly interested in diseases and symptoms. 

In [139]:
df['semtypes'].value_counts().head()

[fndg]    198283
[qnco]    163079
[inpr]    136044
[qlco]    125190
[ftcn]     89938
Name: semtypes, dtype: int64

Looking through the [Metamap documentation](https://metamap.nlm.nih.gov/Docs/SemanticTypes_2018AB.txt), we can see that `fndg` refers to a finding whereas `qnco` refers to a quantitative concept.

In [142]:
# TODO: 2.1 Which semtype corresponds to a disease and which semtype corresponds to a symptom?

Next we want to find the 5 most frequent diseases and the 5 most frequent symptoms.

In [151]:
# TODO: 2.2 What are the 5 most frequent diseases and the 5 most frequent symptoms? 

These extracted clinical concepts are not perfect (see 1.2). For example, the concept with `preferred_name = 'Infantile Neuroaxonal Dystrophy'` is surprisingly frequent.

We examine the source words (`trigger` in the data) the clinical extraction is using. It appears that the word `plan` then maps to "Infantile Neuroaxonal Dystrophy", which doesn't appear to be right. 

In [17]:
name = 'Infantile Neuroaxonal Dystrophy'
df[df['preferred_name'] == name][['trigger', 'semtypes', 'preferred_name']].head(3)

Unnamed: 0,trigger,semtypes,preferred_name
236,"[""PLAN""-tx-20-""plans""-verb-0]",[dsyn],Infantile Neuroaxonal Dystrophy
8371,"[""PLAN""-tx-17-""plans""-noun-1]",[dsyn],Infantile Neuroaxonal Dystrophy
9181,"[""PLAN""-tx-14-""plan""-noun-0,""PLAN""-tx-10-""plan...",[dsyn],Infantile Neuroaxonal Dystrophy


If we look at the raw discharge summary, we see that the word "plan" is used in a normal context and not at all referring to "Infantile Neuroaxonal Dystrophy". We therefore choose to disregard this clinical concept entirely.

In [None]:
df[df['preferred_name'] == name]['dc_chart'].head(1).values

We provide below a list of diseases and symptoms that we can ignore. Some are too broad (e.g. "Disease"), and some are mis-mapped. 

In [18]:
ignore_diseases = ['Disease', 
                   'Communicable Diseases', 
                    'Infantile Neuroaxonal Dystrophy',
                   'SHORT STATURE, ONYCHODYSPLASIA, FACIAL DYSMORPHISM, AND HYPOTRICHOSIS SYNDROME', 
                   'SYNOVITIS, GRANULOMATOUS, WITH UVEITIS AND CRANIAL NEUROPATHIES (disorder)', 
                   'Pneumocystis jiroveci pneumonia', 
                   'Infantile Neuroaxonal Dystrophy', 
                   'Ventricular Fibrillation, Paroxysmal Familial, 1', 
                   'Nuclear non-senile cataract', 
                   'Macrophage Activation Syndrome',
                   'MYOTONIC DYSTROPHY 1', 
                   'MICROCEPHALY, EPILEPSY, AND DIABETES SYNDROME', 
                   'Illness (finding)', 
                   'Oculocutaneous albinism type 1A',
                   'POLYARTERITIS NODOSA, CHILDHOOD-ONSET'
                  ]

ignore_symptoms = ['Discharge, body substance', 
                   'Mass of body region', 
                   'Clubbing', 
                   'Symptoms', 
                   'Signs and Symptoms'
]


In [None]:
# TODO: 2.3 Pick 3 diseases and/or symptoms listed in `ignore_diseases` and `ignore_symptoms`. 
# What is the trigger word? Why did we choose to remove it? 

In [155]:
# TODO: 2.4 Besides removing problematic clinical extraction, what is another method of data cleaning for extracted diseases and symptoms?

# 3. Building a graph of medicine

Using our extracted diseases and symptoms, we can now build relational structures between the two. We want to recreate a co-occurrence matrix of diseses and symptoms where each cell cooccur(i,j) is the number of patients where disease i and symptom j cooccur. 

In [None]:
topN = 100
cooccur = np.zeros((topN,topN))

# TODO: Get top 100 (or topN) most frequent diseases and symptoms
diseases = None
symptoms = None

# Remove diseases and symptoms that are erroneous
diseases = [d for d in diseases if d not in ignore_diseases]
symptoms = [s for s in symptoms if s not in ignore_symptoms]

# TODO: get unique icustay_ids to iterate through
user_ids = None

N_diseases, N_symptoms = len(diseases), len(symptoms)
N = len(user_ids)
print N_diseases, N_symptoms, N

In [None]:
# Iterate through patient stays
for uid in user_ids:
    # Make a sub_df containing one patient's extracted clinical concepts
    sub_df = df[df['icustay_id'] == uid]
    
    # Iterate through diseases of interest
    for d_idx, d in enumerate(diseases):  
        # TODO: Check if disease is in the patient's extracted clinical concepts
        d_in_uid = None
        
        for s_idx, s in enumerate(symptoms):
            # TODO: Check if symptom is in the patient's extracted clinical concepts
            s_in_uid = None
            
            if d_in_uid and s_in_uid:
                cooccur[d_idx][s_idx] += 1

In [None]:
# Get counts of each disease and symptom

disease_ct = np.zeros(N_diseases)
symptom_ct = np.zeros(N_symptoms)

for d_idx, d in enumerate(diseases):   
    # TODO: Counts for each disease
    disease_ct[d_idx] = None
    
for s_idx, s in enumerate(symptoms):   
    # TODO: Counts for of each symptom
    symptom_ct[s_idx] = None

In [158]:
# TODO: 3.1 Read https://www.nature.com/articles/sdata201432#f1 and compare our approach and their approach

In [159]:
# TODO: 3.2 How can we estimate the probabilities from our cooccurrence matrix and count arrays? How can we estimate the lift?

In [None]:
# TODO: 3.3 Calculate the lift(S -> D) where D = “Pneumonia” or D="Hypothyroidism" and report the top 5 symptoms by lift.

# 4. Not graded but please answer

In [None]:
# TODO: 4.1 How long did you spend on this problem set?
# TODO: 4.2 4.2 How many hours a week do you spend on this class (attending lecture, doing readings, working on problem sets) on average?