
# MIMIC-III EDA and prep

Cleaning and prep of diagnostic notes and related data from MIMIC-III dataset

Dataset location:

    wget -r -N -c -np --user [username] --ask-password https://physionet.org/files/mimiciii/1.4/

## Steps

1. Import data
    * NOTEEVENTS.csv
    * DIAGNOSIS_ICD.csv
2. DIAGNOSIS_ICD prep
    * Action: Take first record over HADM_ID where SEQ_NUM==1 which contains the primary diagnoses. 
    * Data check: SEQ_NUM == 1 for all records, no nulls
    * Data check: HADM_ID should be unique across all records, no nulls
3. NOTEEVENTS prep
    * Action: Trim to only include HADM_ID and TEXT columns
    * Action: Drop null rows 231,836 NULL values in HADM_ID
    * Data check: TEXT has no null values
    * Data check: HADM_ID has no null values
4. Combine TEXT in NOTEEVENTS for all HADM_ID
    * Action: create new dataframe with unique HADM_ID where all TEXT values combined
    * Data check: ensure combined frame has higher text length values (e.g. mean, max, stdev)
    * Data check: combined TEXT has no null values
    * Data check: unique HADM_ID across all records
    * Data check: HADM_ID has no null values
5. Output 1: Create dataframe where TEXT is not joined but has ICD9 Code for SEQ==1
    * Data check: check that HADM_ID values all exist in both NOTES and DIAGNOSES before merge
    * Action: Merge NOTES and DIAGNOSES (with uncombined TEXT) 
    * Data check: only common HADM_ID should remain after merge
    * Data check: TEXT has no null values
    * Data check: ICD9_CODE has no null values
    * Data check: SEQ_NUM == 1 for all records 
    * Persist dataframe
6. Output 2: Create dataframe where TEXT is not joined but has ICD9 Code for SEQ==1
    * Data check: check that HADM_ID values all exist in both NOTES and DIAGNOSES before merge
    * Action: Merge NOTES and DIAGNOSES (with uncombined TEXT) 
    * Data check: only common HADM_ID should remain after merge
    * Data check: TEXT has no null values
    * Data check: ICD9_CODE has no null values
    * Data check: SEQ_NUM == 1 for all records 
    * Persist dataframe
7. Explore ICD9_CODE frequency in combined and separate NOTES data sets
8. Persist custom set of stopwords for later use


# Initialize environment    

In [1]:
import sys
import os

WORKING_DIR = f'{os.getcwd()}'

# Magritte has the utility functions we will be using
# Set MAGRITTE_DIR to where you checked out the github repo
MAGRITTE_DIR = f'{WORKING_DIR}/../../magritte'
UTILITIES_DIR = f'{MAGRITTE_DIR}/utilities'

# Directory for loading and storing files
DATA_DIR = f'{WORKING_DIR}/../../data/mimiciii'

# Add the UTILITY_DIR to the path to import files
sys.path.append(UTILITIES_DIR)

In [2]:
import pandas as pd
import DataUtils
import pickle

# Load Data (MIMIC-III Dataset)

In [3]:
%%time
# Loading three tables from MIMIC-III
# 1) DIAGNOSES_ICD.csv.gz
# 2) NOTEEVENTS.csv.gz

DIAGNOSIS_ICD_DF = pd.read_csv(f'{DATA_DIR}/DIAGNOSES_ICD.csv.gz',
                              compression='gzip'
                             )


CPU times: user 151 ms, sys: 17 ms, total: 168 ms
Wall time: 170 ms


## DIAGNOSES_ICD summary

In [4]:
DataUtils.exploreDataframe(DIAGNOSIS_ICD_DF)

dataframe shape: (651047, 5)

dataframe info: 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 651047 entries, 0 to 651046
Data columns (total 5 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   ROW_ID      651047 non-null  int64  
 1   SUBJECT_ID  651047 non-null  int64  
 2   HADM_ID     651047 non-null  int64  
 3   SEQ_NUM     651000 non-null  float64
 4   ICD9_CODE   651000 non-null  object 
dtypes: float64(1), int64(3), object(1)
memory usage: 24.8+ MB
None

Null value count by column:


ROW_ID         0
SUBJECT_ID     0
HADM_ID        0
SEQ_NUM       47
ICD9_CODE     47
dtype: int64



First 5 in dataframe


Unnamed: 0,ROW_ID,SUBJECT_ID,HADM_ID,SEQ_NUM,ICD9_CODE
0,1297,109,172335,1.0,40301
1,1298,109,172335,2.0,486
2,1299,109,172335,3.0,58281
3,1300,109,172335,4.0,5855
4,1301,109,172335,5.0,4254



Last 5 in dataframe


Unnamed: 0,ROW_ID,SUBJECT_ID,HADM_ID,SEQ_NUM,ICD9_CODE
651042,639798,97503,188195,2.0,20280
651043,639799,97503,188195,3.0,V5869
651044,639800,97503,188195,4.0,V1279
651045,639801,97503,188195,5.0,5275
651046,639802,97503,188195,6.0,5569


In [5]:
DataUtils.showUniqueColVals(DIAGNOSIS_ICD_DF, 'ICD9_CODE')

Data type of column [ICD9_CODE] is: object
Total number of rows: 651047
Unique values in column: 6985 [percent unique: 1.0999999999999999%]
Null values in column: 47
List of unique values:
['40301' '486' '58281' ... 'E0070' '6940' '20930']

Top 5 records by frequency for ICD9_CODE
     ICD9_CODE  record_count
1962      4019         20703
2109      4280         13111
2098     42731         12891
2019     41401         12429
2957      5849          9119

Bottom 5 records by frequency for ICD9_CODE
     ICD9_CODE  record_count
6983     V9103             1
6854      V562             1
1336      3060             1
202      07953             1
1338      3062             1


(['4019', '4280', '42731', '41401', '5849'],
 ['V9103', 'V562', '3060', '07953', '3062'])


# DIAGNOSES_ICD explore and prep

In [6]:
DataUtils.exploreDataframe(DIAGNOSIS_ICD_DF)

dataframe shape: (651047, 5)

dataframe info: 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 651047 entries, 0 to 651046
Data columns (total 5 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   ROW_ID      651047 non-null  int64  
 1   SUBJECT_ID  651047 non-null  int64  
 2   HADM_ID     651047 non-null  int64  
 3   SEQ_NUM     651000 non-null  float64
 4   ICD9_CODE   651000 non-null  object 
dtypes: float64(1), int64(3), object(1)
memory usage: 24.8+ MB
None

Null value count by column:


ROW_ID         0
SUBJECT_ID     0
HADM_ID        0
SEQ_NUM       47
ICD9_CODE     47
dtype: int64



First 5 in dataframe


Unnamed: 0,ROW_ID,SUBJECT_ID,HADM_ID,SEQ_NUM,ICD9_CODE
0,1297,109,172335,1.0,40301
1,1298,109,172335,2.0,486
2,1299,109,172335,3.0,58281
3,1300,109,172335,4.0,5855
4,1301,109,172335,5.0,4254



Last 5 in dataframe


Unnamed: 0,ROW_ID,SUBJECT_ID,HADM_ID,SEQ_NUM,ICD9_CODE
651042,639798,97503,188195,2.0,20280
651043,639799,97503,188195,3.0,V5869
651044,639800,97503,188195,4.0,V1279
651045,639801,97503,188195,5.0,5275
651046,639802,97503,188195,6.0,5569


In [7]:
# Info from original data prep
## labels = ['4019', '5849', '51881', '53081']
## diagnosis_icd_filter = diagnosis_icd.loc[diagnosis_icd['ICD9_CODE'].isin(labels)]
## diagnosis_icd_filter_drop = diagnosis_icd_filter.sort_values('SEQ_NUM').drop_duplicates('HADM_ID', keep='first')

In [8]:
# Info from original data prep
# labels = ['4019', '5849', '51881', '53081']

# 4019 is one of the research codes used
DIAGNOSIS_ICD_DF_4019 = DIAGNOSIS_ICD_DF.loc[DIAGNOSIS_ICD_DF['ICD9_CODE'].isin(['4019'])]
DataUtils.exploreDataframe(DIAGNOSIS_ICD_DF_4019, showRecords=1)

dataframe shape: (20703, 5)

dataframe info: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 20703 entries, 53 to 651012
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   ROW_ID      20703 non-null  int64  
 1   SUBJECT_ID  20703 non-null  int64  
 2   HADM_ID     20703 non-null  int64  
 3   SEQ_NUM     20703 non-null  float64
 4   ICD9_CODE   20703 non-null  object 
dtypes: float64(1), int64(3), object(1)
memory usage: 970.5+ KB
None

Null value count by column:


ROW_ID        0
SUBJECT_ID    0
HADM_ID       0
SEQ_NUM       0
ICD9_CODE     0
dtype: int64



First 1 in dataframe


Unnamed: 0,ROW_ID,SUBJECT_ID,HADM_ID,SEQ_NUM,ICD9_CODE
53,1513,115,114585,12.0,4019



Last 1 in dataframe


Unnamed: 0,ROW_ID,SUBJECT_ID,HADM_ID,SEQ_NUM,ICD9_CODE
651012,639768,97488,161999,13.0,4019


In [9]:
# Get all HADM_ID where SEQ_NUM=1 (e.g. primary diagnosis)
DIAGNOSIS_WHERE_SEQ_IS_1 = DIAGNOSIS_ICD_DF_4019.loc[DIAGNOSIS_ICD_DF_4019['SEQ_NUM'] == 1.0]
print(DIAGNOSIS_WHERE_SEQ_IS_1.head(5))

       ROW_ID  SUBJECT_ID  HADM_ID  SEQ_NUM ICD9_CODE
12341   10697         913   138570      1.0      4019
24200   38614        3466   157659      1.0      4019
41483   22137        1988   137710      1.0      4019
57583   41472        3747   112435      1.0      4019
77924   70574        6296   107183      1.0      4019


In [10]:
# Remove ALL rows from filtered dataset where HADM_ID has a 1.0 present 

# Get list of HADM_ID
HADM_ID_with_SEQ_1 = DIAGNOSIS_WHERE_SEQ_IS_1['HADM_ID'].unique()
print(f'There are {len(HADM_ID_with_SEQ_1)} unique HADM_ID having a 1.0 in the dataset')
print(f'HADM_ID where there is a SEQ_NUM==1: {HADM_ID_with_SEQ_1}')

There are 36 unique HADM_ID having a 1.0 in the dataset
HADM_ID where there is a SEQ_NUM==1: [138570 157659 137710 112435 107183 104644 101648 186291 175330 160898
 196106 155063 144308 137862 108356 102481 185828 135201 165198 144320
 137529 156768 175521 118532 167285 165975 105818 105254 158369 157074
 147717 152597 114043 188033 149121 151745]


In [12]:
# delete all rows with column 'Age' has value 30 to 40 
indexNames = DIAGNOSIS_ICD_DF_4019[ (DIAGNOSIS_ICD_DF_4019['HADM_ID'].isin(HADM_ID_with_SEQ_1))].index
DIAGNOSIS_ICD_DF_4019= DIAGNOSIS_ICD_DF_4019.drop(indexNames)

DataUtils.exploreDataframe(DIAGNOSIS_ICD_DF_4019, showRecords=1)

dataframe shape: (20667, 5)

dataframe info: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 20667 entries, 53 to 651012
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   ROW_ID      20667 non-null  int64  
 1   SUBJECT_ID  20667 non-null  int64  
 2   HADM_ID     20667 non-null  int64  
 3   SEQ_NUM     20667 non-null  float64
 4   ICD9_CODE   20667 non-null  object 
dtypes: float64(1), int64(3), object(1)
memory usage: 1.5+ MB
None

Null value count by column:


ROW_ID        0
SUBJECT_ID    0
HADM_ID       0
SEQ_NUM       0
ICD9_CODE     0
dtype: int64



First 1 in dataframe


Unnamed: 0,ROW_ID,SUBJECT_ID,HADM_ID,SEQ_NUM,ICD9_CODE
53,1513,115,114585,12.0,4019



Last 1 in dataframe


Unnamed: 0,ROW_ID,SUBJECT_ID,HADM_ID,SEQ_NUM,ICD9_CODE
651012,639768,97488,161999,13.0,4019


In [15]:
# Get all the records from the original DF matching one of the HADM_ID from above
df_example = DIAGNOSIS_ICD_DF[DIAGNOSIS_ICD_DF['HADM_ID'] == 161999]
DataUtils.exploreDataframe(df_example, showRecords=12)

dataframe shape: (24, 5)

dataframe info: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 24 entries, 651000 to 651023
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   ROW_ID      24 non-null     int64  
 1   SUBJECT_ID  24 non-null     int64  
 2   HADM_ID     24 non-null     int64  
 3   SEQ_NUM     24 non-null     float64
 4   ICD9_CODE   24 non-null     object 
dtypes: float64(1), int64(3), object(1)
memory usage: 1.1+ KB
None

Null value count by column:


ROW_ID        0
SUBJECT_ID    0
HADM_ID       0
SEQ_NUM       0
ICD9_CODE     0
dtype: int64



First 12 in dataframe


Unnamed: 0,ROW_ID,SUBJECT_ID,HADM_ID,SEQ_NUM,ICD9_CODE
651000,639756,97488,161999,1.0,43411
651001,639757,97488,161999,2.0,3485
651002,639758,97488,161999,3.0,3484
651003,639759,97488,161999,4.0,430
651004,639760,97488,161999,5.0,34830
651005,639761,97488,161999,6.0,99731
651006,639762,97488,161999,7.0,51883
651007,639763,97488,161999,8.0,5990
651008,639764,97488,161999,9.0,34291
651009,639765,97488,161999,10.0,29181



Last 12 in dataframe


Unnamed: 0,ROW_ID,SUBJECT_ID,HADM_ID,SEQ_NUM,ICD9_CODE
651012,639768,97488,161999,13.0,4019
651013,639769,97488,161999,14.0,2724
651014,639770,97488,161999,15.0,25000
651015,639771,97488,161999,16.0,V5867
651016,639772,97488,161999,17.0,4280
651017,639773,97488,161999,18.0,3051
651018,639774,97488,161999,19.0,7843
651019,639775,97488,161999,20.0,0414
651020,639776,97488,161999,21.0,30391
651021,639777,97488,161999,22.0,E8798


## Business interpretation of ICD9_CODE and SEQ_NUM

<a href="https://github.com/MIT-LCP/mimic-code/issues/199">Business rules???</a>
<br />



# Scratchpad

## Explore differences in order of ops (ICD9_CODE filter, SEQ_NO)

In [None]:
DataUtils.exploreDataframe(DIAGNOSIS_ICD_DF)

In [None]:
# Original data prep added a filter on ICD9 Codes and then sorted based on "keep first" not SEQ=1.0

# Info from original data prep
## labels = ['4019', '5849', '51881', '53081']
## diagnosis_icd_filter = diagnosis_icd.loc[diagnosis_icd['ICD9_CODE'].isin(labels)]
## diagnosis_icd_filter_drop = diagnosis_icd_filter.sort_values('SEQ_NUM').drop_duplicates('HADM_ID', keep='first')


# Expected record results
RECORDS_AFTER_FILTER = 43645
RECORDS_AFTER_KEEPFIRST = 32275

# ICD9_Codes used in research paper
labels = ['4019', '5849', '51881', '53081']

DIAGNOSIS_ICD_SANITY_orig_filter = DIAGNOSIS_ICD_DF.loc[DIAGNOSIS_ICD_DF['ICD9_CODE'].isin(labels)]
assert(RECORDS_AFTER_FILTER == len(DIAGNOSIS_ICD_SANITY_orig_filter))


DIAGNOSIS_ICD_SANITY_orig_sort = DIAGNOSIS_ICD_SANITY_orig_filter.sort_values('SEQ_NUM', ascending=True).drop_duplicates('HADM_ID', keep='first')
assert(RECORDS_AFTER_KEEPFIRST == len(DIAGNOSIS_ICD_SANITY_orig_sort))


# Explore data changes and output
print(f'')
print(f'SEQ_NUM after filter then keep_first')
_, _ = DataUtils.showUniqueColVals(DIAGNOSIS_ICD_SANITY_orig_sort, 'SEQ_NUM', showRecords=5)

print(f'')
print(f'HADM_ID after filter then keep_first')
_, _ = DataUtils.showUniqueColVals(DIAGNOSIS_ICD_SANITY_orig_sort, 'HADM_ID', showRecords=5)



In [None]:
# Comparing order of operations. Original (above) did filter then sort to keep first records
# This one reverses and sorts to keep first record (primary diagnoses) then filter


# Original data prep added a filter on ICD9 Codes and then sorted based on "keep first" not SEQ=1.0
## labels = ['4019', '5849', '51881', '53081']
## diagnosis_icd_filter = diagnosis_icd.loc[diagnosis_icd['ICD9_CODE'].isin(labels)]
## diagnosis_icd_filter_drop = diagnosis_icd_filter.sort_values('SEQ_NUM').drop_duplicates('HADM_ID', keep='first')


# Expected record results
RECORDS_AFTER_FILTER = 43645
RECORDS_AFTER_KEEPFIRST = 32275

# ICD9_Codes used in research paper
labels = ['4019', '5849', '51881', '53081']


# Filter recordset down keeping only 'first' record based on SEQ_NUM over HADM_ID
DIAGNOSIS_ICD_SANITY_sort = DIAGNOSIS_ICD_DF.sort_values('SEQ_NUM', ascending=True).drop_duplicates('HADM_ID', keep='first')

print(f'')
# Now keep only the ones with the right ICD9_CODE
DIAGNOSIS_ICD_SANITY_filter = DIAGNOSIS_ICD_SANITY_sort.loc[DIAGNOSIS_ICD_SANITY_sort['ICD9_CODE'].isin(labels)]


# Explore data changes and output
print(f'')
print(f'SEQ_NUM after keep_first then filter')
_, _ = DataUtils.showUniqueColVals(DIAGNOSIS_ICD_SANITY_filter, 'SEQ_NUM', showRecords=5)

print(f'')
print(f'HADM_ID after keep_first then filter')
_, _ = DataUtils.showUniqueColVals(DIAGNOSIS_ICD_SANITY_filter, 'HADM_ID', showRecords=5)