# Data Preprocessing

- We need to import and transform following MIMIC-III datasets:
  
  DIAGNOSES_ICD.csv: Each row in this file maps Hospitalization ID (HADM_ID) of a patient with a unique ICD_9 CODE.
  
  Example:
  
  |ROW_ID|SUBJECT_ID|**HADM_ID**|SEQ_NUM|**ICD9_CODE**|
  |--------|------------|---------|---------|-----------|
  |1297|109|172335|1|"0030"|
  |1299|109|172335|3|"0038"|
  |1301|109|173633|5|"0031"|
  
  
  The aim is to transform this dataset to dataframe DIGNOSES with index as HADM_ID and columns as unique ICD_9 codes (6984 in total), to represent multi-hot encoding of ICD_9 codes for given hospitalization.
  
  |HADM_ID|ICD9_CODE_0030|ICD9_CODE_0031|ICD9_CODE_0038|
  |-------|--------------|--------------|--------------|
  |172335|1|0|1|
  |173633|0|1|0|
  
  
  
  NOTEEVENTS.csv: Each row maps HADM_ID (Hospitalization ID) with a free text Discharge summary (TEXT) field.
  
  |ROW_ID|SUBJECT_ID|**HADM_ID**|CHARTDATE|CHARTTIME|STORETIME|CATEGORY|DESCRIPTION|CGID|ISERROR|**TEXT**|
  |------|----------|-----------|---------|---------|---------|--------|-----------|----|-------|--------|
  |174|22532|167853|2151-08-04|||Discharge summary|Report|||Admission Date:  [\*\*2151-7-16**]       Discharge Date:  [\*\*2151-8-4**] Service: ADDENDUM: RADIOLOGIC STUDIES:  Radiologic studies also included a chest| 
 
  The aim is to transform this to dataframe with HADM_ID as index and TEXT as column.
  |HADM_ID|TEXT|
  |-------|----|
  |167853|Admission Date:  [\*\*2151-7-16**]       Discharge Date:  [\*\*2151-8-4**] Service: ADDENDUM: RADIOLOGIC STUDIES:  Radiologic studies also included a chest|

<br/>
<br/>

- The processed dataframes are then stored (as csv) for further usage.

- This notebook uses Rapids framework (cudf and dask dataframe) to enable faster processing of Pandas dataframe on GPU.

#### Check GPU Version

In [None]:
# Check GPU
!nvidia-smi

#Setup:
This set up script:

1. Checks to make sure that the GPU is RAPIDS compatible
1. Installs the **current stable version** of RAPIDSAI's core libraries using pip, which are:
  1. cuDF
  1. cuML
  1. cuGraph
  1. xgboost

**This will complete in about 3-4 minutes**

In [None]:
# This get the RAPIDS-Colab install files and test check your GPU.  Run this and the next cell only.
# Please read the output of this cell.  If your Colab Instance is not RAPIDS compatible, it will warn you and give you remediation steps.

!git clone https://github.com/rapidsai/rapidsai-csp-utils.git
!python rapidsai-csp-utils/colab/pip-install.py

## Critical Imports

In [None]:
# Critical imports
import cudf
import cuml
import os
import numpy as np
import pandas as pd
import dask.dataframe as dd

In [None]:
# Mount the project directory in Google drive. (Its only intended to be run in colab environment.)

from google.colab import drive
drive.mount('drive')

In [None]:
# Define the base project directory.

PROJECT_DIR = 'drive/My Drive/cs598-dl/' # For Google drive only

# PROJECT_DIR = '../' # For local directory

In [None]:
# We first process MIMIC-III DIAGNOSES_ICD dataset.

In [None]:
# Read DIAGNOSES_ICD.csv from data directory, and pre-process.

diagnoses_df = pd.read_csv(PROJECT_DIR + 'data/DIAGNOSES_ICD.csv', usecols=['HADM_ID', 'ICD9_CODE'])
diagnoses_df = diagnoses_df.astype({'ICD9_CODE': 'string'})

# Collect all unique ICD_9 codes and create new DataFrame codes_df.
codes_df = pd.DataFrame(diagnoses_df['ICD9_CODE'].unique(), columns = ['ICD9_CODE'])

# Create DataFrame representing one-hot encoding of ICD_9 codes.
one_hot_enc_df = pd.get_dummies(codes_df, columns = ['ICD9_CODE'], dtype='bool')

# Join codes_df and one_hot_enc_df, based on index
codes_df = codes_df.join(one_hot_enc_df)

# Next we merge diagnoses_df and codes_df, to create our final form mapping each HADM_ID with multi-hot encoding of ICD_9 codes.
# This is very heavy operation due to large number of rows in diagnoses_df and large number of columns in codes_df.
# So we utilize Dask DataFrame to parallelize this operation on GPU cores.

# Create Dask DataFrame from codes_df for distributed processing in GPU.
codes_df = dd.from_pandas(codes_df, npartitions = 10)

# Create Dask DataFrame from diagnoses_df for distributed processing in GPU.
diagnoses_df = dd.from_pandas(diagnoses_df, npartitions = 10)

# Merge diagnoses_df and codes_df based on column 'ICD9_CODE'. 
# Dask operations are lazy and do not materialize until 'compute()' method is invoked.
diagnoses_df = diagnoses_df.merge(codes_df, on='ICD9_CODE').compute()

diagnoses_df = diagnoses_df.drop(['ICD9_CODE'], axis = 1)

# This step will group all ICD_9 codes corresponding to a given HADM_ID and build a multi-hot embedding.
diagnoses_df = diagnoses_df.groupby('HADM_ID').any().reset_index()

print(diagnoses_df)

In [None]:
import time

# Read DIAGNOSES_ICD.csv from data directory, and pre-process.

diagnoses_df = pd.read_csv(PROJECT_DIR + 'data/DIAGNOSES_ICD.csv', usecols=['HADM_ID', 'ICD9_CODE'])
diagnoses_df = diagnoses_df.astype({'ICD9_CODE': 'string'})

# Collect all unique ICD_9 codes and create new DataFrame codes_df.
codes_df = pd.DataFrame(diagnoses_df['ICD9_CODE'].unique(), columns = ['ICD9_CODE'])

# Create DataFrame representing one-hot encoding of ICD_9 codes.
one_hot_enc_df = pd.get_dummies(codes_df, columns = ['ICD9_CODE'], dtype='bool')

# Join codes_df and one_hot_enc_df, based on index
codes_df = codes_df.join(one_hot_enc_df)

sta = time.time()
# Merge diagnoses_df and codes_df based on column 'ICD9_CODE'. 
# Dask operations are lazy and do not materialize until 'compute()' method is invoked.
diagnoses_df = diagnoses_df.merge(codes_df, on='ICD9_CODE')
end = time.time()
print(end - sta)

diagnoses_df = diagnoses_df.drop(['ICD9_CODE'], axis = 1)

# This step will group all ICD_9 codes corresponding to a given HADM_ID and build a multi-hot embedding.
diagnoses_df = diagnoses_df.groupby('HADM_ID').any().reset_index()

print(diagnoses_df)

In [None]:
# Next we process MIMIC-III NOTEEVENTS dataset.

In [None]:
# Import dataset and pre-process.
notes_df = pd.read_csv('drive/My Drive/cs598-dl/data/NOTEEVENTS.csv', usecols=['HADM_ID', "CATEGORY","DESCRIPTION", "TEXT"])
notes_df = notes_df.dropna()

# Only filter-in notes which are 'Discharge summary' and are of sub-type 'Report'.
notes_df = notes_df[(notes_df['CATEGORY'] == 'Discharge summary') & (notes_df['DESCRIPTION'] == 'Report')]
notes_df = notes_df.drop(['CATEGORY', 'DESCRIPTION'], axis=1)
notes_df = notes_df.astype({'HADM_ID': 'int64'})
notes_df = notes_df.drop_duplicates(subset = 'HADM_ID')
print(notes_df)

In [None]:
# We next select the subset of rows in diagnoses_df and notes_df with common set of HADM_IDs, 
# and remove other rows from each DataFrame. Such rows can not be used in training or testing.

In [None]:
# Collect all hadm_ids from diagnoses_df
hadm_ids_from_diagnoses_df = diagnoses_df.filter(items = ['HADM_ID'])

# Collect all hadm_ids from diagnoses_df
hadm_ids_from_notes_df = notes_df.filter(items = ['HADM_ID'])

# Generate DataFrame with common set of HADM_IDs.
hadm_ids_df = hadm_ids_from_diagnoses_df.merge(hadm_ids_from_notes_df, how = 'inner')

# Filter rows in daignoses_df by merging with DataFrame containing common HADM_IDs.
diagnoses_df = diagnoses_df.merge(hadm_ids_df, on='HADM_ID', how = 'right')

# Similarly, filter rows in notes_df by merging with DataFrame containing common HADM_IDs.
notes_df = notes_df.merge(hadm_ids_df, on='HADM_ID', how = 'inner')

In [None]:
# Pickle diagnoses_df
diagnoses_df.to_pickle(PROJECT_DIR + 'data/DIAGNOSES.pkl')

# Pickle notes_df
notes_df.to_pickle(PROJECT_DIR + 'data/NOTES.pkl')

In [None]:
diagnoses_df = Da