# Preprocessing

In this notebook, we create word embeddings which will be used to represent documents for our neural network. First, we try creating word embeddings using just the discharge summaries to be used in the study, and later, we create word embeddings based on all discharge summaries from MIMIC-III, as was done in the original study. We store our results in "word_vectors_from_all_discharge_summaries.wordvectors".

We then save all of the discharge summaries to be used in our replication study within a 3D tensor called study_corpus_tensor and store it in 'embedded_docs.pt'. study_corpus_tensor[i, :, :] is a 2-D tensor representing document i, and study_corpus_tensor[i, j, :] is an embedding vector that represents word j in document i. These document tensors are padded to the length of the longest document using vectors of all zeroes.

We also store a dataframe containing the discharge summaries to be used in our replication study, along with the phenotype labels for each document (which tell you if each patient has depression or not, alcoholism or not, and so on), in labelled_corpus_df.csv".

## Setup

First, we import the required libraries, connect to Google Drive, move to the correct repository, set some values which will be required later.

If the folder "DLH project" was shared to you, you will need to create a shortcut to it in your Google Drive to use its contents within Colab.

In [None]:
import numpy as np
import os
import pandas as pd
import re
import time
import sys
import torch

from gensim.test.utils import datapath
from gensim.models import KeyedVectors
from gensim.models import Word2Vec

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [None]:
os.chdir("/content/drive/MyDrive/DLH project")

In [None]:
embedding_vector_size = 100
num_study_docs = 1341

## Getting the Data

Here, we load NOTEEVENTS.csv (which contains all clinical documents from MIMIC-III) and annotations.csv (which contains labels telling us if each discharge summary to be used corresponds ot each condition), then merge them to get a dataframe (labelled_corpus_df) containing the text and labels we will use in our replication study.

In some cases, a single combination of hospital admission ID and patient ID corresponded to multiple discharge summaries. We omitted all discharge summaries for these cases, as the fact that "chart.time" within annotations.csv contained either a copy of Hospital.Admission.ID or just 999999 instead of corresponding to CHARTTIME within NOTEEVENTS.csv prevented us from matching annotations to documents in cases where single combination of hospital admission ID and patient ID corresponded to multiple discharge summaries.

For instance, if a combination of hospital admission ID and patient ID corresponded to two different discharge summaries, we would not know which discharge summary matched up to which set of labels.

Because of this, we use only 1341 documents in our replication study, compared to 1610 in the original.

In [None]:
t1 = time.time()
clinical_notes_df = pd.read_csv("NOTEEVENTS.csv")
# CHARTDATE and CHARTTIME apparrently have mixed types, but that does not matter for us
t2 = time.time()
print(t2 - t1) # Takes about 1 minute

  clinical_notes_df = pd.read_csv("NOTEEVENTS.csv")


54.101357221603394


In [None]:
print(clinical_notes_df.shape)
print(clinical_notes_df.head(1))

(2083180, 11)
   ROW_ID  SUBJECT_ID   HADM_ID   CHARTDATE CHARTTIME STORETIME  \
0     174       22532  167853.0  2151-08-04       NaN       NaN   

            CATEGORY DESCRIPTION  CGID  ISERROR  \
0  Discharge summary      Report   NaN      NaN   

                                                TEXT  
0  Admission Date:  [**2151-7-16**]       Dischar...  


In [None]:
clinical_notes_df["CATEGORY"].value_counts()

Nursing/other        822497
Radiology            522279
Nursing              223556
ECG                  209051
Physician            141624
Discharge summary     59652
Echo                  45794
Respiratory           31739
Nutrition              9418
General                8301
Rehab Services         5431
Social Work            2670
Case Management         967
Pharmacy                103
Consult                  98
Name: CATEGORY, dtype: int64

In [None]:
clinical_notes_df.columns

Index(['ROW_ID', 'SUBJECT_ID', 'HADM_ID', 'CHARTDATE', 'CHARTTIME',
       'STORETIME', 'CATEGORY', 'DESCRIPTION', 'CGID', 'ISERROR', 'TEXT'],
      dtype='object')

In [None]:
annotations_df = pd.read_csv("annotations.csv")

In [None]:
annotations_df.shape

(1610, 18)

In [None]:
annotations_df.head(1)

Unnamed: 0,Hospital.Admission.ID,subject.id,chart.time,cohort,Obesity,Non.Adherence,Developmental.Delay.Retardation,Advanced.Heart.Disease,Advanced.Lung.Disease,Schizophrenia.and.other.Psychiatric.Disorders,Alcohol.Abuse,Other.Substance.Abuse,Chronic.Pain.Fibromyalgia,Chronic.Neurological.Dystrophies,Advanced.Cancer,Depression,Dementia,Unsure
0,118003,3644,118003,1,0,0,0,0,0,0,0,0,1,0,0,1,0,0


In [None]:
# Get all documents wtih a combination of hospital admission ID and subject ID matching one
# used in the study
merged_df = pd.merge(annotations_df, clinical_notes_df, left_on = ["Hospital.Admission.ID", "subject.id"], right_on = ["HADM_ID", "SUBJECT_ID"])

In [None]:
merged_df.shape

(56839, 29)

In [None]:
merged_df["CATEGORY"].value_counts()

Nursing/other        21066
Radiology            13135
Nursing               7520
Physician             5059
ECG                   5026
Discharge summary     1976
Respiratory           1155
Echo                  1021
Nutrition              325
General                290
Rehab Services         166
Social Work             64
Case Management         36
Name: CATEGORY, dtype: int64

In [None]:
# Only the discharge summaries are relevant
merged_df = merged_df[merged_df["CATEGORY"] == "Discharge summary"]

In [None]:
merged_df.shape

(1976, 29)

In [None]:
# Number of unique combinations of hospital admission ID and subject ID
merged_df.groupby(["HADM_ID", "SUBJECT_ID"]).size().reset_index().rename(columns = {0 : 'count'}).shape

(1560, 3)

In [None]:
temp = merged_df.groupby(["HADM_ID", "SUBJECT_ID"]).size().reset_index().rename(columns = {0 : 'count'})

# Number of unique combinations of hospital admission ID and subject ID
# which have more than one discharge summary
print(temp[temp["count"] > 1].shape)

# Number of unique combinations of hospital admission ID and subject ID
# which have only one discharge summary
print(temp[temp["count"] == 1].shape)

(219, 3)
(1341, 3)


In [None]:
# A dataframe containing the hospital admission ID and subject ID for all
# patients used in the study and who have only one discharge summary in MIMIC-III
# We have no way of telling which set of labels corresponds to which discharge summary
# for each such combination of IDs, due to the labels being somewhat mislabeled (the chart.time
# field contains either a copy of the hospital admission ID or 999999 for each set of labels,
# instead of an actual time), so we can only use discharge summaries from these patients
ids_for_non_duplicate = temp[temp["count"] == 1][["HADM_ID", "SUBJECT_ID"]]

In [None]:
# This contains the hospital admission ID, subject ID, discharge summary text, and labels
# for all of the patients we will be using in our study
labelled_corpus_df = pd.merge(merged_df, ids_for_non_duplicate, left_on = ["HADM_ID", "SUBJECT_ID"], right_on = ["HADM_ID", "SUBJECT_ID"])
labelled_corpus_df = labelled_corpus_df[["HADM_ID",
                                         "SUBJECT_ID",
                                         "TEXT",
                                         "Advanced.Cancer",
                                         "Advanced.Heart.Disease",
                                         "Advanced.Lung.Disease",
                                         "Chronic.Neurological.Dystrophies",
                                         "Chronic.Pain.Fibromyalgia",
                                         "Alcohol.Abuse",
                                         "Other.Substance.Abuse",
                                         "Obesity",
                                         "Schizophrenia.and.other.Psychiatric.Disorders",
                                         "Depression"]]

In [None]:
labelled_corpus_df.head(1)

Unnamed: 0,HADM_ID,SUBJECT_ID,TEXT,Advanced.Cancer,Advanced.Heart.Disease,Advanced.Lung.Disease,Chronic.Neurological.Dystrophies,Chronic.Pain.Fibromyalgia,Alcohol.Abuse,Other.Substance.Abuse,Obesity,Schizophrenia.and.other.Psychiatric.Disorders,Depression
0,118003.0,3644,Admission Date: [**2200-4-7**] Discharge ...,0,0,0,0,1,0,0,0,0,1


In [None]:
labelled_corpus_df.iloc[0]["TEXT"]

"Admission Date:  [**2200-4-7**]     Discharge Date:  [**2200-4-10**]\n\nDate of Birth:   [**2146-9-21**]     Sex:  F\n\nService:  CARDIAC INTENSIVE CARE MEDICINE\n\nCHIEF COMPLAINT:  The patient was admitted to the Cardiac\nIntensive Care Unit Medicine Service on [**2200-4-7**], with the\nchief complaint of acute myocardial infarction and fever.\n\nHISTORY OF PRESENT ILLNESS:  The patient is a 53 year old\nwhite female with a history of coronary artery disease,\nhypertension, hypercholesterolemia and two pack per day\ntobacco use with previous coronary artery bypass graft\nsurgery presenting to an outside hospital on [**2200-4-6**], with a\ntwo day history of fevers and confusion.  The patient had a\nCT scan of the chest at that time which revealed pneumonia by\nreport in the left lower lobe.\n\nWhile in the outside hospital Emergency Department, the\npatient complained of chest pain.  The patient states that\nshe has had this pain for approximately two weeks with no\nrelief.  She was

This function was taken from the GitHub repository provided by the study authors. It is the only code we directly copied from them (their CNN code was written in Lua, and we had a hard time understanding it). It is used to clean and tokenize discharge summaries.

We included a .lower() function, although the study authors commented it out. We are not sure if they lowercased their documents elsewhere. However, when .lower() was omitted, many similar words (such as "ETOH", "etoh", "EToH", and so on) were treated as different, which is probably not good for performance and vastly increases the vocabulary.

We use it on both the entire corpus of discharge summaries, as well as the 1341 discharge summaries we use in our replication study.

In [None]:
# From provided code
# Did not remove commas??
# Separates "patient's" into "patient 's"??? Purpose???? -> should be caught with n-grams
def clean_str(string):
    """
    Tokenization/string cleaning.
    """
    string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string)
    string = re.sub(r"\'s", " \'s", string)
    string = re.sub(r"\'ve", " \'ve", string)
    string = re.sub(r"n\'t", " n\'t", string)
    string = re.sub(r"\'re", " \'re", string)
    string = re.sub(r"\'d", " \'d", string)
    string = re.sub(r"\'ll", " \'ll", string)
    string = re.sub(r",", " , ", string)
    string = re.sub(r"!", " ! ", string)
    string = re.sub(r"\(", " ( ", string)
    string = re.sub(r"\)", " ) ", string)
    string = re.sub(r"\?", " ? ", string)
    string = re.sub(r"\s{2,}", " ", string)
    return string.strip().lower() # We include lower()??? # .lower() word2vec is case sensitive

clean_str(labelled_corpus_df.iloc[0]["TEXT"])

"admission date 2200 4 7 discharge date 2200 4 10 date of birth 2146 9 21 sex f service cardiac intensive care medicine chief complaint the patient was admitted to the cardiac intensive care unit medicine service on 2200 4 7 , with the chief complaint of acute myocardial infarction and fever history of present illness the patient is a 53 year old white female with a history of coronary artery disease , hypertension , hypercholesterolemia and two pack per day tobacco use with previous coronary artery bypass graft surgery presenting to an outside hospital on 2200 4 6 , with a two day history of fevers and confusion the patient had a ct scan of the chest at that time which revealed pneumonia by report in the left lower lobe while in the outside hospital emergency department , the patient complained of chest pain the patient states that she has had this pain for approximately two weeks with no relief she was given levofloxacin for apparent community acquired pneumonia and cardiac enzymes w

In [None]:
# Takes ~25 seconds once, then just 1.6344 seconds when run later?
t1 = time.time()
labelled_corpus_df["Cleaned Text"] = labelled_corpus_df.apply(lambda row : clean_str(row["TEXT"]), axis = 1)
labelled_corpus_df = labelled_corpus_df.drop(['TEXT'], axis=1)
t2 = time.time()
print(t2 - t1)

1.5384259223937988


In [None]:
labelled_corpus_df.iloc[100]["Cleaned Text"]

"admission date 2131 3 26 discharge date 2131 3 29 date of birth 2079 11 1 sex m service medicine allergies bactrim levaquin location ( un ) juice attending first name3 ( lf ) 348 chief complaint hypoglycemic episode , cough major surgical or invasive procedure history of present illness mr known lastname 7264 is a 51 year old male with past medical history significant for type i diabetes and mental retardation who was brought to ed from his group home after an episode of hypoglycemia with fsg of 37 and repeat fsg of 40 even after having his dinner ems was called and he was given 12 16 amp of dextrose enroute to hospital1 18 per caregivers , patient 's mental status was at usual baseline in the ed , initial vs were t age over 90 f , p 86 , bp 104 63 , rr 20 , o2 saturation rate is 97 room air glucose trend in ed included 0030 fs telephone fax ( 1 ) 7265 fs telephone fax ( 1 ) 7266 fs 173 he also had a fever to 103f , noted cough on exam and tachypnea to mid 30s range no abg was done in

In [None]:
# Save the dataframe containing the documents of interest and their labels
labelled_corpus_df.to_csv('labelled_corpus_df.csv')

In [None]:
# List of lists, where each internal list contains the words of a document
# Fairly fast
labelled_corpus_lol = [row.split() for row in labelled_corpus_df['Cleaned Text']]

In [None]:
len(labelled_corpus_lol) # 1341 documents

In [None]:
labelled_corpus_lol[0]

In [None]:
# Here, we tried using Word2Vec on the 1341 discharge summaries we plan to use in our replication study
# We are not doing this anymore, as we will use all discharge summaries within MIMIC-III to make our word vectors
# t_start = time.time()
# w2v_model = Word2Vec(labelled_corpus_lol,
#                      sg=0,
#                      window=10,
#                      negative=10,
#                      min_count=5,
#                      epochs=15,
#                      vector_size=embedding_vector_size,
#                      workers=3)
# t_end = time.time()

# print(t_end - t_start) #102.29641628265381s for 100 on Google Colab standard, 57.3120 seconds on high ram

## Embedding 

Here, we make embedding vectors for words based on all discharge summaries in MIMIC-III (as was done in the original paper). We select the discharge summaries from the dataframe containing all clinical notes within MIMIC-III, clean and tokenize them using the provided function, store them in a custom corpus class, then run Word2Vec on them.

Most of the parameters for Word2Vec were provided in the paper, but we used an embedding size of 100 because that is the default value and no value was specified for that hyperparameter in the paper.

We then do some exploration of the word embeddings and save them to embedded_docs.pt.

In [None]:
# Try doing word2vec on all discharge summaries
all_discharge_summaries_df = clinical_notes_df.loc[clinical_notes_df["CATEGORY"] == "Discharge summary"]
all_discharge_summaries_df.shape

(59652, 11)

In [None]:
t_start = time.time()
all_discharge_summaries_df["Cleaned Text"] = all_discharge_summaries_df.apply(lambda row : clean_str(row["TEXT"]), axis = 1) # See a warning even though I did same thing previously, fine??? 
t_end = time.time()
print(t_end - t_start) # 57.84552216529846 High RAM colab

57.78693175315857


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  all_discharge_summaries_df["Cleaned Text"] = all_discharge_summaries_df.apply(lambda row : clean_str(row["TEXT"]), axis = 1)


In [None]:
all_discharge_summaries_df.iloc[45000]

ROW_ID                                                      47631
SUBJECT_ID                                                  46744
HADM_ID                                                  106032.0
CHARTDATE                                              2173-08-02
CHARTTIME                                                     NaN
STORETIME                                                     NaN
CATEGORY                                        Discharge summary
DESCRIPTION                                                Report
CGID                                                          NaN
ISERROR                                                       NaN
TEXT            Admission Date:  [**2173-7-21**]              ...
Cleaned Text    admission date 2173 7 21 discharge date 2173 8...
Name: 45000, dtype: object

In [None]:
# Returns one "Cleaned Text" from all_discharge_summaries_df at a time, split into a list of words 
class MyCorpus:
    def __iter__(self):
        for i in range(all_discharge_summaries_df.shape[0]):
            # assume there's one document per line, tokens separated by whitespace
            yield all_discharge_summaries_df["Cleaned Text"].iloc[i].split()

In [None]:
# Basically, a list of lists, where each list contains the words from one discharge summary in all_discharge_summaries_df
# but it is not all generated at the same time
# Like so? Repeated iterations?????
streaming_corpus = MyCorpus()

In [None]:
# Make embedding vectors based on all discharge summaries
t_start = time.time()
w2v_model = Word2Vec(streaming_corpus,
                     sg=0,
                     window=10,
                     negative=10,
                     min_count=5,
                     epochs=15,
                     vector_size=embedding_vector_size,
                     workers=3)
t_end = time.time()

print(t_end - t_start) #2145.9494240283966 seconds on Colab Pro High RAM

2145.9494240283966


In [None]:
# Just get the vectors
word_vectors = w2v_model.wv
del w2v_model

# These vectors were constructed using only the 1341 documents we are currently using, and not all discharge summaries
# word_vectors.save("word_vectors_from_study_documents.wordvectors")
word_vectors.save("word_vectors_from_all_discharge_summaries.wordvectors")

In [None]:
# Don't need to run word2vec again
# word_vectors = KeyedVectors.load("word_vectors_only_from_study_documents.wordvectors", mmap='r')

In [None]:
print(word_vectors['alcohol'])

[ -3.1861129    5.262902    -6.3225594   -4.7982006    4.3269615
  -5.589623     2.9312115   -3.663884    -4.197892     1.2257946
  -5.577783    -1.996908     3.8154147    7.301901     1.1599406
   0.07475126  -3.0784452   -3.5749977   -2.5072606   -2.8457582
   4.8978014    1.5327842    9.970897     4.915571     3.5068972
   7.8685474   -4.914637     4.3651986   -1.2349755   -6.4273167
   6.3606176    1.1814938   -0.13996305   4.107138     6.580439
  -2.429423    -0.26516044   6.777397     1.209258     1.5307328
  -8.989547     0.40715852   6.21218    -13.365695    -1.2178051
   3.7284064   -8.925623    -3.7387733    0.29934537  -2.4166577
  -8.572873    -4.974907    -0.6738957   10.524975     0.45470825
   0.68279076  -5.4843254   -7.4348655    6.361214    12.754989
   0.35263175   1.6255226   -4.6482496    0.76110554   1.5838642
   4.86693      3.1606245   -4.481111    -0.16963749   1.643261
   6.9733257    8.063492     0.17277762   2.2409945    1.3575764
  -7.5169578   -2.1233299  

In [None]:
word_vectors.most_similar("alcohol", topn = 5)

[('etoh', 0.8409751057624817),
 ('substance', 0.8176356554031372),
 ('alchol', 0.7953281998634338),
 ('alchohol', 0.77773517370224),
 ('alochol', 0.7399619817733765)]

In [None]:
print("Vocabulary Size")
print(len(word_vectors))

Vocabulary Size
56996


In [None]:
"alcohol" in word_vectors

True

In [None]:
"bee" in word_vectors

True

In [None]:
# Fairly fast
# Creates docs_as_tensors, which features each document as a stack of (horizontal) embeddings for each of its words
t1 = time.time()
docs_as_tensors = [None] * num_study_docs # Number of documents we use

i = 0
for doc in labelled_corpus_lol:
    docs_as_tensors[i] = torch.stack([torch.from_numpy(word_vectors[word].copy()) for word in doc if (word in word_vectors)])
    i += 1
    
t2 = time.time()

print(t2 - t1)

print(len(docs_as_tensors))
print(docs_as_tensors[len(docs_as_tensors) - 1].shape)

25.293248653411865
1341
torch.Size([2485, 100])


In [None]:
max_length = 0
for doc_tensor in docs_as_tensors:
    if doc_tensor.shape[0] > max_length:
        max_length = doc_tensor.shape[0]

print(max_length)

5434


In [None]:
# Pad all documents to the maximum length of any of them
# Much faster on Google Colab Pro

t1 = time.time()

# study_corpus_tensor[i, :, :] will contain the representation of the i-th document
# padded to the length of the longest document
# study_corpus_tensor[i, j, :] will contain the embedding vector for the j-th word in the i-th document,
# or all 0s, if j is greater than the number of words in the i-th document
study_corpus_tensor = torch.zeros((len(docs_as_tensors), max_length, embedding_vector_size))

for i, doc_tensor in enumerate(docs_as_tensors):
    study_corpus_tensor[i, 0:doc_tensor.shape[0], :] = doc_tensor

t2 = time.time()
print(t2 - t1)

2.6939125061035156


In [None]:
study_corpus_tensor.shape

torch.Size([1341, 5434, 100])

In [None]:
docs_as_tensors[58].shape

torch.Size([5434, 100])

In [None]:
# Save the documents recorded as stacks of word embeddings
# They are incomplete, as we used the word2vec embeddings from the 1341 documents under investigation, and not all
# discharge summaries in MIMIC-III
# torch.save(study_corpus_tensor, 'embedded_docs_incomplete.pt')

# No longer incomplete, used word vectors trained from all discharge summaries in MIMIC-III
torch.save(study_corpus_tensor, 'embedded_docs.pt')

In [None]:
labelled_corpus_df = pd.read_csv("labelled_corpus_df.csv")

In [None]:
study_corpus_tensor = torch.load("embedded_docs.pt")

In [None]:
labelled_corpus_df.head(1)

Unnamed: 0.1,Unnamed: 0,HADM_ID,SUBJECT_ID,Advanced.Cancer,Advanced.Heart.Disease,Advanced.Lung.Disease,Chronic.Neurological.Dystrophies,Chronic.Pain.Fibromyalgia,Alcohol.Abuse,Other.Substance.Abuse,Obesity,Schizophrenia.and.other.Psychiatric.Disorders,Depression,Cleaned Text
0,0,118003.0,3644,0,0,0,0,1,0,0,0,0,1,admission date 2200 4 7 discharge date 2200 4 ...


In [None]:
# Making sure we correctly padded with 0s
for i in range(num_study_docs):
    if torch.all(study_corpus_tensor[i, 0:docs_as_tensors[i].shape[0], :] != docs_as_tensors[i]):
        print("Copied wrong while padding document", i, "!")
    if docs_as_tensors[i].shape[0] < max_length and torch.all(study_corpus_tensor[i, docs_as_tensors[i].shape[0]:, :] != 0):
        print("Did not pad document", i, "with 0s!")

In [None]:
labelled_corpus_df["Depression"]

0       1
1       0
2       1
3       1
4       0
       ..
1336    0
1337    0
1338    0
1339    0
1340    0
Name: Depression, Length: 1341, dtype: int64