## Training `ELMo` on MIMIC data for Clinical Natural Language Processing

Short tutorial on how to train [AllenNLP's ELMo](https://allennlp.org/elmo) on [MIMIC](https://mimic.physionet.org/) data for medical/clinical Natural Language Processing. MIMIC is the most commonly used dataset for people doing NLP for medical/clinical purposes. Because CITI training is required to access MIMIC data, I will not be sharing the training files or the final model, however this tutorial will guide you through the steps to train `ELMo` on MIMIC data yourself! Training `ELMo` from scratch is necessary for medical/clinical NLP, because the language we find in clinical notes and medical writeupts is vastly differant than the language of Wikipedia.

## Setup

To get started, `git clone` [this repository](https://github.com/hclent/bilm-tf), which is a fork of the original `BiLM` repository plus a copy of this jupyter notebook to generate the training data for ELMo from MIMIC. 

Then set up your Python environment. I'd recommend creating a new anaconda environment with our `requirements.txt`.
Once you have done this, don't forgt to also follow the installation instructions for `BiLM`:

    pip install tensorflow-gpu==1.2 h5py
    python setup.py install 
    
And I would suggest running the tests: `python -m unittest discover tests/` 

To train `ELMo` on MIMIC, we will complete the following steps from the `BiLM README`:


    Prepare input data and a vocabulary file.
    Train the biLM.
    Test (compute the perplexity of) the biLM on heldout data.
    Write out the weights from the trained biLM to a hdf5 file.
    See the instructions above for using the output from Step #4 in downstream models.


Let's get started by preparing the input data and vocabulary file. 

In [1]:
from n2c2_tokenizer import build_n2c2_tokenizer #Credit to Kelly (https://github.com/burgersmoke) and Jianlin (https://github.com/jianlins)
import time, os, sys, multiprocessing, nltk, itertools
from collections import Counter
from multiprocessing import Pool
from sqlalchemy import create_engine, MetaData, Table, select

In [2]:
'''Step 0: Initalize our tokenizer for MIMIC data'''

ENABLE_PYRUSH_SENTENCE_TOKENIZER = False

n2c2_tokenizer = build_n2c2_tokenizer(enable_pyrush_sentence_tokenizer = ENABLE_PYRUSH_SENTENCE_TOKENIZER,
                                     disable_custom_preprocessing = ENABLE_PYRUSH_SENTENCE_TOKENIZER, keep_token_strings=True)
#Here is an example:
tokenized_doc_example = n2c2_tokenizer.tokenize_document("I am a simple document. Here are my sentences. NLP is the best.")

print(tokenized_doc_example.sentence_tokens_list)

Building n2c2 tokenizer...
('.', '!')
Enabling NLTK Punkt for sentence tokenization...
Type of sentence tokenizer : <class 'nltk.tokenize.punkt.PunktSentenceTokenizer'>
Enabling custom preprocessing expressions.  Total : 8
Class type initialized for ClinicalSentenceTokenizer for sentence tokenization : <class 'nltk.tokenize.punkt.PunktSentenceTokenizer'>
Compiled 8 total preprocessing regular expressions
Class type initialized for IndexTokenizer for sentence tokenization: <class 'clinical_tokenizers.ClinicalSentenceTokenizer'>
[['I', 'am', 'a', 'simple', 'document', '.'], ['Here', 'are', 'my', 'sentences', '.'], ['NLP', 'is', 'the', 'best', '.']]


In [3]:
'''Step 1: Load the Mimic data. I have my Mimic data in an sqlite database. 
For how to do this, see: https://github.com/hclent/PyPatent/blob/master/readMimic.py'''

def getMimicTexts():
    '''
    Input: N/A
    Output: List[Strings] for all 2 million+ MIMIC texts **lowercase**. 
    We're going to use this List[Strings] to generate the data and sorted vocabulary files that are needed to train BiLM.
    '''
    t1 = time.time() #start timer
    
    engine = create_engine('sqlite:///mimic.db') #initiated database engine
    conn = engine.connect()
    metadata = MetaData(bind=engine) #init metadata. will be empty
    metadata.reflect(engine) #retrieve db info for metadata (tables, columns, types)
    mydata = Table('mydata', metadata)

    data: list[string] = []

    #Query db for text. Not efficient. You can only execute one statment at a time with sqllite. Soz bro.   
    s = select([mydata.c.TEXT]) 
    print(type(s))
    result = conn.execute(s)
    print(type(result))
    for row in result:
        #text
        the_text = row["TEXT"]
        keep_text = the_text.rstrip()
        lower_text = keep_text.lower() #lowercase v important.
        # NB: tokenization will happen later. It is too slow to *NOT* run in parallel. 
        data.append(lower_text)
    
    print(" * Finished step0: done in %0.3fs." % (time.time() - t1))
    #Takes less than 1 minute to load all MIMIC texts into memory.
    return data

In [4]:
list_of_all_docs = getMimicTexts()

<class 'sqlalchemy.sql.selectable.Select'>
<class 'sqlalchemy.engine.result.ResultProxy'>
 * Finished step0: done in 47.032s.


In [5]:
#### Example ##### 
print("* list_of_all_docs is a: ", type(list_of_all_docs))
print("* number docs in list_of_all_docs: ", len(list_of_all_docs))
print("* documents in list_of_all_docs are: ", type(list_of_all_docs[0]))
# print("* Example documents: ", list_of_all_docs[0])

* list_of_all_docs is a:  <class 'list'>
* number docs in list_of_all_docs:  2083180
* documents in list_of_all_docs are:  <class 'str'>


In [6]:
'''Step 2: Create a helper function to run with multiprocessing that will tokenize the document, 
prepare sentences in the pretty format that BiLM wants (for data.txt), 
and prepare tokens for Counter create the set of tokens (for vocab.txt)'''

def processText(document):
    '''
    Input: String of the document
    Output: {"tokens": [list of tokens], "sentences": [list of pretty sentences]} = This will output 
    be used to create vocab.txt and data.txt.
    
    Data.txt needs > The training data should be randomly split into many 
        training files, each containing one slice of the data. 
        Each file contains pre-tokenized and white space separated text, one sentence per line. 
        Don't include the <S> or </S> tokens in your training data.
    
    Vocab.txt needs > the vocabulary file should be sorted in descending order by token count in your training data. 
        The first three lines should be the special tokens (<S>, </S> and <UNK>), then the most common token in the training data, ending with the least common token.

    '''
    #tokenize
    tokenized = n2c2_tokenizer.tokenize_document(document).sentence_tokens_list #list of lists of tokens
    #format sentences for data.txt
    pretty_sentences = [' '.join(sentences) for sentences in tokenized]
    
    #flatten the list of lists into one list of strings 
    flatten = list(itertools.chain(*tokenized))
    
    return_dict = {'tokens': flatten, 'sentences': pretty_sentences}
        
    return return_dict

unique_words_example = processText(list_of_all_docs[0])
#print(unique_words_example["set"])
#print("#"*20)
#print(unique_words_example["sentences"])

In [7]:
'''Step 3: Run the helper function asynchronously with multiprocessing to create the vocab.txt and data.txt 
that is necessary to run BiLM.
'''

def createTrainingData(index_start, index_end, n_try):
    '''
    Input: 
       * index_start: Int = index of list_of_all_docs you are trying to start from. E.g. 0 would be the first element in the list.
       * index_end: Int = index of list_of_all_docs you want to stop on. E.g. -1 would be the whole thing. *I DO NOT RECOMMEND THIS*
       * n_try: Int = This will be the suffix on the list of training files you create. I.e. mimmic_data_1.txt, mimic_data_2.txt... 
    Its going to take a very long time if you try to process all of the Mimic data at once 
    (there are 2 million + documents in list_of_all_docs).
    So instead, we are going to break it up into bite sizes by indexting our list of documents and then 
    combine the vocabulary Counters to create the vocab.txt at the end.
    '''
    t1 = time.time() #start the timer
    
    pool_size = multiprocessing.cpu_count() #NOTE: Usin' all yer CPU's my friend. Change this if you want!!!!
    pool = Pool(pool_size)
    print('* created worker pools')
    results0 = pool.map_async(processText, list_of_all_docs[index_start:index_end]) 
    print('* initialized map_async to naiveSearchText function with docs')
    print('* did map to getSetOfWords function with docs. WITH async')
    pool.close()
    print('* closed pool')
    pool.join()
    print('* joined pool')
    list_of_dicts = [r for r in results0.get() if r is not None] # A BUNCH OF SETS
    print("Number of dictionaries created: ", len(list_of_dicts))

    """Step A: create data.txt: Should have 1 sentence per line"""
    #get all doc's sentences
    document_sentences = [s["sentences"]  for s in list_of_dicts]
    #flatten to one big list of sentences
    flatten_sents: list[string] = list(itertools.chain(*document_sentences))
    #output to data_vocab.txt
    name_of_file = "mimic_data_" + str(n_try) + ".txt"
    with open(name_of_file, "w") as out:
        for sent in flatten_sents:
            out.write(sent)
            out.write("\n")
    
    """Step B: create vocab.txt: Should have 1 token per line, as well as AllenNLP special tokens. 
    We're going to simply return this output_counter, so that we can sum multiple Counters to output 
    the vocab.txt"""
    all_tokens = [d["tokens"] for d in list_of_dicts]
    all_flatten = list(itertools.chain(*all_tokens))
    output_counter = Counter(all_flatten)
    
    print(" * Created mimic_data.txt & mimic_vocab.txt: done in %0.3fs." % (time.time() - t1))
    return output_counter



In [8]:
# 10 docs in 0.184s.
# 100 docs in 0.737s
# 1,000 docs in 5.684s.
# 10,000 docs in 59.077s.
'''Looks like it scales linearly! :) '''
#So 100k documents shoud take ~1 hour
# 1 million docs should take ~ 10 hours
# 2 million shoudl take ~ 20 hours 

###### A small example #####
v1 = createTrainingData(0, 10, 1)
v2 = createTrainingData(10, 20, 2)
dummy_vocab = v1  + v2
print(len(dummy_vocab))

* created worker pools
* initialized map_async to naiveSearchText function with docs
* did map to getSetOfWords function with docs. WITH async
* closed pool
* joined pool
Number of dictionaries created:  10
 * Created mimic_data.txt & mimic_vocab.txt: done in 9.051s.
* created worker pools
* initialized map_async to naiveSearchText function with docs
* did map to getSetOfWords function with docs. WITH async
* closed pool
* joined pool
Number of dictionaries created:  10
 * Created mimic_data.txt & mimic_vocab.txt: done in 3.545s.
5159


In [None]:
"""Now we are going to call createTrainingData() and combine all of of the output_counters into one giant counter for the whole vocab.txt
You can train the indices for however much of MIMIC you plan to train on, you'll change these numbers. 
"""


##### The whole kitten kaboodle #####
first_quarter = (len(list_of_all_docs))*.25
print(first_quarter)
half_way = (len(list_of_all_docs))*.5
print(half_way)
three_quarters = (len(list_of_all_docs))*.75
print(three_quarters)
to_the_end = len(list_of_all_docs)
print(to_the_end)


output1 = createTrainingData(0, int(first_quarter), 1)  #the third arg of all of these should count sequentially up!
output2 = createTrainingData(int(first_quarter), int(half_way), 2)  #second data training file
output3 = createTrainingData(int(half_way), int(three_quarters), 3)  #third data training file
output4 = createTrainingData(int(three_quarters), int(to_the_end), 4)  #fourth data training file

final_vocabulary = output1 + output2 + output3 + output4
print("* LEN FINAL VOCABULARY (n_tokens_vocab): ", len(final_vocabulary))

#We have to add some specific tokens to the top of our vocab.txt to make AllenNLP happy
allen_specific = ['<S>','</S>','<UNK>'] 

#now output to vocab.txt
with open("mimic_vocab.txt", "w") as out:
    for special in allen_specific:
        out.write(special)
        out.write("\n")
    for token, count in final_vocabulary.most_common():
        out.write(token)
        out.write("\n")

### Before you are done with this notebook....!!!!
there is one **very important** number you need to get out of here! That is the total number of tokens. You will need to set `n_train_tokens` to be equal to this number  +3 for the AllenNLP special tokens in `train_elmo.py`!!!

In [None]:
print(" I M P O R T A N T !  !  ! ")
print("SET THIS TO 'n_train_tokens' in train_elmo.py: ", sum(final_vocabulary.values())+3) #adding 3 for the 3 unique AllenNLP tokens

## Step 2: Train biLM

    Prepare input data and a vocabulary file.
    --> Train the biLM.
    Test (compute the perplexity of) the biLM on heldout data.
    Write out the weights from the trained biLM to a hdf5 file.
    See the instructions above for using the output from Step #4 in downstream models.

Now that the data files have been created, you are ready to train `biLM`!! Please see the `train_on_mimic.sh` script in the repo.

### Other observations about training `BiLM`
* make sure you've run python setup.py install!
* Double check that you have updated `train_elmo.py` with the `n_train_tokens` (vocab size) for the documents in MIMIC you are using, `n_gpus` your number of GPUs, and your batch size.
* Even if you don't have GPU,s n_gpu cannot be set to 0 or training.py will throw an error. 
* If you see an error like: 
    """
     WARNING:tensorflow:Error encountered when serializing lstm_output_embeddings.
    Type is unsupported, or the types of the items don't match field type in CollectionDef.
    'list' object has no attribute 'name'
    """
, it might just be like [this](https://github.com/tflearn/tflearn/issues/190#issuecomment-231545279) and not matter?
