## Using FAST.AI for NLP

Exploring the MIMIC III data set medical notes.

Tried working with the full dataset, but almost every training step takes many hours (~13 for initial training), predicted 14+ per epoch for fine tuning.

Instead will try to work with just 10% sample... Not sure that will work though

A few notes:
* See https://docs.fast.ai/text.transform.html#Tokenizer for details on what various artificial tokens (e.g xxup, xxmaj, etc.) mean
* Due to a change in the markdown package private API, the 'doc' functionality (e.g. ` doc(learn.lr_find)`) is currently broken. See https://github.com/fastai/fastai/commit/21faa5d187b2cccf2a48315d183c2863ed2cdc50

In [1]:
from fastai.text import *
from sklearn.model_selection import train_test_split

In [2]:
# run this to see what has already been imported
#whos

In [3]:
# pandas doesn't understand ~, so provide full path
base_path = Path('/home/jupyter/mimic')
seed = 42
# previously used 48; worked fine but never seemed to use even half of GPU memory; 64 still on the small side
bs=128

While parsing a CSV and converting to a dataframe is pretty fast, loading a pickle file is much faster.

For load time and size comparison:
* `NOTEEVENTS.csv` is ~ 3.8GB in size
  ```
  CPU times: user 51.2 s, sys: 17.6 s, total: 1min 8s
  Wall time: 1min 47s
  ```
* `noteevents.pickle` is ~ 3.7 GB in size
  ```
  CPU times: user 2.28 s, sys: 3.98 s, total: 6.26 s
  Wall time: 6.26 s
  ```

In [10]:
%%time

filename = base_path/'noteevents.pickle'

if os.path.isfile(filename):
    orig_df = pd.read_pickle(filename)
else:
    print('Could not find noteevent pickle file; creating it')
    # run this the first time to covert CSV to Pickle file
    orig_df = pd.read_csv(base_path/'NOTEEVENTS.csv', low_memory=False, memory_map=True)
    orig_df.to_pickle(filename)

CPU times: user 2.28 s, sys: 3.98 s, total: 6.26 s
Wall time: 6.26 s


Due to data set size and performance reasons, working with a 10% sample. Use same random see to get same results from subsequent runs.

In [11]:
df = orig_df.sample(frac=0.1, random_state=seed)

In [12]:
df.head()

Unnamed: 0,ROW_ID,SUBJECT_ID,HADM_ID,CHARTDATE,CHARTTIME,STORETIME,CATEGORY,DESCRIPTION,CGID,ISERROR,TEXT
1292716,1295263,2549,159440.0,2132-04-02,2132-04-02 13:09:00,2132-04-02 13:35:00,Nursing/other,Report,18566.0,,CCU NSG TRANSFER SUMMARY UPDATE: RESP FAILURE\...
1160271,1175599,29621,190624.0,2149-02-23,2149-02-23 03:27:00,,Radiology,CHEST (PORTABLE AP),,,[**2149-2-23**] 3:27 AM\n CHEST (PORTABLE AP) ...
1549380,1555118,22384,142591.0,2185-03-26,2185-03-26 17:58:00,2185-03-26 18:01:00,Nursing/other,Report,16985.0,,Respiratory Care\nPt remains intubated (#7.5 E...
7474,5743,690,152820.0,2182-09-14,,,Discharge summary,Report,,,Admission Date: [**2182-9-12**] Dischar...
2014768,2023163,25560,156143.0,2154-11-18,2154-11-18 10:44:00,2154-11-18 17:08:00,Nursing/other,Report,16888.0,,Neonatology\nOn exam pink active non-dysmorphi...


In [13]:
df.dtypes

ROW_ID           int64
SUBJECT_ID       int64
HADM_ID        float64
CHARTDATE       object
CHARTTIME       object
STORETIME       object
CATEGORY        object
DESCRIPTION     object
CGID           float64
ISERROR        float64
TEXT            object
dtype: object

In [14]:
df.shape

(208318, 11)

Split data into train and test sets; using same random seed so subsequent runs will generate same result

In [15]:
test_size = 0.333333333
train, test = train_test_split(df, test_size=test_size, random_state=seed)

In [16]:
train.shape

(138878, 11)

In [17]:
test.shape

(69440, 11)

Code to build initial version of language model; If running with full dataset, requires a **LOT** of RAM; using a **LOT** of CPU helps it to happen quickly as well

Questions:

* why does this only seem to use CPU? (applies to both both textclasdatabunch and textlist)
* for 100% of the mimic noteevents data:
  * run out of memory at 32 GB, error at 52 GB, trying 72GB now... got down to only 440MB free; if crash again, increase memory
  * now at 20vCPU and 128GB RAM; ok up to 93%; got down to 22GB available
  * succeeded with 20CPU and 128GB RAM...
* try smaller batch size? will that reduce memory requirements?
* with 10% dataset sample, it seems I could get by with perhaps 32GB system RAM

For comparison:
* 10% langauge model is ~ 1.2 GB in size
  * Time to load existing language model:
    ```
    CPU times: user 3.29 s, sys: 844 ms, total: 4.14 s
    Wall time: 12.6 s
    ```
  * Time to build language model:
* 100% language model is...
  * Time to load existing language model:
    ```
    CPU times: user 3.29 s, sys: 844 ms, total: 4.14 s
    Wall time: 12.6 s
    ```
  * Time to build language model:

In [18]:
%%time

filename = base_path/'mimic_lm.pickle'
file = 'mimic_lm.pickle'

if os.path.isfile(filename):
    data_lm = load_data(base_path, file, bs=bs)
else:
    data_lm = (TextList.from_df(df, 'texts.csv', cols='TEXT')
               #df has several columns; actual text is in column TEXT
               .split_by_rand_pct(valid_pct=0.1, seed=seed)
               #We randomly split and keep 10% for validation
               .label_for_lm()
               #We want to do a language model so we label accordingly
               .databunch(bs=bs))
    data_lm.save(filename)

CPU times: user 3.29 s, sys: 844 ms, total: 4.14 s
Wall time: 12.6 s


If need to view more data, run appropriate line to make display wider/show more columns...
```python
# default 20
pd.get_option('display.max_columns')
pd.set_option('display.max_columns', 20)
pd.set_option('display.max_columns', None) # show all
# default 50
pd.get_option('display.max_colwidth')
pd.set_option('display.max_colwidth', -1) # show all
```

In [19]:
data_lm.show_batch()
# how to look at original version of text
#df[df['TEXT'].str.contains('being paralyzed were discussed', case=False)].TEXT

idx,text
0,pacs . xxmaj bp went back to 150 / 80 . xxup hr varies between 80s to low 90s at rest up to 1-teens with activity . xxup bp varies more widely between 1-teens / 70s at rest up to 170 / 90s with activity . xxmaj she continues on dilt 90 mg po qid . xxmaj she was xxup k+ replaced today . \n xxup resp : xxmaj
1,xxup teams . \n xxbos xxmaj respiratory note : \n xxmaj pt remains on xxup cpap overnight . xxmaj no changes made to the vent setting . xxmaj frequent sx due to very thick yellow secretions . xxmaj combivent inhaler given per order . xxmaj no xxup abg available . xxmaj no xxup rsbi . # 6.0 [ * * xxmaj last xxmaj name ( un ) *
2,"6.2 mg / dl \n xxmaj microbiology : xxup bal : colonization acinetobacter \n xxmaj blood cx [ * * 2 - 15 * * ] : [ * * xxmaj female xxmaj first xxmaj name ( un ) 474 * * ] albicans \n xxup ecg : pend \n xxmaj assessment and xxmaj plan \n xxup renal xxup failure , xxup acute ( xxup"
3,xxup id - will d / c amp / gent \n xxup neuro - will have a xxup hus next week . \n xxup social - will plan to meet with the family \n \n xxbos [ * * 2136 - 4 - 18 * * ] 11:24 xxup am \n xxup chest ( xxup portable xxup ap ) xxmaj clip # [ * * xxmaj
4,"showing pleural effusion . \n \n xxup technique : xxmaj informed consent was obtained from the patient 's son , [ * * xxmaj name ( xxup ni ) * * ] , \n over the phone . xxmaj this was witnessed by two physicians ( xxmaj dr. [ * * xxmaj last xxmaj name ( stitle ) 1886 * * ] and xxmaj dr. \n ["


In [14]:
learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.3)

In [18]:
learn.lr_find()
learn.recorder.plot(skip_end=15)

LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.


### Initial model training

Full data set took about 13 hours using the Nvidia P1000; Full data set was predicted to take about 25 hours with the T4

10% data is predicted to take about 1 hour (1:10) using the Nvidia P1000

In [20]:
# no idea how long nor how much resources this will take
# not sure 1e-2 is the right learning rate; maybe 1e-1 or between 1e-2 and 1e-1
# using t4
# progress bar says this will take around 24 hours... ran for about 52 minutes
# gpustat/nvidia-smi indicates currently only using about 5GB of GPU RAM
# using p100
# progress bar says this will take around 12 hours; took 13:16
# at start GPU using about 5GB RAM
# after about 8 hours GPU using about 7.5GB RAM.
# looks like I could increase batch size...
# with bs=64, still only seems to be using about 7GB GPU RAM after running for 15 minutes. 
# will check after a bit, but likely can increase batch size further

filename = base_path/'mimic_fit_head'

if os.path.isfile(filename):
    learn.load(base_path/'mimic_fit_head')
    print('loaded learner')
else:
    learn.fit_one_cycle(1, 5e-2, moms=(0.8,0.7))
    learn.save(base_path/'mimic_fit_head.pickle')

epoch,train_loss,valid_loss,accuracy,time
0,2.573480,2.400864,0.541587,1:08:18


In [44]:
# pytorch automatically appends .pth to the filename, you cannot provide it
learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.3)
learn.load(base_path/'mimic_fit_head')
print('done')

done


In [42]:
learn.show_results()

text,target,pred
xxbos [ * * 2159 - 9 - 15 * * ] 2:53 xxup pm \n xxup chest (,xxup portable xxup ap ) xxmaj clip # [ * * xxmaj clip xxmaj number ( xxmaj radiology ) xxunk,xxup portable xxup ap ) xxmaj clip # [ * * xxmaj clip xxmaj number ( xxmaj radiology ) xxunk
restarted during her hospital \n stay . xxmaj she appeared to be fairly euvolemic . xxmaj she may \n,require xxmaj lasix to be restarted during her rehab stay . \n \n 1 . xxmaj heme :,have \n monday . be started . the stay . . xxmaj xxmaj . xxmaj she : xxmaj hct
pacing wire is seen in the xxup ra \n and extending into the xxup rv . \n \n,xxup left xxup ventricle : xxmaj mild symmetric xxup lvh . xxmaj moderately dilated xxup lv cavity . \n,xxup ventricle : xxmaj normal symmetric xxup lvh with xxmaj normal dilated xxup rv cavity . xxmaj xxmaj no global
* * ] 05:08 xxup am \n [ * * 2141 - 2 - 10 * * ] 05:26,xxup am \n [ * * 2141 - 2 - 10 * * ] 11:17 xxup am \n,xxup xxup am \n xxup * * xxmaj - 3 - 4 * * ] \n xxup pm
xxrep 78 _ \n xxup final xxup report \n xxup indication : 37-year - old woman with xxup,copd and increasing dyspnea on exertion in \n the setting of chest pain . xxmaj evaluate for pulmonary embolism,"hiv , xxup dyspnea . exertion . the the lower of \n pain . \n \n evaluate for"


In [8]:
prev_cycles = 0
cycles_file = base_path/'num_iterations.pickle'

if os.path.isfile(cycles_file):
    with open(cycles_file, 'rb') as f:
        prev_cycles = pickle.load(f)
print('This model has been trained for', prev_cycles, 'epochs already')

This model has been trained for 6 epochs already


In [25]:
# if want to continue training existing model, set to True
# if want to start fresh from the initialized language model, set to False
continue_flag = False
learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.3)

if continue_flag:
    file = 'mimic_fine_tuned_' + str(prev_cycles)
    learner_file = base_path/file
    if os.path.isfile(str(learner_file) + '.pth'):
        learn.load(learner_file)
        print('loaded existing learner from ', str(learner_file))
    else:
        print('existing learner file not found')    
else:
    prev_cycles = 0

learn.unfreeze()

num_cycles = 2
for n in range(num_cycles):
    #learn.fit_one_cycle(1, 5e-3, moms=(0.8,0.7))
    file = 'mimic_fine_tuned_' + str(prev_cycles + n + 1)
    learner_file = base_path/file
    learn.save(learner_file)
    with open(cycles_file, 'wb') as f:
        pickle.dump(num_cycles + prev_cycles, f)
    
print('completed', num_cycles, 'new training epochs')
print('completed', num_cycles + prev_cycles, 'total training epochs')

NameError: name 'data_lm' is not defined

In [17]:
# at batch size of 128 takes about 1:14:00 per epoch
#       GPU usage is about 14GB; RAM usage is about 10GB
# at batch size of 96 takes about 1:17:00 per epoch
#       GPU usage is about 9GB; RAM usage is about 10GB
# at batch size of 48 takes about 1:30:00 per epoch
#       GPU usage is about 5GB; RAM usage is about 10GB
#
# need to evalate how changing the learning rate would alter training time or accuracy


learn.fit_one_cycle(8, 5e-3, moms=(0.8,0.7))
# 8 cycles gets from about 62.7% accuracy to 67.6% accuracy

epoch,train_loss,valid_loss,accuracy,time
0,1.926960,1.832659,0.627496,1:14:14
1,1.808083,1.755725,0.637424,1:14:15
2,1.747903,1.697741,0.645431,1:14:15
3,1.714081,1.652703,0.652703,1:14:19
4,1.637801,1.602961,0.660170,1:14:15
5,1.596906,1.553225,0.668557,1:14:14
6,1.572020,1.519172,0.674477,1:14:26
7,1.517364,1.510010,0.676342,1:14:14


In [None]:
learn.fit_one_cycle(1, 5e-3, moms=(0.8,0.7))
learn.save(base_path/'mimic_fine_tuned.pickle')

epoch,train_loss,valid_loss,accuracy,time


In [19]:
import glob
print(os.getcwd())
glob.glob(str(base_path/'mimic_fine_tuned*'))

/home/jupyter/fastai-notes


['/home/jupyter/mimic/mimic_fine_tuned.pickle.pth']

In [None]:
learn.load(base_path/'mimic_fine_tuned.pickle')

In [20]:
# test the language generation capabilities of this model (not the point, but is interesting)
TEXT = "For confirmation, she underwent CTA of the lung which was negative for pulmonary embolism"
N_WORDS = 40
N_SENTENCES = 2
print("\n".join(learn.predict(TEXT, N_WORDS, temperature=0.75) for _ in range(N_SENTENCES)))

For confirmation, she underwent CTA of the lung which was negative for pulmonary embolism 
  but showed no PE , but did show some pulmonary edema . She has had 
  some mild dyspnea on exertion but has improved . She was brought to the 
  ED for further evaluation
For confirmation, she underwent CTA of the lung which was negative for pulmonary embolism or dissection . 
  She was extubated today and given 1 unit of prbcs for Hct of 24 . She is 
  afebrile , HR in the 120s , BP stable . She is


In [21]:
learn.save_encoder('mimic_fine_tuned_enc.pickle')