# Install
Before we begin, if you don't already have it you will need to install the following packages. Here is the install command:

**transformers**: `conda install -c conda-forge transformers`

It's important to note that my code differs from Kexin's because I [migrated](https://huggingface.co/transformers/migration.html) to using [HuggingFace's](https://huggingface.co/transformers/index.html) new `transformer` module instead of the formerly known as `pytorch_pretrained_bert` that the author used. 

# Read this article for ClinicalBERT
https://arxiv.org/pdf/1904.05342.pdf
They develop ClinicalBert by applying BERT (bidirectional encoder representations from transformers) to clinical notes. 

```
@article{clinicalbert,
author = {Kexin Huang and Jaan Altosaar and Rajesh Ranganath},
title = {ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission},
year = {2019},
journal = {arXiv:1904.05342},
}
```

# How My Work Differs from the Author's
1. I am not pre-training the ClinicalBERT because the author already performed pre-training on Clinical words and the model's weights are already available.
2. I am only working with early clinical notes. "Discharge summaries have predictive power for readmission. However, discharge summaries might be written after a patient has left the hospital. Therefore, discharge summaries are not actionable since doctors cannot intervene when a patient has left the hospital. Models that dynamically predict readmission in the early stages of a patient's admission are relevant to clinicians...a maximum of the first 48 or 72 hours of a patient's notes are concatenated. These concatenated notes are used to predict readmission."[pg 12](https://arxiv.org/pdf/1904.05342.pdf)


<img src="./images/fig1.png" width="800" />

In this example, care providers add notes to an electronic health record during a patient’s admission, and the model dynamically updates the patient’s risk of being readmitted within a 30-day window.


Boag et al. (2018) study the performance of the bag-of-words model, word2vec, and a Long Short-Term Memory Network (lstm) model combined with word2vec on various tasks such as diagnosis prediction and mortality risk estimation. Word embedding models such as word2vec are trained using the local context of individual words, but as clinical notes are long and their words are interdependent (Zhang et al., 2018), these methods cannot capture long-range dependencies.

Clinical notes require capturing interactions between distant words.

In this work, they develop a model that can predict readmission dynamically. **Making a prediction using a discharge summary at the end of a stay means that there are fewer opportunities to reduce the chance of readmission. To build a clinically-relevant model, we define a task for predicting readmission at any timepoint since a patient was admitted.**

Medicine suffers from alarm fatigue (Sendelbach and Funk, 2013). This
means useful classification rules for medicine need to have high precision (positive predictive value).

Compared to a popular model of clinical text, word2vec, ClinicalBert more accurately captures clinical word similarity.

ClinicalBERT is a modified BERT model: Specifically, the representations are learned
using medical notes and further processed for downstream clinical tasks.
* The transformer encoder architecture is based on a self-attention mechanism
* The pre-training objective function for the model is defined using two unsupervised tasks: masked language modeling and next sentence prediction. 
* The text embeddings and model parameters are fit using stochastic optimization.

<img src="./images/fig2.png" width="800" />

ClinicalBert learns deep representations of clinical text using two unsupervised language modeling tasks: masked language modeling and
next sentence prediction

### Clinical Text Embeddings
A clinical note input to ClinicalBert is represented as a collection of tokens. In ClinicalBert, a token in a clinical note is computed as
the sum of the token embedding, a learned segment embedding, and a position embedding.

### Pre-training ClinicalBERT
The quality of learned representations of text depends on the text the model was trained on. BERT is trained on BooksCorpus and Wikipedia. However, these two datasets are distinct from clinical notes (where jargon and abbreviations are common). Also clinical notes have different syntax and grammar than common language in books or encyclopedias. It is hard to understand clinical notes without professional training.

ClinicalBERT improves over BERT on the MIMIC-III corpus of clinical notes for 
1. Accuracy of masked language modeling a.k.a. predicting held-out tokens (86.80% vs 56.80%).
2. Next sentence prediction (99.25% vs. 80.50%).
The pre-training objective function based on the two tasks is the sum of the log-likelihood of the masked tokens and the log-likelihood of the binary variable indicating whether two sentences are consecutive.

### Fine-tuning ClinicalBERT
The model parameters are fine-tuned to maximize the log-likelihood of this binary classifier: equation (2)

##  Empirical Study II: 30-Day Hospital Readmission Prediction
Before the author even evaluated ClinicalBERT's performance as a model of readmission, **his initial experiment showed that the original BERT suffered in performance on the masked language modeling task on the MIMIC-III data as well as the next sentence prediction tasks. This proves the need develop models tailored to clinical data such as ClinicalBERT!**

<img src="./images/equ3.png" width="600" />

He finds that computing readmission probability using Equation (3) consistently outperforms predictions on each subsequence individually by 3–8%. This is because
1. some subsequences (such as tokens corresponding to progress reports) do NOT contain information about readmission, whereas others do. The risk of readmission should be computed using subsequences that correlate with readmission risk, and **the effect of unimportant subsequences should be minimized**. This is accomplished by using the maximum probability over subsequences. 
2. Also noisy subsequences mislead the model and decrease performance. So they also include the average probability of readmission across subsequences. This leads to a trade-off between the mean and maximum probabilities of readmission in Equation (3).
3. if there are a large number of subsequences for a patient with many clinical notes, there is a higher probability of having a noisy maximum probability of readmission. This means longer sequences may need to have a larger weight on the mean prediction. We include this weight as the n/c scaling factor, with c adjusting for patients with many clinical notes.
Empirically, he found that c = 2 performs best on validation data.

### Evaluation
For validation and testing, 10% of the data is held out respectively, and 5-fold cross-validation is conducted. 

Each model is evaluated using three metrics:
1. AUROC
2. Area under the precision-recall curve
3. Recall at precision of 80%: For the readmission task, false positives are important. To minimize the number of false positives and thus minimize the risk of alarm fatigue, he set the precision to 80% (in other words, 20% false positives out of the predicted positive class) and use the corresponding threshold to calculate recall. This leads to a clinically-relevant metric that enables us to build models that control the false positive rate. 

### Models
* The training parameters are the entire encoder network, along with the classifier **`W`**
* Note that the data labels are imbalanced: negative labels are subsampled to balance the positive readmit labels
* ClinicalBert is trained for one epoch with batch size 4 and ee use the Adam optimizer learning rate 2 × 10−5
*  The ClinicalBert model settings are the same as in Section 3.
* The binary classifier is a linear layer of shape 768 × 1
* The maximum sequence length supported by the model is set to 512, and the model is first trained using shorter sequences.

<img src="./images/tab3.png" width="600" />

Shows that ClinicalBERT outperforms it's competitors like Bag-of-words (Top 5000 TF-IDF words as features) and BiLSTM/Word2Vec in terms of precision and recall.

###  Readmission Prediction With Early Clinical Notes
Discharge summaries have predictive power for readmission. However, discharge summaries
might be written after a patient has left the hospital. Therefore, discharge summaries are
not actionable since doctors cannot intervene when a patient has left the hospital. Models
that dynamically predict readmission in the early stages of a patient’s admission are relevant to clinicians.

> **Note** that readmission predictions from a model are not actionable if a patient has been discharged. 

**24-48h**
* In the MIMIC-III data, admission and discharge times are available, but clinical notes do not have timestamps. This is why the table headings show a range; this range shows the cutoff time for notes fed to the model from early on in a patient’s admission. For example, in the 24–48h column, the model may only take as input a patient’s notes up to 36h because of that patient’s specific admission time.

**48-72h**
* For the second set of readmission prediction experiments, a maximum of the first 48 or 72 hours of a patient’s notes are concatenated. These concatenated notes are used to predict readmission. Since we separate notes into subsequences of the same length, the training set consists of all subsequences within a maximum of 72 hours, and the model is tested given only available notes within the first 48 or 72 hours of a patient’s admission.
* For testing 48 or 72-hour clinical note readmission prediction, patients that are discharged within 48 or 72 hours (respectively) are filtered out.

### Interpretable predictions in ClinicalBert
* ClinicalBert uses several self-attention mechanisms which can be used to inspect its predictions, by visualizing terms correlated with predictions of hospital readmission.
    * For every clinical note input to ClinicalBert, each self-attention mechanism computes a distribution over every term in a sentence, given a query.
    * **A high attention weight between a query and key token means the interaction between these tokens is predictive of readmission**.
    *  In the ClinicalBert encoder, there are 144 self-attention mechanisms (or, 12 multi-head attention mechanisms for each of the 12 transformer encoders). 
  

### Preprocessing
ClinicalBert requires minimal preprocessing:
1. First, words are converted to lowercase and
2. line breaks are removed
3. carriage returns are removed. 
4. De-identified the brackets 
5. remove special characters like ==, −−

* The SpaCy sentence segmentation package is used to segment each note (Honnibal and Montani, 2017).
    * Since clinical notes don't follow rigid standard language grammar, we find rule-based segmentation has better results than dependency parsing-based segmentation.
    * Various segmentation signs that misguide rule-based segmentators are removed or replaced
        * For example 1.2 would be removed
        * M.D., dr. would be replaced with with MD, Dr
    * Clinical notes can include various lab results and medications that also contain numerous rule-based separators, such as 20mg, p.o., q.d.. (where q.d. means one a day and q.o. means to take by mouth.  
        *  To address this, segmentations that have less than 20 words are fused into the previous segmentation so that they are not singled out as different sentences.

# Preprocess.py

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Convert Strings to Dates.
When converting dates, it is safer to use a datetime format. 
Setting the errors = 'coerce' flag allows for missing dates 
but it sets it to NaT (not a datetime)  when the string doesn't match the format.

In [2]:
# Load ADMISSIONS table from AWS S3 bucket

df_adm = pd.read_csv('ADMISSIONS.csv')

In [3]:
# Load ADMISSIONS table
# df_adm = pd.read_csv(
#     '/Users/nwams/Documents/Machine Learning Projects/Predicting-Hospital-Readmission-using-NLP/ADMISSIONS.csv')

In [4]:
df_adm.ADMITTIME = pd.to_datetime(df_adm.ADMITTIME, format='%Y-%m-%d %H:%M:%S', errors='coerce')
df_adm.DISCHTIME = pd.to_datetime(df_adm.DISCHTIME, format='%Y-%m-%d %H:%M:%S', errors='coerce')
df_adm.DEATHTIME = pd.to_datetime(df_adm.DEATHTIME, format='%Y-%m-%d %H:%M:%S', errors='coerce')

Get the next Unplanned admission date for each patient (if it exists).
I need to get the next admission date, if it exists.
First I'll verify that the dates are in order.
Then I'll use the shift() function to get the next admission date.

In [5]:
df_adm = df_adm.sort_values(['SUBJECT_ID', 'ADMITTIME'])
df_adm = df_adm.reset_index(drop=True)
df_adm['NEXT_ADMITTIME'] = df_adm.groupby('SUBJECT_ID').ADMITTIME.shift(-1)
df_adm['NEXT_ADMISSION_TYPE'] = df_adm.groupby('SUBJECT_ID').ADMISSION_TYPE.shift(-1)

Since I want to predict unplanned re-admissions I will drop (filter out) any future admissions that are ELECTIVE 
so that only EMERGENCY re-admissions are measured.
For rows with 'elective' admissions, replace it with NaT and NaN

In [6]:
rows = df_adm.NEXT_ADMISSION_TYPE == 'ELECTIVE'
df_adm.loc[rows,'NEXT_ADMITTIME'] = pd.NaT
df_adm.loc[rows,'NEXT_ADMISSION_TYPE'] = np.NaN

It's safer to sort right before the fill incase something I did above changed the order

In [7]:
df_adm = df_adm.sort_values(['SUBJECT_ID','ADMITTIME'])

Backfill in the values that I removed. So copy the ADMITTIME from the last emergency 
and paste it in the NEXT_ADMITTIME for the previous emergency. 
So I am effectively ignoring/skipping the ELECTIVE admission row completely. 
Doing this will allow me to calculate the days until the next admission.

In [8]:
# Back fill. This will take a little while.
df_adm[['NEXT_ADMITTIME','NEXT_ADMISSION_TYPE']] = df_adm.groupby(['SUBJECT_ID'])[['NEXT_ADMITTIME','NEXT_ADMISSION_TYPE']].fillna(method = 'bfill')

# Calculate days until next admission
df_adm['DAYS_NEXT_ADMIT'] = (df_adm.NEXT_ADMITTIME - df_adm.DISCHTIME).dt.total_seconds()/(24*60*60)

### Remove NEWBORN admissions
According to the MIMIC site "Newborn indicates that the HADM_ID pertains to the patient's birth."

I will remove all NEWBORN admission types because in this project I'm not interested in studying births — my primary 
interest is EMERGENCY and URGENT admissions.
I will remove all admissions that have a DEATHTIME because in this project I'm studying re-admissions, not mortality. 
And a patient who died cannot be re-admitted.

In [9]:
df_adm = df_adm.loc[df_adm.ADMISSION_TYPE != 'NEWBORN']
df_adm = df_adm.loc[df_adm.DEATHTIME.isnull()]

### Make Output Label
For this problem, we are going to classify if a patient will be admitted in the next 30 days. 
Therefore, we need to create a variable with the output label (1 = readmitted, 0 = not readmitted).

In [10]:
df_adm['OUTPUT_LABEL'] = (df_adm.DAYS_NEXT_ADMIT < 30).astype('int')
df_adm['DURATION'] = (df_adm['DISCHTIME']-df_adm['ADMITTIME']).dt.total_seconds()/(24*60*60)

### Load NOTEEVENTS Table

In [11]:
# Load ADMISSIONS table from AWS S3 bucket

df_notes = pd.read_csv('NOTEEVENTS.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [12]:
# Sort by subject_ID, HAD_ID then CHARTDATE
df_notes = df_notes.sort_values(by=['SUBJECT_ID','HADM_ID','CHARTDATE'])
# Merge notes table to admissions table
df_adm_notes = pd.merge(df_adm[['SUBJECT_ID','HADM_ID','ADMITTIME','DISCHTIME','DAYS_NEXT_ADMIT','NEXT_ADMITTIME','ADMISSION_TYPE','DEATHTIME','OUTPUT_LABEL','DURATION']],
                        df_notes[['SUBJECT_ID','HADM_ID','CHARTDATE','TEXT','CATEGORY']],
                        on = ['SUBJECT_ID','HADM_ID'],
                        how = 'left')

In [13]:
# Grab date only, not the time
df_adm_notes.ADMITTIME_C = df_adm_notes.ADMITTIME.apply(lambda x: str(x).split(' ')[0])

df_adm_notes['ADMITTIME_C'] = pd.to_datetime(df_adm_notes.ADMITTIME_C, format = '%Y-%m-%d', errors = 'coerce')
df_adm_notes['CHARTDATE'] = pd.to_datetime(df_adm_notes.CHARTDATE, format = '%Y-%m-%d', errors = 'coerce')

  


Gather Discharge Summaries Only

In [14]:
# Gather Discharge Summaries Only
df_discharge = df_adm_notes[df_adm_notes['CATEGORY'] == 'Discharge summary']
# multiple discharge summary for one admission -> after examination -> replicated summary -> replace with the last one
df_discharge = (df_discharge.groupby(['SUBJECT_ID','HADM_ID']).nth(-1)).reset_index()
df_discharge=df_discharge[df_discharge['TEXT'].notnull()]

If Less than n days on admission notes (Early notes)

In [15]:
def less_n_days_data(df_adm_notes, n):
    df_less_n = df_adm_notes[
        ((df_adm_notes['CHARTDATE'] - df_adm_notes['ADMITTIME_C']).dt.total_seconds() / (24 * 60 * 60)) < n]
    df_less_n = df_less_n[df_less_n['TEXT'].notnull()]
    # concatenate first
    df_concat = pd.DataFrame(df_less_n.groupby('HADM_ID')['TEXT'].apply(lambda x: "%s" % ' '.join(x))).reset_index()
    df_concat['OUTPUT_LABEL'] = df_concat['HADM_ID'].apply(
        lambda x: df_less_n[df_less_n['HADM_ID'] == x].OUTPUT_LABEL.values[0])
    
    return df_concat

In [16]:
df_less_2 = less_n_days_data(df_adm_notes, 2)
df_less_3 = less_n_days_data(df_adm_notes, 3)

In [17]:
import re

def preprocess1(x):
    y = re.sub('\\[(.*?)\\]', '', x)  # remove de-identified brackets
    y = re.sub('[0-9]+\.', '', y)  # remove 1.2. since the segmenter segments based on this
    y = re.sub('dr\.', 'doctor', y)
    y = re.sub('m\.d\.', 'md', y)
    y = re.sub('admission date:', '', y)
    y = re.sub('discharge date:', '', y)
    y = re.sub('--|__|==', '', y)
    return y

In [18]:
from tqdm import tqdm, trange

In [19]:
def preprocessing(df_less_n):
    df_less_n['TEXT'] = df_less_n['TEXT'].fillna(' ')
    df_less_n['TEXT'] = df_less_n['TEXT'].str.replace('\n', ' ')
    df_less_n['TEXT'] = df_less_n['TEXT'].str.replace('\r', ' ')
    df_less_n['TEXT'] = df_less_n['TEXT'].apply(str.strip)
    df_less_n['TEXT'] = df_less_n['TEXT'].str.lower()

    df_less_n['TEXT'] = df_less_n['TEXT'].apply(lambda x: preprocess1(x))

    # to get 318 words chunks for readmission tasks
    df_len = len(df_less_n)
    want = pd.DataFrame({'ID': [], 'TEXT': [], 'Label': []})
    for i in tqdm(range(df_len)):
        x = df_less_n.TEXT.iloc[i].split()
        n = int(len(x) / 318)
        for j in range(n):
            want = want.append({'TEXT': ' '.join(x[j * 318:(j + 1) * 318]), 'Label': df_less_n.OUTPUT_LABEL.iloc[i],
                                'ID': df_less_n.HADM_ID.iloc[i]}, ignore_index=True)
        if len(x) % 318 > 10:
            want = want.append({'TEXT': ' '.join(x[-(len(x) % 318):]), 'Label': df_less_n.OUTPUT_LABEL.iloc[i],
                                'ID': df_less_n.HADM_ID.iloc[i]}, ignore_index=True)

    return want

The preprocessing below for the Discharge, 2-Day and 3-Day stays took about 6.5 hours on my local machine (discharge=2.5hrs, 2-day=1.5 hrs and 3-day=2.5 hrs). 

Uncomment the lines below (I've commented it out since I've already run preprocessing and pickled the files). 

In [20]:
#df_discharge = preprocessing(df_discharge)
#df_less_2 = preprocessing(df_less_2)
df_less_3 = preprocessing(df_less_3)

In [21]:
import pickle

Let's pickle it for later use. Uncomment the code below to pickle your files. 

In [22]:
#df_discharge.to_pickle("./pickle/df_discharge.pkl")
#df_less_2.to_pickle("./pickle/df_less_2.pkl")
#df_less_3.to_pickle("./pickle/df_less_3.pkl")

Load the pickled files, if needed

In [23]:
#df_discharge = pd.read_pickle('./pickle/df_discharge.pkl')
#df_less_2 = pd.read_pickle('./pickle/df_less_2.pkl')
df_less_3 = pd.read_pickle('./pickle/df_less_3.pkl')

In [24]:
df_discharge.shape

(216954, 3)

In [25]:
df_less_2.shape

(277443, 3)

In [26]:
df_less_3.shape

(385724, 3)

Discharge has 216,954 rows. 

2-Day has 277,443 rows.

3-Day has 385,724 rows.