# CS598 Deep Learning for Healthcare - Final Project

*Author*: Michael Haines | mhaines2@illinois.edu

The code trains extracts a large dataset and trains several BERT-based models. It will take several hours to run to completion and should be used in an environment that has access to a GPU. All data and code should be uploaded to Google Drive to make it easy to run from Google Colab.

## Pre-requisites

### Upload the Data to Google Drive

Assuming that you have cloned the repo and followed the instructions in the `README.md` to add the MIMIC-IV dataset to the correct folder, upload the whole repo into your Google drive. It should have the following structure at the minimum to run to completion:

```
├── data
│   ├── full
│   │   ├── dict
│   │   └── tokens
│   ├── sample
│   │   ├── dict
│   │   └── tokens
├── models
│   ├── behrt_finetune_model.py
│   ├── behrt_finetune.py
│   ├── behrt_model.py
│   ├── behrt_no_d_model.py
│   ├── behrt_no_d_train.py
│   ├── behrt_pretrain_model.py
│   ├── behrt_pretrain.py
│   └── behrt_train.py
├── requirements.txt   
```
Note that it will take some time to upload this data to Colab, as the saved datasets are quite large.

### Mount Google  Drive

Please use Google Colab to run the notebook, as this will ensure that the runtime has access to your Google Drive. Mount the Google Drive with the following code. The code assumes that you have uploaded the the cloned repo to the root directory for your Google Drive. If you have uploaded the repo to a subdirectory, change the `PROJECT_ROOT` accordingly:

In [None]:
PROJECT_ROOT = "cs598-final-project" # Change this if required

import os
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

# Change working directory
os.chdir(f'/content/drive/{PROJECT_ROOT}')

## Install & Import Runtime Dependencies

In [None]:
!pip install -r requirements.txt

In [5]:
import sys
import importlib
import pandas as pd
import warnings

# Adding modules
warnings.filterwarnings('ignore')
module_path='./models'
if module_path not in sys.path:
    sys.path.append(module_path)

# Model training
import behrt_train
importlib.reload(behrt_train)
import behrt_train
from behrt_train import *
import behrt_no_d_train
importlib.reload(behrt_no_d_train)
import behrt_no_d_train
from behrt_no_d_train import *
import behrt_pretrain
importlib.reload(behrt_pretrain)
import behrt_pretrain
from behrt_pretrain import *
import behrt_finetune
importlib.reload(behrt_finetune)
import behrt_finetune
from behrt_finetune import *

## Train models

### Model #1: Public Checkpoint Fine-tuned on Sample Cohort

We are training this model as a baseline for comparison. It uses a pre-trained checkpoint BERT model available from the [pytorch-pretrained-bert package](https://pypi.org/project/pytorch-pretrained-bert/) and fine-tunes on the EHR data.

We first train the model on a small sample to test the researchers' premise that BERT-based models trained on EHR data require large samples to be useful.

In [None]:
path="sample"
tokenized_src = pd.read_csv(f'./data/{path}/tokens/tokenized_src.csv', index_col=0)
tokenized_age = pd.read_csv(f'./data/{path}/tokens/tokenized_age.csv', index_col=0)
tokenized_gender = pd.read_csv(f'./data/{path}/tokens/tokenized_gender.csv', index_col=0)
tokenized_ethni = pd.read_csv(f'./data/{path}/tokens/tokenized_ethni.csv', index_col=0)
tokenized_ins = pd.read_csv(f'./data/{path}/tokens/tokenized_ins.csv', index_col=0)
tokenized_labels = pd.read_csv(f'./data/{path}/tokens/tokenized_labels.csv', index_col=0)
behrt_train.train_behrt(tokenized_src, tokenized_age, tokenized_gender, tokenized_ethni, tokenized_ins, tokenized_labels, path=path)

### Model #2: Public Checkpoint Fine-tuned on Full Cohort

This is the same model as #1, but trained on the full dataset.

In [None]:
path="full"
tokenized_src = pd.read_csv(f'./data/{path}/tokens/tokenized_src.csv', index_col=0)
tokenized_age = pd.read_csv(f'./data/{path}/tokens/tokenized_age.csv', index_col=0)
tokenized_gender = pd.read_csv(f'./data/{path}/tokens/tokenized_gender.csv', index_col=0)
tokenized_ethni = pd.read_csv(f'./data/{path}/tokens/tokenized_ethni.csv', index_col=0)
tokenized_ins = pd.read_csv(f'./data/{path}/tokens/tokenized_ins.csv', index_col=0)
tokenized_labels = pd.read_csv(f'./data/{path}/tokens/tokenized_labels.csv', index_col=0)
behrt_train.train_behrt(tokenized_src, tokenized_age, tokenized_gender, tokenized_ethni, tokenized_ins, tokenized_labels, path=path)

### Model #3 Public Checkpoint Fine-tuned on Full Cohort, Demographic Data Excluded

This is the same as Model #2, with the potentially sensitive data `age`, `insurance`, `ethnicity`, and `gender` excluded.

In [None]:
path="full"
tokenized_src = pd.read_csv(f'./data/{path}/tokens/tokenized_src.csv', index_col=0)
tokenized_labels = pd.read_csv(f'./data/{path}/tokens/tokenized_labels.csv', index_col=0)
behrt_model_no_d.train_behrt(tokenized_src, tokenized_labels, path=path)

### Pre-training BERT from Full Cohort

In this model, we use the raw BERT head from [pytorch-pretrained-bert package](https://pypi.org/project/pytorch-pretrained-bert/) and perform the masked language modeling task ourself. 

In [None]:
path="full"
behrt_pretrain.pretrain_behrt(path=path)

### Model #4: Custom Checkpoint Fine-tuned on Full Cohort in Adversarial Setting

In this model, we use the pre-trained checkpoint that we just created and perform the fine-tuning task to replicate the model that from the original study. 

In [None]:
path="full"
behrt_finetune.finetune_behrt(path=path)