# Before you use this template

This template is just a recommended template for project Report. It only considers the general type of research in our paper pool. Feel free to edit it to better fit your project. You will iteratively update the same notebook submission for your draft and the final submission. Please check the project rubriks to get a sense of what is expected in the template.

---

# FAQ and Attentions
* Copy and move this template to your Google Drive. Name your notebook by your team ID (upper-left corner). Don't eidt this original file.
* This template covers most questions we want to ask about your reproduction experiment. You don't need to exactly follow the template, however, you should address the questions. Please feel free to customize your report accordingly.
* any report must have run-able codes and necessary annotations (in text and code comments).
* The notebook is like a demo and only uses small-size data (a subset of original data or processed data), the entire runtime of the notebook including data reading, data process, model training, printing, figure plotting, etc,
must be within 8 min, otherwise, you may get penalty on the grade.
  * If the raw dataset is too large to be loaded  you can select a subset of data and pre-process the data, then, upload the subset or processed data to Google Drive and load them in this notebook.
  * If the whole training is too long to run, you can only set the number of training epoch to a small number, e.g., 3, just show that the training is runable.
  * For results model validation, you can train the model outside this notebook in advance, then, load pretrained model and use it for validation (display the figures, print the metrics).
* The post-process is important! For post-process of the results,please use plots/figures. The code to summarize results and plot figures may be tedious, however, it won't be waste of time since these figures can be used for presentation. While plotting in code, the figures should have titles or captions if necessary (e.g., title your figure with "Figure 1. xxxx")
* There is not page limit to your notebook report, you can also use separate notebooks for the report, just make sure your grader can access and run/test them.
* If you use outside resources, please refer them (in any formats). Include the links to the resources if necessary.

# Mount Notebook to Google Drive
Upload the data, pretrianed model, figures, etc to your Google Drive, then mount this notebook to Google Drive. After that, you can access the resources freely.

Instruction: https://colab.research.google.com/notebooks/io.ipynb

Example: https://colab.research.google.com/drive/1srw_HFWQ2SMgmWIawucXfusGzrj1_U0q

Video: https://www.youtube.com/watch?v=zc8g8lGcwQU

In [1]:
from google.colab import drive
drive.mount('/content/drive')

ModuleNotFoundError: No module named 'google'

# Introduction
This is an introduction to your report, you should edit this text/mardown section to compose. In this text/markdown, you should introduce:

*   Background of the problem
  * what type of problem: disease/readmission/mortality prediction,  feature engineeing, data processing, etc
  * what is the importance/meaning of solving the problem
  * what is the difficulty of the problem
  * the state of the art methods and effectiveness.



*   Paper explanation
  * what did the paper propose
  * what is the innovations of the method
  * how well the proposed method work (in its own metrics)
  * what is the contribution to the reasearch regime (referring the Background above, how important the paper is to the problem).

## Background:

Current pretraining objectives in predictive EHR-based models are limited to predicting a fraction of ICD codes within a patient’s visit, when in reality, patients usually have multiple, often highly-correlated diseases. In addition, current models are unable to accurately predict the timeline of correlated diagnoses and could lead to missed opportunities in preventative care. Predictive tasks surrounding healthcare data can be challenging due to the complexity of healthcare data, which includes high-dimensional and often incomplete patient data over time. The state of the art methods used to solve similar healthcare related problems have been transformer based deep learning models trained on extensive datasets and fine-tuned for specific tasks. Despite their success, current models are often fine-tuned to focus on predicting a limited set of outcomes, thus overlooking the interconnected nature of various health conditions.



## Paper Explanation:

The paper presents "TransformEHR," a novel generative encoder-decoder model leveraging transformer architecture, specifically designed for predicting future patient outcomes based on their longitudinal EHRs. The authors utilized techniques like visit-masking and time embedding to achieve results that outperform the other state of the art models. For example, when testing their encoder-decoder model against an encoder only model, the authors were able to achieve an, “improvement of 95%CI: 0.74%–1.16%, p < 0.001 in AUROC across all diseases/outcomes tested.” While the paper boasted strong results on a variety of both common and uncommon diseases, the authors mentioned that their work was related to predictive model studies focused on intentional self-harm. TransformEHR performed exceptionally well within this subspace and esteemed to reduce incremental cost-effective ratio by $109k per quality-adjusted life-years.

In [None]:
# code comment is used as inline annotations for your coding

# Scope of Reproducibility:

List hypotheses from the paper you will test and the corresponding experiments you will run.


1.   Hypothesis 1: Visit Masking vs Code Masking
  * This test will see whether masking all ICD codes of a visit has significant results compared to masking only a selected fragment of ICD codes per visit
  * Both models will be encoder-decoder
  * This would have to be done by changing the training process to mask out data per run
    * far more difficult and don't see the code the paper reference to do this
2.  Hypothesis 2: Time embedding vs Time Embedding Excluded
  * This test will see whether incliding time embedding (for temporal information of prior visits) has significant results compared to excluding them.
  * Both models will be encoder-decoder
  * This will be done by deleting the self.embed_positions.
    * `x = inputs_embeds + embed_pos + embed_visit`
    * we should be able to delete embed_pos without changing any config

3. (If time allows) encoder-decoder vs encoder only
  * This test compares the performance of an enconder-deocder model versus an encoder model trained on the same dataset


You can insert images in this notebook text, [see this link](https://stackoverflow.com/questions/50670920/how-to-insert-an-inline-image-in-google-colaboratory-from-google-drive) and example below:

![sample_image.png](https://drive.google.com/uc?export=view&id=1g2efvsRJDxTxKz-OY3loMhihrEUdBxbc)



You can also use code to display images, see the code below.

The images must be saved in Google Drive first.


In [None]:
# no code is required for this section
'''
if you want to use an image outside this notebook for explanaition,
you can upload it to your google drive and show it with OpenCV or matplotlib
'''
# mount this notebook to your google drive
drive.mount('/content/gdrive')

# define dirs to workspace and data
img_dir = '/content/gdrive/My Drive/Colab Notebooks/<path-to-your-image>'

import cv2
img = cv2.imread(img_dir)
cv2.imshow("Title", img)


# Methodology

This methodology is the core of your project. It consists of run-able codes with necessary annotations to show the expeiment you executed for testing the hypotheses.

The methodology at least contains two subsections **data** and **model** in your experiment.





In [11]:
# !pip3 install torch
# !pip3 install transformers

# import  packages you need
import numpy as np
# from google.colab import drive
import torch
import transformers
import pickle

##  Data
Data includes raw data (MIMIC III tables), descriptive statistics (our homework questions), and data processing (feature engineering).
  * Source of the data: where the data is collected from; if data is synthetic or self-generated, explain how. If possible, please provide a link to the raw datasets.
  * Statistics: include basic descriptive statistics of the dataset like size, cross validation split, label distribution, etc.
  * Data process: how do you munipulate the data, e.g., change the class labels, split the dataset to train/valid/test, refining the dataset.
  * Illustration: printing results, plotting figures for illustration.
  * You can upload your raw dataset to Google Drive and mount this Colab to the same directory. If your raw dataset is too large, you can upload the processed dataset and have a code to load the processed dataset.

### Data Description
 * We start with the hospital admissions dataset and patient diagnoses dataset from MIMIC-IV. https://physionet.org/content/mimiciv/2.2/
 * We preprocess this raw data into local files. We format patient data as: `[([icd-version-icd-code], date time)]`
 * Our sample dataset has 180640 patients.


### Preprocess the data
 * Start with raw data from mimiciv
 * map patients with visits and diagnoses codes
 * add timestamps or relevant time information


In [7]:
#Maps visit_id's to discahrge dates
data_df = pd.read_csv('/data/corpora_alpha/MIMIC/physionet.org/files/mimiciv/2.2/hosp/admissions.csv.gz', nrows=None, compression='gzip',
            dtype={'subject_id': str, 'hadm_id': str} )
visitid2dischargedate = {}
for ind, row in data_df.iterrows():
    visitid2dischargedate[row['hadm_id']] = row['dischtime'][0:10]

print(min(visitid2dischargedate.values()))
print(max(visitid2dischargedate.values()))

#creates a dictionary of patients {patient_id (key) : diagnoses codes}
data_df = pd.read_csv('/data/corpora_alpha/MIMIC/physionet.org/files/mimiciv/2.2/hosp/diagnoses_icd.csv.gz', nrows=None, compression='gzip',
            dtype={'subject_id': str, 'hadm_id': str, 'icd_code': str, 'icd_version': str} )
patients = defaultdict(lambda: defaultdict(list)) #lambda: "Not Present"
for ind, row in data_df.iterrows():
    hadm_id = row['hadm_id']
    scrssn = row['subject_id']
    visit_date = visitid2dischargedate[hadm_id]
    patients[scrssn][visit_date].append(row['icd_version'] +'-'+ row['icd_code']) # 687621183, 912831070

num_icd_pat = defaultdict(int)
for k,v in patients.items():
    for kv, vv in v.items():
        for icdcode in vv:
            if icdcode.startswith("10-"):
                num_icd_pat[k] += 1
                break

print(len(patients))
print(len(num_icd_pat))

num_pos = 0
for k,v in num_icd_pat.items():
    if v > 1:
        num_pos += 1
print(num_pos)

print("Done")

#adds timestamps??
def icd2cui(patients, logging_step=50000):
    dictionary = defaultdict(int)
    # cuis_li = []
    cuis_di = {}
    date_di = {}
    num_idx = 0
    for pssn,v in patients.items():
        num_idx += 1
        if num_idx%logging_step == 0:
            print("|{} - Processed {}".format(time.asctime(time.localtime(time.time())), num_idx), flush=True)
        cuis_di[pssn] = []
        cuis_li_tmp = []
        date_li_tmp = []
        for datetime_str in sorted(v.keys()): # sort by time
            datetime_object = datetime.strptime(datetime_str, '%Y-%m-%d') # make sure time str is correct
            infos = v[datetime_str]
            if len(infos) > 0:
                # cuis_di[pssn].append((cuis, ext_cuis, strs))
                cuis_li_tmp.append((infos, [], []))
                date_li_tmp.append(datetime_str)
            for cui_id in infos:
                dictionary[cui_id] += 1
        if len(cuis_li_tmp) > 0:
            cuis_di[pssn] = cuis_li_tmp
            date_di[pssn] = date_li_tmp
    return cuis_di, date_di, dictionary

#putting everything together
patients_few = dict(islice(patients.items(), 0, 200))
cuis, date, dictionary = icd2cui(patients, logging_step=50000)

#Saving the preprocessed data into 3 seperate files
dir_apth = 'path/to/file' #file to save preprocessed data
print("Number of cui in dictionary: {}".format(len(dictionary)), flush=True)
with open(dir_apth + '/dict.txt', 'w') as handle: #TODO
    handle.write("[PAD]"+"\n")
    for i in range(99):
        handle.write("[unused{}]".format(i)+"\n")
    handle.write("[UNK]"+"\n")
    handle.write("[CLS]"+"\n")
    handle.write("[SEP]"+"\n")
    handle.write("[MASK]"+"\n")
    for i in range(99,194):
        handle.write("[unused{}]".format(i)+"\n")
    for k,v in dictionary.items():
        handle.write("{}\n".format(k))
# save data
print("Saving patient data...", flush=True)
f1 = open(dir_apth + '/value.pickle', 'wb')
f3 = open(dir_apth + '/dates.pickle', 'wb')
f2 = open(dir_apth + '/key.txt', 'w')
for k,v in cuis.items():
    pickle.dump(v, f1, protocol=pickle.HIGHEST_PROTOCOL)
    pickle.dump(date[k], f3, protocol=pickle.HIGHEST_PROTOCOL)
    f2.write("{}\n".format(k))
f1.close()
f3.close()
f2.close()

print("Done")

FileNotFoundError: [Errno 2] No such file or directory: '/data/corpora_alpha/MIMIC/physionet.org/files/mimiciv/2.2/hosp/admissions.csv.gz'

### Loading the Data from 3 previously saved files

In [16]:
import pickle

dir_path = "./train_data" #edit with path file
do_date = True


f1 = open(dir_path+ '/value.pickle', 'rb')
f3 = open(dir_path+ '/dates.pickle', 'rb')
f2 = open(dir_path+ '/key.txt', 'r')
keys = f2.readlines()

patients = {}
for key in keys:
    patient_idd = key.strip()
    each_visit = pickle.load(f1)
    f1obj = []
    for (cuis, ext_cuis, strs) in each_visit:
        f1obj.append((cuis, [], []))
    f3obj = pickle.load(f3)
    assert len(f1obj) == len(f3obj)
    if do_date:
        if patient_idd in patients:
            patients[patient_idd] += list(zip(f1obj, f3obj))
        else:
            patients[patient_idd] = list(zip(f1obj, f3obj))
    else:
        if patient_idd in patients:
            patients[patient_idd] += f1obj
        else:
            patients[patient_idd] = f1obj


print("number of patients in the sample dataset: ", len(patients))

# print(patients[list(patients.keys())[0]])
# print("format of the data is as follows [([icd-version-icd-code], date time)]")



number of patients in the sample dataset:  180640
[((['9-5723', '9-78959', '9-5715', '9-07070', '9-496', '9-29680', '9-30981', '9-V1582'], [], []), '2180-05-07'), ((['9-07071', '9-78959', '9-2875', '9-2761', '9-496', '9-5715', '9-V08', '9-3051'], [], []), '2180-06-27'), ((['9-45829', '9-07044', '9-7994', '9-2761', '9-78959', '9-2767', '9-3051', '9-V08', '9-V4986', '9-V462', '9-496', '9-29680', '9-5715'], [], []), '2180-07-25'), ((['9-07054', '9-78959', '9-V462', '9-5715', '9-2767', '9-2761', '9-496', '9-V08', '9-3051', '9-78791'], [], []), '2180-08-07')]
format of the data is as follows [([icd-version-icd-code], date time)]


##   Model
The model includes the model definitation which usually is a class, model training, and other necessary parts.
  * Model architecture: layer number/size/type, activation function, etc
  * Training objectives: loss function, optimizer, weight of each loss term, etc
  * Others: whether the model is pretrained, Monte Carlo simulation for uncertainty analysis, etc
  * The code of model should have classes of the model, functions of model training, model validation, etc.
  * If your model training is done outside of this notebook, please upload the trained model here and develop a function to load and test it.

### Model Descriptions

* We will use the `ICDBartForPreTraining` from `icdmodelbart.py`. `ICDBartForPreTraining` is an extension of `BartModel` used for pretraining in clinical data processing, and includes features for resizing token embeddings and handling masked language modeling objectives. 
* The model contains multiple forward-pass encoder and decoder layers. 

### Model Implementation/Set-Up

In [8]:
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

# Initialize distributed training
# dist.init_process_group(backend='nccl',rank=0,world_size=1)


from importlib import reload
import icdmodelbart
import trainer
reload(icdmodelbart)
reload(trainer)


from icdmodelbart import ICDBartForPreTraining
from trainer import Trainer 
import torch
from transformers import BartConfig, BartModel, TrainingArguments
from transformers import DefaultDataCollator



config = BartConfig()
# Create an instance of the model
model = ICDBartForPreTraining(config)

# define training arguments
training_args = TrainingArguments(
    output_dir="./results",  # output directory
    overwrite_output_dir=True,  # overwrite the content of the output directory
    num_train_epochs=3,  # number of training epochs
    per_device_train_batch_size=8,  # batch size per device during training
    per_device_eval_batch_size=8,   # batch size for evaluation
    evaluation_strategy="epoch",  # evaluation strategy to adopt during training
    logging_dir="./logs",  # directory for storing logs
    logging_steps=100,  # log every 100 steps
    save_steps=100,  # save checkpoint every 100 steps
    save_total_limit=3,  # limit the total amount of checkpoints
    prediction_loss_only=True, # When enabled, only the prediction loss is calculated
)



# Create an instance of the Trainer
trainer = Trainer(
  model=model,
  args=training_args
)

print("Done")
# class my_model():
#   # use this class to define your model
#   pass

# model = my_model()
# loss_func = None
# optimizer = None

# def train_model_one_iter(model, loss_func, optimizer):
#   pass

# num_epoch = 10
# # model training loop: it is better to print the training/validation losses during the training
# for i in range(num_epoch):
#   train_model_one_iter(model, loss_func, optimizer)
#   train_loss, valid_loss = None, None
#   print("Train Loss: %.2f, Validation Loss: %.2f" % (train_loss, valid_loss))


BartEncoder: SinusoidalPositionalEmbedding(config.max_position_embeddings, embed_dim, self.padding_idx)
nn.Embedding(1460, embed_dim)
self.layernorm_embedding = LayerNorm(embed_dim) if config.normalize_embedding else nn.Identity()
self.layer_norm = LayerNorm(config.d_model) if config.normalize_before else None
BartDecoder: self.embed_positions = LearnedPositionalEmbedding(config.max_position_embeddings, config.d_model, self.padding_idx)
self.layernorm_embedding = LayerNorm(config.d_model) if config.normalize_embedding else nn.Identity()
self.layer_norm = LayerNorm(config.d_model) if config.normalize_before else None
world master -- one process only


##   Training

### Computational Requirements

### Implementation Code

In [5]:
# Training the model


from icdmodelbart import ICDBartForPreTraining
from trainer import Trainer

# -- instantiating blank model 
#    -- import model from file first later import sinja's model from the box above
model=ICDBartForPreTraining()

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    test_collator=test_data_collator,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    test_datasets=test_datasets,
    prediction_loss_only=False
)

# -- setting up training values 
train_dataloader = trainer.get_train_dataloader()
if trainer.args.max_steps > 0:
    t_total = trainer.args.max_steps
    num_train_epochs = (
        trainer.args.max_steps // (len(train_dataloader) // trainer.args.gradient_accumulation_steps) + 1
    )
else:
    t_total = int(len(train_dataloader) // trainer.args.gradient_accumulation_steps * trainer.args.num_train_epochs)
    num_train_epochs = trainer.args.num_train_epochs

optimizer, scheduler = trainer.get_optimizers(num_training_steps=t_total)




#  -- training loop 




TypeError: __init__() missing 1 required positional argument: 'config'

## Evaluation

### Metrics Descriptions

`trainer.evaluate()` will return a dict containing:
* the eval loss
* the potential metrics computed from the predictions

### Implementation Code

In [None]:
# Evaluation
results = {}
if training_args.do_eval:
    logger.info("*** Evaluate ***")

    eval_output = trainer.evaluate()

    perplexity = math.exp(eval_output["eval_00_valid_loss"])
    eval_output["perplexity"] = perplexity

    output_eval_file = os.path.join(training_args.output_dir, "eval_results_lm.txt")
    if trainer.is_world_master():
        with open(output_eval_file, "w") as writer:
            logger.info("***** Eval results *****")
            for key in sorted(eval_output.keys()):
                logger.info("  %s = %s", key, str(eval_output[key]))
                writer.write("%s = %s\n" % (key, str(eval_output[key])))

    results.update(eval_output)

# Results
In this section, you should finish training your model training or loading your trained model. That is a great experiment! You should share the results with others with necessary metrics and figures.

Please test and report results for all experiments that you run with:

*   specific numbers (accuracy, AUC, RMSE, etc)
*   figures (loss shrinkage, outputs from GAN, annotation or label of sample pictures, etc)


In [None]:
# metrics to evaluate my model

# plot figures to better show the results

# it is better to save the numbers and figures for your presentation.

## Model comparison

In [None]:
# compare you model with others
# you don't need to re-run all other experiments, instead, you can directly refer the metrics/numbers in the paper

# Discussion

In this section,you should discuss your work and make future plan. The discussion should address the following questions:
  * Make assessment that the paper is reproducible or not.
  * Explain why it is not reproducible if your results are kind negative.
  * Describe “What was easy” and “What was difficult” during the reproduction.
  * Make suggestions to the author or other reproducers on how to improve the reproducibility.
  * What will you do in next phase.



Since the paper is missing large parts of relevant code, specifically functions from `dataset.py` such as the data collator functions and `prepare_dataset` function, it is difficult to reproduce. Moving forward, we plan to implement our own versions of these functions to use for running and training the models. To improve reproducibility of the paper, ideally the authors could publish their `dataset.py` with the completed working functions.

**Plan for ablation studies:**

The `DataTrainingArguments` class has flags pertaining to the data we provide as input to our model for training and evaluation, which we can use for analyzing each ablation and how it affects the model. 

1) ***Visit Masking:*** For this ablation, we will adjust the `mlm` flag and/or `do_poisoon_random_masking`. When set to True, the `mlm` argument will train with bart masking by natural visit, instead of to mask with poisoon_random lambda 4. Similarly, the `do_poisoon_random_masking` decides whether to mask with poisoon_random lambda 4 instead of by natural visit.

2) ***Time Embedding:*** To test this ablation, we will adjust the `do_pos_emb` argument, which decides whether or not to do positional embedding, and/or the `do_date_visit` argument, which decides whether or not to treat each visit embedding as a date (time embedding implemented) or a sequence number (like BERT, time embedding not implemented).

3) ***Encoder-Decoder vs Encoder Only:*** This ablation would be much more complex to do, since it will involve modifying the architecture of the model. Currently, the model consists of several forward-pass (unidirectional) encoder and decoder layers. To compare this with an encoder-only model, we would need to remove the use of the decoder layers and implement bidirectional encoder layers.


In [None]:
# no code is required for this section
'''
if you want to use an image outside this notebook for explanaition,
you can read and plot it here like the Scope of Reproducibility
'''

# References

1. Yang, Z., Mitra, A., Liu, W. et al. TransformEHR: transformer-based encoder-decoder generative model to enhance prediction of disease outcomes using electronic health records. Nat Commun 14, 7857 (2023). https://doi.org/10.1038/s41467-023-43715-z



# Feel free to add new sections