Skip to content

Experiments towards efficient use of compact biomedical LMs

License

Notifications You must be signed in to change notification settings

nlpie-research/efficient-ml

Repository files navigation


Efficiency at Scale: Investigating the Performance of Diminutive Language Models in Clinical Tasks

Utility of different Parameter Efficient Fine-tuning (PEFT) strategies for clinical NLP tasks with models of varying scales and domain pre-training.

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Contributing
  5. License
  6. Contact
  7. Acknowledgments

About The Project

This repository contains the code for the following paper: Efficiency at Scale: Investigating the Performance of Diminutive Language Models in Clinical Tasks, find the paper here.

We explored the utility of different Parameter Efficient Fine-tuning (PEFT) strategies for clinical NLP tasks. We used the models of varying scales and domain pre-training to determine the interaction of different methods and downstream task performance. We also used the following datasets: MIMIC-III, and i2b2.

(back to top)

Getting Started

This is a fairly simple repo which utilises the HuggingFace transformers and peft libraries to fine-tune models on clinical NLP tasks. The code is designed to be run on a local machine, and the datasets are not provided.

Prerequisites

The key libraries required are:

peft
transformers
torch
datasets
evaluate

These are quite dynamic and may change over time. Exact package versions can be found in the requirements.txt file.

Datasets

All of the clinical downstream tasks were performed on the MIMIC-III and i2b2 datasets. The MIMIC-III dataset is a large, freely-available database comprising de-identified health-related data associated with over forty thousand patients who stayed in the intensive care.

These datasets do require data agreements and will require users to request access to the data prior to use.

Pre-processing

We follow the pre-processing steps below to prepare datasets locally for use with the scripts provided. It is a bit clunky, but you eventually want all datasets to land in the same global dataset directory.

MIMIC-III

To prepare the clinical outcome tasks: mortality prediction (MIMIC MP), length of stay prediction (MIIMC LOS), we follow the steps provided by the original authors here.

i2b2

Generally speaking we follow the same pre-processing steps as the original papers for the datasets. For the i2b2 tasks we follow the same steps as provided by the facebook research group here and subsequently here.

NER tasks

Once you have the data from the original preprocessing steps, we further process the data into the HuggingFace datasets format. This is done using the datasets library and the transformers library.

For example, to process the i2b2 2010 NER data, we use the following script:

python load_i2b2_2010_ner.py

At present, you will need to change the directory paths in the script to point to the correct location of the data on your local machine, as well as the save location for the new HF dataset.

Dataset directory structure

The directory structure for the datasets should be as follows:

datasets
├── I2B22010NER_hf_dataset
│   ├── dataset_dict.json
│   ├── info
│   │   ├── dataset_info.json
│   │   └── state.json
│   ├── test
│   │   ├── dataset_info.json
│   │   └── state.json
│   ├── train
│   │   ├── dataset_info.json
│   │   └── state.json
│   └── validation
│       ├── dataset_info.json
│       └── state.json
├── i2b2-2010-RE
│   ├── dataset_dict.json
│   ├── train
│   │   ├── dataset_info.json
│   │   └── state.json
│   └── validation
│       ├── dataset_info.json
│       └── state.json
├── i2b2-2012_hf_dataset
│   ├── dataset_dict.json
│   ├── info
│   │   ├── dataset_info.json
│   │   └── state.json
│   ├── test
│   │   ├── dataset_info.json
│   │   └── state.json
│   ├── train
│   │   ├── dataset_info.json
│   │   └── state.json
│   └── validation
│       ├── dataset_info.json
│       └── state.json
├── i2b2-2014_hf_dataset
│   ├── dataset_dict.json
│   ├── info
│   │   ├── dataset_info.json
│   │   └── state.json
│   ├── test
│   │   ├── dataset_info.json
│   │   └── state.json
│   ├── train
│   │   ├── dataset_info.json
│   │   └── state.json
│   └── validation
│       ├── dataset_info.json
│       └── state.json
├── icd9-triage
│   ├── test.csv
│   ├── train.csv
│   └── valid.csv
├── MedNLI
│   ├── dataset_dict.json
│   ├── test
│   │   ├── dataset_info.json
│   │   └── state.json
│   ├── train
│   │   ├── dataset_info.json
│   │   └── state.json
│   └── validation
│       ├── dataset_info.json
│       └── state.json
└── mimic3-clinical-outcomes
    ├── los
    │   ├── LOS_WEEKS_adm_test.csv
    │   ├── LOS_WEEKS_adm_train.csv
    │   ├── LOS_WEEKS_adm_val.csv
    │   ├── test.csv
    │   ├── train.csv
    │   └── valid.csv
    ├── mp
    │   ├── test.csv
    │   ├── train.csv
    │   └── valid.csv

Training

We use a relatively simple training script to fine-tune the models on the clinical tasks. The script is designed to be run from the command line, and the user can specify the model, task, and PEFT strategy to use. To facilitate the interchanging of datasets, we use a yaml file with relevant dataset path details etc. At present, it is essential you create this datasets.yaml file in the root directory of the project.

An example of the yaml file is provided below:

mimic-mp:
  training_data_dir: /mnt/sdd/efficient_ml_data/datasets/mimic3-clinical-outcomes/mp
  eval_data_dir: /mnt/sdd/efficient_ml_data/datasets/mimic3-clinical-outcomes/mp
  data_dir: ''
  training_file: train.csv
  validation_file: valid.csv
  test_file: test.csv
  task_type: SEQ_CLS
  label_name: hospital_expire_flag
  text_column: text
  remove_columns: [text]

The training script can be run as follows, for example to train a model on the MIMIC-III mortality prediction task with LoRA PEFT strategy:

python peft_trainer.py  --model_name_or_path "$MODEL" --peft_method "LORA" --task "mimic-mp" --log_save_dir "$WHERE_YOU_WANT_LOGS" --ckpt_save_dir "$WHERE_YOU_WANT_CHECKPOINTS" --train_batch_size 32 --eval_batch_size 32 --max_epochs 5

License

Distributed under the MIT License. See LICENCE for more information.

(back to top)

Citation

@misc{taylor2024efficiency,
      title={Efficiency at Scale: Investigating the Performance of Diminutive Language Models in Clinical Tasks}, 
      author={Niall Taylor and Upamanyu Ghose and Omid Rohanian and Mohammadmahdi Nouriborji and Andrey Kormilitzin and David Clifton and Alejo Nevado-Holgado},
      year={2024},
      eprint={2402.10597},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

About

Experiments towards efficient use of compact biomedical LMs

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages