Utility of different Parameter Efficient Fine-tuning (PEFT) strategies for clinical NLP tasks with models of varying scales and domain pre-training.
Table of Contents
This repository contains the code for the following paper: Efficiency at Scale: Investigating the Performance of Diminutive Language Models in Clinical Tasks, find the paper here.
We explored the utility of different Parameter Efficient Fine-tuning (PEFT) strategies for clinical NLP tasks. We used the models of varying scales and domain pre-training to determine the interaction of different methods and downstream task performance. We also used the following datasets: MIMIC-III, and i2b2.
This is a fairly simple repo which utilises the HuggingFace transformers
and peft
libraries to fine-tune models on clinical NLP tasks. The code is designed to be run on a local machine, and the datasets are not provided.
The key libraries required are:
peft
transformers
torch
datasets
evaluate
These are quite dynamic and may change over time. Exact package versions can be found in the requirements.txt
file.
All of the clinical downstream tasks were performed on the MIMIC-III and i2b2 datasets. The MIMIC-III dataset is a large, freely-available database comprising de-identified health-related data associated with over forty thousand patients who stayed in the intensive care.
These datasets do require data agreements and will require users to request access to the data prior to use.
We follow the pre-processing steps below to prepare datasets locally for use with the scripts provided. It is a bit clunky, but you eventually want all datasets to land in the same global dataset directory.
To prepare the clinical outcome tasks: mortality prediction (MIMIC MP), length of stay prediction (MIIMC LOS), we follow the steps provided by the original authors here.
Generally speaking we follow the same pre-processing steps as the original papers for the datasets. For the i2b2 tasks we follow the same steps as provided by the facebook research group here and subsequently here.
Once you have the data from the original preprocessing steps, we further process the data into the HuggingFace datasets
format. This is done using the datasets
library and the transformers
library.
For example, to process the i2b2 2010 NER data, we use the following script:
python load_i2b2_2010_ner.py
At present, you will need to change the directory paths in the script to point to the correct location of the data on your local machine, as well as the save location for the new HF dataset.
The directory structure for the datasets should be as follows:
datasets
├── I2B22010NER_hf_dataset
│ ├── dataset_dict.json
│ ├── info
│ │ ├── dataset_info.json
│ │ └── state.json
│ ├── test
│ │ ├── dataset_info.json
│ │ └── state.json
│ ├── train
│ │ ├── dataset_info.json
│ │ └── state.json
│ └── validation
│ ├── dataset_info.json
│ └── state.json
├── i2b2-2010-RE
│ ├── dataset_dict.json
│ ├── train
│ │ ├── dataset_info.json
│ │ └── state.json
│ └── validation
│ ├── dataset_info.json
│ └── state.json
├── i2b2-2012_hf_dataset
│ ├── dataset_dict.json
│ ├── info
│ │ ├── dataset_info.json
│ │ └── state.json
│ ├── test
│ │ ├── dataset_info.json
│ │ └── state.json
│ ├── train
│ │ ├── dataset_info.json
│ │ └── state.json
│ └── validation
│ ├── dataset_info.json
│ └── state.json
├── i2b2-2014_hf_dataset
│ ├── dataset_dict.json
│ ├── info
│ │ ├── dataset_info.json
│ │ └── state.json
│ ├── test
│ │ ├── dataset_info.json
│ │ └── state.json
│ ├── train
│ │ ├── dataset_info.json
│ │ └── state.json
│ └── validation
│ ├── dataset_info.json
│ └── state.json
├── icd9-triage
│ ├── test.csv
│ ├── train.csv
│ └── valid.csv
├── MedNLI
│ ├── dataset_dict.json
│ ├── test
│ │ ├── dataset_info.json
│ │ └── state.json
│ ├── train
│ │ ├── dataset_info.json
│ │ └── state.json
│ └── validation
│ ├── dataset_info.json
│ └── state.json
└── mimic3-clinical-outcomes
├── los
│ ├── LOS_WEEKS_adm_test.csv
│ ├── LOS_WEEKS_adm_train.csv
│ ├── LOS_WEEKS_adm_val.csv
│ ├── test.csv
│ ├── train.csv
│ └── valid.csv
├── mp
│ ├── test.csv
│ ├── train.csv
│ └── valid.csv
We use a relatively simple training script to fine-tune the models on the clinical tasks. The script is designed to be run from the command line, and the user can specify the model, task, and PEFT strategy to use. To facilitate the interchanging of datasets, we use a yaml
file with relevant dataset path details etc. At present, it is essential you create this datasets.yaml
file in the root directory of the project.
An example of the yaml file is provided below:
mimic-mp:
training_data_dir: /mnt/sdd/efficient_ml_data/datasets/mimic3-clinical-outcomes/mp
eval_data_dir: /mnt/sdd/efficient_ml_data/datasets/mimic3-clinical-outcomes/mp
data_dir: ''
training_file: train.csv
validation_file: valid.csv
test_file: test.csv
task_type: SEQ_CLS
label_name: hospital_expire_flag
text_column: text
remove_columns: [text]
The training script can be run as follows, for example to train a model on the MIMIC-III mortality prediction task with LoRA PEFT strategy:
python peft_trainer.py --model_name_or_path "$MODEL" --peft_method "LORA" --task "mimic-mp" --log_save_dir "$WHERE_YOU_WANT_LOGS" --ckpt_save_dir "$WHERE_YOU_WANT_CHECKPOINTS" --train_batch_size 32 --eval_batch_size 32 --max_epochs 5
Distributed under the MIT License. See LICENCE for more information.
@misc{taylor2024efficiency,
title={Efficiency at Scale: Investigating the Performance of Diminutive Language Models in Clinical Tasks},
author={Niall Taylor and Upamanyu Ghose and Omid Rohanian and Mohammadmahdi Nouriborji and Andrey Kormilitzin and David Clifton and Alejo Nevado-Holgado},
year={2024},
eprint={2402.10597},
archivePrefix={arXiv},
primaryClass={cs.CL}
}