Skip to content

longluu/LLM-NER-clinical-text

Repository files navigation

LLM-NER-clinical-text

The purpose of this repositories is to train potential LLMs on clinical text for NER task.

1. Model

1.1 GatorTron

The first powerful model class we consider is the GatorTron which is a BERT type model (https://huggingface.co/UFNLP/gatortron-medium).

Developed by a joint effort between the University of Florida and NVIDIA, GatorTron-Medium is a clinical language model of 3.9 billion parameters, pre-trained using a BERT architecure implemented in the Megatron package (https://github.com/NVIDIA/Megatron-LM). GatorTron-Medium is pre-trained using a dataset consisting of 82B words of de-identified clinical notes from the University of Florida Health System, 6.1B words from PubMed CC0, 2.5B words from WikiText, 0.5B words of de-identified clinical notes from MIMIC-III

The base models has 345 million params while the medium one has a whopping 3.9 billion params. More details are provided in the paper https://www.nature.com/articles/s41746-022-00742-2.

1.2 GatorTronS

This is related to the GatorTron but was trained on a different corpus.

Developed by a joint effort between the University of Florida and NVIDIA, GatorTronS is a clinical language model of 345 million parameters, pre-trained using a BERT architecure implemented in the Megatron package (https://github.com/NVIDIA/Megatron-LM). GatorTronS is pre-trained using a dataset consisting of: 22B synthetic clinical words generated by GatorTronGPT (a Megatron GPT-3 model) 6.1B words from PubMed CC0, 2.5B words from WikiText, 0.5B words of de-identified clinical notes from MIMIC-III

The model has 345 million params. Details can be found here https://www.nature.com/articles/s41746-023-00958-w#code-availability.

2. Dataset

2.1. MedMentions

This dataset contain 4,392 abstracts released in PubMed®1 between January 2016 and January 2017. The abstracts were manually annotated for biomedical concepts. Details are provided in https://arxiv.org/pdf/1902.09476v1.pdf and data is in https://github.com/chanzuckerberg/MedMentions.

The preprocessed data for LLM training can be found here https://github.com/mhmdrdwn/medm/tree/main/data/built_data.

2.2 NCBI-disease

This dataset contains the disease name and concept annotations of the NCBI disease corpus, a collection of 793 PubMed abstracts fully annotated at the mention and concept level to serve as a research resource for the biomedical natural language processing community. Details are here https://www.sciencedirect.com/science/article/pii/S1532046413001974?via%3Dihub.

The preprocessed data for LLM training can be found here https://huggingface.co/datasets/ncbi_disease.

2.3 n2c2 2018 Track 2

Details of the dataset can be found here https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7489085/.

The data for this shared task consisted of 505 discharge summaries drawn from the MIMIC-III (Medical Information Mart for Intensive Care-III) clinical care database.33 These records were selected using a query that searched for an ADE in the International Classification of Diseases code description of each record. The identified records were manually screened to contain at least 1 ADE, and were annotated for the concept and relation types shown in Table 1. Each record in the dataset was annotated by 2 independent annotators while a third annotator resolved conflicts.

3. Training setup

3.1 Environment

Here I use Python version 3.9.2. All the dependencies are listed in requirements.txt. You also need to install the repo as a package pip install -e ..

3.2 Run the code

An example to run the training code is

python3 src/models/train_model.py --model_name 'UFNLP/gatortrons' --data_dir '/home/ec2-user/SageMaker/LLM-NER-clinical-text/data/public/MedMentions/preprocessed-data/' --batch_size 4 --num_train_epochs 5 --weight_decay 0.01 --new_model_dir "/home/ec2-user/SageMaker/LLM-NER-clinical-text/models/medmentions/gatortrons/" --path_umls_semtype '/home/ec2-user/SageMaker/LLM-NER-clinical-text/data/public/MedMentions/SemGroups_2018.txt'

4. Results

The fine-tuned models and brief results can be found at my huggingface page https://huggingface.co/longluu. You can also look at the notebooks folder for training and test results.

Project Organization

├── LICENSE
├── Makefile           <- Makefile with commands like `make data` or `make train`
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
│
├── docs               <- A default Sphinx project; see sphinx-doc.org for details
│
├── models             <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
│                         the creator's initials, and a short `-` delimited description, e.g.
│                         `1.0-jqp-initial-data-exploration`.
│
├── references         <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
│
├── setup.py           <- makes project pip installable (pip install -e .) so src can be imported
├── src                <- Source code for use in this project.
│   ├── __init__.py    <- Makes src a Python module
│   │
│   ├── data           <- Scripts to download or generate data
│   │   └── make_dataset.py
│   │
│   │
│   ├── models         <- Scripts to train models and then use trained models to make
│   │   │                 predictions
│   │   ├── predict_model.py
│   │   └── train_model.py
│   │
│   └── visualization  <- Scripts to create exploratory and results oriented visualizations
│       └── visualize.py
│
└── tox.ini            <- tox file with settings for running tox; see tox.readthedocs.io

Project based on the cookiecutter data science project template. #cookiecutterdatascience

About

Train LLMs on clinical text for NER task.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published