Skip to content

This repository contains the code used for distillation and fine-tuning of compact biomedical transformers that have been introduced in the paper "On The Effectiveness of Compact Biomedical Transformers"

License

Notifications You must be signed in to change notification settings

nlpie-research/Compact-Biomedical-Transformers

Repository files navigation

Compact Biomedical Transformers

PWC PWC PWC PWC PWC

This repository contains the code used for distillation and fine-tuning of compact biomedical transformers that have been introduced in the paper "On The Effectiveness of Compact Biomedical Transformers".

Abstract of the Research

Language models pre-trained on biomedical corpora, such as BioBERT, have recently shown promising results on downstream biomedical tasks. Many existing pre-trained models, on the other hand, are resource-intensive and computationally heavy owing to factors such as embedding size, hidden dimension, and number of layers. The natural language processing (NLP) community has developed numerous strategies to compress these models utilising techniques such as pruning, quantisation, and knowledge distillation, resulting in models that are considerably faster, smaller, and subsequently easier to use in practice. By the same token, in this paper we introduce six lightweight models, namely, BioDistilBERT, BioTinyBERT, BioMobileBERT, DistilBioBERT, TinyBioBERT, and CompactBioBERT which are obtained either by knowledge distillation from a biomedical teacher or continual learning on the Pubmed dataset via the Masked Language Modelling (MLM) objective. We evaluate all of our models on three biomedical tasks and compare them with BioBERT-v1.1 to create efficient lightweight models that per- form on par with their larger counterparts.

Available Models

Model Name isDistiiled #Layers #Params Huggingface Path Link
DistilBioBERT True 6 65M nlpie/distil-biobert here
CompactBioBERT True 6 65M nlpie/compact-biobert here
TinyBioBERT True 4 15M nlpie/tiny-biobert here
BioDistilBERT-cased False 6 65M nlpie/bio-distilbert-cased here
BioDistilBERT-uncased False 6 65M nlpie/bio-distilbert-uncased here
BioTinyBERT False 4 15M nlpie/bio-tinybert here
BioMobileBERT False 24 25M nlpie/bio-mobilebert here

How to prepare your coding environment

First, install the below packages using the following command:

pip install transformers datasets seqeval evaluate

Second, clone this repository:

git clone https://github.com/nlpie-research/Compact-Biomedical-Transformers.git

Third, add the path of the cloned repository to your project using the below command so you can access the files in it:

import sys
sys.path.append("PATH_TO_REPO/Compact-Biomedical-Transformers")

Forth, download and extract the pre-processed datasets from the BioBERT Github Repo via these commands:

wget http://nlp.dmis.korea.edu/projects/biobert-2020-checkpoints/datasets.tar.gz
tar -xvzf datasets.tar.gz

Run Models on NER

First, import the load_and_preprocess_dataset and train_and_evaluate functions from the ner.py:

from ner import load_and_preprocess_dataset, train_and_evaluate

Then, specify the pre-trained model, dataset, path to dataset and logging file path like this:

datasetName = "BC5CDR-chem"

modelPath = "nlpie/distil-biobert"
tokenizerPath = "nlpie/distil-biobert"

datasetPath = f"PATH_TO_DOWNLOADED_DATASET/datasets/NER/{datasetName}/"
logsPath = f"{datasetName}-logs.txt"

Next, load the pre-trained tokeniser from huggingface and call the load_and_preprocess_dataset function:

import transformers as ts

tokenizer = ts.AutoTokenizer.from_pretrained(tokenizerPath)

tokenizedTrainDataset, tokenizedValDataset, tokenizedTestDataset, compute_metrics, label_names = load_and_preprocess_dataset(
    datasetPath=datasetPath,
    tokenizer=tokenizer
)

Finally, call the train_and_evaluate function and wait for the results:

model, valResults, testResults = train_and_evaluate(lr=5e-5,
                                                    batchsize=16,
                                                    epochs=5,
                                                    tokenizer=tokenizer,
                                                    tokenizedTrainDataset=tokenizedTrainDataset,
                                                    tokenizedValDataset=tokenizedValDataset,
                                                    tokenizedTestDataset=tokenizedTestDataset,
                                                    compute_metrics=compute_metrics,
                                                    label_names=label_names,
                                                    logsPath=logsPath,
                                                    trainingArgs=None)

Note that, you can either use our pre-defined TrainingArguments with your desired learning rate, batch size and number of epochs or use the trainingArgs argument (which by default is set to None) and pass your custom TrainingArguments to it.

Run Models on QA

First import the load_and_preprocess_train_dataset, load_test_dataset, train, and evaluate from the qa.py along with the following libraries:

import transformers as ts

import torch
import torch.nn as nn
from torch.functional import F

from qa import load_and_preprocess_train_dataset, load_test_dataset, train, evaluate

Then, Specify the model, tokeniser, dataset etc, as shown below:

modelPath = "nlpie/bio-distilbert-cased"
tokenizerPath = "nlpie/bio-distilbert-cased"

trainPath = "PATH_TO_DOWNLOADED_DATASET/datasets/QA/BioASQ/BioASQ-train-factoid-7b.json"
testPath = "PATH_TO_DOWNLOADED_DATASET/datasets/QA/BioASQ/BioASQ-test-factoid-7b.json"
goldenPath = "PATH_TO_DOWNLOADED_DATASET/datasets/QA/BioASQ/7B_golden.json"

logsPath = "qa_logs/"

Next, load the tokeniser and the train and test datasets:

tokenizer = ts.AutoTokenizer.from_pretrained(tokenizerPath)

trainDataset, tokenizedTrainDataset = load_and_preprocess_train_dataset(trainPath, 
                                                                        tokenizer,
                                                                        max_length=384, 
                                                                        stride=128)
                                                                        
testDataset = load_test_dataset(testPath)

Afterwards, train the model using the code below:

model = train(tokenizedTrainDataset,
              modelPath,
              tokenizer,
              learning_rate=3e-5,
              num_epochs=5,
              batch_size=16,
              training_args=None)

Please note that you can either use our pre-defined TrainingArguments or pass your own TrainingArguments to the training_args argument.

Finally, use the below code for making predictions on the test dataset and saving it into a json file in the correct format expected by the evalutaion script used in the BioASQ competition.

answersDict = evaluate(model,
                       tokenizer,
                       testDataset,
                       goldenPath,
                       logsPath,
                       top_k_predictions=5,
                       max_seq_len=384,
                       doc_stride=128)

Evaluation using BioASQ evaluation script

First, clone the BioASQ repository with the code below:

git clone https://github.com/BioASQ/Evaluation-Measures.git

Afterwards, use the following code for evaluation:

java -Xmx10G -cp $CLASSPATH:/FULL_PATH_TO_CLONED_REPO/Evaluation-Measures/flat/BioASQEvaluation/dist/BioASQEvaluation.jar evaluation.EvaluatorTask1b -phaseB -e 5 /FULL_PATH_TO_DOWNLOADED_DATASET/datasets/QA/BioASQ/7B_golden.json /FULL_PATH_TO_LOGS_FOLDER/qa_logs/prediction_7B_golden.json

Finally, you will get a result like below, in which the second to forth numbers are Strict Accuracy, Lenient Accuracy, and Mean Reciprocal Rank scores respectively.

1.0 0.2345679012345679 0.36419753086419754 0.28524397413286307 1.0 1.0 1.0 1.0 1.0 1.0

Citation

@article{rohanian2023effectiveness,
  title={On the effectiveness of compact biomedical transformers},
  author={Rohanian, Omid and Nouriborji, Mohammadmahdi and Kouchaki, Samaneh and Clifton, David A},
  journal={Bioinformatics},
  volume={39},
  number={3},
  pages={btad103},
  year={2023},
  publisher={Oxford University Press}
}

About

This repository contains the code used for distillation and fine-tuning of compact biomedical transformers that have been introduced in the paper "On The Effectiveness of Compact Biomedical Transformers"

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages