GitHub

ELECTRAMed

A new pre-trained language representation model for biomedical NLP

About ELECTRAMed

Motivation
The overwhelming amount of biomedical scientific texts calls for the development of effectivelanguage models able to tackle a wide range of biomedical natural language processing (NLP) tasks. Themost recent dominant approaches are domain-specific models, initialized with general-domain textualdata and then trained on a variety of scientific corpora. However, it has been observed that for specializeddomains in which large corpora exist, training a model from scratch with just in-domain knowledge mayyield better results. Moreover, the increasing focus on the compute costs for pre-training recently led tothe design of more efficient architectures, such as ELECTRA.

Results
We propose a pre-trained domain-specific language model, called ELECTRAMed, suited for the biomedical field. The novel approach inherits the learning framework of the general-domain ELECTRA architecture, as well as its computational advantages. Experiments performed on benchmark datasets for several biomedical NLP tasks support the usefulness of ELECTRAMed, which sets the novels tate-of-the-art result on the BC5CDR corpus for named entity recognition, and provides the best outcome in 2 over the 5 runs of the 7th BioASQ-factoid Challange for the question answering task.

Built With

TensorFlow 1.15.3 (pre-training)
TensorFlow 2.3.0 (fine-tuning)
HuggingFace 3.4.0

Download

Pre-training was performed leveraging the original ELECTRA code provided by Google Research.
The corpus used for pre-training was published by Peng et al., 2019 and consists of 28,714,373 PubMed abstracts (uncompressed, approximately 26GB), representing the entirety of PubMed until 2018.
It is available for download in its preprocessed version here.
For more information about how the pre-training was performed, refer to the paper.

Currently, there is only one version of the model available. Its weights, vocabulary and config files can be downloaded from the Hugging Face model repository.

Datasets

ELECTRAMed was fine-tuned and tested on three biomedical NLP tasks: named entity recognition (NER), relationship extraction (RE) and question answering (QA).
The datasets used are detailed in the paper and are made available in electramed/data.

Models

We will soon publish the available fine-tuned models for NER, RE and QA.

License

Distributed under the MIT License. See LICENSE for more information.

Contacts

Giacomo Miolo - giacomo.miolo@mip.polimi.it
Giulio Mantoan - giulio.mantoan@mip.polimi.it
Carlotta Orsenigo - carlotta.orsenigo@polimi.it

Acknowledgements

We thank Dr. Sandra Coecke from the Join Research Center at European Commission and Dr. Anna Beronius from Karolisnka Institute for their valuable and fruitful discussions that fostered a positive and encouraging environment which greatly contributed to the development of our work.
We also thank the authors of ELECTRA, SciBERT and BlueBERT for making the data and codes publicly available.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
finetune		finetune
model		model
pretrain		pretrain
util		util
LICENSE		LICENSE
README.md		README.md
build_openwebtext_pretraining_dataset.py		build_openwebtext_pretraining_dataset.py
build_pretraining_dataset.py		build_pretraining_dataset.py
configure_finetuning.py		configure_finetuning.py
configure_pretraining.py		configure_pretraining.py
flops_computation.py		flops_computation.py
load_CSV_tab.py		load_CSV_tab.py
run_factoid.py		run_factoid.py
run_finetuning.py		run_finetuning.py
run_pretraining.py		run_pretraining.py
run_tf_ner.py		run_tf_ner.py
run_tf_text_classification.py		run_tf_text_classification.py
tasks.py		tasks.py
trainer_tf.py		trainer_tf.py
training_args.py		training_args.py
utils_ner.py		utils_ner.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ELECTRAMed

Table of Contents

About ELECTRAMed

Built With

Download

Datasets

Models

License

Contacts

Acknowledgements

About

Releases

Packages

Languages

License

giumanto/electramed

Folders and files

Latest commit

History

Repository files navigation

ELECTRAMed

Table of Contents

About ELECTRAMed

Built With

Download

Datasets

Models

License

Contacts

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages