Skip to content

insilicomedicine/DILBERT

Repository files navigation

Medical Concept Normalization in Clinical Trials with Drug and Disease Representation Learning

This repository contains additional materials of our papers "Medical Concept Normalization in Clinical Trials with Drug and Disease Representation Learning" (Bioinformatics) and "Drug and Disease Interpretation Learning with Biomedical Entity Representation Transformer" (ECIR 2021). We investigate the effectiveness of transferring concept normalization from the general biomedical domain to the clinical trials domain in a zero-shot setting with an absence of labeled data. We propose a simple and effective two-stage neural approach based on fine-tuned BERT architectures. In the first stage, we train a metric learning model that optimizes relative similarity of mentions and concepts via triplet loss. The model is trained on available labeled corpora of scientific abstracts to obtain vector embeddings of concept names and entity mentions from texts. In the second stage, we find the closest concept name representation in an embedding space to a given clinical mention. We evaluated several models, including state-of-the-art architectures, on a dataset of abstracts and a real-world dataset of trial records with interventions and conditions mapped to drug and disease terminologies. Extensive experiments validate the effectiveness of our approach in knowledge transfer from the scientific literature to clinical trials.

Evaluation & Results

Table 1 Out-of-domain performance of the proposed DILBERT model andbaselines in terms of Acc@1 on the filtered test set of clinical trials (CT)

Model CT Condition CT Intervention
single concept full set single concept full set
BioBERT ranking 72.6 71.74 77.83 56.97
BioSyn 86.36 - 79.58 -
DILBERT with different ranking strategies
random sampling 85.73 84.85 82.54 81.16
random + 2 parents 86.74 86.36 81.84 79.14
random + 5 parents 87.12 86.74 81.67 79.14
resampling 85.22 84.63 81.67 80.21
resampling + 5 siblings 84.84 84.26 80.62 76.16

Table 2 In-domain performance of the proposed DILBERT model interms of Acc@1 on the refined test set of the Biocreative V CDR corpus. For more details about the refined CDR corpus, please see our paper "Fair evaluation in concept normalization: a large-scale comparative analysis for BERT-based models"

Model CDR Disease CDR Chemical
BioBERT ranking 66.4 80.7
BioSyn 74.1 83.8
DILBERT, random sampling 75.5 81.4
DILBERT, random + 2 parents 75.0 81.2
DILBERT, random + 5 parents 73.5 81.4
DILBERT, resampling 75.8 83.3
DILBERT, resampling + 5 siblings 75.3 82.1

Figure 1 In-domain performance of the proposed DILBERT model in terms of Acc@1 on the refined test set of the Biocreative V CDR corpus using reduced dictionaries.

Requirements

$ pip install -r requirements.txt

Resources

Pretrained Model

We use the Huggingface version of BioBERT v1.1 so that the pretrained model can be run on the pytorch framework.

Datasets

We made available all datasets

Run

To run the full training and evaluation procedure use run.sh script.

$ ./run.sh

Generating triplets

$ python data_utils/convert_to_triplet_dataset.py --input_data path/to/labeled/files \
                                     --vocab path/to/vocabulary \
                                     --save_to path/to/save/triplets/file \
                                     --path_to_bert_model path/to/bert/model \
                                     --hard \
                                     --hierarchy path/to/hierarchy/file \
                                     --hierarchy_aware

Train

$ python train_sentence_bert.py --path_to_bert_model path/to/bert/model \
                                --data_folder path/to/folder/containing/triplet/file  \
                                --triplets_file triplet_file_name \
                                --output_dir path/to/save/model

Evaluation

To eval the model run the command:

$ python eval_bert_ranking.py --model_dir path/to/bert/model \
                         --data_folder path/to/labeled/files \
                         --vocab path/to/vocabulary

Citing & Authors

Miftahutdinov Z., Kadurin A., Kudrin R., Tutubalina E. Drug and Disease Interpretation Learning with Biomedical Entity Representation Transformer //Advances in Information Retrieval. – 2021. – pp. 451-466. paper, preprint

@InProceedings{10.1007/978-3-030-72113-8_30,
 author="Miftahutdinov, Zulfat and Kadurin, Artur and Kudrin, Roman and Tutubalina, Elena",
 title="Drug and Disease Interpretation Learning with Biomedical Entity Representation Transformer",
 booktitle="Advances in  Information Retrieval",
 year="2021",
 publisher="Springer International Publishing",
 address="Cham",
 pages="451--466",
 isbn="978-3-030-72113-8"
}

Miftahutdinov Z., Kadurin A., Kudrin R., Tutubalina E. Medical concept normalization in clinical trials with drug and disease representation learning //Bioinformatics. – 2021. – Т. 37. – №. 21. – pp. 3856-3864. paper

@article{10.1093/bioinformatics/btab474,
    author = {Miftahutdinov, Zulfat and Kadurin, Artur and Kudrin, Roman and Tutubalina, Elena},
    title = "{Medical concept normalization in clinical trials with drug and disease representation learning}",
    journal = {Bioinformatics},
    volume = {37},
    number = {21},
    pages = {3856-3864},
    year = {2021},
    month = {07},
    issn = {1367-4803},
    doi = {10.1093/bioinformatics/btab474},
    url = {https://doi.org/10.1093/bioinformatics/btab474},
    eprint = {https://academic.oup.com/bioinformatics/article-pdf/37/21/3856/41091512/btab474.pdf},
}

Tutubalina E., Kadurin A., Miftahutdinov Z. Fair evaluation in concept normalization: a large-scale comparative analysis for BERT-based models //Proceedings of the 28th International Conference on Computational Linguistics. – 2020. – pp. 6710-6716. paper, git

@inproceedings{tutubalina2020fair,
  title={Fair evaluation in concept normalization: a large-scale comparative analysis for BERT-based models},
  author={Tutubalina, Elena and Kadurin, Artur and Miftahutdinov, Zulfat},
  booktitle={Proceedings of the 28th International Conference on Computational Linguistics},
  pages={6710--6716},
  year={2020}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published