GitHub

Medical Concept Normalization in Clinical Trials with Drug and Disease Representation Learning

This repository contains additional materials of our papers "Medical Concept Normalization in Clinical Trials with Drug and Disease Representation Learning" (Bioinformatics) and "Drug and Disease Interpretation Learning with Biomedical Entity Representation Transformer" (ECIR 2021). We investigate the effectiveness of transferring concept normalization from the general biomedical domain to the clinical trials domain in a zero-shot setting with an absence of labeled data. We propose a simple and effective two-stage neural approach based on fine-tuned BERT architectures. In the first stage, we train a metric learning model that optimizes relative similarity of mentions and concepts via triplet loss. The model is trained on available labeled corpora of scientific abstracts to obtain vector embeddings of concept names and entity mentions from texts. In the second stage, we find the closest concept name representation in an embedding space to a given clinical mention. We evaluated several models, including state-of-the-art architectures, on a dataset of abstracts and a real-world dataset of trial records with interventions and conditions mapped to drug and disease terminologies. Extensive experiments validate the effectiveness of our approach in knowledge transfer from the scientific literature to clinical trials.

Evaluation & Results

Table 1 Out-of-domain performance of the proposed DILBERT model andbaselines in terms of Acc@1 on the filtered test set of clinical trials (CT)

Model	CT Condition		CT Intervention
Model	single concept	full set	single concept	full set
BioBERT ranking	72.6	71.74	77.83	56.97
BioSyn	86.36	-	79.58	-
DILBERT with different ranking strategies
random sampling	85.73	84.85	82.54	81.16
random + 2 parents	86.74	86.36	81.84	79.14
random + 5 parents	87.12	86.74	81.67	79.14
resampling	85.22	84.63	81.67	80.21
resampling + 5 siblings	84.84	84.26	80.62	76.16

Table 2 In-domain performance of the proposed DILBERT model interms of Acc@1 on the refined test set of the Biocreative V CDR corpus. For more details about the refined CDR corpus, please see our paper "Fair evaluation in concept normalization: a large-scale comparative analysis for BERT-based models"

Model	CDR Disease	CDR Chemical
BioBERT ranking	66.4	80.7
BioSyn	74.1	83.8
DILBERT, random sampling	75.5	81.4
DILBERT, random + 2 parents	75.0	81.2
DILBERT, random + 5 parents	73.5	81.4
DILBERT, resampling	75.8	83.3
DILBERT, resampling + 5 siblings	75.3	82.1

Figure 1 In-domain performance of the proposed DILBERT model in terms of Acc@1 on the refined test set of the Biocreative V CDR corpus using reduced dictionaries.

Requirements

$ pip install -r requirements.txt

Resources

Pretrained Model

We use the Huggingface version of BioBERT v1.1 so that the pretrained model can be run on the pytorch framework.

biobert v1.1 (pytorch)

Datasets

We made available all datasets

Run

To run the full training and evaluation procedure use run.sh script.

$ ./run.sh

Generating triplets

$ python data_utils/convert_to_triplet_dataset.py --input_data path/to/labeled/files \
                                     --vocab path/to/vocabulary \
                                     --save_to path/to/save/triplets/file \
                                     --path_to_bert_model path/to/bert/model \
                                     --hard \
                                     --hierarchy path/to/hierarchy/file \
                                     --hierarchy_aware

Train

$ python train_sentence_bert.py --path_to_bert_model path/to/bert/model \
                                --data_folder path/to/folder/containing/triplet/file  \
                                --triplets_file triplet_file_name \
                                --output_dir path/to/save/model

Evaluation

To eval the model run the command:

$ python eval_bert_ranking.py --model_dir path/to/bert/model \
                         --data_folder path/to/labeled/files \
                         --vocab path/to/vocabulary

Citing & Authors

Miftahutdinov Z., Kadurin A., Kudrin R., Tutubalina E. Drug and Disease Interpretation Learning with Biomedical Entity Representation Transformer //Advances in Information Retrieval. – 2021. – pp. 451-466. paper, preprint

@InProceedings{10.1007/978-3-030-72113-8_30,
 author="Miftahutdinov, Zulfat and Kadurin, Artur and Kudrin, Roman and Tutubalina, Elena",
 title="Drug and Disease Interpretation Learning with Biomedical Entity Representation Transformer",
 booktitle="Advances in  Information Retrieval",
 year="2021",
 publisher="Springer International Publishing",
 address="Cham",
 pages="451--466",
 isbn="978-3-030-72113-8"
}

Miftahutdinov Z., Kadurin A., Kudrin R., Tutubalina E. Medical concept normalization in clinical trials with drug and disease representation learning //Bioinformatics. – 2021. – Т. 37. – №. 21. – pp. 3856-3864. paper

@article{10.1093/bioinformatics/btab474,
    author = {Miftahutdinov, Zulfat and Kadurin, Artur and Kudrin, Roman and Tutubalina, Elena},
    title = "{Medical concept normalization in clinical trials with drug and disease representation learning}",
    journal = {Bioinformatics},
    volume = {37},
    number = {21},
    pages = {3856-3864},
    year = {2021},
    month = {07},
    issn = {1367-4803},
    doi = {10.1093/bioinformatics/btab474},
    url = {https://doi.org/10.1093/bioinformatics/btab474},
    eprint = {https://academic.oup.com/bioinformatics/article-pdf/37/21/3856/41091512/btab474.pdf},
}

Tutubalina E., Kadurin A., Miftahutdinov Z. Fair evaluation in concept normalization: a large-scale comparative analysis for BERT-based models //Proceedings of the 28th International Conference on Computational Linguistics. – 2020. – pp. 6710-6716. paper, git

@inproceedings{tutubalina2020fair,
  title={Fair evaluation in concept normalization: a large-scale comparative analysis for BERT-based models},
  author={Tutubalina, Elena and Kadurin, Artur and Miftahutdinov, Zulfat},
  booktitle={Proceedings of the 28th International Conference on Computational Linguistics},
  pages={6710--6716},
  year={2020}
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
data_utils		data_utils
hierarchy		hierarchy
images		images
models		models
preprocess		preprocess
README.MD		README.MD
__init__.py		__init__.py
eval_bert_ranking.py		eval_bert_ranking.py
requirements.txt		requirements.txt
run.sh		run.sh
train_sentence_bert.py		train_sentence_bert.py

insilicomedicine/DILBERT

Folders and files

Latest commit

History

Repository files navigation

Medical Concept Normalization in Clinical Trials with Drug and Disease Representation Learning

Evaluation & Results

Requirements

Resources

Pretrained Model

Datasets

Run

Generating triplets

Train

Evaluation

Citing & Authors

About

Resources

Stars

Watchers

Forks

Languages