GitHub - insilicomedicine/Fair-Evaluation-BERT

Fair Evaluation in Concept Normalization: a Large-scale Comparative Analysis for BERT-based Models

This repository contains additional materials of our paper "Fair Evaluation in Concept Normalization: a Large-scale Comparative Analysis for BERT-based Models". In biomedical research and healthcare, the entity linking problem is known as medical concept normalization (MCN). In this work, we perform a comparative evaluation of various benchmarks and study the efficiency of BERT-based models for linking of three entity types across three domains: research abstracts, drug labels, and usergenerated texts on drug therapy in English.

Evaluation & Results

Given predefined train/test splits of six biomedical datasets, we found out that app. 78% entity mentions in the test set are textual duplicates of other entities in the test set or entities presented in train+dev sets. In order to obtain more realistic results, we present refined test sets without duplicates or exact overlaps (see Table 1). Please refer to our paper for details.

Single-terminology Evaluation on Refined Test Sets

Table 1

This table presents the summary statistics of corpora used in study.

	NCBI Disease	BC5CDR Disease	BC5CDR Chem	BC2GN Gene	TAC 2017 ADR	SMM4H 2017 ADR
domain	abstracts	abstracts	abstracts	abstracts	drug labels	tweets
entity type	disease	disease	chemicals	genes	ADRs	ADRs
terminology	MEDIC	MEDIC	CTD Chem	Entrez Gene	MedDRA	MedDRA
number of pre-processed entity mentions
full corpus	6881	12850	15935	5712	13381	9150
avg. len in chars	20.37	14.88	11.27	8.35	17.28	11.69
% have numerals	5.74%	0.11%	7.32%	62.46%	1.62%	2.52%
train set	5134	4182	5203	2725	7038	6650
dev set	787	4244	5347	-	-	-
test set	960	4424	5385	2987	6343	2500
refined test	204 (21.2%)	657 (14.9%)	425 (7.9%)	985 (32.9%)	1,544 (24.3%)	831 (33.3%)
number of concepts
train set \|T_1\|	668	968	922	556	1517	472
test set \|T_2\|	203	669	617	670	1323	254
refined test \|T_3\|	140	438	351	642	857	201
\|T_1 & T_2\|	136	457	368	55	867	218
\|T_1 & T_3\|	76	226	102	27	401	165

Figure 1

This figure shows differences in evaluation metrics on the refined and full test set of BioSyn and BERT ranking approaches.

Cross-terminology Evaluation on Refined Test Sets

Tables 2 & 3

Tables 2 & 3 contain metrics on cross-terminology evaluation mode. Table 2 shows performance of BioSyn on refined test sets in terms of accuracy@1; the numbers in bold print represent bert performance in a row.

Test set	Train set
Test set	NCBI Dis	BC5CDR Dis	BC5CDR Chem	BC2GN Gene	TAC ADR	SMM4H ADR
NCBI	72.5	67.6	64.7	67.6	67.2	48.5
CDR Dis	74.7	74.1	73.4	74.9	73.1	58.3
CDR Chem	82.4	84.2	83.8	82.4	82.6	73.9
TAC ADR	74.3	77.5	70.1	83.2	69.9	51.5
BC2GN	83.1	81.7	83.7	82.6	85.8	73.2
SMM4H ADR	27.3	35.6	24.8	30.1	21.9	60.5

Table 3

Table 3 shows differences in results between a given model and the in-domain model in parentheses (by row).

Test set	Train set
Test set	NCBI Dis	BC5CDR Dis	BC5CDR Chem	BC2GN Gene	TAC ADR	SMM4H ADR
NCBI Disease	72.5	-4.9	-7.8	-5.4	-4.9	-24.0
BC5CDR Dis	+0.6	74.1	-0.8	-1.1	+0.8	-15.8
BC5CDR Chem	-1.4	+0.5	83.8	-1.2	-1.4	-9.9
BC2GN Gene	-2.6	-4.1	-2.1	85.8	-3.1	-12.6
TAC ADR	-8.9	-5.7	-13.0	-13.3	83.2	-31.7
SMM4H ADR	-33.2	-24.9	-35.7	-38.6	-30.4	60.5

Conclusion

We have presented the first comparative evaluation of medical concept normalization (MCN) datasets, studying the NCBI Disease, BC5CDR Disease & Chemical, BC2GN Gene, TAC 2017 ADR, and SMM4H 2017 ADR corpora. We perform an extensive evaluation of two BERT-based models on six datasets in two setups: with official train/test splits and with the proposed test sets that represent refined samples of entity mentions. Our evaluation shows great divergence in performance between these two test sets, finding an average accuracy difference of 15% for the state-of-the-art model BioSyn. We also performed a quantitative evaluation of BioSyn in the cross-terminology MCN task where models were trained and evaluated on entity mentions of various types with concepts from different terminologies. Knowledge transfer can be effective between diseases, chemicals, and genes with an average drop of 2.53% accuracy in the performance on NCBI, BC5CDR, and BC2GN sets. For TAC and SMM4H sets with ADRs from drug labels and social media, BioSyn models trained on four other corpora show a substantial decrease in performance (-10.2% and -33.1% accuracy, respectively) compared to in-domain trained models. To our surprise, these models still outperformed the straightforward ranking baseline on BioBERT representations. We believe that refined datasets with cross-terminology evaluation can serve as a step toward reliable and large-scale evaluation of biomedical IE models.

Requirements

$ pip install -r requirements.txt

Resources

Pretrained Model

We use the Huggingface version of BioBERT v1.1 so that the pretrained model can be run on the pytorch framework.

biobert v1.1 (pytorch)

Datasets

Datasets and the preprocessing procedures are used the same as in BioSyn. Additionally, we used SMM4H 2017 dataset. We made available all datasets except TAC ADR 2017. TAC2017ADR dataset cannot be shared because of the license issue. But we made available preprocessing scripts.

Preprocess

To get a refined test set from the test set simply run:

$ python process_data.py --train_data_folder /data/ncbi/processed_train \
                         --test_data_folder /data/ncbi/processed_test \
                         --save_to /data/ncbi/processed_test_refined

Train

To train the BioSyn models follow the instructions. BERT ranking doesn't require any training procedure.

Evaluation

To eval BioSyn trained models follow the instructions. To eval the BERT ranking run the command:

$ python eval_bert_ranking.py --model_dir /data/pretrained_models/biobert_v1.1_pubmed_pytorch/ \
                         --data_folder /data/ncbi/processed_test \
                         --vocab /data/ncbi/test_dictionary.txt

Citing & Authors

Tutubalina E., Kadurin A., Miftahutdinov Z. Fair Evaluation in Concept Normalization: a Large-scale Comparative Analysis for BERT-based Models //Proceedings of the 28th International Conference on Computational Linguistics. – 2020. – С. 6710-6716.link

BibTex

@inproceedings{tutubalina-etal-2020-fair,
    title = "Fair Evaluation in Concept Normalization: a Large-scale Comparative Analysis for {BERT}-based Models",
    author = "Tutubalina, Elena  and
      Kadurin, Artur  and
      Miftahutdinov, Zulfat",
    booktitle = "Proceedings of the 28th International Conference on Computational Linguistics",
    month = dec,
    year = "2020",
    address = "Barcelona, Spain (Online)",
    publisher = "International Committee on Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.coling-main.588",
    pages = "6710--6716",
 }

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
images		images
README.MD		README.MD
bert_ranker.py		bert_ranker.py
eval_bert_ranking.py		eval_bert_ranking.py
process_data.py		process_data.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

images

images

README.MD

README.MD

bert_ranker.py

bert_ranker.py

eval_bert_ranking.py

eval_bert_ranking.py

process_data.py

process_data.py

requirements.txt

requirements.txt

utils.py

utils.py

Repository files navigation

Fair Evaluation in Concept Normalization: a Large-scale Comparative Analysis for BERT-based Models

Evaluation & Results

Single-terminology Evaluation on Refined Test Sets

Table 1

Figure 1

Cross-terminology Evaluation on Refined Test Sets

Tables 2 & 3

Table 3

Conclusion

Requirements

Resources

Pretrained Model

Datasets

Preprocess

Train

Evaluation

Citing & Authors

About

Releases

Packages

Contributors 2

Languages

insilicomedicine/Fair-Evaluation-BERT

Folders and files

Latest commit

History

Repository files navigation

Fair Evaluation in Concept Normalization: a Large-scale Comparative Analysis for BERT-based Models

Evaluation & Results

Single-terminology Evaluation on Refined Test Sets

Table 1

Figure 1

Cross-terminology Evaluation on Refined Test Sets

Tables 2 & 3

Table 3

Conclusion

Requirements

Resources

Pretrained Model

Datasets

Preprocess

Train

Evaluation

Citing & Authors

About

Resources

Stars

Watchers

Forks

Languages