Dyport is a novel benchmarking framework for evaluating biomedical hypothesis generation systems. Utilizing curated datasets, our approach tests these systems under realistic conditions, enhancing the relevance of our evaluations. We integrate knowledge from the curated databases into a dynamic graph, accompanied by a method to quantify discovery importance. This not only assesses hypothesis accuracy but also their potential impact in biomedical research which significantly extends traditional link prediction benchmarks. Applicability of our benchmarking process is demonstrated on several link prediction systems applied on biomedical semantic knowledge graphs. Being flexible, our benchmarking system is designed for broad application in hypothesis generation quality verification, aiming to expand the scope of scientific discovery within the biomedical research community.
Arxiv preprint link: https://arxiv.org/abs/2312.03303
Full benchmark data is available here: Google Drive
The data is represented as .csv
tables with pairwise concept associations stratified by the year of their first occurrence in literature.
For each positive connection
Each .csv
file has the following structure:
source | target | source_st | target_st | origin_db | mentions_diff | sem_scholar_cit | ig | jac_2nd_train | eig_cent_diff | betw_cent | IMP_manh | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
42241 | C4078312__acalabrutinib | C4079830__venetoclax | Pharmacologic Substance; Organic Chemical | Pharmacologic Substance; Organic Chemical | rxnav | 62 | 975.0 | 0.074084 | 0.077561 | 0.000203 | 0.000019 | 0.956588 |
22993 | C2353951__dapagliflozin | C4524093__Euglycaemic diabetic ketoacidosis | Pharmacologic Substance; Organic Chemical | Disease or Syndrome | drugcentral | 30 | 427.0 | 0.103463 | 0.041605 | 0.000261 | 0.000050 | 0.965584 |
8080 | C1335212__PIK3CA gene | C4049141__copanlisib | Gene or Genome | Pharmacologic Substance; Organic Chemical | drugcentral; kegg | 17 | 1066.0 | 0.067049 | 0.025928 | 0.000923 | 0.000018 | 0.974659 |
where:
source
andtarget
columns represent UMLS CUI terms and their corresponding preferred names (separated with__
);source_st
andtarget_st
columns contain their semantic types;origin_db
shows what biomedical database a particular connection came from;
Next columns represent the importance components:
mentions_diff
: number of times a particular association was mentioned in the literature;sem_scholar_cit
: number of Semantic Scholar citations;ig
: Integrated Gradients attribution score;jac_2nd_train
: 2nd order Jaccard similarity;eig_cent_diff
: eigenvector centrality difference;betw_cent
: betweenness centrality;
The last column IMP_manh
is a final combined importance score (a number between 0 and 1).
Note: the importance components described above are the ones we reported in our paper. If you want to use your own custom component(s), you can calculate them for every connection (alongside already existing components) and simply add them to the list. IMP_manh
should be recalculated accordingly at the end to get the final combined importance score for every connection. Please refer to this illustrated notebook, part - Calculating merged importance metric
.
We shared the output numbers for every model we tested (for both positive and negative samples) and then we computed ROC AUC scores based on different stratification strategies.
If you would like to use the benchmark, this might be a course of actions:
- Train your model. The model should accept two UMLS terms as input and output a number indicating the likelihood of two terms being connected. The training data should come from scientific publications (or relevant literature) prior to the testing timestamp
<YEAR>
. - Download a table
all_model_scores_test_<YEAR>.csv
from Google Drive, where<YEAR>
represents the year of interest and indicates the timestamp when the associations first appeared in the literature. - Compute your model output for associations from step 2 and calculate ROC AUC score with, for example, this scikit-learn implementation.
Dataset time split:
- train: 2015
- test: 2016
- importance: 2022
Table with ROC AUC scores based on Importance stratification:
AGATHA | DistMult | Node2Vec | HolE | ComplEx | TransE | |
---|---|---|---|---|---|---|
Low importance ( < 0.41) | 0.774 | 0.760 | 0.755 | 0.623 | 0.577 | 0.555 |
Medium importance (0.41 |
0.698 | 0.674 | 0.666 | 0.588 | 0.559 | 0.541 |
High importance ( > 0.53) | 0.638 | 0.623 | 0.603 | 0.566 | 0.551 | 0.533 |
Want to upload your results? Open an issue on GitHub! Or reach out to me via email: tyagin at udel dot edu
This research was supported by NIH award #R01DA054992. The computational experiments were supported in part through the use of DARWIN computing system: DARWIN - A Resource for Computational and Data-intensive Research at the University of Delaware and in the Delaware Region, which is supported by NSF Grant #1919839.
I. Tyagin, I. Safro. Dyport: Dynamic Importance-based Hypothesis Generation Benchmarking Technique. 2023
@article{tyagin2023dyport,
title={Dyport: Dynamic Importance-based Hypothesis Generation Benchmarking Technique},
author={Ilya Tyagin and Ilya Safro},
year={2023},
eprint={2312.03303},
archivePrefix={arXiv},
primaryClass={cs.AI}
}