Dyport: Dynamic Importance-based Biomedical Hypothesis Generation Benchmarking Technique

Overview:

Dyport is a novel benchmarking framework for evaluating biomedical hypothesis generation systems. Utilizing curated datasets, our approach tests these systems under realistic conditions, enhancing the relevance of our evaluations. We integrate knowledge from the curated databases into a dynamic graph, accompanied by a method to quantify discovery importance. This not only assesses hypothesis accuracy but also their potential impact in biomedical research which significantly extends traditional link prediction benchmarks. Applicability of our benchmarking process is demonstrated on several link prediction systems applied on biomedical semantic knowledge graphs. Being flexible, our benchmarking system is designed for broad application in hypothesis generation quality verification, aiming to expand the scope of scientific discovery within the biomedical research community.

Arxiv preprint link: https://arxiv.org/abs/2312.03303

Data

Full benchmark data is available here: Google Drive

Data description

The data is represented as .csv tables with pairwise concept associations stratified by the year of their first occurrence in literature. For each positive connection $e$ we calculate a range of features describing its importance using the network and literature-based data. After that the final importance score is calculated as the average percentile rank of the underlying features.

Each .csv file has the following structure:

	source	target	source_st	target_st	origin_db	mentions_diff	sem_scholar_cit	ig	jac_2nd_train	eig_cent_diff	betw_cent	IMP_manh
42241	C4078312__acalabrutinib	C4079830__venetoclax	Pharmacologic Substance; Organic Chemical	Pharmacologic Substance; Organic Chemical	rxnav	62	975.0	0.074084	0.077561	0.000203	0.000019	0.956588
22993	C2353951__dapagliflozin	C4524093__Euglycaemic diabetic ketoacidosis	Pharmacologic Substance; Organic Chemical	Disease or Syndrome	drugcentral	30	427.0	0.103463	0.041605	0.000261	0.000050	0.965584
8080	C1335212__PIK3CA gene	C4049141__copanlisib	Gene or Genome	Pharmacologic Substance; Organic Chemical	drugcentral; kegg	17	1066.0	0.067049	0.025928	0.000923	0.000018	0.974659

where:

source and target columns represent UMLS CUI terms and their corresponding preferred names (separated with __);
source_st and target_st columns contain their semantic types;
origin_db shows what biomedical database a particular connection came from;

Next columns represent the importance components:

mentions_diff: number of times a particular association was mentioned in the literature;
sem_scholar_cit: number of Semantic Scholar citations;
ig: Integrated Gradients attribution score;
jac_2nd_train: 2nd order Jaccard similarity;
eig_cent_diff: eigenvector centrality difference;
betw_cent: betweenness centrality;

The last column IMP_manh is a final combined importance score (a number between 0 and 1).

Note: the importance components described above are the ones we reported in our paper. If you want to use your own custom component(s), you can calculate them for every connection (alongside already existing components) and simply add them to the list. IMP_manh should be recalculated accordingly at the end to get the final combined importance score for every connection. Please refer to this illustrated notebook, part - Calculating merged importance metric.

Proposed Use Case

We shared the output numbers for every model we tested (for both positive and negative samples) and then we computed ROC AUC scores based on different stratification strategies.

If you would like to use the benchmark, this might be a course of actions:

Train your model. The model should accept two UMLS terms as input and output a number indicating the likelihood of two terms being connected. The training data should come from scientific publications (or relevant literature) prior to the testing timestamp <YEAR>.
Download a table all_model_scores_test_<YEAR>.csv from Google Drive, where <YEAR> represents the year of interest and indicates the timestamp when the associations first appeared in the literature.
Compute your model output for associations from step 2 and calculate ROC AUC score with, for example, this scikit-learn implementation.

Current leaderboard

Dataset time split:

train: 2015
test: 2016
importance: 2022

Table with ROC AUC scores based on Importance stratification:

	AGATHA	DistMult	Node2Vec	HolE	ComplEx	TransE
Low importance ( < 0.41)	0.774	0.760	0.755	0.623	0.577	0.555
Medium importance (0.41 $\leq$ $I_t$ $\leq$ 0.53)	0.698	0.674	0.666	0.588	0.559	0.541
High importance ( > 0.53)	0.638	0.623	0.603	0.566	0.551	0.533

Want to upload your results? Open an issue on GitHub! Or reach out to me via email: tyagin at udel dot edu

Acknowledgements

This research was supported by NIH award #R01DA054992. The computational experiments were supported in part through the use of DARWIN computing system: DARWIN - A Resource for Computational and Data-intensive Research at the University of Delaware and in the Delaware Region, which is supported by NSF Grant #1919839.

Citing Dyport

I. Tyagin, I. Safro. Dyport: Dynamic Importance-based Hypothesis Generation Benchmarking Technique. 2023

@article{tyagin2023dyport,
      title={Dyport: Dynamic Importance-based Hypothesis Generation Benchmarking Technique}, 
      author={Ilya Tyagin and Ilya Safro},
      year={2023},
      eprint={2312.03303},
      archivePrefix={arXiv},
      primaryClass={cs.AI}
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
01_data_collection_code		01_data_collection_code
02_data_processing_code		02_data_processing_code
04_models_evaluation_code		04_models_evaluation_code
benchmark_data		benchmark_data
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

01_data_collection_code

01_data_collection_code

02_data_processing_code

02_data_processing_code

04_models_evaluation_code

04_models_evaluation_code

benchmark_data

benchmark_data

README.md

README.md

Repository files navigation

Dyport: Dynamic Importance-based Biomedical Hypothesis Generation Benchmarking Technique

Overview:

Data

Data description

Proposed Use Case

Current leaderboard

Acknowledgements

Citing Dyport

About

Releases

Packages

Languages

IlyaTyagin/Dyport

Folders and files

Latest commit

History

Repository files navigation

Dyport: Dynamic Importance-based Biomedical Hypothesis Generation Benchmarking Technique

Overview:

Data

Data description

Proposed Use Case

Current leaderboard

Acknowledgements

Citing Dyport

About

Resources

Stars

Watchers

Forks

Languages