Skip to content

Dyport: Dynamic Importance-based Hypothesis Generation Benchmarking Technique

Notifications You must be signed in to change notification settings

IlyaTyagin/Dyport

Repository files navigation

Dyport: Dynamic Importance-based Biomedical Hypothesis Generation Benchmarking Technique

Overview:

Dyport is a novel benchmarking framework for evaluating biomedical hypothesis generation systems. Utilizing curated datasets, our approach tests these systems under realistic conditions, enhancing the relevance of our evaluations. We integrate knowledge from the curated databases into a dynamic graph, accompanied by a method to quantify discovery importance. This not only assesses hypothesis accuracy but also their potential impact in biomedical research which significantly extends traditional link prediction benchmarks. Applicability of our benchmarking process is demonstrated on several link prediction systems applied on biomedical semantic knowledge graphs. Being flexible, our benchmarking system is designed for broad application in hypothesis generation quality verification, aiming to expand the scope of scientific discovery within the biomedical research community.

Arxiv preprint link: https://arxiv.org/abs/2312.03303

Data

Full benchmark data is available here: Google Drive

Data description

The data is represented as .csv tables with pairwise concept associations stratified by the year of their first occurrence in literature. For each positive connection $e$ we calculate a range of features describing its importance using the network and literature-based data. After that the final importance score is calculated as the average percentile rank of the underlying features.

Each .csv file has the following structure:

source target source_st target_st origin_db mentions_diff sem_scholar_cit ig jac_2nd_train eig_cent_diff betw_cent IMP_manh
42241 C4078312__acalabrutinib C4079830__venetoclax Pharmacologic Substance; Organic Chemical Pharmacologic Substance; Organic Chemical rxnav 62 975.0 0.074084 0.077561 0.000203 0.000019 0.956588
22993 C2353951__dapagliflozin C4524093__Euglycaemic diabetic ketoacidosis Pharmacologic Substance; Organic Chemical Disease or Syndrome drugcentral 30 427.0 0.103463 0.041605 0.000261 0.000050 0.965584
8080 C1335212__PIK3CA gene C4049141__copanlisib Gene or Genome Pharmacologic Substance; Organic Chemical drugcentral; kegg 17 1066.0 0.067049 0.025928 0.000923 0.000018 0.974659

where:

  • source and target columns represent UMLS CUI terms and their corresponding preferred names (separated with __);
  • source_st and target_st columns contain their semantic types;
  • origin_db shows what biomedical database a particular connection came from;

Next columns represent the importance components:

  • mentions_diff: number of times a particular association was mentioned in the literature;
  • sem_scholar_cit: number of Semantic Scholar citations;
  • ig: Integrated Gradients attribution score;
  • jac_2nd_train: 2nd order Jaccard similarity;
  • eig_cent_diff: eigenvector centrality difference;
  • betw_cent: betweenness centrality;

The last column IMP_manh is a final combined importance score (a number between 0 and 1).

Note: the importance components described above are the ones we reported in our paper. If you want to use your own custom component(s), you can calculate them for every connection (alongside already existing components) and simply add them to the list. IMP_manh should be recalculated accordingly at the end to get the final combined importance score for every connection. Please refer to this illustrated notebook, part - Calculating merged importance metric.

Proposed Use Case

We shared the output numbers for every model we tested (for both positive and negative samples) and then we computed ROC AUC scores based on different stratification strategies.

If you would like to use the benchmark, this might be a course of actions:

  1. Train your model. The model should accept two UMLS terms as input and output a number indicating the likelihood of two terms being connected. The training data should come from scientific publications (or relevant literature) prior to the testing timestamp <YEAR>.
  2. Download a table all_model_scores_test_<YEAR>.csv from Google Drive, where <YEAR> represents the year of interest and indicates the timestamp when the associations first appeared in the literature.
  3. Compute your model output for associations from step 2 and calculate ROC AUC score with, for example, this scikit-learn implementation.

Current leaderboard

Dataset time split:

  • train: 2015
  • test: 2016
  • importance: 2022

Table with ROC AUC scores based on Importance stratification:

AGATHA DistMult Node2Vec HolE ComplEx TransE
Low importance ( < 0.41) 0.774 0.760 0.755 0.623 0.577 0.555
Medium importance (0.41 $\leq$ $I_t$ $\leq$ 0.53) 0.698 0.674 0.666 0.588 0.559 0.541
High importance ( > 0.53) 0.638 0.623 0.603 0.566 0.551 0.533

Want to upload your results? Open an issue on GitHub! Or reach out to me via email: tyagin at udel dot edu

Acknowledgements

This research was supported by NIH award #R01DA054992. The computational experiments were supported in part through the use of DARWIN computing system: DARWIN - A Resource for Computational and Data-intensive Research at the University of Delaware and in the Delaware Region, which is supported by NSF Grant #1919839.

Citing Dyport

I. Tyagin, I. Safro. Dyport: Dynamic Importance-based Hypothesis Generation Benchmarking Technique. 2023

@article{tyagin2023dyport,
      title={Dyport: Dynamic Importance-based Hypothesis Generation Benchmarking Technique}, 
      author={Ilya Tyagin and Ilya Safro},
      year={2023},
      eprint={2312.03303},
      archivePrefix={arXiv},
      primaryClass={cs.AI}
}

About

Dyport: Dynamic Importance-based Hypothesis Generation Benchmarking Technique

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published