Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade of Gene to Disease ingest mappings #709

Open
5 tasks
Tracked by #710
RichardBruskiewich opened this issue Jan 23, 2023 · 11 comments
Open
5 tasks
Tracked by #710

Upgrade of Gene to Disease ingest mappings #709

RichardBruskiewich opened this issue Jan 23, 2023 · 11 comments
Assignees
Labels
enhancement New feature or request ingest

Comments

@RichardBruskiewich
Copy link

RichardBruskiewich commented Jan 23, 2023

Monarch graph has ingested HPOA (OMIM, Orphanet, MorbidMap, etc.) mappings but these have some subtle issues of precision and completeness, and appear generated from secondary data sources that have challenging semantics. More importantly, the Monarch Initiative (and other related projects) have spawned numerous additional code bases, highly overlapping but also heterogeneous in design to one another, for example:

Closely related to the G2D mapping task are the underlying disease and phenotype ontology efforts:

This issue has the goal of a compare and contrast (tabular?) review of relevant G2D input data parsing code bases to identify a common normalized (singular) approach for the ingest of Monarch knowledge graph G2D mappings. This would aim to characterize the following for each reviewed code library:

  • Enumeration and general review of the composition of G2D-related input (knowledge) data files which are parsed by the library
  • Parsing heuristics ('rules') and algorithms internally encoded by the library
  • Enumeration and description of library output formats
  • Review of possible output formats (e.g. TSV?) for the Monarch KG construction pipeline, which could be added to the given library, to allow for optimal and complete capture of gene-to-disease knowledge capture (from OMIM, Orphanet, etc.) within the Monarch knowledge graphs
  • Review and highlight the relationship of library to MONDO and HPO.

Reviews Archive

https://drive.google.com/drive/folders/1ob6BiPuVcVGyO7kkNfTHjfoxGXAPbc5m

@RichardBruskiewich
Copy link
Author

RichardBruskiewich commented Jan 23, 2023

@pnrobinson, @cmungall, @putmantime, @kevinschaper @matentzn ... I've 'assigned' you to this issue for the moment, simply to flag the issue for your kind feedback and augmentation.

I am otherwise initiating the review of the Phenol code (Peter, as I have questions about the code base, I'll coordinate with you and Daniel for guidance).

@RichardBruskiewich
Copy link
Author

One ancient related issue (in the icebox): monarch-initiative/monarch-ingest#251

@matentzn
Copy link
Member

Closely related to monarch-initiative/omim#80

@putmantime
Copy link

@RichardBruskiewich
@matentzn offered to give you an overview of the Exomiser/Koza/Mondo situation regarding g2d.
We'd like to have a data call after this review process is complete to come up with and schedule the work for a generalized solution.

@RichardBruskiewich
Copy link
Author

RichardBruskiewich commented Mar 17, 2023

@matentzn and @putmantime, thank you for the meeting on the 16th March 2023, to discuss this task and formulate a plan for its resolution. Briefly:

  1. Study and document all the ways that OMIM and Orphanet are being processed within various code bases hosted by Monarch, to guide the creation of a more comprehensive Koza ingest for of a more normalized set of Gene-to-Disease (G2D) and Phenotype-to-Disease (P2D) subject-relationship (predicate) - object associations for the Monarch Graph.

  2. Goal: The Monarch team is attempting to capture all the processes for G2D and P2D (specifically, OMIM and Orphanet data) capture across Monarch, to identify how it is currently being done, to clarify provenance of knowledge to allow easier comparative analyses, and create a comprehensive G2D and P2D ingest for Monarch.

  3. To meet this goal, an inventory of existing Monarch-hosted (or used) project 'solutions' that have some component of parsing OMIM and Orphanet information into G2D and P2D subject-predicate-object associations will be reviewed. A tentative list of such 'solutions' is already compiled in the task plan (although more may be added if necessary) with identified "application experts" listed alongside. This list current includes the following Monarch-affiliated applications: Exomizer, Phenol, HPOQC, MONDO OMIM ingest, Dipper and Koza itself.

  4. We will conduct a basic self-study of each 'solution' code base, with the aim of composing a basic architecture and data flow diagram, with brief supporting notes, to serve as a conversation piece with the "application experts" guiding the capture of suitable descriptions of each application with respect to the objective of capturing G2D and P2D associations.

  5. A common interview script of questions is formulated to be posed to each such "application expert" to drive the compilation of software and data characteristics of each application, and includes a request for (sample) 'dumps' of files containing data relating to G2D and P2D associations. An approximately 1 hour interview based on the script will be scheduled and convened with each identified application expert, to correct/refine the aforementioned application architecture and data flow diagram and document additional information relevant to the task goal.

  6. The resolution of this issue will be the documented answers to the aforementioned questions, the data dumps requested, and a first-order comparison of these applications and their data dumps against one another, to guide future Monarch G2D and P2D association Koza ingest design and implementation. These deliverables will be hosted in a secure Monarch private storage bucket for further Monarch team assessment.

@sagehrke
Copy link
Member

@madanucd this ticket may be of help to your G2D ingest assessment.

@RichardBruskiewich RichardBruskiewich removed their assignment Nov 14, 2023
@RichardBruskiewich
Copy link
Author

@sagehrke I'm not that sure what to make of this exercise now after all the discussions some many months ago. We had a "70% solution" but not sure what comes next.

@sagehrke
Copy link
Member

Perhaps @madanucd and @kevinschaper can connect with you, @RichardBruskiewich, to see what next steps are regarding G2D review and any potential updates to ingest mappings.

@RichardBruskiewich
Copy link
Author

RichardBruskiewich commented Jan 10, 2024

Given that my Monarch subaward budget is depleted, I can no longer contribute to the resolution of this issue.

@RichardBruskiewich RichardBruskiewich removed their assignment Jan 24, 2024
@sagehrke
Copy link
Member

sagehrke commented Feb 1, 2024

Related to #707

@monicacecilia monicacecilia transferred this issue from monarch-initiative/monarch-ingest May 22, 2024
@pnrobinson
Copy link
Member

phenol and hpoannotQC should be considered the source of truth. This pipeline outputs phenotype.hpoa, which does not have genetic data. Other parts of phenol combine the genetic data and this is used for the HPO website and API. THe latter has been recently reworded by Mike and could provide a more unified view on several ontologies and could be more easily adapted for Monarch (e.g., uberon, Mondo, Maxo browsers).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request ingest
Projects
None yet
Development

No branches or pull requests

8 participants