Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aim 2.2.a. Evaluate consistency of relation and entity types used within BioPortal ontologies. #22

Closed
6 tasks done
caufieldjh opened this issue Mar 14, 2022 · 13 comments
Closed
6 tasks done

Comments

@caufieldjh
Copy link
Collaborator

caufieldjh commented Mar 14, 2022

Specifically:

  • Get total counts of KGX validation error by type across all transforms (e.g., "X transforms have at least one error of type Y")
    • Which transforms did not work?
    • How many of those transformation issues are resolvable?
    • How many of those transformation issues are not worth resolving in this repository (e.g., something structurally wrong with ontology on Bioportal; entry may be duplicated, test, or otherwise broken)
  • Enumerate usage of entity types by ontology (e.g., "X ontologies use biolink:OntologyClass)
  • Enumerate usage of relation types by ontology (e.g., "X ontologies use biolink:subclass_of)
@caufieldjh
Copy link
Collaborator Author

Output of newly added script get_all_transform_stats.sh:

All processed ontologies:
910
All successful JSON transforms:
896
All successful KGX TSV transforms:
888
Transforms with at least one of the following errors:
MISSING_NODE_PROPERTY
0
MISSING_EDGE_PROPERTY
0
INVALID_NODE_PROPERTY
833
INVALID_EDGE_PROPERTY
796
INVALID_NODE_PROPERTY_VALUE_TYPE
28
INVALID_NODE_PROPERTY_VALUE
833
INVALID_EDGE_PROPERTY_VALUE_TYPE
0
INVALID_EDGE_PROPERTY_VALUE
796
MISSING_CATEGORY
0
INVALID_CATEGORY
888
Category 'OntologyClass' is a mixin in the Biolink Model
888
MISSING_EDGE_PREDICATE
0
INVALID_EDGE_PREDICATE
495
MISSING_NODE_CURIE_PREFIX
0
DUPLICATE_NODE
0
MISSING_NODE
0
INVALID_EDGE_TRIPLE
0
VALIDATION_SYSTEM_ERROR
0

The big take-home here is that entities in every transform gets assigned biolink:OntologyClass despite Biolink modeling OntologyClass as a class mixin rather than intending it to be a class type itself.

Do we know enough about each ontology to assign a mode specific class to nodes?

There are other metaclasses, like [biolink:TaxonomicRank](https://w3id.org/biolink/vocab/TaxonomicRank) - these may still make sense to use in some contexts.

@caufieldjh
Copy link
Collaborator Author

Finding appropriate mappings vs. Biolink is a goal for kgx - that will help to reduce the number of OntologyClass nodes.

@caufieldjh
Copy link
Collaborator Author

caufieldjh commented Mar 14, 2022

Completely failed transforms:

ID Name Issue
NIFSTD Neuroscience Information Framework (NIF) Standard Ontology #15
EXACT An ontology for experimental actions #15 ; Small, alpha status, last uploaded 2014
DOID Human Disease Ontology #15 ; Unknown - would really expect this to work
ECOCORE An ontology of core ecological entities Empty? Error in Bioportal? Last updated Mar 10 2022
ETHIOPIADISEASES EthiopiaDiseaseList Empty? Does not render on Bioportal
LC-CARRIERS Library of Congress Carriers Scheme Empty? In SKOS format; does not render on Bioportal
SCDO Sickle Cell Disease Ontology #15
FENICS Functional Epilepsy Nomenclature for Ion Channels #15 ; Using webprotege: prefix (unsure if related to transform fail)
FOVT FuTRES Ontology of Vertebrate Traits #15
PTRANS Pathogen Transmission Ontology #15 ; Does not render on Bioportal
TIMEBANK Timebank Ontology #15
GSSO Gender, Sex, and Sexual Orientation Ontology #15
CST Cancer Staging Terms Unknown ; Does not render on Bioportal
MARC-RELATORS MARC Code List for Relators Empty? Does not render on Bioportal

@caufieldjh
Copy link
Collaborator Author

caufieldjh commented Mar 15, 2022

Transforms translating to Obojson but not to KGX TSV:

ID Name Issue
PDRO The Prescription of Drugs Ontology Unknown CURIE prefix: file
VICO Vaccination Informed Consent Ontology Unknown CURIE prefix: file
IXNO Interaction Ontology Last updated in 2011; Unknown CURIE prefix: file
IDQA Image and Data Quality Assessment Ontology Unknown CURIE prefix: file
KTAO Kidney Tissue Atlas Ontology Unknown CURIE prefix: file
GAZ Gazetteer Unknown CURIE prefix: file; KG-OBO transforms GAZ w/o issue, see https://kg-hub.berkeleybop.io/kg-obo/gaz/no_version/
CANONT Upper-Level Cancer Ontology Last updated in 2012; Unknown CURIE prefix: file

These are generally issues with the OBONamespace set to a local file path, and in at least one case (VICO) it's because of references to another namespace beginning with file: (GAZ).

@caufieldjh
Copy link
Collaborator Author

See #23 for Unknown CURIE prefix: file issue.

@caufieldjh
Copy link
Collaborator Author

caufieldjh commented Mar 18, 2022

With issues #15 and #23 resolved, the only remaining problematic transforms are:

  • ECOCORE (use current BioPortal submission)
  • ETHIOPIADISEASES (drop)
  • LC-CARRIERS (drop)
  • CST (drop)
  • MARC-RELATORS (drop)

@caufieldjh
Copy link
Collaborator Author

ECOCORE has a new version on BioPortal - can just use this for now:
https://bioportal.bioontology.org/ontologies/ECOCORE/?p=summary

Can drop LC-CARRIERS and MARC-RELATORS.

@jvendetti
Copy link
Member

Hi Harry.

ETHIOPIADISEASES

The latest submission in our system was corrupt. I recreated/reprocessed the submission so that the ontology is accessible again:

https://bioportal.bioontology.org/ontologies/ETHIOPIADISEASES?p=summary

CST

It looks like the end user uploaded an ontology source file for this entry, but we were never able to load the data into the triplestore, because our code errors out when we try to serialize to RDF/XML format with the following error:

org.semanticweb.owlapi.rdf.rdfxml.renderer.IllegalElementNameException: Illegal Element Name (Element Is Not A QName): http://www.w3.org/2000/01/rdf-schema#comment:

I think this one could probably be dropped for now.

@caufieldjh
Copy link
Collaborator Author

Great - thanks @jvendetti !

@jvendetti
Copy link
Member

Hi @caufieldjh. It turns out that the maintainers of ETHIOPIADISEASE told John that they no longer need this entry in BioPortal. I had originally reprocessed it, but I've now deleted the entry.

@caufieldjh
Copy link
Collaborator Author

Great, thanks! One more off the list.

@caufieldjh
Copy link
Collaborator Author

caufieldjh commented Apr 12, 2022

Updated statistics, including for types:

*** General ontology counts:
All processed ontologies:       910
All successful JSON transforms: 906
All successful KGX TSV transforms:      903
All transforms with KGX validation logs:        902
All transforms with ROBOT measure reports:      883
All transforms with ROBOT validation reports:   904
Ontologies with failed transforms:      
./transformed/ontologies/ETHIOPIADISEASES
./transformed/ontologies/LC-CARRIERS
./transformed/ontologies/CST
*** Transforms with at least one of the following errors:
MISSING_NODE_PROPERTY   0
MISSING_EDGE_PROPERTY   0
INVALID_NODE_PROPERTY   844
INVALID_EDGE_PROPERTY   807
INVALID_NODE_PROPERTY_VALUE_TYPE        31
INVALID_NODE_PROPERTY_VALUE     844
INVALID_EDGE_PROPERTY_VALUE_TYPE        0
INVALID_EDGE_PROPERTY_VALUE     807
MISSING_CATEGORY        0
INVALID_CATEGORY        902
Category 'OntologyClass' is a mixin in the Biolink Model        902
MISSING_EDGE_PREDICATE  0
INVALID_EDGE_PREDICATE  502
MISSING_NODE_CURIE_PREFIX       0
DUPLICATE_NODE  0
MISSING_NODE    0
INVALID_EDGE_TRIPLE     0
VALIDATION_SYSTEM_ERROR 0
*** Node type counts:
biolink:NamedThing      731
biolink:OntologyClass   903
biolink:BiologicalProcess       76
biolink:Cell    110
biolink:CellularComponent       46
biolink:ChemicalSubstance       119
biolink:Disease 15
biolink:Event   2
biolink:ExposureEvent   3
biolink:Gene    9
biolink:MolecularActivity       49
biolink:NamedThing      731
biolink:OntologyClass   903
biolink:OrganismalEntity        128
biolink:Pathway 6
biolink:PhenotypicFeature       44
biolink:Protein 79
biolink:SequenceFeature 56
biolink:SexQualifier    1
biolink:Source  2
biolink:TaxonomicRank   3
biolink:Unit    2
biolink:AnatomicalEntity        112
*** Edge type counts (i.e., predicate types):
biolink:related_to      376
biolink:subclass_of     899
biolink:part_of 52
biolink:inverseOf       408
biolink:subPropertyOf   449
biolink:has_part        165
biolink:has_participant 99
biolink:has_unit        29
biolink:preceded_by     69
biolink:has_attribute   76
biolink:positively_regulates    35
biolink:negatively_regulates    37

This includes all node types across all ontologies, and a selection of the more common predicate types.
Note that these are largely the result of type assignment by KGX.
As expected, nodes with biolink:NamedThing or biolink:OntologyClass are ubiquitous, suggesting that many may be re-assigned to more informative types.
Though predicate types appear more consistent, there is a long tail of sparsely-used types (not shown) across all ontologies.

@caufieldjh
Copy link
Collaborator Author

Closing issue as complete - reopen as needed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants