`parse_yaml_predictions` doesn't correctly get predicate #353

serenalotreck · 2024-03-22T18:28:44Z

Not sure how widespread this formatting is, or if it's something specific to my schema, but it seems like the support for n-ary relations introduced in #349 breaks support for binary relations, so wanted to ask about it.

The relations in my YAML output are formatted as dictionaries that only contain the subject and object, and I was previously pulling the predicate from the type of the relation -- e.g. for gene_gene_interactions = [{'gene1': 'MAPK', 'gene2': 'SIPK'}], I would use the schema to convert gene_gene_interactions to GeneGeneInteraction and use that as the predicate.

With the updated code, I get the following for all relations in my dataset, and the resulting relations df is emtpy:

WARNING:ontogpt.io.csv_wrapper:Relation {'gene': 'AUTO:drought%20adaptation', 'organism': 'NCBITaxon:28950'} missing part
WARNING:ontogpt.io.csv_wrapper:Relation {'gene': 'AUTO:N.%20dombeyi', 'organism': 'AUTO:drought%20adaptation'} missing part
WARNING:ontogpt.io.csv_wrapper:Relation {'gene': 'AUTO:endophytes', 'organism': 'NCBITaxon:3193'} missing part
WARNING:ontogpt.io.csv_wrapper:Relation {'gene1': 'AUTO:NFKB1', 'gene2': 'AUTO:RELA'} missing part
WARNING:ontogpt.io.csv_wrapper:Relation {'gene': 'AUTO:NFKB1', 'protein': 'PR:000003293'} missing part
WARNING:ontogpt.io.csv_wrapper:Relation {'gene': 'AUTO:sorghum', 'organism': 'NCBITaxon:4558'} missing part
WARNING:ontogpt.io.csv_wrapper:Relation {'protein': 'AUTO:not%20found', 'molecule': 'AUTO:not%20found'} missing part
WARNING:ontogpt.io.csv_wrapper:Relation {'gene': 'AUTO:Cucumis%20sativus', 'organism': 'NCBITaxon:3659'} missing part
WARNING:ontogpt.io.csv_wrapper:Relation {'gene1': 'AUTO:geneA', 'gene2': 'AUTO:geneB'} missing part
WARNING:ontogpt.io.csv_wrapper:Relation {'gene': 'GO:0005006', 'protein': 'PR:000003244'} missing part
WARNING:ontogpt.io.csv_wrapper:Relation {'gene': 'AUTO:BRCA1', 'organism': 'NCBITaxon:9606'} missing part
WARNING:ontogpt.io.csv_wrapper:Relation {'protein': 'AUTO:hemoglobin', 'organism': 'NCBITaxon:9606'} missing part
WARNING:ontogpt.io.csv_wrapper:Relation {'gene': 'AUTO:BRCA1', 'molecule': 'CHEBI:32875'} missing part
WARNING:ontogpt.io.csv_wrapper:Relation {'protein': 'AUTO:enzyme', 'molecule': 'AUTO:allosteric%20inhibitor'} missing part

Let me know if that's expected behavior!

The text was updated successfully, but these errors were encountered:

caufieldjh · 2024-03-26T16:52:40Z

Hi @serenalotreck - that last PR doesn't actually add any support for n-ary relations, but you're right, it's still making a lot of assumptions about binary relations. Some objects with multiple attributes may not be relations, though, and the schemas intentionally don't require that level of modeling detail to be defined.

If I have a Drug class with the attributes brand_name and approval_status where both are strings, I could assume that all Drug objects should be translated to relations where brand_name and approval_status will be subject and object, respectively - but that may not be the intended model at all. Same for additional fields: if I have another field like updated_on then should that be a node property? An edge property? Probably not a predicate type but I don't really know that by the name alone.

We could just assume that everything is a triple, like in RDF, but that may defeat the purpose of LinkML, plus it's kind of messy.
So I'll propose a hybrid solution for now:

If attributes are explicitly named subject, predicate, and object, they will be treated accordingly
predicate may or may not be present - if not, use class name
Otherwise, assume dicts of 1:1 key:value pairs are triples

This is a workaround. I will accept your changes in the linked PR then open another issue+PR to add the above solution, enabling other predicate parsing cases.

There's a more elegant solution here in which the annotations field can accept explicit directions for what should be subject, predicate, object, etc. - I'll open up another ticket for that.

Fixes #353 -- not sure how universal this fix is, so would appreciate feedback! In particular, it seemed like the previous implementation assumed that the relation dictionaries would have the keys `"subject", "predicate", "object"`, which at least in the case of the schema I've been working with, isn't the case; the relation dictionary keys are entity types, and there's no entry for the predicate. For the moment, I just used the `rel_type` as the predicate in addition to the category; previously I had converted the lowercase&pluralized `rel_type` to the camelcase&singular relation type from the schema with a function that didn't use SchemaView; I assume using SchemaView is what I should do here, but for some reason, my SchemaView seems to be empty, even though my schema is properly formatted (at least, I assume it is since it works when I run `ontogpt extract`!). The `try/except` needs to be updated to catch n-ary relations -- just wanted to wait to do that until I was sure the initial changes were good. Thanks!

serenalotreck · 2024-03-26T17:11:38Z

Awesome, thanks so much!

serenalotreck mentioned this issue Mar 26, 2024

Debug relation formatting in csv_wrapper #354

Merged

caufieldjh closed this as completed in #354 Mar 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`parse_yaml_predictions` doesn't correctly get predicate #353

`parse_yaml_predictions` doesn't correctly get predicate #353

serenalotreck commented Mar 22, 2024

caufieldjh commented Mar 26, 2024

serenalotreck commented Mar 26, 2024

parse_yaml_predictions doesn't correctly get predicate #353

parse_yaml_predictions doesn't correctly get predicate #353

Comments

serenalotreck commented Mar 22, 2024

caufieldjh commented Mar 26, 2024

serenalotreck commented Mar 26, 2024

`parse_yaml_predictions` doesn't correctly get predicate #353

`parse_yaml_predictions` doesn't correctly get predicate #353