Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parse_yaml_predictions doesn't correctly get predicate #353

Closed
serenalotreck opened this issue Mar 22, 2024 · 2 comments · Fixed by #354
Closed

parse_yaml_predictions doesn't correctly get predicate #353

serenalotreck opened this issue Mar 22, 2024 · 2 comments · Fixed by #354

Comments

@serenalotreck
Copy link
Contributor

Not sure how widespread this formatting is, or if it's something specific to my schema, but it seems like the support for n-ary relations introduced in #349 breaks support for binary relations, so wanted to ask about it.

The relations in my YAML output are formatted as dictionaries that only contain the subject and object, and I was previously pulling the predicate from the type of the relation -- e.g. for gene_gene_interactions = [{'gene1': 'MAPK', 'gene2': 'SIPK'}], I would use the schema to convert gene_gene_interactions to GeneGeneInteraction and use that as the predicate.

With the updated code, I get the following for all relations in my dataset, and the resulting relations df is emtpy:

WARNING:ontogpt.io.csv_wrapper:Relation {'gene': 'AUTO:drought%20adaptation', 'organism': 'NCBITaxon:28950'} missing part
WARNING:ontogpt.io.csv_wrapper:Relation {'gene': 'AUTO:N.%20dombeyi', 'organism': 'AUTO:drought%20adaptation'} missing part
WARNING:ontogpt.io.csv_wrapper:Relation {'gene': 'AUTO:endophytes', 'organism': 'NCBITaxon:3193'} missing part
WARNING:ontogpt.io.csv_wrapper:Relation {'gene1': 'AUTO:NFKB1', 'gene2': 'AUTO:RELA'} missing part
WARNING:ontogpt.io.csv_wrapper:Relation {'gene': 'AUTO:NFKB1', 'protein': 'PR:000003293'} missing part
WARNING:ontogpt.io.csv_wrapper:Relation {'gene': 'AUTO:sorghum', 'organism': 'NCBITaxon:4558'} missing part
WARNING:ontogpt.io.csv_wrapper:Relation {'protein': 'AUTO:not%20found', 'molecule': 'AUTO:not%20found'} missing part
WARNING:ontogpt.io.csv_wrapper:Relation {'gene': 'AUTO:Cucumis%20sativus', 'organism': 'NCBITaxon:3659'} missing part
WARNING:ontogpt.io.csv_wrapper:Relation {'gene1': 'AUTO:geneA', 'gene2': 'AUTO:geneB'} missing part
WARNING:ontogpt.io.csv_wrapper:Relation {'gene': 'GO:0005006', 'protein': 'PR:000003244'} missing part
WARNING:ontogpt.io.csv_wrapper:Relation {'gene': 'AUTO:BRCA1', 'organism': 'NCBITaxon:9606'} missing part
WARNING:ontogpt.io.csv_wrapper:Relation {'protein': 'AUTO:hemoglobin', 'organism': 'NCBITaxon:9606'} missing part
WARNING:ontogpt.io.csv_wrapper:Relation {'gene': 'AUTO:BRCA1', 'molecule': 'CHEBI:32875'} missing part
WARNING:ontogpt.io.csv_wrapper:Relation {'protein': 'AUTO:enzyme', 'molecule': 'AUTO:allosteric%20inhibitor'} missing part

Let me know if that's expected behavior!

@caufieldjh
Copy link
Member

Hi @serenalotreck - that last PR doesn't actually add any support for n-ary relations, but you're right, it's still making a lot of assumptions about binary relations. Some objects with multiple attributes may not be relations, though, and the schemas intentionally don't require that level of modeling detail to be defined.

If I have a Drug class with the attributes brand_name and approval_status where both are strings, I could assume that all Drug objects should be translated to relations where brand_name and approval_status will be subject and object, respectively - but that may not be the intended model at all. Same for additional fields: if I have another field like updated_on then should that be a node property? An edge property? Probably not a predicate type but I don't really know that by the name alone.

We could just assume that everything is a triple, like in RDF, but that may defeat the purpose of LinkML, plus it's kind of messy.
So I'll propose a hybrid solution for now:

  • If attributes are explicitly named subject, predicate, and object, they will be treated accordingly
  • predicate may or may not be present - if not, use class name
  • Otherwise, assume dicts of 1:1 key:value pairs are triples

This is a workaround. I will accept your changes in the linked PR then open another issue+PR to add the above solution, enabling other predicate parsing cases.

There's a more elegant solution here in which the annotations field can accept explicit directions for what should be subject, predicate, object, etc. - I'll open up another ticket for that.

caufieldjh added a commit that referenced this issue Mar 26, 2024
Fixes #353 -- not sure how universal this fix is, so would appreciate
feedback!

In particular, it seemed like the previous implementation assumed that
the relation dictionaries would have the keys `"subject", "predicate",
"object"`, which at least in the case of the schema I've been working
with, isn't the case; the relation dictionary keys are entity types, and
there's no entry for the predicate.

For the moment, I just used the `rel_type` as the predicate in addition
to the category; previously I had converted the lowercase&pluralized
`rel_type` to the camelcase&singular relation type from the schema with
a function that didn't use SchemaView; I assume using SchemaView is what
I should do here, but for some reason, my SchemaView seems to be empty,
even though my schema is properly formatted (at least, I assume it is
since it works when I run `ontogpt extract`!).

The `try/except` needs to be updated to catch n-ary relations -- just
wanted to wait to do that until I was sure the initial changes were
good.

Thanks!
@serenalotreck
Copy link
Contributor Author

Awesome, thanks so much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants