Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expressing "Out of Scope" where the source concept is out of scope for the target vocabulary #245

Open
tompollard opened this issue Dec 2, 2022 · 6 comments
Assignees

Comments

@tompollard
Copy link

tompollard commented Dec 2, 2022

I'm sorry in advance for (1) not following the issue template and (2) using incorrect language when referring to SSSOM concepts. I am new to this and still getting familiar with correct terminology etc.

We (well, specifically @a-chahin) are creating SSSOM-formatted tables to map terms in the MIMIC-IV database to common vocabularies. For example, the following pull request maps MIMIC-IV lab results to LOINC: MIT-LCP/mimic-code#1418

@matentzn kindly reviewed this work and gave helpful advice on steps towards SSSOM compliance. During review, the following question was raised. For background, see: MIT-LCP/mimic-code#1418 (comment).

Question

Our source list includes a lot of junk terms that just don't make sense to map to LOINC.

Despite the source terms being meaningless, it is important for us to maintain a complete list of them in the mapping file. The question is how best to do this.

In cases where no target term is found, @matentzn suggests that standard practice is to define the predicate_id (e.g skos:exactMatch) and then add a sssom:NoTermFound placeholder.

NoTermFound is perhaps true in our case, but something like sssom:OutOfScope would be more informative. This makes it clear that not only was no target found but that there is no desire to create a mapping.

We would appreciate your thoughts on this!

@matentzn
Copy link
Collaborator

matentzn commented Dec 3, 2022

I think this request makes sense - it speaks to the problem that people want to declare their mapping sets as in some sense “complete”. Say you want to align OMIM and NCIT, you may want to say: NCIT:ultrasoundDevice is out of scope.

Unfortunately, due to the nature of the standard, it will be a bit of a hack to include this in SSSOM, especially the arbitraryness of defining a mapping predicate. I am wondering now what is the cleanest way. I see two paths:

  1. predicate_id: skos:MappingRelation (which is the mother of all mapping relations), object_id: sssom:OutOfScope
  2. predicate_id: semapv:ObjectTermIsOutOfScopeForSubjectSource, object_id: sssom:noTermFound

@cthoyt
Copy link
Member

cthoyt commented Dec 3, 2022

Unfortunately, due to the nature of the standard, it will be a bit of a hack to include this in SSSOM, especially the arbitraryness of defining a mapping predicate. I am wondering now what is the cleanest way. I see two paths:

  1. predicate_id: skos:MappingRelation (which is the mother of all mapping relations), object_id: sssom:OutOfScope

I guess this would look like:

subject_id predicate_id object_id some other columns
prefix1:A skos:MappingRelation sssom:OutOfScope ...

👎 If you consume this and just look at the mapping columns, this doesn't tell you for what semantic space prefix1:A is out of scope. I don't think you should infer this from the file itself or the metadata. I believe that all rows in SSSOM files should be atomic (i.e., don't need to be interpreted in the scope of other rows or the document where they came from) and further that all SPO triples should also be meaningful (though, not necessarily contextualized, e.g., with provenance or attribution) on their own.

  1. predicate_id: semapv:ObjectTermIsOutOfScopeForSubjectSource, object_it: sssom:noTermFound

I guess this would look like:

subject_id predicate_id object_id some other columns
prefix1:A semapv:ObjectTermIsOutOfScopeForSubjectSource sssom:noTermFound ...

👎 I think this has the same issue as above, just using different vocabulary for the predicate and objject

Alternate proposal

If you want to say something is out of scope, then wouldn't it make more sense to have a relationship like sssom:OutOfScopePredicate as in

subject_id predicate_id object_id some other columns
prefix1:A sssom:OutOfScopePredicate bioregistry:prefix2 ....

This has the advantage of being directly interpretable from the atomic SPO triple

@matentzn
Copy link
Collaborator

matentzn commented Dec 3, 2022

The scenario is this:

Source 1:

  • s1:Apple
  • s2:Pear
  • s3:Sportscar

Source 2:

  • s4:Apple
  • s4:Pear

Note that source 1 does not correspond to any particular ontology - it can be a value set (i.e. a list of terms, from multiple ontologies). source 2 is about fruits.

So the question is, what do to with s3:Sportscar? We want to say its out of scope for source 2, a stronger statement then sssom:noTermFound.

@cthoyt in light of this, how does your suggestion differ from my Option 2 above?

EDIT: this has been updated for clarity in light of @cthoyt comments below.

@cthoyt
Copy link
Member

cthoyt commented Dec 3, 2022

@matentzn I'm sorry but it's super confusing to follow talking about ontologies and examples with letters. For the benefit of the discussion, we should try and be a bit more concrete (e.g., I updated my last comment to have explicit examples of rows of SSSOM documents). Also I don't think there's any value in having a name for ontology O be any different from the prefix used to refer to terms from that ontology, since this just introduces more and more stuff to keep track of. Can we just say prefix1 and prefix2 to talk about two semantic spaces?

That being said, I don't think that you should have to look into the object_source to contextualize the (subject_id, predicate_id, object_id) (SPO) triple.

Further, if there's no mapping from a local unique identifier prefix1 to one in prefix2, how does it make sense that an SSSOM term can have the object source of prefix2? This seems like a hack and also just introduces consistency. Maybe I don't understand the object_source field, though. What advantage does it have past the identity of the object that's already encoded in the CURIE prefix?

@graybeal
Copy link

graybeal commented Dec 6, 2022

@cthoyt in light of this, how does your suggestion differ from my Option 2 above?

If your Option 2 is
predicate_id: semapv:ObjectTermIsOutOfScopeForSubjectSource, object_it: sssom:noTermFound
then
prefix1:A sssom:OutOfScopePredicate bioregistry:prefix2
is significantly different.

The first says prefix1:A is out of scope for what you call "SubjectSource" (but I think it should be for ObjectSource, unless I'm totally backwards; I'm assuming we are trying to map prefix1:A into the prefix2 list of of terms). And it isn't clear what the object_id of sssom:noTermFound represents, though I can guess it represents a reason for the out of scope declaration (ObjectTermIsOutOfScopeForTargetSourceForReason). But that's redundant information anyway—if it's out of scope there should always be no term found.

The second one says prefix1:A is out of scope with respect to the explicit target, which is identified by its prefix in the bioregistry. This meaning is immediate and transparent, at least it seems that way to me to me.

It is true that if you have declared a single target in the header, then the person/software who is reading this statement could go back and retrieve the target ID, or more likely just not care about that detail. But I agree that should not be required.

But what if you are mapping from source prefix1:A and the terms are being mapped to targets prefix2 and prefix3 (etc.)? Now you don't know for which object the statement is being made, so we need to specify that in the triple. Arguably we should rule this situation out by saying if there are multiple targets, the OutOfScope relationship must apply to all of them; otherwise you get in the odd situation of having to declare outOfScope relations to all your targets individually. (Mapping to multiple targets is a common use case, so I sure hope it isn't illegal in SSSOM.)

Re @cthoyt's comments:

if there's no mapping from a local unique identifier prefix1 to one in prefix2, how does it make sense that an SSSOM term can have the object source of prefix2? This seems like a hack and also just introduces consistency.

I didn't really follow this, even if I make the last word 'inconsistency'. In fact it seems to make an opposing argument to the post with the Alternate Proposal, which was the one I liked.

I guess 'prefix1:A outOfScope prefix2' is an anti-mapping, and so it makes sense for completeness to be able to say there is no relation between prefix1:A and prefix2. (The meta analog of a disjoint relationship…)

bioregistry:prefix2

Why is 'bioregistry:prefix2' any clearer than 'prefix2'? We didn't say 'bioregistry:prefix1'. If we need it for one we need it for the other.

I believe that all rows in SSSOM files should be atomic (i.e., don't need to be interpreted in the scope of other rows or the document where they came from)

I think I'm on the same page, except that prefixes by definition requires the context to define them, and all relations must be interpreted within respect to their appropriate context. But I think we agree here.

@matentzn
Copy link
Collaborator

Thank you @cthoyt and @graybeal for your comments!

First of all, saying "ontology" is indeed too restrictive for subject and object sources; they are semantic spaces. While you @cthoyt strive towards making a prefix equal to their respective semantic space, we should still stay explicit about it and say "semantic space" when we talk to each other.

Here was your suggestion @cthoyt :

subject_id predicate_id object_id some other columns
prefix1:A sssom:OutOfScopePredicate bioregistry:prefix2 ....

This is actually a good more explicit alternative to using sssom:noTermFound or some such. We could use this logic to fulfill @tompollard even more elegantly, by doing this:

subject_id predicate_id object_id some other columns
prefix1:A semapv:NoMappingFoundIn ss:y ....
prefix1:B semapv:ObjectTermIsOutOfScopeForSubjectSource ss:y ....

ss:y should be the mapped resource (usually the object_source). We add semapv:NoMappingFoundIn to the semantic mapping vocabulary, meaning that the term is in principle in scope, but there was no suitable mapping found. The second example, semapv:SubjectTermIsOutOfScopeForObjectSource states that prefix1:B is a term that is part of the subject source, but entirely out of scope for the object source (for example, a disease quality like severe may be in one disease vocab like HPO, but another disease vocab may have no disease qualities at all).

I like the idea, thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants