Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Contexts don't guarantee bi-directional mapping #11

Open
3 tasks done
cthoyt opened this issue Oct 20, 2022 · 7 comments
Open
3 tasks done

Contexts don't guarantee bi-directional mapping #11

cthoyt opened this issue Oct 20, 2022 · 7 comments

Comments

@cthoyt
Copy link
Collaborator

cthoyt commented Oct 20, 2022

In the BioPortal context file (https://github.com/linkml/prefixmaps/blob/main/src/prefixmaps/data/bioportal.csv#L199-L202), two prefixes share the same URI prefix. This seems to invalidate the claim on the README about bijectivity (https://github.com/linkml/prefixmaps/blob/main/README.md?plain=1#L13).

Maybe this is a curation oversight, which I bet @caufieldjh has already found, since I know he recently put a lot of effort into looking through this content. However, this isn't resolved when loading content through the package, which makes me think that the package should be more careful about checking the integrity of content. The following code illustrates:

from prefixmaps import load_context
context = load_context("bioportal")
for e in context.prefix_expansions:
    if e.prefix in {"INVERSEROLES", "ISO-15926-2_2003"}:
         print(e)

# Output:
# PrefixExpansion(context='bioportal', prefix='INVERSEROLES', namespace='http://rds.posccaesar.org/2008/02/OWL/ISO-15926-2_2003#', status=<StatusType.canonical: 'canonical'>)
# PrefixExpansion(context='bioportal', prefix='ISO-15926-2_2003', namespace='http://rds.posccaesar.org/2008/02/OWL/ISO-15926-2_2003#', status=<StatusType.canonical: 'canonical'>)

This means the assumptions in Context.as_dict() are also incorrect, since this naively iterates through the expansions and picks out the prefix/URI prefix (namespace) pairs.

I'm pretty stumped trying to understand the data structure used in this package, it seems like a lot of things that could be grouped are not. Have you considered using a JSON structure?

For example, the curies package has a lot of overlap in terms of needing to represent a group of related prefixes and URI prefixes while denoting which is the "canonical" prefix and "canonical" URI prefix. This data structure is described in the curies documentation and a more full example with the whole Bioregistry can be found here.

Background: I'm currently trying to implement a more principled import of a Context object from this package into a curies.Converter in biopragmatics/curies#22 and am stuck since there's no way to decide which of these two canonical records should be the actual canonical record.

To Do

@caufieldjh
Copy link
Contributor

caufieldjh commented Oct 20, 2022

two prefixes share the same URI prefix

I'm going to chalk that up to curation error - one ontology is a variation on the other, so the easy solution would just be to assign ISO-15926-2_2003 as the canonical prefix. I'll update accordingly (and probably catch some other issues in the process).

But yes, that means the assumption is that the input maps are bijective, without explicit validation.

@cthoyt
Copy link
Collaborator Author

cthoyt commented Oct 20, 2022

Great, thanks. I would say besides just implementing this fix, this issue requires adding some automated tests for all content that can be run on CI and will stop new data from being added that doesn't pass these requirements

@caufieldjh
Copy link
Contributor

Fixes to those maps are in d58f199

@cthoyt cthoyt mentioned this issue Oct 20, 2022
1 task
@cthoyt
Copy link
Collaborator Author

cthoyt commented Oct 25, 2022

Following adding tests in #15, this issue can be closed if there’s branch protection on the main branch and a rule that tests have to pass

@cmungall
Copy link
Member

@caufieldjh curated content should really go in a curated.yaml, the csvs should be entirely autogenerated. In fact we would eventually like to cede the curated maps upstream to bioregistry, but it is useful to be use a placeholder here for now.

@cthoyt
Copy link
Collaborator Author

cthoyt commented Oct 26, 2022

I added #26, which is also a blocker for closing this issue.

@caufieldjh
Copy link
Contributor

@cmungall do you mean something along the same lines of https://raw.githubusercontent.com/geneontology/go-site/master/metadata/db-xrefs.yaml ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants