Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use bioregistry as a resource for external prefix map via curies #396

Merged
merged 5 commits into from
Jul 20, 2023

Conversation

hrshdhgd
Copy link
Contributor

Fixes #395

@hrshdhgd hrshdhgd marked this pull request as ready for review July 19, 2023 16:08
@hrshdhgd hrshdhgd requested a review from matentzn July 19, 2023 16:08
Copy link
Collaborator

@matentzn matentzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beautiful PR! I think we need to move a bit more of our logic to curies, because the epm the way that you parse it will not be able to distinguish between preferred and non preferred options.

tests/test_parsers.py Show resolved Hide resolved
f"{key} is already in prefix map ({prefix_map[key]}, but with a different value than {v}"
)

prefix_map.update({(k, v) for k, v in contxt_external.items() if k not in prefix_map})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only line in this PR I can judge, the rest looks great.

In a second PR, we should consider dropping our expansion/contraction code and replacing it entirely with curies (basically, we would not need get_built_in_prefix_map() anymore) - we just create our Converter the way we want and then use the contract and expand methods.

For this here: can you tell me what part of the epm ends up in converter.prefix_map? Can you find a single example in epm with multiple prefixes and multiple uri_prefixes, and print the respective rows in the prefixmap?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So basically, since there is so much stuff in epm (multiple URI expansions), we need to be much more careful in how we use the content of the epm. We cant just use the prefixmap, because we lose which prefix should be preferred when their are multiple "fitting URIs". I think you should basically import the converter independently to parsers.py, and use it in place of the prefixmap logic. In "get_built_in_prefix_map()", you just leave the contxt, and drop that "external" thing.

Copy link
Contributor Author

@hrshdhgd hrshdhgd Jul 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this here: can you tell me what part of the epm ends up in converter.prefix_map? Can you find a single example in epm with multiple prefixes and multiple uri_prefixes, and print the respective rows in the prefixmap?

Every prefix seems to be unique and there are no duplicates (AFAIK). For each prefix there seems to be one and only one uri_prefix. The uri_prefix is the default value for the prefix_map. For e.g.:

'3dmet': 'http://www.3dmet.dna.affrc.go.jp/cgi/show_data.php?acc=' is the prefix_map for the reference:

{
        "prefix": "3dmet",
        "uri_prefix": "http://www.3dmet.dna.affrc.go.jp/cgi/show_data.php?acc=",
        "uri_prefix_synonyms": [
            "http://bio2rdf.org/3dmet:",
            "http://bioregistry.io/3dmet:",
            "http://identifiers.org/3dmet/",
            "http://identifiers.org/3dmet:",
            "http://n2t.net/3dmet:",
            "https://bio2rdf.org/3dmet:",
            "https://bioregistry.io/3dmet:",
            "https://identifiers.org/3dmet/",
            "https://identifiers.org/3dmet:",
            "https://n2t.net/3dmet:",
            "https://www.3dmet.dna.affrc.go.jp/cgi/show_data.php?acc="
        ]
    }

Copy link
Contributor Author

@hrshdhgd hrshdhgd Jul 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cant just use the prefixmap, because we lose which prefix should be preferred when their are multiple "fitting URIs".

Yes, you are correct. The uri_prefix_synonyms get unnoticed in the process I propose in this PR. This will need a major overhaul of the code to be entirely dependent on curies.Converter.

In a prefix_map which is a dict , the key which is the CURIE prefix has to be unique. How do we decide which one to choose: uri_prefix vs any one from the uri_prefix_synonyms? This should only be considered when a URI does not match the default URI prefix, then we look at the uri_prefix_synonyms to get the prefix. By design, the URI prefix should be whatever is in the uri_prefix.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the prefix map, what exactly is in it when there are multiple prefixes or and multiple uri prefixes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now even the internal sssom-schema context seems to exist in the bioregistry context. Does that mean this iteration we just make one source of truth i.e. bioregistry?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All exaxt prefix uri prefix pairs in the sssom schema context are already included in the bioregistry context, even if you don't merge the contxt variable?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I think this is basically the same as what was there before then when using the bimap. Lets merge this, and do the bigger refactoring relying entirely on the epm in a subsequent PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using uri_prefix should basically replicate exactly what was there before.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe sanity check some cases.

@@ -97,7 +97,7 @@ def get_default_metadata() -> Metadata:
:return: Metadata
"""
contxt = get_jsonld_context()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the sssom schema prefix map i guesss

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes ... And this will be gone in the next PR when we entirely rely on bioregistry.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but if we do this, I want a hard coded test in the testing framework that ensures that the whenever the epm is updated, the built-in prefixes are exactly what we expect them to be (which is what is in sssom context).

f"{key} is already in prefix map ({prefix_map[key]}, but with a different value than {v}"
)

prefix_map.update({(k, v) for k, v in contxt_external.items() if k not in prefix_map})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I think this is basically the same as what was there before then when using the bimap. Lets merge this, and do the bigger refactoring relying entirely on the epm in a subsequent PR.

@hrshdhgd hrshdhgd merged commit e266188 into master Jul 20, 2023
6 checks passed
@hrshdhgd hrshdhgd deleted the issue-395 branch July 20, 2023 13:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Replace sssom.context.external with obo.epm.json (bioregistry)
2 participants