Use bioregistry as a resource for external prefix map via `curies` #396

hrshdhgd · 2023-07-19T15:42:58Z

Fixes #395

matentzn

Beautiful PR! I think we need to move a bit more of our logic to curies, because the epm the way that you parse it will not be able to distinguish between preferred and non preferred options.

tests/test_parsers.py

matentzn · 2023-07-19T17:21:40Z

src/sssom/context.py

-                        f"{key} is already in prefix map ({prefix_map[key]}, but with a different value than {v}"
-                    )
+
+    prefix_map.update({(k, v) for k, v in contxt_external.items() if k not in prefix_map})


This is the only line in this PR I can judge, the rest looks great.

In a second PR, we should consider dropping our expansion/contraction code and replacing it entirely with curies (basically, we would not need get_built_in_prefix_map() anymore) - we just create our Converter the way we want and then use the contract and expand methods.

For this here: can you tell me what part of the epm ends up in converter.prefix_map? Can you find a single example in epm with multiple prefixes and multiple uri_prefixes, and print the respective rows in the prefixmap?

So basically, since there is so much stuff in epm (multiple URI expansions), we need to be much more careful in how we use the content of the epm. We cant just use the prefixmap, because we lose which prefix should be preferred when their are multiple "fitting URIs". I think you should basically import the converter independently to parsers.py, and use it in place of the prefixmap logic. In "get_built_in_prefix_map()", you just leave the contxt, and drop that "external" thing.

For this here: can you tell me what part of the epm ends up in converter.prefix_map? Can you find a single example in epm with multiple prefixes and multiple uri_prefixes, and print the respective rows in the prefixmap?

Every prefix seems to be unique and there are no duplicates (AFAIK). For each prefix there seems to be one and only one uri_prefix. The uri_prefix is the default value for the prefix_map. For e.g.:

'3dmet': 'http://www.3dmet.dna.affrc.go.jp/cgi/show_data.php?acc=' is the prefix_map for the reference:

{ "prefix": "3dmet", "uri_prefix": "http://www.3dmet.dna.affrc.go.jp/cgi/show_data.php?acc=", "uri_prefix_synonyms": [ "http://bio2rdf.org/3dmet:", "http://bioregistry.io/3dmet:", "http://identifiers.org/3dmet/", "http://identifiers.org/3dmet:", "http://n2t.net/3dmet:", "https://bio2rdf.org/3dmet:", "https://bioregistry.io/3dmet:", "https://identifiers.org/3dmet/", "https://identifiers.org/3dmet:", "https://n2t.net/3dmet:", "https://www.3dmet.dna.affrc.go.jp/cgi/show_data.php?acc=" ] }

We cant just use the prefixmap, because we lose which prefix should be preferred when their are multiple "fitting URIs".

Yes, you are correct. The uri_prefix_synonyms get unnoticed in the process I propose in this PR. This will need a major overhaul of the code to be entirely dependent on curies.Converter.

In a prefix_map which is a dict , the key which is the CURIE prefix has to be unique. How do we decide which one to choose: uri_prefix vs any one from the uri_prefix_synonyms? This should only be considered when a URI does not match the default URI prefix, then we look at the uri_prefix_synonyms to get the prefix. By design, the URI prefix should be whatever is in the uri_prefix.

So the prefix map, what exactly is in it when there are multiple prefixes or and multiple uri prefixes?

Now even the internal sssom-schema context seems to exist in the bioregistry context. Does that mean this iteration we just make one source of truth i.e. bioregistry?

All exaxt prefix uri prefix pairs in the sssom schema context are already included in the bioregistry context, even if you don't merge the contxt variable?

Ok I think this is basically the same as what was there before then when using the bimap. Lets merge this, and do the bigger refactoring relying entirely on the epm in a subsequent PR.

I think using uri_prefix should basically replicate exactly what was there before.

Maybe sanity check some cases.

matentzn · 2023-07-19T20:11:22Z

src/sssom/context.py

@@ -97,7 +97,7 @@ def get_default_metadata() -> Metadata:
    :return: Metadata
    """
    contxt = get_jsonld_context()


this is the sssom schema prefix map i guesss

Yes ... And this will be gone in the next PR when we entirely rely on bioregistry.

Ok, but if we do this, I want a hard coded test in the testing framework that ensures that the whenever the epm is updated, the built-in prefixes are exactly what we expect them to be (which is what is in sssom context).

matentzn · 2023-07-19T20:12:44Z

src/sssom/context.py

-                        f"{key} is already in prefix map ({prefix_map[key]}, but with a different value than {v}"
-                    )
+
+    prefix_map.update({(k, v) for k, v in contxt_external.items() if k not in prefix_map})


Ok I think this is basically the same as what was there before then when using the bimap. Lets merge this, and do the bigger refactoring relying entirely on the epm in a subsequent PR.

Use bioregistry as a resource for external prefix map via curies

1f42b47

hrshdhgd marked this pull request as ready for review July 19, 2023 16:08

hrshdhgd requested a review from matentzn July 19, 2023 16:08

hrshdhgd mentioned this pull request Jul 19, 2023

PR #345 continued monarch-initiative/mondo-ingest#347

Merged

4 tasks

removed unnecessary variable

66c5d7a

matentzn reviewed Jul 19, 2023

View reviewed changes

hrshdhgd added 3 commits July 19, 2023 14:22

Added a todo

4e6ee60

commented

81c7699

linted

a3f5426

matentzn approved these changes Jul 19, 2023

View reviewed changes

matentzn mentioned this pull request Jul 19, 2023

Replace CURIE inference mechanism with curies.Converter.from_extended_prefix_map #363

Closed

hrshdhgd merged commit e266188 into master Jul 20, 2023
6 checks passed

hrshdhgd deleted the issue-395 branch July 20, 2023 13:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use bioregistry as a resource for external prefix map via `curies` #396

Use bioregistry as a resource for external prefix map via `curies` #396

hrshdhgd commented Jul 19, 2023

matentzn left a comment

matentzn Jul 19, 2023

matentzn Jul 19, 2023

hrshdhgd Jul 19, 2023 •

edited

Loading

hrshdhgd Jul 19, 2023 •

edited

Loading

matentzn Jul 19, 2023

hrshdhgd Jul 19, 2023

matentzn Jul 19, 2023

matentzn Jul 19, 2023

matentzn Jul 19, 2023

matentzn Jul 19, 2023

matentzn Jul 19, 2023

hrshdhgd Jul 19, 2023

matentzn Jul 20, 2023

matentzn Jul 19, 2023

Use bioregistry as a resource for external prefix map via curies #396

Use bioregistry as a resource for external prefix map via curies #396

Conversation

hrshdhgd commented Jul 19, 2023

matentzn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hrshdhgd Jul 19, 2023 • edited Loading

Choose a reason for hiding this comment

hrshdhgd Jul 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Use bioregistry as a resource for external prefix map via `curies` #396

Use bioregistry as a resource for external prefix map via `curies` #396

hrshdhgd Jul 19, 2023 •

edited

Loading

hrshdhgd Jul 19, 2023 •

edited

Loading