-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use bioregistry as a resource for external prefix map via curies
#396
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beautiful PR! I think we need to move a bit more of our logic to curies, because the epm the way that you parse it will not be able to distinguish between preferred and non preferred options.
f"{key} is already in prefix map ({prefix_map[key]}, but with a different value than {v}" | ||
) | ||
|
||
prefix_map.update({(k, v) for k, v in contxt_external.items() if k not in prefix_map}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only line in this PR I can judge, the rest looks great.
In a second PR, we should consider dropping our expansion/contraction code and replacing it entirely with curies
(basically, we would not need get_built_in_prefix_map() anymore) - we just create our Converter
the way we want and then use the contract
and expand
methods.
For this here: can you tell me what part of the epm ends up in converter.prefix_map? Can you find a single example in epm with multiple prefixes and multiple uri_prefixes, and print the respective rows in the prefixmap?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So basically, since there is so much stuff in epm (multiple URI expansions), we need to be much more careful in how we use the content of the epm. We cant just use the prefixmap, because we lose which prefix should be preferred when their are multiple "fitting URIs". I think you should basically import the converter independently to parsers.py, and use it in place of the prefixmap logic. In "get_built_in_prefix_map()", you just leave the contxt, and drop that "external" thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this here: can you tell me what part of the epm ends up in converter.prefix_map? Can you find a single example in epm with multiple prefixes and multiple uri_prefixes, and print the respective rows in the prefixmap?
Every prefix
seems to be unique and there are no duplicates (AFAIK). For each prefix there seems to be one and only one uri_prefix
. The uri_prefix
is the default value for the prefix_map
. For e.g.:
'3dmet': 'http://www.3dmet.dna.affrc.go.jp/cgi/show_data.php?acc='
is the prefix_map for the reference:
{
"prefix": "3dmet",
"uri_prefix": "http://www.3dmet.dna.affrc.go.jp/cgi/show_data.php?acc=",
"uri_prefix_synonyms": [
"http://bio2rdf.org/3dmet:",
"http://bioregistry.io/3dmet:",
"http://identifiers.org/3dmet/",
"http://identifiers.org/3dmet:",
"http://n2t.net/3dmet:",
"https://bio2rdf.org/3dmet:",
"https://bioregistry.io/3dmet:",
"https://identifiers.org/3dmet/",
"https://identifiers.org/3dmet:",
"https://n2t.net/3dmet:",
"https://www.3dmet.dna.affrc.go.jp/cgi/show_data.php?acc="
]
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cant just use the prefixmap, because we lose which prefix should be preferred when their are multiple "fitting URIs".
Yes, you are correct. The uri_prefix_synonyms
get unnoticed in the process I propose in this PR. This will need a major overhaul of the code to be entirely dependent on curies.Converter
.
In a prefix_map
which is a dict , the key
which is the CURIE prefix has to be unique. How do we decide which one to choose: uri_prefix
vs any one from the uri_prefix_synonyms
? This should only be considered when a URI does not match the default URI prefix, then we look at the uri_prefix_synonyms
to get the prefix. By design, the URI prefix should be whatever is in the uri_prefix
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the prefix map, what exactly is in it when there are multiple prefixes or and multiple uri prefixes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now even the internal sssom-schema context seems to exist in the bioregistry
context. Does that mean this iteration we just make one source of truth i.e. bioregistry
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All exaxt prefix uri prefix pairs in the sssom schema context are already included in the bioregistry context, even if you don't merge the contxt variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I think this is basically the same as what was there before then when using the bimap. Lets merge this, and do the bigger refactoring relying entirely on the epm in a subsequent PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using uri_prefix
should basically replicate exactly what was there before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe sanity check some cases.
@@ -97,7 +97,7 @@ def get_default_metadata() -> Metadata: | |||
:return: Metadata | |||
""" | |||
contxt = get_jsonld_context() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the sssom schema prefix map i guesss
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes ... And this will be gone in the next PR when we entirely rely on bioregistry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, but if we do this, I want a hard coded test in the testing framework that ensures that the whenever the epm is updated, the built-in prefixes are exactly what we expect them to be (which is what is in sssom context).
f"{key} is already in prefix map ({prefix_map[key]}, but with a different value than {v}" | ||
) | ||
|
||
prefix_map.update({(k, v) for k, v in contxt_external.items() if k not in prefix_map}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I think this is basically the same as what was there before then when using the bimap. Lets merge this, and do the bigger refactoring relying entirely on the epm in a subsequent PR.
Fixes #395