-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HGNC as use case of multiple identifier complexities #19
Comments
We do not have a single URL for the different "IDs". I am currently redesigning the whole of HGNC site and therefore will bare this in mind. I have read the whole issue above so let me try and clarify some issues with a HGNC perspective: 1.) The symbol URL should be avoided. This is only to be used by teams that do not store the HGNC ID of a gene symbol report but say they need to link to us. This approach is problematic as symbols can change over time while HGNC IDs do not. If you have the IDs forget about the symbols for links. 2.) As discussed in the previous message we do allow HGNC IDs to be entered using just the number, however we do say that officially a HGNC ID for a gene symbol report is HGNC:\d+ and should be quoted as such. For historical reasons and because a few resource keep stripping off the HGNC: prefix we maintain a way to link to our gene symbol reports without the prefix, but again strongly suggest using the HGNC: prefix. 3.) identifiers.org does not have the most up to date information regarding our resource. In January 2015 we replaced our previous gene families resource (http://www.genenames.org/genefamilies/a-z) with a new gene families resource which resides at a different URL (http://www.genenames.org/cgi-bin/genefamilies/). Identifiers.org does not show the new URL and although we have a frozen version of the old gene family reports that are still available to view, we do warn people that these family reports are not up to date within the reports and will be removed in future. I have contacted identifiers.org to update the hgnc.family rules to point to the new resource. The family IDs are now numerical and are not symbols (which in our mind was misleading as families don't officially have symbols only genes) and just like the gene symbol reports the ID will never change stabilising the URL. The new rule we have ask identifiers.org to add is where $id is a numerical value. As I have mentioned above I am currently rebuilding the website and will bare in mind adding a URL that will resolve to the correct report based on the ID given. However to do this we will need to be strict on the HGNC: prefix for gene symbols and may have to create a prefix for the gene families as well such as HGNCF:\d+ or HGNC:F\d+ etc. Identifiers.org is a great project however it is worth remembering that they rely on user participation to correct URLs and rules which does mean that URLs and rules can be out of date especially when certain projects (such as ourselves) were entered into the resource via other sources. From now on we will try and remember to notify identifiers.org of any changes that effect of URLs and IDs. Hope this helps. Kris Gray |
HGNC is an example collection with four co-occuring identifier complexities:
1. Ambiguity about what $id even is.
The identifiers.org record above captures the fact that HGNC records exist in 3rd party databases but identifiers.org doesn't have a strong concept of a prefix; consequently it isn't possible to get to both "physical locations" of the entity using a single (equivalent) $id. In one case $id is prefixed, and in the other, it is not. HGNC, mercifully, honors both forms. However:
A stronger notion of prefix is the simplest thing that would help data integrators collapse the following as equivalent http identifiers since
2674
is the invariant part of the ID.Given the identifiers.org data model, there is no way to determine whether http://identifiers.org/hgnc/hgnc:2674 points to the same entity as http://identifiers.org/hgnc/2674. This is why I favor developing a bare-curie based resolver like http://n2t.net/hgnc:2674--or if identifiers.org is interested in doing so--http://identifiers.org/hgnc:2674
This would allow us to determine that all of these are talking about the same entity:
Authoritative sources:
Identifier resolvers:
Third party content providers
2. Multiple entity types (Genes and Gene families)
3. Multiple identifier types (alphanumeric symbol and numeric ID)
4. Type-specific URL patterns combined with lack of deterministic typing in
local ID
Consequently you have to know what you're looking at before you can know where to resolve it. Note lack of deterministic typing in localID is not a problem unless you need type-specific URLs the way HGNC does.
Sorry to bug you @KrisGray, you're listed on the HGNC github; could you comment as to whether there's a single URL that can be used across types of IDs in HGNC? (family, symbol, numeric ID) so that we can address at least number 4 on the list?
cc: @timclark, @jkunze
The text was updated successfully, but these errors were encountered: