Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HGNC as use case of multiple identifier complexities #19

Open
jmcmurry opened this issue Apr 22, 2016 · 1 comment
Open

HGNC as use case of multiple identifier complexities #19

jmcmurry opened this issue Apr 22, 2016 · 1 comment

Comments

@jmcmurry
Copy link
Member

jmcmurry commented Apr 22, 2016

HGNC is an example collection with four co-occuring identifier complexities:

1. Ambiguity about what $id even is.

screen shot 2016-04-22 at 3 38 23 pm

The identifiers.org record above captures the fact that HGNC records exist in 3rd party databases but identifiers.org doesn't have a strong concept of a prefix; consequently it isn't possible to get to both "physical locations" of the entity using a single (equivalent) $id. In one case $id is prefixed, and in the other, it is not. HGNC, mercifully, honors both forms. However:

  1. Other data providers may not be as forgiving as HGNC is
  2. More often than not variation in the local ID pattern is precisely what the data provider is relying on in order to redirect to their right type-specific path.

A stronger notion of prefix is the simplest thing that would help data integrators collapse the following as equivalent http identifiers since 2674 is the invariant part of the ID.

Given the identifiers.org data model, there is no way to determine whether http://identifiers.org/hgnc/hgnc:2674 points to the same entity as http://identifiers.org/hgnc/2674. This is why I favor developing a bare-curie based resolver like http://n2t.net/hgnc:2674--or if identifiers.org is interested in doing so--http://identifiers.org/hgnc:2674

This would allow us to determine that all of these are talking about the same entity:

Authoritative sources:
Identifier resolvers:
Third party content providers
2. Multiple entity types (Genes and Gene families)
Identifiers.org namespace regex URI
hgnc ^((HGNC or hgnc):)?\d{1,5}$ http://www.genenames.org/cgi-bin/gene_symbol_report?hgnc_id=$id [Example: 2674]
hgnc.family ^[A-Z0-9-]+(#[A-Z0-9-]+)?$ http://www.genenames.org/genefamilies/$id [Example: PADI]
hgnc.symbol ^[A-Za-z-0-9_]+(@)?$ http://www.genenames.org/cgi-bin/gene_symbol_report?match=$id [Example: DAPK1]

3. Multiple identifier types (alphanumeric symbol and numeric ID)

4. Type-specific URL patterns combined with lack of deterministic typing in local ID

Consequently you have to know what you're looking at before you can know where to resolve it. Note lack of deterministic typing in localID is not a problem unless you need type-specific URLs the way HGNC does.


Sorry to bug you @KrisGray, you're listed on the HGNC github; could you comment as to whether there's a single URL that can be used across types of IDs in HGNC? (family, symbol, numeric ID) so that we can address at least number 4 on the list?

cc: @timclark, @jkunze

@KrisGray
Copy link

We do not have a single URL for the different "IDs". I am currently redesigning the whole of HGNC site and therefore will bare this in mind.

I have read the whole issue above so let me try and clarify some issues with a HGNC perspective:

1.) The symbol URL should be avoided. This is only to be used by teams that do not store the HGNC ID of a gene symbol report but say they need to link to us. This approach is problematic as symbols can change over time while HGNC IDs do not. If you have the IDs forget about the symbols for links.

2.) As discussed in the previous message we do allow HGNC IDs to be entered using just the number, however we do say that officially a HGNC ID for a gene symbol report is HGNC:\d+ and should be quoted as such. For historical reasons and because a few resource keep stripping off the HGNC: prefix we maintain a way to link to our gene symbol reports without the prefix, but again strongly suggest using the HGNC: prefix.

3.) identifiers.org does not have the most up to date information regarding our resource. In January 2015 we replaced our previous gene families resource (http://www.genenames.org/genefamilies/a-z) with a new gene families resource which resides at a different URL (http://www.genenames.org/cgi-bin/genefamilies/). Identifiers.org does not show the new URL and although we have a frozen version of the old gene family reports that are still available to view, we do warn people that these family reports are not up to date within the reports and will be removed in future. I have contacted identifiers.org to update the hgnc.family rules to point to the new resource. The family IDs are now numerical and are not symbols (which in our mind was misleading as families don't officially have symbols only genes) and just like the gene symbol reports the ID will never change stabilising the URL. The new rule we have ask identifiers.org to add is

http://www.genenames.org/cgi-bin/genefamilies/set/$id

where $id is a numerical value.

As I have mentioned above I am currently rebuilding the website and will bare in mind adding a URL that will resolve to the correct report based on the ID given. However to do this we will need to be strict on the HGNC: prefix for gene symbols and may have to create a prefix for the gene families as well such as HGNCF:\d+ or HGNC:F\d+ etc.

Identifiers.org is a great project however it is worth remembering that they rely on user participation to correct URLs and rules which does mean that URLs and rules can be out of date especially when certain projects (such as ourselves) were entered into the resource via other sources. From now on we will try and remember to notify identifiers.org of any changes that effect of URLs and IDs.

Hope this helps.

Kris Gray

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants