-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expand compact identifiers to concept name with resolved hyperlink #220
Comments
The only difference I might consider in syntax is to be prefixed with the Resolving names is a total pain in the butt! One solution could be to use the Ontology Lookup Service (OLS), but there's lots of discrepancies between what's indexed there and what's allowed in identifiers.org. One HUGE omission from the OLS that makes it a way less viable as a solution is their refusal to index HGNC, Entrez, and other gene nomenclatures. I've been working on making a generalizable server to index identifier->name mappings myself because I care a lot about this problem. I don't have any infrastructure to host a service, but as I mentioned in that other comment thread, PyOBO might be able to do the job. Maybe the next step I could take would to put the code in place in a new repo that will use the whole pipeline to make a super-sized 3-column TSV with prefix, identifier, and name in each row that I could post to figshare, zenodo, or whereever. Then, anyone could consume it and build a simple lookup service on top. I could also make a small demo lookup service with Flask and an in-memory python dictionary as a key-value store. I guess we'd have to keep in mind that this solution comes with the caveat that it's another service that needs external maintenance... Maybe for the purposes of Manubot we could do a demo roll-out of this feature that just do ChEBI and HGNC, for example, and downloads the data from the source on the fly for each build (thus requiring no databases or web services). I would guess that on the manubot/python side, it wouldn't be so hard to have a function that mix and matches what sources it uses to look up different CURIEs |
Update, I wrote a "resolver service" and added it here biopragmatics/pyobo@03329b7. It wasn't so much work since I already wrote the code that did the heavy lifting :) If you want to try it out, you can do pip install git+https://github.com/pyobo/pyobo
python -m pyobo.apps.resolver Note - it will take some time to download/parse resources the first time you ask for them. But after that it's relatively quick since it caches the id->name mappings as 2 column TSVs in the I was wondering, though, what would be the licensing ramifications of making the super-TSV that I described before |
I see you've done a fair amount of manual work in It would be great to rely on a service like this. There's not a huge number of dependencies, so we could make it a (optional) manubot dependency. Another option would be to set up a public server / API. @cthoyt what are your thoughts? Do you know of any ways we could turn this into a public API with very little maintenance required?
At least in the U.S., I don't think this file would be subject to copyright as its rather factual. Just like how a phone directory has been found not to be copyrightable. In addition, your use would likely fall under "fair use". |
I'm looking into hosting this service on AWS. In the mean time, I've written a pipeline for creating the aforementioned TSV. I posted the results to . Lots of Rhianna references included. |
Awesome. Linking to your blog post on the name resolver as well: |
@cthoyt and I chatted today and he demonstrated his Ooh Na Na API for resolving CURIE names. It's currently hosted on an AWS instance whose IP address is aliased at http://curie.manubot.org/. Example output: {
"identifier": "14330",
"miriam": "https://identifiers.org/DOID:14330",
"name": "Parkinson's disease",
"prefix": "doid",
"query": "DOID:14330",
"success": true
} |
As far as syntax goes, it would be a lot easier to support the following via a pandoc filter: we used [iron](CHEBI:53438) to do the experiment Which could get converted to we used [iron(3+) sulfate](https://identifiers.org/CHEBI:53438) to do the experiment But there is also a question of whether we want to convert all named entities to hyperlinks. Or perhaps rather we should add a tooltip / hover text. Before going further, it would be good to know:
|
Most documents I have worked with were in xml format; however, looking at Pubtator Central's example, the main idea is to surround text with a mark tag and provide at least the entity id . If you wanted to make future documents text mining friendly, I recommend including other information such as type of entity and the offset of the mention itself. Ex: We used <mark data-identifier='CHEBI:53438' data-offset=10 data-type='compound'> iron </mark> to do the experiment |
Thanks @danich1! Looks like the HTML
Here we'd want the output HTML to be we used <data value="CHEBI:53438">iron(3+) sulfate</data> to do the experiment The browser shows the This seems pretty aligned with what we want to accomplish, but still does not indicate the "machine-readable translation" is a CURIE. I am hoping there is a standard we can find for this. CC @andrewsu, @cmungall, @jmcmurry: we're looking for a semantic annotation standard for noting compact identifiers in HTML. Any insights (also tweeted)? @vincerubinetti any frontend constraints or ways you think we should proceed? |
|
@dhimmel so the idea is that you could write a link however you wanted, then it would replace the name with the standard? I like the One of the things I've put on my low-priority todo list is to include the data version with all prefix/identifier/name pairs (re: biopragmatics/pyobo#58) Regarding @danich1's point about entity types - this is really really hard. Ontologies have hierarchies, but they don't all inherit from a standardized type vocabulary. For ChEBI, it might be possible to assign all things as either chemicals or roles based on the top level terms, but this kind of annotation would have to be done on a database/ontology basis. Do you have a controlled vocabulary of entity types that you prefer (like SBO, for example)? What will happen for entities that don't fall in that vocabulary? |
@cthoyt implementing this will require decisions at levels:
We could add an HTML attribute in 2 with version information if available. Not a priority at this point, but if its there, we'll find a way to preserve it.
Let's skip entity type for now. As long as a machine can extract IDs, entity types could always be assigned later by downstream users. |
Does using templating open up any better options for the markdown syntax? I haven't thought of any specific syntax proposals that I like. Something like |
Possibly it'd be easier to implement, but I'd much rather have this as a pandoc filter like A pandoc filter should be able to handle any syntax, although figuring out how to modify the AST is not necessarily easy. |
Just checking in - I finally got the über resolver service working properly at http://biolookup.io/. I know we sort of dropped the discussion for a year, but I'm sure you'll all be interested in this! |
Suggested by @cthoyt in #218.
The proposal is to support including Compact Identifiers (CURIEs) in markdown like:
and have the rendered manuscript show
We could select any syntax, but the above one fits well with pandoc's existing link syntax, which has some magic like the
implicit_header_references
extension. It also is the syntax shown on the Wikipedia article, which mentions the brackets produce a "safe CURIE". @cthoyt do you have any comments on syntax / are you aware of alternatives?I'm envisioning the substitution and hyperlinking occurring in a pandoc filter.
One thing I'm not sure of yet is how to get the concept name to replace the CURIE with. For example,
iron(3+) sulfate
above. We could use our existing citation infrastructure, which would use Zotero's translation-server to get the CSL JSON"title"
for webpages. @cthoyt do you know of any ways to get standardized metadata for all CURIEs where we could reliably retrieve the concept name?The text was updated successfully, but these errors were encountered: