Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand compact identifiers to concept name with resolved hyperlink #220

Open
dhimmel opened this issue Apr 16, 2020 · 15 comments
Open

Expand compact identifiers to concept name with resolved hyperlink #220

dhimmel opened this issue Apr 16, 2020 · 15 comments

Comments

@dhimmel
Copy link
Member

dhimmel commented Apr 16, 2020

Suggested by @cthoyt in #218.

The proposal is to support including Compact Identifiers (CURIEs) in markdown like:

we used [CHEBI:53438] to do the experiment

and have the rendered manuscript show

we used iron(3+) sulfate to do the experiment

We could select any syntax, but the above one fits well with pandoc's existing link syntax, which has some magic like the implicit_header_references extension. It also is the syntax shown on the Wikipedia article, which mentions the brackets produce a "safe CURIE". @cthoyt do you have any comments on syntax / are you aware of alternatives?

I'm envisioning the substitution and hyperlinking occurring in a pandoc filter.

One thing I'm not sure of yet is how to get the concept name to replace the CURIE with. For example, iron(3+) sulfate above. We could use our existing citation infrastructure, which would use Zotero's translation-server to get the CSL JSON "title" for webpages. @cthoyt do you know of any ways to get standardized metadata for all CURIEs where we could reliably retrieve the concept name?

@cthoyt
Copy link
Contributor

cthoyt commented Apr 16, 2020

The only difference I might consider in syntax is to be prefixed with the @ symbol, like the other references to pubmed and so on (e.g., [@CHEBI:53438]).

Resolving names is a total pain in the butt! One solution could be to use the Ontology Lookup Service (OLS), but there's lots of discrepancies between what's indexed there and what's allowed in identifiers.org. One HUGE omission from the OLS that makes it a way less viable as a solution is their refusal to index HGNC, Entrez, and other gene nomenclatures.

I've been working on making a generalizable server to index identifier->name mappings myself because I care a lot about this problem. I don't have any infrastructure to host a service, but as I mentioned in that other comment thread, PyOBO might be able to do the job.

Maybe the next step I could take would to put the code in place in a new repo that will use the whole pipeline to make a super-sized 3-column TSV with prefix, identifier, and name in each row that I could post to figshare, zenodo, or whereever. Then, anyone could consume it and build a simple lookup service on top. I could also make a small demo lookup service with Flask and an in-memory python dictionary as a key-value store. I guess we'd have to keep in mind that this solution comes with the caveat that it's another service that needs external maintenance...

Maybe for the purposes of Manubot we could do a demo roll-out of this feature that just do ChEBI and HGNC, for example, and downloads the data from the source on the fly for each build (thus requiring no databases or web services). I would guess that on the manubot/python side, it wouldn't be so hard to have a function that mix and matches what sources it uses to look up different CURIEs

@cthoyt
Copy link
Contributor

cthoyt commented Apr 16, 2020

Update, I wrote a "resolver service" and added it here biopragmatics/pyobo@03329b7. It wasn't so much work since I already wrote the code that did the heavy lifting :)

If you want to try it out, you can do

pip install git+https://github.com/pyobo/pyobo
python -m pyobo.apps.resolver

Note - it will take some time to download/parse resources the first time you ask for them. But after that it's relatively quick since it caches the id->name mappings as 2 column TSVs in the ~/.obo directory.

I was wondering, though, what would be the licensing ramifications of making the super-TSV that I described before

@dhimmel
Copy link
Member Author

dhimmel commented Apr 17, 2020

I see you've done a fair amount of manual work in metaregistry.json.

It would be great to rely on a service like this. There's not a huge number of dependencies, so we could make it a (optional) manubot dependency. Another option would be to set up a public server / API.

@cthoyt what are your thoughts? Do you know of any ways we could turn this into a public API with very little maintenance required?

what would be the licensing ramifications of making the super-TSV that I described before

At least in the U.S., I don't think this file would be subject to copyright as its rather factual. Just like how a phone directory has been found not to be copyrightable. In addition, your use would likely fall under "fair use".

@cthoyt
Copy link
Contributor

cthoyt commented Apr 18, 2020

I'm looking into hosting this service on AWS.

In the mean time, I've written a pipeline for creating the aforementioned TSV. I posted the results to DOI. Lots of Rhianna references included.

@dhimmel
Copy link
Member Author

dhimmel commented Apr 20, 2020

Awesome. Linking to your blog post on the name resolver as well:

@dhimmel
Copy link
Member Author

dhimmel commented Apr 23, 2020

@cthoyt and I chatted today and he demonstrated his Ooh Na Na API for resolving CURIE names. It's currently hosted on an AWS instance whose IP address is aliased at http://curie.manubot.org/. Example output:

{
    "identifier": "14330",
    "miriam": "https://identifiers.org/DOID:14330",
    "name": "Parkinson's disease",
    "prefix": "doid",
    "query": "DOID:14330",
    "success": true
}

@dhimmel
Copy link
Member Author

dhimmel commented Jun 2, 2020

As far as syntax goes, it would be a lot easier to support the following via a pandoc filter:

we used [iron](CHEBI:53438) to do the experiment

Which could get converted to

we used [iron(3+) sulfate](https://identifiers.org/CHEBI:53438) to do the experiment

But there is also a question of whether we want to convert all named entities to hyperlinks. Or perhaps rather we should add a tooltip / hover text. Before going further, it would be good to know:

  1. what standards the NLP / text mining community has for including named entity tags in documents. CC @danich1: aware of any standards for how to encode tagged entities in HTML?
  2. what user experience we envision for someone reading a sentence including a named entity. Do we want a hyperlink / tooltip etcetera.

@danich1
Copy link

danich1 commented Jun 2, 2020

what standards the NLP / text mining community has for including named entity tags in documents. CC @danich1: aware of any standards for how to encode tagged entities in HTML?

Most documents I have worked with were in xml format; however, looking at Pubtator Central's example, the main idea is to surround text with a mark tag and provide at least the entity id . If you wanted to make future documents text mining friendly, I recommend including other information such as type of entity and the offset of the mention itself.

Ex:

We used <mark data-identifier='CHEBI:53438' data-offset=10 data-type='compound'> iron </mark> to do the experiment

@dhimmel
Copy link
Member Author

dhimmel commented Jun 2, 2020

Thanks @danich1! Looks like the HTML <mark> element is used for highlighting. Another option might be the <data> element:

The HTML <data> element links a given content with a machine-readable translation.

Here we'd want the output HTML to be

we used <data value="CHEBI:53438">iron(3+) sulfate</data> to do the experiment

The browser shows the value as ID upon hover:

Screenshot from 2020-06-02 12-52-10

This seems pretty aligned with what we want to accomplish, but still does not indicate the "machine-readable translation" is a CURIE. I am hoping there is a standard we can find for this. CC @andrewsu, @cmungall, @jmcmurry: we're looking for a semantic annotation standard for noting compact identifiers in HTML. Any insights (also tweeted)?

@vincerubinetti any frontend constraints or ways you think we should proceed?

@vincerubinetti
Copy link
Collaborator

<data> is more appropriate for this circumstance I think, though if there is precedence for using <mark> in this particular situation, I don't Manubot is using it anywhere else yet. There shouldn't be any conflict with either choice.

@cthoyt
Copy link
Contributor

cthoyt commented Jun 2, 2020

@dhimmel so the idea is that you could write a link however you wanted, then it would replace the name with the standard? I like the <data> tag mockup you provided, but also think that tooltips would be helpful too.

One of the things I've put on my low-priority todo list is to include the data version with all prefix/identifier/name pairs (re: biopragmatics/pyobo#58)

Regarding @danich1's point about entity types - this is really really hard. Ontologies have hierarchies, but they don't all inherit from a standardized type vocabulary. For ChEBI, it might be possible to assign all things as either chemicals or roles based on the top level terms, but this kind of annotation would have to be done on a database/ontology basis. Do you have a controlled vocabulary of entity types that you prefer (like SBO, for example)? What will happen for entities that don't fall in that vocabulary?

@dhimmel
Copy link
Member Author

dhimmel commented Jun 2, 2020

@cthoyt implementing this will require decisions at levels:

  1. figuring out what markdown syntax to use for compact IDs. My ideas so far are [iron](CHEBI:53438) or [CHEBI:53438]. It would be good to collect feedback from people who know more about CommonMark and Pandoc.

  2. deciding how to encode the compact ID / entity name in the output HTML. For example the <data> element with a value attribute.

  3. deciding how the frontend should display entities with compact IDs. I think this can largely be done with CSS, but we need to design 2 in such a way that we can render names / IDs in an optimal way.

One of the things I've put on my low-priority todo list is to include the data version with all prefix/identifier/name pairs (re: biopragmatics/pyobo#58)

We could add an HTML attribute in 2 with version information if available. Not a priority at this point, but if its there, we'll find a way to preserve it.

Regarding @danich1's point about entity types - this is really really hard

Let's skip entity type for now. As long as a machine can extract IDs, entity types could always be assigned later by downstream users.

@agitter
Copy link
Member

agitter commented Jun 2, 2020

Does using templating open up any better options for the markdown syntax? I haven't thought of any specific syntax proposals that I like. Something like {{CHEBI:53438}} or {{CHEBI:53438.expanded}} could be replaced with <data value="CHEBI:53438">iron(3+) sulfate</data>.

@dhimmel
Copy link
Member Author

dhimmel commented Jun 2, 2020

Does using templating open up any better options for the markdown syntax?

Possibly it'd be easier to implement, but I'd much rather have this as a pandoc filter like pandoc-manubot-cite (or part of it). This way we're less constrained to jinja2 in the future and the potential userbase is ~1000x.

A pandoc filter should be able to handle any syntax, although figuring out how to modify the AST is not necessarily easy.

@cthoyt
Copy link
Contributor

cthoyt commented Sep 14, 2021

Just checking in - I finally got the über resolver service working properly at http://biolookup.io/. I know we sort of dropped the discussion for a year, but I'm sure you'll all be interested in this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants