Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pubtator bioc json error - fix bioc json reader? or add new format "pubtator bioc json"? #5

Open
joelduerksen opened this issue Jan 10, 2021 · 8 comments

Comments

@joelduerksen
Copy link

I'm attempting to use bconv to convert BioC JSON to pubtator/TXT, but it throws an error (on validate spanning?). At a glance format appears compliant, but maybe we need a new format called pubtator bioc json?

Files I'm attempting to convert can be found here

ftp://ftp.ncbi.nlm.nih.gov/pub/lu/CORD19/cord19-pubtator.json.tar

first few lines from output/1.json seem to align with the BioC json format.


{
"source": "PubTator",
"date": "",
"key": "BioC.key",
"infons": {},
"documents": [
{
"id": "xqhn0vbp",
"infons": {},
"passages": [
{
"offset": 0,
"infons": {
.....

@lfurrer
Copy link
Owner

lfurrer commented Jan 10, 2021

Hi Joel, is this PubTator Central?
Their BioC JSON looks funny. AFAIK there's no specs for BioC JSON besides the converter code by Don Comeau, to which I've been sticking.
If they provide BioC XML, you should give that a try; it looked fine when I last checked.

@lfurrer
Copy link
Owner

lfurrer commented Jan 10, 2021

... unless this is a simple offset problem that can be fixed with the bytes_offset option; have you checked that?

@joelduerksen
Copy link
Author

joelduerksen commented Jan 10, 2021 via email

@joelduerksen
Copy link
Author

joelduerksen commented Jan 10, 2021 via email

@lfurrer
Copy link
Owner

lfurrer commented Jan 10, 2021

I'm having a look at the files right now.
It seems that setting byte_offsets=False helps, but there are cases where it still breaks.
Concerning the error you see for XML: You need to pass a binary file handle here. XML is, technically, a binary format, not plain text (at least that's what lxml's author claims).

@lfurrer
Copy link
Owner

lfurrer commented Jan 10, 2021

Here's my quick analysis of the problem:
We definitely don't need a new format; the documents appear to be well-formed (my above suspicion about "funny" BioC-JSON does not apply). Rather, there is a mismatch in the interpretation of the BioC specs between bconv and PubTator. Also, the data contain some errors.

First, as I said before, you should turn off the byte_offset option. The BioC specs dictate that offsets are calculated in bytes, but many disregard this detail and simply count Unicode codepoints, which arguably makes more sense, and that's why there's an option for this in bconv.

Second, bconv is pretty strict in its span validation, because that's how you notice that you should be turning on or off the byte_offset option. The BioC annotations have a text and a location field, which is a bit redundant, so we can use it to do a sanity check by looking up the substring and comparing it to the text value. Now it turns out that PubTator is doing some normalisation and stores the normalised version in text rather than the original one, so bconv barks at you.
Examples:

  • 3321.json: PubTator: "SARS CoV 2", original: "SARS‐CoV‐2"
  • 2025.json: PubTator: "TNF-a", original: "TNF-α"
  • 3097.json: PubTator: "IL-1b", original: "IL-1ß" (with German sharp-s for beta 🙈)
  • 785.json: PubTator: "off-camp", original: "analysis" (I'm not sure if this is a valid synonym or an error)

The BioC DTD says that the text field of annotations is "Typically the annotated text", so bconv's interpretation is possibly a bit too strict. I could add an option to skip validation, so these cases would pass, but then actual errors wouldn't be detected either.

Third, there are errors in the data. A questionable case is the "off-camp"/"analysis" one above. A clear instance is the following:
1992.json contains two occurrences of "mefloquine", the second of which (at offset 20015) is annotated twice: once with the correct location and once with offset 18070 (the first occurrence), which is outside the paragraph at which it is anchored (starting at offset 19247). The same pattern can be seen for "fatty acid" in 2952.json. It seems like both cases appear in duplicate paragraphs or documents, which might be responsible for the spurious annotations.

In conclusion, I'm not so sure what to do.
I'm not convinced that all of these problems should be fixed at bconv's end.
You may want to reach out to the authors of CORD-19-PubTator. Chances are they want to fix problems like the last one in their pipeline.

@joelduerksen
Copy link
Author

joelduerksen commented Jan 11, 2021 via email

@joelduerksen
Copy link
Author

joelduerksen commented Jan 11, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants