pubtator bioc json error - fix bioc json reader? or add new format "pubtator bioc json"? #5

joelduerksen · 2021-01-10T03:26:31Z

I'm attempting to use bconv to convert BioC JSON to pubtator/TXT, but it throws an error (on validate spanning?). At a glance format appears compliant, but maybe we need a new format called pubtator bioc json?

Files I'm attempting to convert can be found here

ftp://ftp.ncbi.nlm.nih.gov/pub/lu/CORD19/cord19-pubtator.json.tar

first few lines from output/1.json seem to align with the BioC json format.

{
"source": "PubTator",
"date": "",
"key": "BioC.key",
"infons": {},
"documents": [
{
"id": "xqhn0vbp",
"infons": {},
"passages": [
{
"offset": 0,
"infons": {
.....

lfurrer · 2021-01-10T18:20:02Z

Hi Joel, is this PubTator Central?
Their BioC JSON looks funny. AFAIK there's no specs for BioC JSON besides the converter code by Don Comeau, to which I've been sticking.
If they provide BioC XML, you should give that a try; it looked fine when I last checked.

lfurrer · 2021-01-10T18:22:46Z

... unless this is a simple offset problem that can be fixed with the bytes_offset option; have you checked that?

joelduerksen · 2021-01-10T20:47:42Z

Hi Lenz, Yes I believe these are generated directly by pubtator (and yes most likely central) more on that here https://github.com/ncbi-nlp/PubTator-Covid19/ where they say "Pubtator annotations are provided for six entity types (gene/protein, drug/chemical, disease, cell type, species and genomic variants) in two formats (BioC JSON and BioC XML)."

…

On Sun, Jan 10, 2021 at 1:20 PM Lenz Furrer ***@***.***> wrote: Hi Joel, is this PubTator Central? Their BioC JSON looks funny. AFAIK there's no specs for BioC JSON besides the converter code by Don Comeau <https://github.com/ncbi-nlp/BioC-JSON>, to which I've been sticking. If they provide BioC XML, you should give that a try; it looked fine when I last checked. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#5 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AJL6NETNRLL6A2LPKVH3VVDSZHVV5ANCNFSM4V4CTEPA> .

-- Joel L. Duerksen joellduerksen@gmail.com Home: 321-549-7210 Cell: 317-289-1036

joelduerksen · 2021-01-10T20:59:04Z

Hi Lenz, I have not checked into using that option, having mostly used the simple pubtator/TXT format I'm not familiar with the inner workings of the json/xml format(s). if you have any hints on how that option could help let me know. Here are example full errors I see JSON

>> with open('/home/plastic/d2/downloads_other/cord19/output.json/1.json',

encoding='utf8') as f: ... coll = bconv.load(f, fmt='bioc_json') ... Traceback (most recent call last): File "<stdin>", line 2, in <module> File "/home/plastic/anaconda3/lib/python3.8/site-packages/bconv/fmt/__init__.py", line 77, in load return _load(loader, mode, source, id_) File "/home/plastic/anaconda3/lib/python3.8/site-packages/bconv/fmt/__init__.py", line 84, in _load content = loader.load_one(source, id_) File "/home/plastic/anaconda3/lib/python3.8/site-packages/bconv/fmt/_load.py", line 52, in load_one return self.collection(source, id_) File "/home/plastic/anaconda3/lib/python3.8/site-packages/bconv/fmt/bioc.py", line 62, in collection collection.add_document(self._document(doc)) File "/home/plastic/anaconda3/lib/python3.8/site-packages/bconv/fmt/bioc.py", line 79, in _document doc.add_section(sec_type, text, offset, anno) File "/home/plastic/anaconda3/lib/python3.8/site-packages/bconv/doc/document.py", line 426, in add_section section = Section(section_type, text, self, offset, entities) File "/home/plastic/anaconda3/lib/python3.8/site-packages/bconv/doc/document.py", line 295, in __init__ self.add_entities(entities) File "/home/plastic/anaconda3/lib/python3.8/site-packages/bconv/doc/document.py", line 154, in add_entities sent.add_entities((entity,)) File "/home/plastic/anaconda3/lib/python3.8/site-packages/bconv/doc/document.py", line 194, in add_entities self._validate_spans(entity) File "/home/plastic/anaconda3/lib/python3.8/site-packages/bconv/doc/document.py", line 211, in _validate_spans assert extracted[0] == entity.text, _mismatch() AssertionError: entity mention mismatch: rhinovirus vs. [', rhinovir'] I tried xml as well, XML

>> with open('/home/plastic/d2/downloads_other/cord19/output/1.xml',

encoding='utf8') as f: ... coll = bconv.load(f, fmt='bioc_xml') ... Traceback (most recent call last): File "<stdin>", line 2, in <module> File "/home/plastic/anaconda3/lib/python3.8/site-packages/bconv/fmt/__init__.py", line 77, in load return _load(loader, mode, source, id_) File "/home/plastic/anaconda3/lib/python3.8/site-packages/bconv/fmt/__init__.py", line 84, in _load content = loader.load_one(source, id_) File "/home/plastic/anaconda3/lib/python3.8/site-packages/bconv/fmt/_load.py", line 52, in load_one return self.collection(source, id_) File "/home/plastic/anaconda3/lib/python3.8/site-packages/bconv/fmt/bioc.py", line 58, in collection coll_node, docs = self._parse_collection(source) File "/home/plastic/anaconda3/lib/python3.8/site-packages/bconv/fmt/bioc.py", line 158, in _parse_collection first, docs = peek(self._iterparse(source)) File "/home/plastic/anaconda3/lib/python3.8/site-packages/bconv/util/iterate.py", line 36, in peek first = next(iterator) File "/home/plastic/anaconda3/lib/python3.8/site-packages/bconv/fmt/bioc.py", line 170, in _iterparse for _, node in etree.iterparse(source, tag='document'): File "src/lxml/iterparse.pxi", line 209, in lxml.etree.iterparse.__next__ File "src/lxml/iterparse.pxi", line 194, in lxml.etree.iterparse.__next__ File "src/lxml/iterparse.pxi", line 222, in lxml.etree.iterparse._read_more_events TypeError: reading file objects must return bytes objects I'm guessing we might need two new formats pubtator/json and pubtator/xml? since you said it looked weird inside. this cord19 site is creating regular updates, but not providing pubtator/TXT download, hence the desire to convert.

…

On Sun, Jan 10, 2021 at 1:22 PM Lenz Furrer ***@***.***> wrote: ... unless this is a simple offset problem that can be fixed with the bytes_offset option <https://github.com/lfurrer/bconv/wiki/BioC#options-1>; have you checked that? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#5 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AJL6NEVGV5FDKET55AA6MTDSZHWAFANCNFSM4V4CTEPA> .

-- Joel L. Duerksen joellduerksen@gmail.com Home: 321-549-7210 Cell: 317-289-1036

lfurrer · 2021-01-10T21:03:50Z

I'm having a look at the files right now.
It seems that setting byte_offsets=False helps, but there are cases where it still breaks.
Concerning the error you see for XML: You need to pass a binary file handle here. XML is, technically, a binary format, not plain text (at least that's what lxml's author claims).

lfurrer · 2021-01-10T22:17:21Z

Here's my quick analysis of the problem:
We definitely don't need a new format; the documents appear to be well-formed (my above suspicion about "funny" BioC-JSON does not apply). Rather, there is a mismatch in the interpretation of the BioC specs between bconv and PubTator. Also, the data contain some errors.

First, as I said before, you should turn off the byte_offset option. The BioC specs dictate that offsets are calculated in bytes, but many disregard this detail and simply count Unicode codepoints, which arguably makes more sense, and that's why there's an option for this in bconv.

Second, bconv is pretty strict in its span validation, because that's how you notice that you should be turning on or off the byte_offset option. The BioC annotations have a text and a location field, which is a bit redundant, so we can use it to do a sanity check by looking up the substring and comparing it to the text value. Now it turns out that PubTator is doing some normalisation and stores the normalised version in text rather than the original one, so bconv barks at you.
Examples:

3321.json: PubTator: "SARS CoV 2", original: "SARS‐CoV‐2"
2025.json: PubTator: "TNF-a", original: "TNF-α"
3097.json: PubTator: "IL-1b", original: "IL-1ß" (with German sharp-s for beta 🙈)
785.json: PubTator: "off-camp", original: "analysis" (I'm not sure if this is a valid synonym or an error)

The BioC DTD says that the text field of annotations is "Typically the annotated text", so bconv's interpretation is possibly a bit too strict. I could add an option to skip validation, so these cases would pass, but then actual errors wouldn't be detected either.

Third, there are errors in the data. A questionable case is the "off-camp"/"analysis" one above. A clear instance is the following:
1992.json contains two occurrences of "mefloquine", the second of which (at offset 20015) is annotated twice: once with the correct location and once with offset 18070 (the first occurrence), which is outside the paragraph at which it is anchored (starting at offset 19247). The same pattern can be seen for "fatty acid" in 2952.json. It seems like both cases appear in duplicate paragraphs or documents, which might be responsible for the spurious annotations.

In conclusion, I'm not so sure what to do.
I'm not convinced that all of these problems should be fixed at bconv's end.
You may want to reach out to the authors of CORD-19-PubTator. Chances are they want to fix problems like the last one in their pipeline.

joelduerksen · 2021-01-11T15:37:12Z

I believe these files are generated by the creators of the pubtator/txt, pubtator/json, pubtator/xml format. (so it might be an interesting discussion to argue they are creating their own format/files wrong) ftp://ftp.ncbi.nlm.nih.gov/pub/lu/

…

On Sun, Jan 10, 2021 at 5:17 PM Lenz Furrer ***@***.***> wrote: Here's my quick analysis of the problem: We definitely don't need a new format; the documents appear to be well-formed (my above suspicion about "funny" BioC-JSON does not apply). Rather, there is a mismatch in the interpretation of the BioC specs between bconv and PubTator. Also, the data contain some errors. *First*, as I said before, you should turn off the byte_offset option. The BioC specs dictate that offsets are calculated in bytes, but many disregard this detail and simply count Unicode codepoints, which arguably makes more sense, and that's why there's an option for this in bconv. *Second*, bconv is pretty strict in its span validation, because that's how you notice that you should be turning on or off the byte_offset option. The BioC annotations have a text and a location field, which is a bit redundant, so we can use it to do a sanity check by looking up the substring and comparing it to the text value. Now it turns out that PubTator is doing some normalisation and stores the normalised version in text rather than the original one, so bconv barks at you. Examples: - 3321.json: PubTator: "SARS CoV 2", original: "SARS‐CoV‐2" - 2025.json: PubTator: "TNF-a", original: "TNF-α" - 3097.json: PubTator: "IL-1b", original: "IL-1ß" (with German sharp-s for beta 🙈) - 785.json: PubTator: "off-camp", original: "analysis" (I'm not sure if this is a valid synonym or an error) The BioC DTD says that the text field of annotations is "Typically the annotated text", so bconv's interpretation is possibly a bit too strict. I could add an option to skip validation, so these cases would pass, but then actual errors wouldn't be detected either. *Third*, there are errors in the data. A questionable case is the "off-camp"/"analysis" one above. A clear instance is the following: 1992.json contains two occurrences of "mefloquine", the second of which (at offset 20015) is annotated twice: once with the correct location and once with offset 18070 (the first occurrence), which is outside the paragraph at which it is anchored (starting at offset 19247). In conclusion, I'm not so sure what to do. I'm not convinced that all of these problems should be fixed at bconv's end. You may want to reach out to the authors of CORD-19-PubTator. Chances are they want to fix problems like the last one in their pipeline. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#5 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AJL6NERR27XWR52MCHNB3QDSZIRP3ANCNFSM4V4CTEPA> .

-- Joel L. Duerksen joellduerksen@gmail.com Home: 321-549-7210 Cell: 317-289-1036

joelduerksen · 2021-01-11T17:13:03Z

I can't make sense of these offsets (they do seem to be correct for the few I checked in the title field but every entry I checked in the text wasn't at that offset), I guess it is some kind of programmatic approach that can't be checked in a text viewer. (e.g. vi) However, the problems are deeper, and I'll probably write them about that first, compare the output to their own service output, and we see these files are missing annotation content as well. https://www.ncbi.nlm.nih.gov/research/pubtator-api/publications/export/biocjson?pmids=19672853 https://www.ncbi.nlm.nih.gov/research/pubtator-api/publications/export/pubtator?pmids=19672853 I wrote them about issues with another dataset, and while they didn't respond, it was fixed in the next update. (coincidence, maybe, but...) On Mon, Jan 11, 2021 at 10:36 AM Joel Duerksen <joellduerksen@gmail.com> wrote:

…

I believe these files are generated by the creators of the pubtator/txt, pubtator/json, pubtator/xml format. (so it might be an interesting discussion to argue they are creating their own format/files wrong) ftp://ftp.ncbi.nlm.nih.gov/pub/lu/ On Sun, Jan 10, 2021 at 5:17 PM Lenz Furrer ***@***.***> wrote: > Here's my quick analysis of the problem: > We definitely don't need a new format; the documents appear to be > well-formed (my above suspicion about "funny" BioC-JSON does not apply). > Rather, there is a mismatch in the interpretation of the BioC specs > between bconv and PubTator. Also, the data contain some errors. > > *First*, as I said before, you should turn off the byte_offset option. > The BioC specs dictate that offsets are calculated in bytes, but many > disregard this detail and simply count Unicode codepoints, which arguably > makes more sense, and that's why there's an option for this in bconv. > > *Second*, bconv is pretty strict in its span validation, because that's > how you notice that you should be turning on or off the byte_offset > option. The BioC annotations have a text and a location field, which is > a bit redundant, so we can use it to do a sanity check by looking up the > substring and comparing it to the text value. Now it turns out that > PubTator is doing some normalisation and stores the normalised version in > text rather than the original one, so bconv barks at you. > Examples: > > - 3321.json: PubTator: "SARS CoV 2", original: "SARS‐CoV‐2" > - 2025.json: PubTator: "TNF-a", original: "TNF-α" > - 3097.json: PubTator: "IL-1b", original: "IL-1ß" (with German > sharp-s for beta 🙈) > - 785.json: PubTator: "off-camp", original: "analysis" (I'm not sure > if this is a valid synonym or an error) > > The BioC DTD says that the text field of annotations is "Typically the > annotated text", so bconv's interpretation is possibly a bit too strict. > I could add an option to skip validation, so these cases would pass, but > then actual errors wouldn't be detected either. > > *Third*, there are errors in the data. A questionable case is the > "off-camp"/"analysis" one above. A clear instance is the following: > 1992.json contains two occurrences of "mefloquine", the second of which > (at offset 20015) is annotated twice: once with the correct location and > once with offset 18070 (the first occurrence), which is outside the > paragraph at which it is anchored (starting at offset 19247). > > In conclusion, I'm not so sure what to do. > I'm not convinced that all of these problems should be fixed at bconv's > end. > You may want to reach out to the authors of CORD-19-PubTator. Chances are > they want to fix problems like the last one in their pipeline. > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > <#5 (comment)>, or > unsubscribe > <https://github.com/notifications/unsubscribe-auth/AJL6NERR27XWR52MCHNB3QDSZIRP3ANCNFSM4V4CTEPA> > . > -- Joel L. Duerksen ***@***.*** Home: 321-549-7210 Cell: 317-289-1036

-- Joel L. Duerksen joellduerksen@gmail.com Home: 321-549-7210 Cell: 317-289-1036

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pubtator bioc json error - fix bioc json reader? or add new format "pubtator bioc json"? #5

pubtator bioc json error - fix bioc json reader? or add new format "pubtator bioc json"? #5

joelduerksen commented Jan 10, 2021

lfurrer commented Jan 10, 2021

lfurrer commented Jan 10, 2021

joelduerksen commented Jan 10, 2021 via email

joelduerksen commented Jan 10, 2021 via email

lfurrer commented Jan 10, 2021 •

edited

Loading

lfurrer commented Jan 10, 2021 •

edited

Loading

joelduerksen commented Jan 11, 2021 via email

joelduerksen commented Jan 11, 2021 via email

pubtator bioc json error - fix bioc json reader? or add new format "pubtator bioc json"? #5

pubtator bioc json error - fix bioc json reader? or add new format "pubtator bioc json"? #5

Comments

joelduerksen commented Jan 10, 2021

lfurrer commented Jan 10, 2021

lfurrer commented Jan 10, 2021

joelduerksen commented Jan 10, 2021 via email

joelduerksen commented Jan 10, 2021 via email

lfurrer commented Jan 10, 2021 • edited Loading

lfurrer commented Jan 10, 2021 • edited Loading

joelduerksen commented Jan 11, 2021 via email

joelduerksen commented Jan 11, 2021 via email

lfurrer commented Jan 10, 2021 •

edited

Loading

lfurrer commented Jan 10, 2021 •

edited

Loading