-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pubtator bioc json error - fix bioc json reader? or add new format "pubtator bioc json"? #5
Comments
Hi Joel, is this PubTator Central? |
... unless this is a simple offset problem that can be fixed with the |
Hi Lenz, Yes I believe these are generated directly by pubtator (and yes
most likely central) more on that here
https://github.com/ncbi-nlp/PubTator-Covid19/ where they say
"Pubtator annotations are provided for six entity types (gene/protein,
drug/chemical, disease, cell type, species and genomic variants) in two
formats (BioC JSON and BioC XML)."
…On Sun, Jan 10, 2021 at 1:20 PM Lenz Furrer ***@***.***> wrote:
Hi Joel, is this PubTator Central?
Their BioC JSON looks funny. AFAIK there's no specs for BioC JSON besides
the converter code by Don Comeau <https://github.com/ncbi-nlp/BioC-JSON>,
to which I've been sticking.
If they provide BioC XML, you should give that a try; it looked fine when
I last checked.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#5 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AJL6NETNRLL6A2LPKVH3VVDSZHVV5ANCNFSM4V4CTEPA>
.
--
Joel L. Duerksen
joellduerksen@gmail.com
Home: 321-549-7210
Cell: 317-289-1036
|
Hi Lenz,
I have not checked into using that option, having mostly used the simple
pubtator/TXT format I'm not familiar with the inner workings of the
json/xml format(s).
if you have any hints on how that option could help let me know.
Here are example full errors I see
JSON
>> with open('/home/plastic/d2/downloads_other/cord19/output.json/1.json',
encoding='utf8') as f:
... coll = bconv.load(f, fmt='bioc_json')
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File
"/home/plastic/anaconda3/lib/python3.8/site-packages/bconv/fmt/__init__.py",
line 77, in load
return _load(loader, mode, source, id_)
File
"/home/plastic/anaconda3/lib/python3.8/site-packages/bconv/fmt/__init__.py",
line 84, in _load
content = loader.load_one(source, id_)
File
"/home/plastic/anaconda3/lib/python3.8/site-packages/bconv/fmt/_load.py",
line 52, in load_one
return self.collection(source, id_)
File
"/home/plastic/anaconda3/lib/python3.8/site-packages/bconv/fmt/bioc.py",
line 62, in collection
collection.add_document(self._document(doc))
File
"/home/plastic/anaconda3/lib/python3.8/site-packages/bconv/fmt/bioc.py",
line 79, in _document
doc.add_section(sec_type, text, offset, anno)
File
"/home/plastic/anaconda3/lib/python3.8/site-packages/bconv/doc/document.py",
line 426, in add_section
section = Section(section_type, text, self, offset, entities)
File
"/home/plastic/anaconda3/lib/python3.8/site-packages/bconv/doc/document.py",
line 295, in __init__
self.add_entities(entities)
File
"/home/plastic/anaconda3/lib/python3.8/site-packages/bconv/doc/document.py",
line 154, in add_entities
sent.add_entities((entity,))
File
"/home/plastic/anaconda3/lib/python3.8/site-packages/bconv/doc/document.py",
line 194, in add_entities
self._validate_spans(entity)
File
"/home/plastic/anaconda3/lib/python3.8/site-packages/bconv/doc/document.py",
line 211, in _validate_spans
assert extracted[0] == entity.text, _mismatch()
AssertionError: entity mention mismatch: rhinovirus vs. [', rhinovir']
I tried xml as well,
XML
>> with open('/home/plastic/d2/downloads_other/cord19/output/1.xml',
encoding='utf8') as f:
... coll = bconv.load(f, fmt='bioc_xml')
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File
"/home/plastic/anaconda3/lib/python3.8/site-packages/bconv/fmt/__init__.py",
line 77, in load
return _load(loader, mode, source, id_)
File
"/home/plastic/anaconda3/lib/python3.8/site-packages/bconv/fmt/__init__.py",
line 84, in _load
content = loader.load_one(source, id_)
File
"/home/plastic/anaconda3/lib/python3.8/site-packages/bconv/fmt/_load.py",
line 52, in load_one
return self.collection(source, id_)
File
"/home/plastic/anaconda3/lib/python3.8/site-packages/bconv/fmt/bioc.py",
line 58, in collection
coll_node, docs = self._parse_collection(source)
File
"/home/plastic/anaconda3/lib/python3.8/site-packages/bconv/fmt/bioc.py",
line 158, in _parse_collection
first, docs = peek(self._iterparse(source))
File
"/home/plastic/anaconda3/lib/python3.8/site-packages/bconv/util/iterate.py",
line 36, in peek
first = next(iterator)
File
"/home/plastic/anaconda3/lib/python3.8/site-packages/bconv/fmt/bioc.py",
line 170, in _iterparse
for _, node in etree.iterparse(source, tag='document'):
File "src/lxml/iterparse.pxi", line 209, in lxml.etree.iterparse.__next__
File "src/lxml/iterparse.pxi", line 194, in lxml.etree.iterparse.__next__
File "src/lxml/iterparse.pxi", line 222, in
lxml.etree.iterparse._read_more_events
TypeError: reading file objects must return bytes objects
I'm guessing we might need two new formats pubtator/json and pubtator/xml?
since you said it looked weird inside. this cord19 site is creating
regular updates, but not providing pubtator/TXT download, hence the
desire to convert.
…On Sun, Jan 10, 2021 at 1:22 PM Lenz Furrer ***@***.***> wrote:
... unless this is a simple offset problem that can be fixed with the
bytes_offset option <https://github.com/lfurrer/bconv/wiki/BioC#options-1>;
have you checked that?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#5 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AJL6NEVGV5FDKET55AA6MTDSZHWAFANCNFSM4V4CTEPA>
.
--
Joel L. Duerksen
joellduerksen@gmail.com
Home: 321-549-7210
Cell: 317-289-1036
|
I'm having a look at the files right now. |
Here's my quick analysis of the problem: First, as I said before, you should turn off the Second,
The BioC DTD says that the Third, there are errors in the data. A questionable case is the "off-camp"/"analysis" one above. A clear instance is the following: In conclusion, I'm not so sure what to do. |
I believe these files are generated by the creators of the pubtator/txt,
pubtator/json, pubtator/xml format. (so it might be an interesting
discussion to argue they are creating their own format/files wrong)
ftp://ftp.ncbi.nlm.nih.gov/pub/lu/
…On Sun, Jan 10, 2021 at 5:17 PM Lenz Furrer ***@***.***> wrote:
Here's my quick analysis of the problem:
We definitely don't need a new format; the documents appear to be
well-formed (my above suspicion about "funny" BioC-JSON does not apply).
Rather, there is a mismatch in the interpretation of the BioC specs
between bconv and PubTator. Also, the data contain some errors.
*First*, as I said before, you should turn off the byte_offset option.
The BioC specs dictate that offsets are calculated in bytes, but many
disregard this detail and simply count Unicode codepoints, which arguably
makes more sense, and that's why there's an option for this in bconv.
*Second*, bconv is pretty strict in its span validation, because that's
how you notice that you should be turning on or off the byte_offset
option. The BioC annotations have a text and a location field, which is a
bit redundant, so we can use it to do a sanity check by looking up the
substring and comparing it to the text value. Now it turns out that
PubTator is doing some normalisation and stores the normalised version in
text rather than the original one, so bconv barks at you.
Examples:
- 3321.json: PubTator: "SARS CoV 2", original: "SARS‐CoV‐2"
- 2025.json: PubTator: "TNF-a", original: "TNF-α"
- 3097.json: PubTator: "IL-1b", original: "IL-1ß" (with German sharp-s
for beta 🙈)
- 785.json: PubTator: "off-camp", original: "analysis" (I'm not sure
if this is a valid synonym or an error)
The BioC DTD says that the text field of annotations is "Typically the
annotated text", so bconv's interpretation is possibly a bit too strict.
I could add an option to skip validation, so these cases would pass, but
then actual errors wouldn't be detected either.
*Third*, there are errors in the data. A questionable case is the
"off-camp"/"analysis" one above. A clear instance is the following:
1992.json contains two occurrences of "mefloquine", the second of which
(at offset 20015) is annotated twice: once with the correct location and
once with offset 18070 (the first occurrence), which is outside the
paragraph at which it is anchored (starting at offset 19247).
In conclusion, I'm not so sure what to do.
I'm not convinced that all of these problems should be fixed at bconv's
end.
You may want to reach out to the authors of CORD-19-PubTator. Chances are
they want to fix problems like the last one in their pipeline.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#5 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AJL6NERR27XWR52MCHNB3QDSZIRP3ANCNFSM4V4CTEPA>
.
--
Joel L. Duerksen
joellduerksen@gmail.com
Home: 321-549-7210
Cell: 317-289-1036
|
I can't make sense of these offsets (they do seem to be correct for the few
I checked in the title field but every entry I checked in the text wasn't
at that offset), I guess it is some kind of programmatic approach that
can't be checked in a text viewer. (e.g. vi)
However, the problems are deeper, and I'll probably write them about that
first, compare the output to their own service output, and we see these
files are missing annotation content as well.
https://www.ncbi.nlm.nih.gov/research/pubtator-api/publications/export/biocjson?pmids=19672853
https://www.ncbi.nlm.nih.gov/research/pubtator-api/publications/export/pubtator?pmids=19672853
I wrote them about issues with another dataset, and while they didn't
respond, it was fixed in the next update. (coincidence, maybe, but...)
On Mon, Jan 11, 2021 at 10:36 AM Joel Duerksen <joellduerksen@gmail.com>
wrote:
… I believe these files are generated by the creators of the pubtator/txt,
pubtator/json, pubtator/xml format. (so it might be an interesting
discussion to argue they are creating their own format/files wrong)
ftp://ftp.ncbi.nlm.nih.gov/pub/lu/
On Sun, Jan 10, 2021 at 5:17 PM Lenz Furrer ***@***.***>
wrote:
> Here's my quick analysis of the problem:
> We definitely don't need a new format; the documents appear to be
> well-formed (my above suspicion about "funny" BioC-JSON does not apply).
> Rather, there is a mismatch in the interpretation of the BioC specs
> between bconv and PubTator. Also, the data contain some errors.
>
> *First*, as I said before, you should turn off the byte_offset option.
> The BioC specs dictate that offsets are calculated in bytes, but many
> disregard this detail and simply count Unicode codepoints, which arguably
> makes more sense, and that's why there's an option for this in bconv.
>
> *Second*, bconv is pretty strict in its span validation, because that's
> how you notice that you should be turning on or off the byte_offset
> option. The BioC annotations have a text and a location field, which is
> a bit redundant, so we can use it to do a sanity check by looking up the
> substring and comparing it to the text value. Now it turns out that
> PubTator is doing some normalisation and stores the normalised version in
> text rather than the original one, so bconv barks at you.
> Examples:
>
> - 3321.json: PubTator: "SARS CoV 2", original: "SARS‐CoV‐2"
> - 2025.json: PubTator: "TNF-a", original: "TNF-α"
> - 3097.json: PubTator: "IL-1b", original: "IL-1ß" (with German
> sharp-s for beta 🙈)
> - 785.json: PubTator: "off-camp", original: "analysis" (I'm not sure
> if this is a valid synonym or an error)
>
> The BioC DTD says that the text field of annotations is "Typically the
> annotated text", so bconv's interpretation is possibly a bit too strict.
> I could add an option to skip validation, so these cases would pass, but
> then actual errors wouldn't be detected either.
>
> *Third*, there are errors in the data. A questionable case is the
> "off-camp"/"analysis" one above. A clear instance is the following:
> 1992.json contains two occurrences of "mefloquine", the second of which
> (at offset 20015) is annotated twice: once with the correct location and
> once with offset 18070 (the first occurrence), which is outside the
> paragraph at which it is anchored (starting at offset 19247).
>
> In conclusion, I'm not so sure what to do.
> I'm not convinced that all of these problems should be fixed at bconv's
> end.
> You may want to reach out to the authors of CORD-19-PubTator. Chances are
> they want to fix problems like the last one in their pipeline.
>
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub
> <#5 (comment)>, or
> unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AJL6NERR27XWR52MCHNB3QDSZIRP3ANCNFSM4V4CTEPA>
> .
>
--
Joel L. Duerksen
***@***.***
Home: 321-549-7210
Cell: 317-289-1036
--
Joel L. Duerksen
joellduerksen@gmail.com
Home: 321-549-7210
Cell: 317-289-1036
|
I'm attempting to use bconv to convert BioC JSON to pubtator/TXT, but it throws an error (on validate spanning?). At a glance format appears compliant, but maybe we need a new format called pubtator bioc json?
Files I'm attempting to convert can be found here
ftp://ftp.ncbi.nlm.nih.gov/pub/lu/CORD19/cord19-pubtator.json.tar
first few lines from output/1.json seem to align with the BioC json format.
{
"source": "PubTator",
"date": "",
"key": "BioC.key",
"infons": {},
"documents": [
{
"id": "xqhn0vbp",
"infons": {},
"passages": [
{
"offset": 0,
"infons": {
.....
The text was updated successfully, but these errors were encountered: