1942 PDF test set #9

kermitt2 · 2018-06-06T12:05:41Z

As a reference with the current version, we have 1071 PDF failing out of 1942 when testing pdfalto with GROBID using the 1942 PubMed Central PDF set.
Errors are pdfalto failure (mostly) and not well-formed XML - usually not valid XML character in attributes.
I open separate issues with test PDF attached for the different cases.
Note that the latest version of our pdf2xml fork modified for grobid was 100% successful on this set, so should pdfalto be too :)

ghost · 2018-06-07T14:02:19Z

Does this mean that the 1071 PDF failing out cannot be processed by PDFAlto or not in combination with GROBID? I did see some errors concerning OCRISE characters when trying to process a specific PDF, is this what you are talking about?

kermitt2 · 2018-06-07T17:24:02Z

yes in combination with GROBID, which permits to test also the outputted XML ALTO. This is a regression test with pdf2xml.
I am not talking about OCR of unsolved character codes which was anyway not present in pdf2xml.

kermitt2 · 2018-06-07T22:25:40Z

We're improving :)
We have now 356 PDF out of 1942 with errors. Most of them are invalid XML character in attribute content.
There are still some PDF parsing failures, I update the corresponding issue with some examples.

kermitt2 · 2018-06-11T15:52:28Z

I have now 100% success, great !!

Regarding the metrics, apparently some loss, 1-2% on field accuracy a bit everywhere. I will investigate this to see if it comes from modification of content stream or problems from the character composition.

Aazhar added the evaluation evaluation and tests results label Jun 12, 2018

kermitt2 closed this as completed Apr 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1942 PDF test set #9

1942 PDF test set #9

kermitt2 commented Jun 6, 2018 •

edited

Loading

ghost commented Jun 7, 2018

kermitt2 commented Jun 7, 2018

kermitt2 commented Jun 7, 2018

kermitt2 commented Jun 11, 2018

1942 PDF test set #9

1942 PDF test set #9

Comments

kermitt2 commented Jun 6, 2018 • edited Loading

ghost commented Jun 7, 2018

kermitt2 commented Jun 7, 2018

kermitt2 commented Jun 7, 2018

kermitt2 commented Jun 11, 2018

kermitt2 commented Jun 6, 2018 •

edited

Loading