Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1942 PDF test set #9

Closed
kermitt2 opened this issue Jun 6, 2018 · 4 comments
Closed

1942 PDF test set #9

kermitt2 opened this issue Jun 6, 2018 · 4 comments
Labels
evaluation evaluation and tests results

Comments

@kermitt2
Copy link
Owner

kermitt2 commented Jun 6, 2018

As a reference with the current version, we have 1071 PDF failing out of 1942 when testing pdfalto with GROBID using the 1942 PubMed Central PDF set.
Errors are pdfalto failure (mostly) and not well-formed XML - usually not valid XML character in attributes.
I open separate issues with test PDF attached for the different cases.
Note that the latest version of our pdf2xml fork modified for grobid was 100% successful on this set, so should pdfalto be too :)

@ghost
Copy link

ghost commented Jun 7, 2018

Does this mean that the 1071 PDF failing out cannot be processed by PDFAlto or not in combination with GROBID? I did see some errors concerning OCRISE characters when trying to process a specific PDF, is this what you are talking about?

@kermitt2
Copy link
Owner Author

kermitt2 commented Jun 7, 2018

yes in combination with GROBID, which permits to test also the outputted XML ALTO. This is a regression test with pdf2xml.
I am not talking about OCR of unsolved character codes which was anyway not present in pdf2xml.

@kermitt2
Copy link
Owner Author

kermitt2 commented Jun 7, 2018

We're improving :)
We have now 356 PDF out of 1942 with errors. Most of them are invalid XML character in attribute content.
There are still some PDF parsing failures, I update the corresponding issue with some examples.

@kermitt2
Copy link
Owner Author

I have now 100% success, great !!

Regarding the metrics, apparently some loss, 1-2% on field accuracy a bit everywhere. I will investigate this to see if it comes from modification of content stream or problems from the character composition.

@Aazhar Aazhar added the evaluation evaluation and tests results label Jun 12, 2018
@kermitt2 kermitt2 closed this as completed Apr 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
evaluation evaluation and tests results
Projects
None yet
Development

No branches or pull requests

2 participants