-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1942 PDF test set #9
Comments
Does this mean that the 1071 PDF failing out cannot be processed by PDFAlto or not in combination with GROBID? I did see some errors concerning OCRISE characters when trying to process a specific PDF, is this what you are talking about? |
yes in combination with GROBID, which permits to test also the outputted XML ALTO. This is a regression test with pdf2xml. |
We're improving :) |
I have now 100% success, great !! Regarding the metrics, apparently some loss, 1-2% on field accuracy a bit everywhere. I will investigate this to see if it comes from modification of content stream or problems from the character composition. |
As a reference with the current version, we have 1071 PDF failing out of 1942 when testing pdfalto with GROBID using the 1942 PubMed Central PDF set.
Errors are pdfalto failure (mostly) and not well-formed XML - usually not valid XML character in attributes.
I open separate issues with test PDF attached for the different cases.
Note that the latest version of our pdf2xml fork modified for grobid was 100% successful on this set, so should pdfalto be too :)
The text was updated successfully, but these errors were encountered: