-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PDF parsing failure #10
Comments
possibly related to #5 |
Less failure, but still some (I didn't count but more than 100 I think) - here are a few examples: |
What about using the same placeholder for all unsolved character codes above a certain number, in order to make pdfalto robust? Otherwise it will fail for some weird PDF which have complete embedded fonts with maybe hundred of unsolved codes? |
I only report PDF parser failure here :) |
* Due to wrong mapping found in embedded fonts.
Ok these failures were because of missing fonts metadata, is fixed with this commit 04618c4 |
Here are some examples of PDF parsing failures from PubMed Central reusable set 1942.
Uploading trf0051-0558.pdf…
Uploading 1746-1340-18-26.pdf…
Uploading 1617-9625-9-4.pdf…
The text was updated successfully, but these errors were encountered: