Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF parsing failure #10

Closed
kermitt2 opened this issue Jun 6, 2018 · 6 comments
Closed

PDF parsing failure #10

kermitt2 opened this issue Jun 6, 2018 · 6 comments
Labels
bug Something isn't working

Comments

@kermitt2
Copy link
Owner

kermitt2 commented Jun 6, 2018

Here are some examples of PDF parsing failures from PubMed Central reusable set 1942.
Uploading trf0051-0558.pdf…
Uploading 1746-1340-18-26.pdf…
Uploading 1617-9625-9-4.pdf…

@Aazhar Aazhar closed this as completed in 3e20ca9 Jun 7, 2018
@Aazhar
Copy link
Collaborator

Aazhar commented Jun 7, 2018

possibly related to #5

@kermitt2 kermitt2 reopened this Jun 7, 2018
@kermitt2
Copy link
Owner Author

kermitt2 commented Jun 7, 2018

Less failure, but still some (I didn't count but more than 100 I think) - here are a few examples:
Uploading cjem8_2p0057.pdf…
Uploading ott-4-059.pdf…
Uploading gmos34-715.pdf…

@kermitt2 kermitt2 added the bug Something isn't working label Jun 7, 2018
@Aazhar Aazhar closed this as completed in 3f0ef8c Jun 8, 2018
@kermitt2
Copy link
Owner Author

kermitt2 commented Jun 8, 2018

What about using the same placeholder for all unsolved character codes above a certain number, in order to make pdfalto robust? Otherwise it will fail for some weird PDF which have complete embedded fonts with maybe hundred of unsolved codes?

@Aazhar
Copy link
Collaborator

Aazhar commented Jun 8, 2018

This is done for the prod version (master).
@kermitt2 the documents you referenced above should be rather referenced in this issue #11.
I'll close this one

@Aazhar Aazhar closed this as completed Jun 8, 2018
@kermitt2 kermitt2 reopened this Jun 8, 2018
@kermitt2
Copy link
Owner Author

kermitt2 commented Jun 8, 2018

I only report PDF parser failure here :)

ccp_76_3_524.pdf
rev_117_1_210.pdf

Aazhar pushed a commit that referenced this issue Jun 11, 2018
* Due to wrong mapping found in embedded fonts.
@Aazhar
Copy link
Collaborator

Aazhar commented Jun 11, 2018

Ok these failures were because of missing fonts metadata, is fixed with this commit 04618c4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants