PDF parsing failure #10

kermitt2 · 2018-06-06T15:23:12Z

Here are some examples of PDF parsing failures from PubMed Central reusable set 1942.
Uploading trf0051-0558.pdf…
Uploading 1746-1340-18-26.pdf…
Uploading 1617-9625-9-4.pdf…

Aazhar · 2018-06-07T15:43:10Z

possibly related to #5

kermitt2 · 2018-06-07T22:29:29Z

Less failure, but still some (I didn't count but more than 100 I think) - here are a few examples:
Uploading cjem8_2p0057.pdf…
Uploading ott-4-059.pdf…
Uploading gmos34-715.pdf…

kermitt2 · 2018-06-08T09:19:04Z

What about using the same placeholder for all unsolved character codes above a certain number, in order to make pdfalto robust? Otherwise it will fail for some weird PDF which have complete embedded fonts with maybe hundred of unsolved codes?

* Follows suggestion here.

Aazhar · 2018-06-08T13:45:33Z

This is done for the prod version (master).
@kermitt2 the documents you referenced above should be rather referenced in this issue #11.
I'll close this one

kermitt2 · 2018-06-08T14:27:33Z

I only report PDF parser failure here :)

ccp_76_3_524.pdf
rev_117_1_210.pdf

* Due to wrong mapping found in embedded fonts.

Aazhar · 2018-06-11T12:48:48Z

Ok these failures were because of missing fonts metadata, is fixed with this commit 04618c4

Aazhar closed this as completed in 3e20ca9 Jun 7, 2018

kermitt2 reopened this Jun 7, 2018

kermitt2 added the bug Something isn't working label Jun 7, 2018

Aazhar pushed a commit that referenced this issue Jun 8, 2018

Attempt to workaround #10 (add more placeholders).

7006517

Aazhar closed this as completed in 3f0ef8c Jun 8, 2018

kermitt2 reopened this Jun 8, 2018

Aazhar referenced this issue Jun 8, 2018

Use unique placeholder in case of a long list of non unicode character.

3b01a78

* Follows suggestion here.

Aazhar closed this as completed Jun 8, 2018

kermitt2 reopened this Jun 8, 2018

Aazhar pushed a commit that referenced this issue Jun 11, 2018

Add workaround for invalid character #10.

7ffb8a0

* Due to wrong mapping found in embedded fonts.

kermitt2 closed this as completed Jun 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF parsing failure #10

PDF parsing failure #10

kermitt2 commented Jun 6, 2018

Aazhar commented Jun 7, 2018

kermitt2 commented Jun 7, 2018

kermitt2 commented Jun 8, 2018

Aazhar commented Jun 8, 2018

kermitt2 commented Jun 8, 2018 •

edited

Loading

Aazhar commented Jun 11, 2018

PDF parsing failure #10

PDF parsing failure #10

Comments

kermitt2 commented Jun 6, 2018

Aazhar commented Jun 7, 2018

kermitt2 commented Jun 7, 2018

kermitt2 commented Jun 8, 2018

Aazhar commented Jun 8, 2018

kermitt2 commented Jun 8, 2018 • edited Loading

Aazhar commented Jun 11, 2018

kermitt2 commented Jun 8, 2018 •

edited

Loading