New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error: File did not complete the page properly and may be damaged. #514
Comments
Are the files visually damaged in any way, or is it just the error message
that appears with no apparent error?
Are you using Ghostscript 9.51 or 9.52?
…On Thu., Mar. 26, 2020, 02:27 tice17, ***@***.***> wrote:
2014-07-26 Gardena Tuinslangoproller gebruiksaanwijzing.pdf
<https://github.com/jbarlow83/OCRmyPDF/files/4386250/2014-07-26.Gardena.Tuinslangoproller.gebruiksaanwijzing.pdf>
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#514 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAN5YM4HWNWDX4UDQNWVC7DRJMNZRANCNFSM4LUCP2ZQ>
.
|
The files really show up empty. I included the faulty output file for the input file I provided.
|
As to the version of ghostscript:
|
I'm seeing something very similar, if not identical, to this. Am running
The input PDF is 12 pages long. The output PDF is also 12 pages but pages 4, 7, 9, 11 and 12 are blank whilst the input has images on those pages. Taking page 12 as an example I see this in the logs:
I'm actually using it in anger inside a python AWS Lambda but can reproduce as above. For now I'm going to try adding another logging handler to collect anything logged at ERROR and above and then I will abort downstream processing in the case any exist at the end of a run. Unfortunately I'm unable to provide the PDF in question but would be happy to run further exploration. |
Can you reproduce the error in "native" ocrmypdf without Lambda, Docker or other sources of anger? |
Interestingly enough no. Works fine on macOS Monterey. Also works fine on a clean Ubuntu 22.04 VM after I've also just fairly closely mimicked the docker file steps in the Ubuntu VM and also end up with a working conversion. Don't have more time just now but will try and come back to this later. |
A little more experimentation indicates that the PDF is malformed in a way that Ghostscript can't solve inside the container - possibly because of a font (still investigating). I've found that |
TL;DR: the GS output on Ubuntu 9.55 VM (set up as close as possible to the container):
GS output from inside the current
There is a link to the font in the container, but target is missing:
On the VM we find what provides the font:
Adding this in the container fixes the issue. I'd still argue for |
Problem
After applying ocrmypdf to some 5000 existing pdf's, it turned out that 370 were empty after conversion. All of the empty output pdf files have in common that there is the warning in the output saying" ERROR - 2: **** Error reading a content stream. The page may be incomplete.
Output may be incorrect.
**** Error: File did not complete the page properly and may be damaged.
Output may be incorrect."
Full message:
To Reproduce
The utility I used looped through a list of files found on my QNAP Nas. The python snippet used within the loop:
The output was logged in a database table so it was easy for me to study the number of files inflicted etc.
Example file
2014-07-26 Gardena Tuinslangoproller gebruiksaanwijzing.pdf
I have included one of the many failing input files as an example. The file has already been OCRed, like the majority of files on my network.
Please check any or all that apply about the test file:
Expected behavior
I would have expected the application to NOT overwrite the existing input pdf file when it could not successfully read the input. But as the output shows, the "overal" conclusion is that the process was succesfull:
which made ocrmypdf decide to overwrite the file...
I have not been able to find the error message inside the ocrmypdf code base, which leads me to assume it is in a utility that ocrmypdf builds on.
System:
Additional context
The utilty I wrote runs in a docker. It is based on https://hub.docker.com/r/jbarlow83/ocrmypdf.
The text was updated successfully, but these errors were encountered: