Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: File did not complete the page properly and may be damaged. #514

Open
2 of 5 tasks
tice17 opened this issue Mar 26, 2020 · 8 comments
Open
2 of 5 tasks

Error: File did not complete the page properly and may be damaged. #514

tice17 opened this issue Mar 26, 2020 · 8 comments

Comments

@tice17
Copy link

tice17 commented Mar 26, 2020

Problem
After applying ocrmypdf to some 5000 existing pdf's, it turned out that 370 were empty after conversion. All of the empty output pdf files have in common that there is the warning in the output saying" ERROR - 2: **** Error reading a content stream. The page may be incomplete.
Output may be incorrect.
**** Error: File did not complete the page properly and may be damaged.
Output may be incorrect."
Full message:

   INFO - Start processing 2 pages concurrently
   INFO - Using Tesseract OpenMP thread limit 2
  ERROR -    2:    **** Error reading a content stream. The page may be incomplete.
               Output may be incorrect.
   **** Error: File did not complete the page properly and may be damaged.
               Output may be incorrect.

  ERROR -    1:    **** Error reading a content stream. The page may be incomplete.
               Output may be incorrect.
   **** Error: File did not complete the page properly and may be damaged.
               Output may be incorrect.

   INFO - Optimize ratio: 1.00 savings: 0.0%
   INFO - Output file is a PDF/A-2B (as expected)

To Reproduce
The utility I used looped through a list of files found on my QNAP Nas. The python snippet used within the loop:

def process_pdf (file):
    # requires ocrmypdf to be installed
    command = 'ocrmypdf --deskew "{}" "{}"'.format(file, file)
    out = subprocess.getoutput(command)
    return out

The output was logged in a database table so it was easy for me to study the number of files inflicted etc.

Example file

2014-07-26 Gardena Tuinslangoproller gebruiksaanwijzing.pdf
I have included one of the many failing input files as an example. The file has already been OCRed, like the majority of files on my network.

Please check any or all that apply about the test file:

  • This is the input file
  • The file contains no personal or confidential information
  • I am the copyright holder for this file
  • I permit this file to be included in the OCRmyPDF test suite under the CC-BY-SA 4.0 license
  • I am not the copyright holder, but this file is available under a free software license

Expected behavior
I would have expected the application to NOT overwrite the existing input pdf file when it could not successfully read the input. But as the output shows, the "overal" conclusion is that the process was succesfull:

   INFO - Optimize ratio: 1.00 savings: 0.0%
   INFO - Output file is a PDF/A-2B (as expected)

which made ocrmypdf decide to overwrite the file...

I have not been able to find the error message inside the ocrmypdf code base, which leads me to assume it is in a utility that ocrmypdf builds on.

System:

# uname -a
Linux 24efc6c5b453 4.14.24-qnap #1 SMP Fri Feb 14 01:03:29 CST 2020 x86_64 x86_64 x86_64 GNU/Linux
- OCRmyPDF Version: [e.g. v7.4.0]
# ocrmypdf --version
9.6.1.post1+g0165255.d20200311

Additional context
The utilty I wrote runs in a docker. It is based on https://hub.docker.com/r/jbarlow83/ocrmypdf.

@jbarlow83
Copy link
Collaborator

jbarlow83 commented Mar 26, 2020 via email

@tice17
Copy link
Author

tice17 commented Mar 26, 2020

The files really show up empty. I included the faulty output file for the input file I provided.
2014-07-26 Gardena Tuinslangoproller gebruiksaanwijzing.pdf
And to be complete: this was the output for that particular conversion:

   INFO - Start processing 4 pages concurrently
  ERROR -    2:    **** Error reading a content stream. The page may be incomplete.
               Output may be incorrect.
   **** Error: File did not complete the page properly and may be damaged.
               Output may be incorrect.

  ERROR -    3:    **** Error reading a content stream. The page may be incomplete.
               Output may be incorrect.
   **** Error: File did not complete the page properly and may be damaged.
               Output may be incorrect.

  ERROR -    4:    **** Error reading a content stream. The page may be incomplete.
               Output may be incorrect.
   **** Error: File did not complete the page properly and may be damaged.
               Output may be incorrect.

  ERROR -    1:    **** Error reading a content stream. The page may be incomplete.
               Output may be incorrect.
   **** Error: File did not complete the page properly and may be damaged.
               Output may be incorrect.

  ERROR -    5:    **** Error reading a content stream. The page may be incomplete.
               Output may be incorrect.
   **** Error: File did not complete the page properly and may be damaged.
               Output may be incorrect.

  ERROR -    6:    **** Error reading a content stream. The page may be incomplete.
               Output may be incorrect.
   **** Error: File did not complete the page properly and may be damaged.
               Output may be incorrect.

   INFO - Optimize ratio: 1.00 savings: 0.2%
   INFO - Output file is a PDF/A-2B (as expected)--

@tice17
Copy link
Author

tice17 commented Mar 27, 2020

As to the version of ghostscript:

# apt list --installed | grep ghostscript

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

ghostscript/now 9.27~dfsg+0-0ubuntu3.1 amd64 [installed,local]

@sihil
Copy link

sihil commented Mar 15, 2023

I'm seeing something very similar, if not identical, to this.

Am running ocrmypdf using:

docker run --rm -i jbarlow83/ocrmypdf -v 2 --force-ocr --output-type pdf - - <input.pdf >output.pdf

The input PDF is 12 pages long. The output PDF is also 12 pages but pages 4, 7, 9, 11 and 12 are blank whilst the input has images on those pages. Taking page 12 as an example I see this in the logs:

  INFO ocrmypdf._pipeline -   12  page already has text! - rasterizing text and running OCR anyway
  DEBUG ocrmypdf._pipeline -   12  Rasterize with png16m, rotation 0
  DEBUG ocrmypdf.subprocess -   12  Running: ['gs', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-dInterpolateControl=-1', '-sDEVICE=png16m', '-dFirstPage=12', '-dLastPage=12', '-r400.000000x400.000000', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', '/tmp/ocrmypdf.io.8jgbl4vn/origin.pdf']
  DEBUG ocrmypdf.subprocess.gs -   12  stderr =    **** Error reading a content stream. The page may be incomplete.
               Output may be incorrect.
   **** Error: File did not complete the page properly and may be damaged.
               Output may be incorrect.

  ERROR ocrmypdf._exec.ghostscript -   12     **** Error reading a content stream. The page may be incomplete.
               Output may be incorrect.
   **** Error: File did not complete the page properly and may be damaged.
               Output may be incorrect.

  DEBUG PIL.PngImagePlugin -   12  STREAM b'IHDR' 16 13
  DEBUG PIL.PngImagePlugin -   12  STREAM b'iCCP' 41 2354
  DEBUG PIL.PngImagePlugin -   12  iCCP profile name b'default_rgb.icc'
  DEBUG PIL.PngImagePlugin -   12  Compression method 0
  DEBUG PIL.PngImagePlugin -   12  STREAM b'pHYs' 2407 9
  DEBUG PIL.PngImagePlugin -   12  STREAM b'tEXt' 2428 31
  DEBUG PIL.PngImagePlugin -   12  STREAM b'IDAT' 2471 8192
  DEBUG ocrmypdf._exec.ghostscript -   12  Rotating output by 0
  DEBUG PIL.PngImagePlugin -   12  STREAM b'IHDR' 16 13
  DEBUG PIL.PngImagePlugin -   12  STREAM b'iCCP' 41 2350
  DEBUG PIL.PngImagePlugin -   12  iCCP profile name b'ICC Profile'
  DEBUG PIL.PngImagePlugin -   12  Compression method 0
  DEBUG PIL.PngImagePlugin -   12  STREAM b'pHYs' 2403 9
  DEBUG PIL.PngImagePlugin -   12  STREAM b'IDAT' 2424 54246
  DEBUG ocrmypdf._pipeline -   12  resolution (399.9992, 399.9992)
  DEBUG ocrmypdf._pipeline -   12  convert
  DEBUG PIL.PngImagePlugin -   12  STREAM b'IHDR' 16 13
  DEBUG PIL.PngImagePlugin -   12  STREAM b'iCCP' 41 2350
  DEBUG PIL.PngImagePlugin -   12  iCCP profile name b'ICC Profile'
  DEBUG PIL.PngImagePlugin -   12  Compression method 0
  DEBUG PIL.PngImagePlugin -   12  STREAM b'pHYs' 2403 9
  DEBUG PIL.PngImagePlugin -   12  STREAM b'IDAT' 2424 54246
  DEBUG img2pdf -   12  PIL format = PNG
  DEBUG img2pdf -   12  imgformat = PNG
  DEBUG img2pdf -   12  input dpi = 400 x 400
  DEBUG img2pdf -   12  rotation = 0°
  DEBUG img2pdf -   12  input colorspace = RGB
  DEBUG img2pdf -   12  width x height = 3276px x 4648px
  DEBUG img2pdf -   12  read_images() embeds a PNG
  DEBUG ocrmypdf._pipeline -   12  convert done
  DEBUG ocrmypdf.subprocess -   12  Running: ['tesseract', '-l', 'eng', '-c', 'textonly_pdf=1', '/tmp/ocrmypdf.io.8jgbl4vn/000012_ocr.png', '/tmp/ocrmypdf.io.8jgbl4vn/000012_ocr_tess', 'pdf', 'txt']
  DEBUG ocrmypdf._graft -   12  Emplacement update
  DEBUG ocrmypdf._graft -   12  Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0
  DEBUG ocrmypdf._graft -   12  Grafting
  DEBUG ocrmypdf._graft -   12  Page rotation: (content, auto) -> page = (0, 0) -> 0

I'm actually using it in anger inside a python AWS Lambda but can reproduce as above. For now I'm going to try adding another logging handler to collect anything logged at ERROR and above and then I will abort downstream processing in the case any exist at the end of a run.

Unfortunately I'm unable to provide the PDF in question but would be happy to run further exploration.

@jbarlow83
Copy link
Collaborator

Can you reproduce the error in "native" ocrmypdf without Lambda, Docker or other sources of anger?

@sihil
Copy link

sihil commented Mar 16, 2023

Interestingly enough no. Works fine on macOS Monterey. Also works fine on a clean Ubuntu 22.04 VM after apt install ocrmypdf.

I've also just fairly closely mimicked the docker file steps in the Ubuntu VM and also end up with a working conversion. Don't have more time just now but will try and come back to this later.

@sihil
Copy link

sihil commented Mar 18, 2023

A little more experimentation indicates that the PDF is malformed in a way that Ghostscript can't solve inside the container - possibly because of a font (still investigating).

I've found that -dPDFSTOPONERROR causes Ghostscript to bail in this situation. Is this an option you'd consider adding, or having a flag that adds it, so that rather than outputting a corrupt PDF the whole process fails?

@sihil
Copy link

sihil commented Mar 20, 2023

TL;DR: the fonts-droid-fallback package needs to be added to the container image.

GS output on Ubuntu 9.55 VM (set up as close as possible to the container):

$ gs -dSAFER -dInterpolateControl=-1 -sDEVICE=png16m -dFirstPage=12 -dLastPage=12 -r400.000000x400.000000 -o - -sstdout=%stderr -dAutoRotatePages=/None -f property_information_form_TA6-2023-03-09-09-17-55-454676.pdf > output.pdf
GPL Ghostscript 9.55.0 (2021-09-27)
Copyright (C) 2021 Artifex Software, Inc.  All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
Processing pages 12 through 12.
Page 12
Loading NimbusSans-Regular font from /usr/share/ghostscript/9.55.0/Resource/Font/NimbusSans-Regular... 4494652 2991617 2633824 1314847 4 done.
Loading NimbusRoman-Regular font from /usr/share/ghostscript/9.55.0/Resource/Font/NimbusRoman-Regular... 4560764 3191460 2674224 1339790 4 done.
Loading NimbusMonoPS-Regular font from /usr/share/ghostscript/9.55.0/Resource/Font/NimbusMonoPS-Regular... 4788476 3425108 2714624 1378808 4 done.
Loading D050000L font from /usr/share/ghostscript/9.55.0/Resource/Font/D050000L... 4902044 3535494 2734824 1393475 4 done.
Can't find CID font "TimesNewRomanPSMT".
Attempting to substitute CID font /Adobe-Identity for /TimesNewRomanPSMT, see doc/Use.htm#CIDFontSubstitution.
The substitute CID font "Adobe-Identity" is not provided either. attempting to use fallback CIDFont.See doc/Use.htm#CIDFontSubstitution.
Loading a TT font from /usr/share/ghostscript/9.55.0/Resource/CIDFSubst/DroidSansFallback.ttf to emulate a CID font Adobe-Identijbig2dec WARNING text region refers to no symbol dictionaries (segment 0x02)
ty ... Done.

GS output from inside the current jbarlow/ocrmypdf container:

# gs -dSAFER -dInterpolateControl=-1 -sDEVICE=png16m -dFirstPage=12 -dLastPage=12 -r400.000000x400.000000 -o - -sstdout=%stderr -dAutoRotatePages=/None -f /app/property_information_form_TA6-2023-03-09-09-17-55-454676.pdf > output.pdf
GPL Ghostscript 9.55.0 (2021-09-27)
Copyright (C) 2021 Artifex Software, Inc.  All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
Processing pages 12 through 12.
Page 12
Loading NimbusSans-Regular font from /usr/share/ghostscript/9.55.0/Resource/Font/NimbusSans-Regular... 4595652 3018313 2633824 1314847 4 done.
Loading NimbusRoman-Regular font from /usr/share/ghostscript/9.55.0/Resource/Font/NimbusRoman-Regular... 4661764 3218052 2674224 1339790 4 done.
Loading NimbusMonoPS-Regular font from /usr/share/ghostscript/9.55.0/Resource/Font/NimbusMonoPS-Regular... 4808676 3437972 2714624 1378808 4 done.
Loading D050000L font from /usr/share/ghostscript/9.55.0/Resource/Font/D050000L... 4922244 3548358 2734824 1393475 4 done.
Can't find CID font "TimesNewRomanPSMT".
Attempting to substitute CID font /Adobe-Identity for /TimesNewRomanPSMT, see doc/Use.htm#CIDFontSubstitution.
The substitute CID font "Adobe-Identity" is not provided either. attempting to use fallback CIDFont.See doc/Use.htm#CIDFontSubstitution.
The fallback CID font "CIDFallBack" is not provided.  Finally attempting to use ArtifexBullet.
   **** Error reading a content stream. The page may be incomplete.
               Output may be incorrect.
   **** Error: File did not complete the page properly and may be damaged.
               Output may be incorrect.

There is a link to the font in the container, but target is missing:

/usr/share/ghostscript/9.55.0/Resource/CIDFSubst# ls -la
total 8
drwxr-xr-x  2 root root 4096 Feb 15 20:21 .
drwxr-xr-x 11 root root 4096 Feb 15 20:21 ..
lrwxrwxrwx  1 root root   58 Sep 26 14:05 DroidSansFallback.ttf -> ../../../../fonts/truetype/droid/DroidSansFallbackFull.ttf

/usr/share/ghostscript/9.55.0/Resource/CIDFSubst# ls -la ../../../../fonts/truetype/droid/
ls: cannot access '../../../../fonts/truetype/droid/': No such file or directory

On the VM we find what provides the font:

$ dpkg -S DroidSansFallbackFull.ttf
fonts-droid-fallback: /usr/share/fonts/truetype/droid/DroidSansFallbackFull.ttf

Adding this in the container fixes the issue. I'd still argue for -dPDFSTOPONERROR as this would have prevented corruption and is likely to do so in future cases too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants