Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyError: '/Contents' #154

Closed
feinerer opened this issue Apr 24, 2017 · 2 comments
Closed

KeyError: '/Contents' #154

feinerer opened this issue Apr 24, 2017 · 2 comments

Comments

@feinerer
Copy link
Contributor

$ docker run --rm -v "$(pwd):/home/docker"   ocrmypdf --skip-text input.pdf output.pdf
   INFO - Tesseract v4.x.alpha found. OCRmyPDF support is experimental.
  ERROR - Traceback (most recent call last):
  File "/appenv/lib/python3.5/site-packages/ruffus/task.py", line 751, in run_pooled_job_without_exceptions
    register_cleanup, touch_files_only)
  File "/appenv/lib/python3.5/site-packages/ruffus/task.py", line 567, in job_wrapper_io_files
    ret_val = user_defined_work_func(*params)
  File "/appenv/lib/python3.5/site-packages/ocrmypdf/pipeline.py", line 183, in repair_pdf
    pdfinfo = pdf_get_all_pageinfo(output_file)
  File "/appenv/lib/python3.5/site-packages/ocrmypdf/pageinfo.py", line 524, in pdf_get_all_pageinfo
    return [_pdf_get_pageinfo(infile, n) for n in range(pdf.numPages)]
  File "/appenv/lib/python3.5/site-packages/ocrmypdf/pageinfo.py", line 524, in <listcomp>
    return [_pdf_get_pageinfo(infile, n) for n in range(pdf.numPages)]
  File "/appenv/lib/python3.5/site-packages/ocrmypdf/pageinfo.py", line 496, in _pdf_get_pageinfo
    pageinfo['has_text'] = _page_has_text(pdf, page)
  File "/appenv/lib/python3.5/site-packages/ocrmypdf/pageinfo.py", line 468, in _page_has_text
    text = page.extractText()
  File "/appenv/lib/python3.5/site-packages/PyPDF2/pdf.py", line 2593, in extractText
    content = self["/Contents"].getObject()
  File "/appenv/lib/python3.5/site-packages/PyPDF2/generic.py", line 516, in __getitem__
    return dict.__getitem__(self, key).getObject()
KeyError: '/Contents'

This happens on a Debian Jessie system running the latest Docker container (see above command line).

Unfortunately I cannot include the corresponding PDF as it contains private information.

If you need further information, please give me instructions in order to help you debug this issue. Thank you!

@jbarlow83
Copy link
Collaborator

I'll add a check for this case in the next release.

The PDF is missing a data field that is strictly optional, but almost never omit, and the third party PyPDF2 library does not handle this.

Try re-frying the PDF with Ghostscript as this would likely insert the expected object. Note this constructs a visually identical PDF and will reencode JPEGs in the process.

gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -sOutputFile=out.pdf in.pdf

@feinerer
Copy link
Contributor Author

Confirmed: using Ghostscript to rewrite the PDF suffices so that PyPDF2 can handle it.

A direct check in OCRmyPDF is appreciated to avoid the manual Ghostscript call.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants