KeyError: '/Contents' #154

feinerer · 2017-04-24T15:17:21Z

$ docker run --rm -v "$(pwd):/home/docker"   ocrmypdf --skip-text input.pdf output.pdf
   INFO - Tesseract v4.x.alpha found. OCRmyPDF support is experimental.
  ERROR - Traceback (most recent call last):
  File "/appenv/lib/python3.5/site-packages/ruffus/task.py", line 751, in run_pooled_job_without_exceptions
    register_cleanup, touch_files_only)
  File "/appenv/lib/python3.5/site-packages/ruffus/task.py", line 567, in job_wrapper_io_files
    ret_val = user_defined_work_func(*params)
  File "/appenv/lib/python3.5/site-packages/ocrmypdf/pipeline.py", line 183, in repair_pdf
    pdfinfo = pdf_get_all_pageinfo(output_file)
  File "/appenv/lib/python3.5/site-packages/ocrmypdf/pageinfo.py", line 524, in pdf_get_all_pageinfo
    return [_pdf_get_pageinfo(infile, n) for n in range(pdf.numPages)]
  File "/appenv/lib/python3.5/site-packages/ocrmypdf/pageinfo.py", line 524, in <listcomp>
    return [_pdf_get_pageinfo(infile, n) for n in range(pdf.numPages)]
  File "/appenv/lib/python3.5/site-packages/ocrmypdf/pageinfo.py", line 496, in _pdf_get_pageinfo
    pageinfo['has_text'] = _page_has_text(pdf, page)
  File "/appenv/lib/python3.5/site-packages/ocrmypdf/pageinfo.py", line 468, in _page_has_text
    text = page.extractText()
  File "/appenv/lib/python3.5/site-packages/PyPDF2/pdf.py", line 2593, in extractText
    content = self["/Contents"].getObject()
  File "/appenv/lib/python3.5/site-packages/PyPDF2/generic.py", line 516, in __getitem__
    return dict.__getitem__(self, key).getObject()
KeyError: '/Contents'

This happens on a Debian Jessie system running the latest Docker container (see above command line).

Unfortunately I cannot include the corresponding PDF as it contains private information.

If you need further information, please give me instructions in order to help you debug this issue. Thank you!

The text was updated successfully, but these errors were encountered:

jbarlow83 · 2017-04-24T22:08:48Z

I'll add a check for this case in the next release.

The PDF is missing a data field that is strictly optional, but almost never omit, and the third party PyPDF2 library does not handle this.

Try re-frying the PDF with Ghostscript as this would likely insert the expected object. Note this constructs a visually identical PDF and will reencode JPEGs in the process.

gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -sOutputFile=out.pdf in.pdf

feinerer · 2017-04-28T08:33:16Z

Confirmed: using Ghostscript to rewrite the PDF suffices so that PyPDF2 can handle it.

A direct check in OCRmyPDF is appreciated to avoid the manual Ghostscript call.

jbarlow83 closed this as completed in 6c8c1d8 Apr 28, 2017

feinerer mentioned this issue Apr 29, 2017

AttributeError: 'NoneType' object has no attribute 'getObject' #156

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KeyError: '/Contents' #154

KeyError: '/Contents' #154

feinerer commented Apr 24, 2017

jbarlow83 commented Apr 24, 2017

feinerer commented Apr 28, 2017

KeyError: '/Contents' #154

KeyError: '/Contents' #154

Comments

feinerer commented Apr 24, 2017

jbarlow83 commented Apr 24, 2017

feinerer commented Apr 28, 2017