You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
$ docker run --rm -v "$(pwd):/home/docker" ocrmypdf --skip-text input.pdf output.pdf
INFO - Tesseract v4.x.alpha found. OCRmyPDF support is experimental.
ERROR - Traceback (most recent call last):
File "/appenv/lib/python3.5/site-packages/ruffus/task.py", line 751, in run_pooled_job_without_exceptions
register_cleanup, touch_files_only)
File "/appenv/lib/python3.5/site-packages/ruffus/task.py", line 567, in job_wrapper_io_files
ret_val = user_defined_work_func(*params)
File "/appenv/lib/python3.5/site-packages/ocrmypdf/pipeline.py", line 183, in repair_pdf
pdfinfo = pdf_get_all_pageinfo(output_file)
File "/appenv/lib/python3.5/site-packages/ocrmypdf/pageinfo.py", line 524, in pdf_get_all_pageinfo
return [_pdf_get_pageinfo(infile, n) for n in range(pdf.numPages)]
File "/appenv/lib/python3.5/site-packages/ocrmypdf/pageinfo.py", line 524, in <listcomp>
return [_pdf_get_pageinfo(infile, n) for n in range(pdf.numPages)]
File "/appenv/lib/python3.5/site-packages/ocrmypdf/pageinfo.py", line 496, in _pdf_get_pageinfo
pageinfo['has_text'] = _page_has_text(pdf, page)
File "/appenv/lib/python3.5/site-packages/ocrmypdf/pageinfo.py", line 468, in _page_has_text
text = page.extractText()
File "/appenv/lib/python3.5/site-packages/PyPDF2/pdf.py", line 2593, in extractText
content = self["/Contents"].getObject()
File "/appenv/lib/python3.5/site-packages/PyPDF2/generic.py", line 516, in __getitem__
return dict.__getitem__(self, key).getObject()
KeyError: '/Contents'
This happens on a Debian Jessie system running the latest Docker container (see above command line).
Unfortunately I cannot include the corresponding PDF as it contains private information.
If you need further information, please give me instructions in order to help you debug this issue. Thank you!
The text was updated successfully, but these errors were encountered:
I'll add a check for this case in the next release.
The PDF is missing a data field that is strictly optional, but almost never omit, and the third party PyPDF2 library does not handle this.
Try re-frying the PDF with Ghostscript as this would likely insert the expected object. Note this constructs a visually identical PDF and will reencode JPEGs in the process.
This happens on a Debian Jessie system running the latest Docker container (see above command line).
Unfortunately I cannot include the corresponding PDF as it contains private information.
If you need further information, please give me instructions in order to help you debug this issue. Thank you!
The text was updated successfully, but these errors were encountered: