Syntax errors in PDFs created by Xsane cause ocrmypdf to fail with errors #61

KBDCALLS · 2016-03-19T18:38:24Z

ocrmypdf out.pdf o1.pdf
--- Logging error ---
Traceback (most recent call last):
File "/usr/lib/python3.4/logging/init.py", line 978, in emit
msg = self.format(record)
File "/usr/lib/python3.4/logging/init.py", line 828, in format
return fmt.format(record)
File "/usr/lib/python3.4/logging/init.py", line 565, in format
record.message = record.getMessage()
File "/usr/lib/python3.4/logging/init.py", line 326, in getMessage
msg = str(self.msg)
File "/usr/local/lib/python3.4/dist-packages/ruffus/ruffus_exceptions.py", line 127, in str
message += self.get_nth_exception_str (ii)
File "/usr/local/lib/python3.4/dist-packages/ruffus/ruffus_exceptions.py", line 116, in get_nth_exception_str
task_name, job_name, exception_name, exception_value, exception_stack = self.args[nn]
ValueError: too many values to unpack (expected 5)
Call stack:
File "/usr/lib/python3.4/threading.py", line 888, in _bootstrap
self._bootstrap_inner()
File "/usr/lib/python3.4/threading.py", line 920, in _bootstrap_inner
self.run()
File "/usr/lib/python3.4/threading.py", line 868, in run
self._target(_self._args, *_self._kwargs)
File "/usr/lib/python3.4/multiprocessing/managers.py", line 195, in handle_request
result = func(c, _args, *_kwds)
File "/usr/lib/python3.4/multiprocessing/managers.py", line 392, in accept_connection
self.serve_client(c)
File "/usr/lib/python3.4/multiprocessing/managers.py", line 241, in serve_client
res = function(_args, *_kwds)
Message: RethrownJobError('ocrmypdf.main.repair_pdf', 'Job = [out.pdf -> .../com.github.ocrmypdf.frjgsbgy/out.repaired.pdf, <ocrmypdf.main.WrappedLogger>, [], <_thread.lock>]', 'builtins.ValueError', "(invalid literal for int() with base 16: b'XT')", 'Traceback (most recent call last):\n File "/usr/local/lib/python3.4/dist-packages/ruffus/task.py", line 751, in run_pooled_job_without_exceptions\n register_cleanup, touch_files_only)\n File "/usr/local/lib/python3.4/dist-packages/ruffus/task.py", line 567, in job_wrapper_io_files\n ret_val = user_defined_work_func(*params)\n File "/usr/local/lib/python3.4/dist-packages/ocrmypdf/main.py", line 392, in repair_pdf\n pdfinfo.extend(pdf_get_all_pageinfo(output_file))\n File "/usr/local/lib/python3.4/dist-packages/ocrmypdf/pageinfo.py", line 314, in pdf_get_all_pageinfo\n return [_pdf_get_pageinfo(infile, n) for n in range(pdf.numPages)]\n File "/usr/local/lib/python3.4/dist-packages/ocrmypdf/pageinfo.py", line 314, in \n return [_pdf_get_pageinfo(infile, n) for n in range(pdf.numPages)]\n File "/usr/local/lib/python3.4/dist-packages/ocrmypdf/pageinfo.py", line 283, in _pdf_get_pageinfo\n pageinfo['has_text'] = _page_has_text(pdf, page)\n File "/usr/local/lib/python3.4/dist-packages/ocrmypdf/pageinfo.py", line 255, in _page_has_text\n text = page.extractText()\n File "/usr/local/lib/python3.4/dist-packages/PyPDF2/pdf.py", line 2566, in extractText\n content = ContentStream(content, self.pdf)\n File "/usr/local/lib/python3.4/dist-packages/PyPDF2/pdf.py", line 2645, in init\n self.__parseContentStream(stream)\n File "/usr/local/lib/python3.4/dist-packages/PyPDF2/pdf.py", line 2677, in __parseContentStream\n operands.append(readObject(stream, None))\n File "/usr/local/lib/python3.4/dist-packages/PyPDF2/generic.py", line 68, in readObject\n return readHexStringFromStream(stream)\n File "/usr/local/lib/python3.4/dist-packages/PyPDF2/generic.py", line 312, in readHexStringFromStream\n txt += chr(int(x, base=16))\nValueError: invalid literal for int() with base 16: b'XT'\n')
Arguments: ()

This is the output from pdf that I create with Xsane. But with OCRMYPDF 2.2 is OK

KBDCALLS · 2016-03-19T18:39:52Z

I am using Debian 8.3 AMD64

jbarlow83 · 2016-03-19T20:04:19Z

There's something unusual about that PDF. Can you upload a copy to Dropbox
(or equivalent) and send me a link?

On Sat, 19 Mar 2016 at 11:39 KBDCALLS notifications@github.com wrote:

I am using Debian 8.3 AMD64

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#61 (comment)

KBDCALLS · 2016-03-20T17:50:16Z

Here is an pdf . I am create this with Xsane 0.998

out0002.pdf

jbarlow83 · 2016-03-20T20:47:36Z

Xsane does not seem to produce syntactically valid PDFs.

You can fix it with Ghostscript which seems tolerant of the problems, and then run it through ocrmypdf.

gs -o "$output" -sDEVICE=pdfwrite "$input"

Details on the nature of the errors:

qpdf finds and fixes some problems, but doesn't seem to fix everything. poppler doesn't understand it, and nor does PyPDF2, the library I rely on for PDF parsing. The Xsane takes the unusual approach of creating a large image as an inline image (normally used for small images only) rather than a proper PDF object.

jbarlow83 · 2016-04-28T07:58:24Z

Should be fixed in 4.1

jbarlow83 · 2016-04-28T20:38:59Z

Sorry, it's not completely fixed, but at least the error message is now somewhat coherent.

It will work in this mode:

ocrmypdf --pdf-renderer tesseract ...

I recommend manually patching tesseract's OCR font, which fixes problems with PDFs produced by tesseract. Just replace the tessdata/pdf.ttf with this file from upstream.

https://github.com/tesseract-ocr/tesseract/pull/220/files

jbarlow83 · 2016-12-03T09:18:19Z

I believe that the underlying issue was in PyPDF 1.25.1 which was the best available in April. Under PyPDF2 1.26.0 this issue seems to be fixed. Tested with ocrmypdf 4.3.3 and it works.

jbarlow83 changed the title ~~What's wrong with OCRmyPDF 4.07~~ Syntax errors in PDFs created by Xsane cause ocrmypdf to fail with errors Mar 21, 2016

jbarlow83 mentioned this issue Apr 3, 2016

ruffus sometimes throws exceptions in RethrownJobError cgat-developers/ruffus#65

Closed

OCRmyPDF-issuebot mentioned this issue Jul 15, 2016

original images not kept unaltered #8

Closed

jbarlow83 closed this as completed Dec 3, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Syntax errors in PDFs created by Xsane cause ocrmypdf to fail with errors #61

Syntax errors in PDFs created by Xsane cause ocrmypdf to fail with errors #61

KBDCALLS commented Mar 19, 2016

KBDCALLS commented Mar 19, 2016

jbarlow83 commented Mar 19, 2016

KBDCALLS commented Mar 20, 2016

jbarlow83 commented Mar 20, 2016

jbarlow83 commented Apr 28, 2016

jbarlow83 commented Apr 28, 2016

jbarlow83 commented Dec 3, 2016

Syntax errors in PDFs created by Xsane cause ocrmypdf to fail with errors #61

Syntax errors in PDFs created by Xsane cause ocrmypdf to fail with errors #61

Comments

KBDCALLS commented Mar 19, 2016

KBDCALLS commented Mar 19, 2016

jbarlow83 commented Mar 19, 2016

KBDCALLS commented Mar 20, 2016

jbarlow83 commented Mar 20, 2016

jbarlow83 commented Apr 28, 2016

jbarlow83 commented Apr 28, 2016

jbarlow83 commented Dec 3, 2016