Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Syntax errors in PDFs created by Xsane cause ocrmypdf to fail with errors #61

Closed
KBDCALLS opened this issue Mar 19, 2016 · 7 comments
Closed

Comments

@KBDCALLS
Copy link

ocrmypdf out.pdf o1.pdf
--- Logging error ---
Traceback (most recent call last):
File "/usr/lib/python3.4/logging/init.py", line 978, in emit
msg = self.format(record)
File "/usr/lib/python3.4/logging/init.py", line 828, in format
return fmt.format(record)
File "/usr/lib/python3.4/logging/init.py", line 565, in format
record.message = record.getMessage()
File "/usr/lib/python3.4/logging/init.py", line 326, in getMessage
msg = str(self.msg)
File "/usr/local/lib/python3.4/dist-packages/ruffus/ruffus_exceptions.py", line 127, in str
message += self.get_nth_exception_str (ii)
File "/usr/local/lib/python3.4/dist-packages/ruffus/ruffus_exceptions.py", line 116, in get_nth_exception_str
task_name, job_name, exception_name, exception_value, exception_stack = self.args[nn]
ValueError: too many values to unpack (expected 5)
Call stack:
File "/usr/lib/python3.4/threading.py", line 888, in _bootstrap
self._bootstrap_inner()
File "/usr/lib/python3.4/threading.py", line 920, in _bootstrap_inner
self.run()
File "/usr/lib/python3.4/threading.py", line 868, in run
self._target(_self._args, *_self._kwargs)
File "/usr/lib/python3.4/multiprocessing/managers.py", line 195, in handle_request
result = func(c, _args, *_kwds)
File "/usr/lib/python3.4/multiprocessing/managers.py", line 392, in accept_connection
self.serve_client(c)
File "/usr/lib/python3.4/multiprocessing/managers.py", line 241, in serve_client
res = function(_args, *_kwds)
Message: RethrownJobError('ocrmypdf.main.repair_pdf', 'Job = [out.pdf -> .../com.github.ocrmypdf.frjgsbgy/out.repaired.pdf, <ocrmypdf.main.WrappedLogger>, [], <_thread.lock>]', 'builtins.ValueError', "(invalid literal for int() with base 16: b'XT')", 'Traceback (most recent call last):\n File "/usr/local/lib/python3.4/dist-packages/ruffus/task.py", line 751, in run_pooled_job_without_exceptions\n register_cleanup, touch_files_only)\n File "/usr/local/lib/python3.4/dist-packages/ruffus/task.py", line 567, in job_wrapper_io_files\n ret_val = user_defined_work_func(*params)\n File "/usr/local/lib/python3.4/dist-packages/ocrmypdf/main.py", line 392, in repair_pdf\n pdfinfo.extend(pdf_get_all_pageinfo(output_file))\n File "/usr/local/lib/python3.4/dist-packages/ocrmypdf/pageinfo.py", line 314, in pdf_get_all_pageinfo\n return [_pdf_get_pageinfo(infile, n) for n in range(pdf.numPages)]\n File "/usr/local/lib/python3.4/dist-packages/ocrmypdf/pageinfo.py", line 314, in \n return [_pdf_get_pageinfo(infile, n) for n in range(pdf.numPages)]\n File "/usr/local/lib/python3.4/dist-packages/ocrmypdf/pageinfo.py", line 283, in _pdf_get_pageinfo\n pageinfo['has_text'] = _page_has_text(pdf, page)\n File "/usr/local/lib/python3.4/dist-packages/ocrmypdf/pageinfo.py", line 255, in _page_has_text\n text = page.extractText()\n File "/usr/local/lib/python3.4/dist-packages/PyPDF2/pdf.py", line 2566, in extractText\n content = ContentStream(content, self.pdf)\n File "/usr/local/lib/python3.4/dist-packages/PyPDF2/pdf.py", line 2645, in init\n self.__parseContentStream(stream)\n File "/usr/local/lib/python3.4/dist-packages/PyPDF2/pdf.py", line 2677, in __parseContentStream\n operands.append(readObject(stream, None))\n File "/usr/local/lib/python3.4/dist-packages/PyPDF2/generic.py", line 68, in readObject\n return readHexStringFromStream(stream)\n File "/usr/local/lib/python3.4/dist-packages/PyPDF2/generic.py", line 312, in readHexStringFromStream\n txt += chr(int(x, base=16))\nValueError: invalid literal for int() with base 16: b'XT'\n')
Arguments: ()

This is the output from pdf that I create with Xsane. But with OCRMYPDF 2.2 is OK

@KBDCALLS
Copy link
Author

I am using Debian 8.3 AMD64

@jbarlow83
Copy link
Collaborator

There's something unusual about that PDF. Can you upload a copy to Dropbox
(or equivalent) and send me a link?

On Sat, 19 Mar 2016 at 11:39 KBDCALLS notifications@github.com wrote:

I am using Debian 8.3 AMD64


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#61 (comment)

@KBDCALLS
Copy link
Author

Here is an pdf . I am create this with Xsane 0.998

out0002.pdf

@jbarlow83
Copy link
Collaborator

Xsane does not seem to produce syntactically valid PDFs.

You can fix it with Ghostscript which seems tolerant of the problems, and then run it through ocrmypdf.

gs -o "$output" -sDEVICE=pdfwrite "$input"

Details on the nature of the errors:

qpdf finds and fixes some problems, but doesn't seem to fix everything. poppler doesn't understand it, and nor does PyPDF2, the library I rely on for PDF parsing. The Xsane takes the unusual approach of creating a large image as an inline image (normally used for small images only) rather than a proper PDF object.

@jbarlow83 jbarlow83 changed the title What's wrong with OCRmyPDF 4.07 Syntax errors in PDFs created by Xsane cause ocrmypdf to fail with errors Mar 21, 2016
@jbarlow83
Copy link
Collaborator

Should be fixed in 4.1

@jbarlow83
Copy link
Collaborator

Sorry, it's not completely fixed, but at least the error message is now somewhat coherent.

It will work in this mode:

ocrmypdf --pdf-renderer tesseract ...

I recommend manually patching tesseract's OCR font, which fixes problems with PDFs produced by tesseract. Just replace the tessdata/pdf.ttf with this file from upstream.

https://github.com/tesseract-ocr/tesseract/pull/220/files

@jbarlow83
Copy link
Collaborator

I believe that the underlying issue was in PyPDF 1.25.1 which was the best available in April. Under PyPDF2 1.26.0 this issue seems to be fixed. Tested with ocrmypdf 4.3.3 and it works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants