-
-
Notifications
You must be signed in to change notification settings - Fork 965
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Syntax errors in PDFs created by Xsane cause ocrmypdf to fail with errors #61
Comments
I am using Debian 8.3 AMD64 |
There's something unusual about that PDF. Can you upload a copy to Dropbox On Sat, 19 Mar 2016 at 11:39 KBDCALLS notifications@github.com wrote:
|
Here is an pdf . I am create this with Xsane 0.998 |
Xsane does not seem to produce syntactically valid PDFs. You can fix it with Ghostscript which seems tolerant of the problems, and then run it through ocrmypdf.
Details on the nature of the errors: qpdf finds and fixes some problems, but doesn't seem to fix everything. poppler doesn't understand it, and nor does PyPDF2, the library I rely on for PDF parsing. The Xsane takes the unusual approach of creating a large image as an inline image (normally used for small images only) rather than a proper PDF object. |
Should be fixed in 4.1 |
Sorry, it's not completely fixed, but at least the error message is now somewhat coherent. It will work in this mode:
I recommend manually patching tesseract's OCR font, which fixes problems with PDFs produced by tesseract. Just replace the tessdata/pdf.ttf with this file from upstream. |
I believe that the underlying issue was in PyPDF 1.25.1 which was the best available in April. Under PyPDF2 1.26.0 this issue seems to be fixed. Tested with ocrmypdf 4.3.3 and it works. |
ocrmypdf out.pdf o1.pdf
--- Logging error ---
Traceback (most recent call last):
File "/usr/lib/python3.4/logging/init.py", line 978, in emit
msg = self.format(record)
File "/usr/lib/python3.4/logging/init.py", line 828, in format
return fmt.format(record)
File "/usr/lib/python3.4/logging/init.py", line 565, in format
record.message = record.getMessage()
File "/usr/lib/python3.4/logging/init.py", line 326, in getMessage
msg = str(self.msg)
File "/usr/local/lib/python3.4/dist-packages/ruffus/ruffus_exceptions.py", line 127, in str
message += self.get_nth_exception_str (ii)
File "/usr/local/lib/python3.4/dist-packages/ruffus/ruffus_exceptions.py", line 116, in get_nth_exception_str
task_name, job_name, exception_name, exception_value, exception_stack = self.args[nn]
ValueError: too many values to unpack (expected 5)
Call stack:
File "/usr/lib/python3.4/threading.py", line 888, in _bootstrap
self._bootstrap_inner()
File "/usr/lib/python3.4/threading.py", line 920, in _bootstrap_inner
self.run()
File "/usr/lib/python3.4/threading.py", line 868, in run
self._target(_self._args, *_self._kwargs)
File "/usr/lib/python3.4/multiprocessing/managers.py", line 195, in handle_request
result = func(c, _args, *_kwds)
File "/usr/lib/python3.4/multiprocessing/managers.py", line 392, in accept_connection
self.serve_client(c)
File "/usr/lib/python3.4/multiprocessing/managers.py", line 241, in serve_client
res = function(_args, *_kwds)
Message: RethrownJobError('ocrmypdf.main.repair_pdf', 'Job = [out.pdf -> .../com.github.ocrmypdf.frjgsbgy/out.repaired.pdf, <ocrmypdf.main.WrappedLogger>, [], <_thread.lock>]', 'builtins.ValueError', "(invalid literal for int() with base 16: b'XT')", 'Traceback (most recent call last):\n File "/usr/local/lib/python3.4/dist-packages/ruffus/task.py", line 751, in run_pooled_job_without_exceptions\n register_cleanup, touch_files_only)\n File "/usr/local/lib/python3.4/dist-packages/ruffus/task.py", line 567, in job_wrapper_io_files\n ret_val = user_defined_work_func(*params)\n File "/usr/local/lib/python3.4/dist-packages/ocrmypdf/main.py", line 392, in repair_pdf\n pdfinfo.extend(pdf_get_all_pageinfo(output_file))\n File "/usr/local/lib/python3.4/dist-packages/ocrmypdf/pageinfo.py", line 314, in pdf_get_all_pageinfo\n return [_pdf_get_pageinfo(infile, n) for n in range(pdf.numPages)]\n File "/usr/local/lib/python3.4/dist-packages/ocrmypdf/pageinfo.py", line 314, in \n return [_pdf_get_pageinfo(infile, n) for n in range(pdf.numPages)]\n File "/usr/local/lib/python3.4/dist-packages/ocrmypdf/pageinfo.py", line 283, in _pdf_get_pageinfo\n pageinfo['has_text'] = _page_has_text(pdf, page)\n File "/usr/local/lib/python3.4/dist-packages/ocrmypdf/pageinfo.py", line 255, in _page_has_text\n text = page.extractText()\n File "/usr/local/lib/python3.4/dist-packages/PyPDF2/pdf.py", line 2566, in extractText\n content = ContentStream(content, self.pdf)\n File "/usr/local/lib/python3.4/dist-packages/PyPDF2/pdf.py", line 2645, in init\n self.__parseContentStream(stream)\n File "/usr/local/lib/python3.4/dist-packages/PyPDF2/pdf.py", line 2677, in __parseContentStream\n operands.append(readObject(stream, None))\n File "/usr/local/lib/python3.4/dist-packages/PyPDF2/generic.py", line 68, in readObject\n return readHexStringFromStream(stream)\n File "/usr/local/lib/python3.4/dist-packages/PyPDF2/generic.py", line 312, in readHexStringFromStream\n txt += chr(int(x, base=16))\nValueError: invalid literal for int() with base 16: b'XT'\n')
Arguments: ()
This is the output from pdf that I create with Xsane. But with OCRMYPDF 2.2 is OK
The text was updated successfully, but these errors were encountered: