recent python-ruffus error #140

sagittarius06 · 2017-03-05T14:33:06Z

On Archlinux, ocrmypdf recently stops working.

$ ocrmypdf -l fra scan0196.pdf out.pdf
ERROR - Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/ruffus/task.py", line 751, in run_pooled_job_without_exceptions
register_cleanup, touch_files_only)
File "/usr/lib/python3.6/site-packages/ruffus/task.py", line 567, in job_wrapper_io_files
ret_val = user_defined_work_func(*params)
File "/usr/lib/python3.6/site-packages/ocrmypdf/pipeline.py", line 497, in ocr_tesseract_hocr
log=log
File "/usr/lib/python3.6/site-packages/ocrmypdf/exec/tesseract.py", line 232, in generate_hocr
universal_newlines=True, timeout=timeout)
File "/usr/lib/python3.6/subprocess.py", line 336, in check_output
**kwargs).stdout
File "/usr/lib/python3.6/subprocess.py", line 405, in run
stdout, stderr = process.communicate(input, timeout=timeout)
File "/usr/lib/python3.6/subprocess.py", line 836, in communicate
stdout, stderr = self._communicate(input, endtime, timeout)
File "/usr/lib/python3.6/subprocess.py", line 1533, in _communicate
self.stdout.errors)
File "/usr/lib/python3.6/subprocess.py", line 735, in _translate_newlines
data = data.decode(encoding, errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x84 in position 155: invalid start byte

jbarlow83 · 2017-03-05T19:33:27Z

It seems that tesseract printed an invalid character to its standard output. Maybe this is a tesseract 3.05 issue as that was just released. Please send the file if possible.

…

On Sun, Mar 5, 2017 at 09:33 sagittarius06 ***@***.***> wrote: On Archlinux, ocrmypdf recently stops working. $ ocrmypdf -l fra scan0196.pdf out.pdf ERROR - Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ruffus/task.py", line 751, in run_pooled_job_without_exceptions register_cleanup, touch_files_only) File "/usr/lib/python3.6/site-packages/ruffus/task.py", line 567, in job_wrapper_io_files ret_val = user_defined_work_func(*params) File "/usr/lib/python3.6/site-packages/ocrmypdf/pipeline.py", line 497, in ocr_tesseract_hocr log=log File "/usr/lib/python3.6/site-packages/ocrmypdf/exec/tesseract.py", line 232, in generate_hocr universal_newlines=True, timeout=timeout) File "/usr/lib/python3.6/subprocess.py", line 336, in check_output **kwargs).stdout File "/usr/lib/python3.6/subprocess.py", line 405, in run stdout, stderr = process.communicate(input, timeout=timeout) File "/usr/lib/python3.6/subprocess.py", line 836, in communicate stdout, stderr = self._communicate(input, endtime, timeout) File "/usr/lib/python3.6/subprocess.py", line 1533, in _communicate self.stdout.errors) File "/usr/lib/python3.6/subprocess.py", line 735, in _translate_newlines data = data.decode(encoding, errors) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x84 in position 155: invalid start byte — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#140>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABvcM2yHnjWcfzA7BaKPU8xFVu453-Nuks5riseigaJpZM4MTbeh> .

sagittarius06 · 2017-03-05T20:03:04Z

Here is an example that fails scanned on my HP 8600
scan0197.pdf

For info:

$ unpaper -version
6.1
$ tesseract --version
tesseract 3.05.00
leptonica-1.74
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.1) : libpng 1.6.28 : libtiff 4.0.7 : zlib 1.2.11 : libwebp 0.5.2

sagittarius06 · 2017-03-08T08:16:45Z

Issue resolved with latest update.

It seems it was because of tesseract language files : pkgInstallDateLister --explicit

tesseract-data-deu-1:3.04.00-1 2017-03-07 11:30:05
tesseract-data-eng-1:3.04.00-1 2017-03-07 11:30:06
tesseract-data-fra-1:3.04.00-1 2017-03-07 11:30:06

rennefJ · 2017-11-28T14:57:31Z

I am seeing exactly the same issue. I am using the homebrew version on macOS 10.13.1.
'UnicodeDecodeError: 'utf-8' codec can't decode byte 0x84 in position 155: invalid start byte'
I haven't used it in some time, so I am not sure since when it does not work anymore.
ocrmypdf version 5.4.3 and tesseract version is 3.05.01
If it is really the tesseract data files, I am using the most recent ones for the 3.05 release.

Any suggitions on how to fix it are appriciated.

jbarlow83 · 2017-11-28T22:19:19Z

Do you have a file and command line that demonstrates the issue? Or do all files and arguments seem to fail? Can you run tesseract on an image on its own?

…

On Nov 28, 2017 06:57, "rennefJ" ***@***.***> wrote: I am seeing exactly the same issue. I am using the homebrew version on macOS 10.13.1. 'UnicodeDecodeError: 'utf-8' codec can't decode byte 0x84 in position 155: invalid start byte' I haven't used it in some time, so I am not sure since when it does not work anymore. ocrmypdf version 5.4.3 and tesseract version is 3.05.01 If it is really the tesseract data files, I am using the most recent ones for the 3.05 release. Any suggitions on how to fix it are appriciated. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#140 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABvcMzjm24yNDVTxY9hVLLyGiJuoahkkks5s7B9cgaJpZM4MTbeh> .

rennefJ · 2017-11-29T12:16:33Z

Thank you for asking these questions.
I was able to solve the issue.
The error only occurred when selecting German as language.
It turns out I was using the German tessdata file for the 4.0 branch instead of the 3.05 branch.
This was just my mistake since I don‘t know how to download the German language file automatically I always download it manually.

jbarlow83 · 2017-11-29T23:59:11Z

@rennefJ I added a change to v5.4.4 that should print a helpful error instead of suppressing the error from tesseract. If you can test it again with v5.4.4 and let know what happens. I was not able to replicate it exactly by replacing 3.05 tessdata with 4.00.

rennefJ · 2017-12-04T08:43:52Z

@jbarlow83 I updated to v5.4.4 and ran with the wrong tessdata file again.
What happens is that is puts out 100k lines of text on the console. The first line is the following error message:

ERROR - 1: [tesseract] command line output was not utf-8. This usually means Tesseract's language packs do not match the installed version of Tesseract.

The rest is INFO level messages. I have attached the console output as a compressed text file.
ocymypdf_testlog.txt.zip
In the end it creates the output pdf, which it did not do before, and writes:

INFO - Output file is a PDF/A-2B (as expected)

The first line is a useful error message, but it could get drowned in all the other output created.

sagittarius06 closed this as completed Mar 8, 2017

jbarlow83 mentioned this issue Nov 30, 2017

NixOS packaging issues #202

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

recent python-ruffus error #140

recent python-ruffus error #140

sagittarius06 commented Mar 5, 2017

jbarlow83 commented Mar 5, 2017 via email

sagittarius06 commented Mar 5, 2017 •

edited

sagittarius06 commented Mar 8, 2017 •

edited

rennefJ commented Nov 28, 2017

jbarlow83 commented Nov 28, 2017 via email

rennefJ commented Nov 29, 2017

jbarlow83 commented Nov 29, 2017

rennefJ commented Dec 4, 2017

recent python-ruffus error #140

recent python-ruffus error #140

Comments

sagittarius06 commented Mar 5, 2017

jbarlow83 commented Mar 5, 2017 via email

sagittarius06 commented Mar 5, 2017 • edited

sagittarius06 commented Mar 8, 2017 • edited

rennefJ commented Nov 28, 2017

jbarlow83 commented Nov 28, 2017 via email

rennefJ commented Nov 29, 2017

jbarlow83 commented Nov 29, 2017

rennefJ commented Dec 4, 2017

sagittarius06 commented Mar 5, 2017 •

edited

sagittarius06 commented Mar 8, 2017 •

edited