Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

recent python-ruffus error #140

Closed
sagittarius06 opened this issue Mar 5, 2017 · 8 comments
Closed

recent python-ruffus error #140

sagittarius06 opened this issue Mar 5, 2017 · 8 comments

Comments

@sagittarius06
Copy link

On Archlinux, ocrmypdf recently stops working.

$ ocrmypdf -l fra scan0196.pdf out.pdf
ERROR - Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/ruffus/task.py", line 751, in run_pooled_job_without_exceptions
register_cleanup, touch_files_only)
File "/usr/lib/python3.6/site-packages/ruffus/task.py", line 567, in job_wrapper_io_files
ret_val = user_defined_work_func(*params)
File "/usr/lib/python3.6/site-packages/ocrmypdf/pipeline.py", line 497, in ocr_tesseract_hocr
log=log
File "/usr/lib/python3.6/site-packages/ocrmypdf/exec/tesseract.py", line 232, in generate_hocr
universal_newlines=True, timeout=timeout)
File "/usr/lib/python3.6/subprocess.py", line 336, in check_output
**kwargs).stdout
File "/usr/lib/python3.6/subprocess.py", line 405, in run
stdout, stderr = process.communicate(input, timeout=timeout)
File "/usr/lib/python3.6/subprocess.py", line 836, in communicate
stdout, stderr = self._communicate(input, endtime, timeout)
File "/usr/lib/python3.6/subprocess.py", line 1533, in _communicate
self.stdout.errors)
File "/usr/lib/python3.6/subprocess.py", line 735, in _translate_newlines
data = data.decode(encoding, errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x84 in position 155: invalid start byte

@jbarlow83
Copy link
Collaborator

jbarlow83 commented Mar 5, 2017 via email

@sagittarius06
Copy link
Author

sagittarius06 commented Mar 5, 2017

Here is an example that fails scanned on my HP 8600
scan0197.pdf

For info:

$ unpaper -version
6.1
$ tesseract --version
tesseract 3.05.00
leptonica-1.74
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.1) : libpng 1.6.28 : libtiff 4.0.7 : zlib 1.2.11 : libwebp 0.5.2

@sagittarius06
Copy link
Author

sagittarius06 commented Mar 8, 2017

Issue resolved with latest update.

It seems it was because of tesseract language files : pkgInstallDateLister --explicit

tesseract-data-deu-1:3.04.00-1 2017-03-07 11:30:05
tesseract-data-eng-1:3.04.00-1 2017-03-07 11:30:06
tesseract-data-fra-1:3.04.00-1 2017-03-07 11:30:06

@rennefJ
Copy link

rennefJ commented Nov 28, 2017

I am seeing exactly the same issue. I am using the homebrew version on macOS 10.13.1.
'UnicodeDecodeError: 'utf-8' codec can't decode byte 0x84 in position 155: invalid start byte'
I haven't used it in some time, so I am not sure since when it does not work anymore.
ocrmypdf version 5.4.3 and tesseract version is 3.05.01
If it is really the tesseract data files, I am using the most recent ones for the 3.05 release.

Any suggitions on how to fix it are appriciated.

@jbarlow83
Copy link
Collaborator

jbarlow83 commented Nov 28, 2017 via email

@rennefJ
Copy link

rennefJ commented Nov 29, 2017

Thank you for asking these questions.
I was able to solve the issue.
The error only occurred when selecting German as language.
It turns out I was using the German tessdata file for the 4.0 branch instead of the 3.05 branch.
This was just my mistake since I don‘t know how to download the German language file automatically I always download it manually.

@jbarlow83
Copy link
Collaborator

@rennefJ I added a change to v5.4.4 that should print a helpful error instead of suppressing the error from tesseract. If you can test it again with v5.4.4 and let know what happens. I was not able to replicate it exactly by replacing 3.05 tessdata with 4.00.

@rennefJ
Copy link

rennefJ commented Dec 4, 2017

@jbarlow83 I updated to v5.4.4 and ran with the wrong tessdata file again.
What happens is that is puts out 100k lines of text on the console. The first line is the following error message:

ERROR - 1: [tesseract] command line output was not utf-8. This usually means Tesseract's language packs do not match the installed version of Tesseract.

The rest is INFO level messages. I have attached the console output as a compressed text file.
ocymypdf_testlog.txt.zip
In the end it creates the output pdf, which it did not do before, and writes:

INFO - Output file is a PDF/A-2B (as expected)

The first line is a useful error message, but it could get drowned in all the other output created.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants