New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
recent python-ruffus error #140
Comments
It seems that tesseract printed an invalid character to its standard
output. Maybe this is a tesseract 3.05 issue as that was just released.
Please send the file if possible.
…On Sun, Mar 5, 2017 at 09:33 sagittarius06 ***@***.***> wrote:
On Archlinux, ocrmypdf recently stops working.
$ ocrmypdf -l fra scan0196.pdf out.pdf
ERROR - Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/ruffus/task.py", line 751, in
run_pooled_job_without_exceptions
register_cleanup, touch_files_only)
File "/usr/lib/python3.6/site-packages/ruffus/task.py", line 567, in
job_wrapper_io_files
ret_val = user_defined_work_func(*params)
File "/usr/lib/python3.6/site-packages/ocrmypdf/pipeline.py", line 497, in
ocr_tesseract_hocr
log=log
File "/usr/lib/python3.6/site-packages/ocrmypdf/exec/tesseract.py", line
232, in generate_hocr
universal_newlines=True, timeout=timeout)
File "/usr/lib/python3.6/subprocess.py", line 336, in check_output
**kwargs).stdout
File "/usr/lib/python3.6/subprocess.py", line 405, in run
stdout, stderr = process.communicate(input, timeout=timeout)
File "/usr/lib/python3.6/subprocess.py", line 836, in communicate
stdout, stderr = self._communicate(input, endtime, timeout)
File "/usr/lib/python3.6/subprocess.py", line 1533, in _communicate
self.stdout.errors)
File "/usr/lib/python3.6/subprocess.py", line 735, in _translate_newlines
data = data.decode(encoding, errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x84 in position 155:
invalid start byte
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#140>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABvcM2yHnjWcfzA7BaKPU8xFVu453-Nuks5riseigaJpZM4MTbeh>
.
|
Here is an example that fails scanned on my HP 8600 For info: $ unpaper -version |
Issue resolved with latest update. It seems it was because of tesseract language files : pkgInstallDateLister --explicit tesseract-data-deu-1:3.04.00-1 2017-03-07 11:30:05 |
I am seeing exactly the same issue. I am using the homebrew version on macOS 10.13.1. Any suggitions on how to fix it are appriciated. |
Do you have a file and command line that demonstrates the issue? Or do all
files and arguments seem to fail?
Can you run tesseract on an image on its own?
…On Nov 28, 2017 06:57, "rennefJ" ***@***.***> wrote:
I am seeing exactly the same issue. I am using the homebrew version on
macOS 10.13.1.
'UnicodeDecodeError: 'utf-8' codec can't decode byte 0x84 in position 155:
invalid start byte'
I haven't used it in some time, so I am not sure since when it does not
work anymore.
ocrmypdf version 5.4.3 and tesseract version is 3.05.01
If it is really the tesseract data files, I am using the most recent ones
for the 3.05 release.
Any suggitions on how to fix it are appriciated.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#140 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABvcMzjm24yNDVTxY9hVLLyGiJuoahkkks5s7B9cgaJpZM4MTbeh>
.
|
Thank you for asking these questions. |
@rennefJ I added a change to v5.4.4 that should print a helpful error instead of suppressing the error from tesseract. If you can test it again with v5.4.4 and let know what happens. I was not able to replicate it exactly by replacing 3.05 tessdata with 4.00. |
@jbarlow83 I updated to v5.4.4 and ran with the wrong tessdata file again.
The rest is INFO level messages. I have attached the console output as a compressed text file.
The first line is a useful error message, but it could get drowned in all the other output created. |
On Archlinux, ocrmypdf recently stops working.
$ ocrmypdf -l fra scan0196.pdf out.pdf
ERROR - Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/ruffus/task.py", line 751, in run_pooled_job_without_exceptions
register_cleanup, touch_files_only)
File "/usr/lib/python3.6/site-packages/ruffus/task.py", line 567, in job_wrapper_io_files
ret_val = user_defined_work_func(*params)
File "/usr/lib/python3.6/site-packages/ocrmypdf/pipeline.py", line 497, in ocr_tesseract_hocr
log=log
File "/usr/lib/python3.6/site-packages/ocrmypdf/exec/tesseract.py", line 232, in generate_hocr
universal_newlines=True, timeout=timeout)
File "/usr/lib/python3.6/subprocess.py", line 336, in check_output
**kwargs).stdout
File "/usr/lib/python3.6/subprocess.py", line 405, in run
stdout, stderr = process.communicate(input, timeout=timeout)
File "/usr/lib/python3.6/subprocess.py", line 836, in communicate
stdout, stderr = self._communicate(input, endtime, timeout)
File "/usr/lib/python3.6/subprocess.py", line 1533, in _communicate
self.stdout.errors)
File "/usr/lib/python3.6/subprocess.py", line 735, in _translate_newlines
data = data.decode(encoding, errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x84 in position 155: invalid start byte
The text was updated successfully, but these errors were encountered: