Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeErrors when processing certain images #26

Closed
davidelstone opened this issue Nov 8, 2015 · 1 comment
Closed

UnicodeDecodeErrors when processing certain images #26

davidelstone opened this issue Nov 8, 2015 · 1 comment

Comments

@davidelstone
Copy link

I ran into instances of a UnicodeDecodeError when processing some images, particularly ones where the type isn't very clear.

It seems like what's happening is tesseract's putting in some bytes that python has trouble picking out the codec for. I did a bit of digging around it seems like the standard output from tesseract is encoded in utf-8:

https://code.google.com/p/tesseract-ocr/wiki/FAQ#What_output_formats_can_Tesseract_produce?

Altering the file open command in pytesseract.py to explicitly state the expected encoding prevents the error, but also changes the print behaviour (returns encoded bytes instead of a string):

f = open(output_file_name, encoding='utf-8', errors='replace')

Altering the return command to encode the output to ascii, and then decode it, allows the current output to be maintained (as far as I can see - tested it with a few working images) and also squashes the UnicodeDecodeErrors.

return f.read().strip().encode('ascii', 'replace').decode('ascii', 'replace')

This seems to work.. and seems to allow processing of any image to get some output, however useful it may be.

Thoughts welcome - not sure if it's the best way to go about resolving the errors.

@qacollective
Copy link

I too had to make the UTF-8 change (the first, not the second) to pytesseract.py to get it working for the same reasons. A stream of bytes being returned is no problem for me because I know how to convert that into what I need using .encode and .decode.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants