You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I ran into instances of a UnicodeDecodeError when processing some images, particularly ones where the type isn't very clear.
It seems like what's happening is tesseract's putting in some bytes that python has trouble picking out the codec for. I did a bit of digging around it seems like the standard output from tesseract is encoded in utf-8:
Altering the file open command in pytesseract.py to explicitly state the expected encoding prevents the error, but also changes the print behaviour (returns encoded bytes instead of a string):
f = open(output_file_name, encoding='utf-8', errors='replace')
Altering the return command to encode the output to ascii, and then decode it, allows the current output to be maintained (as far as I can see - tested it with a few working images) and also squashes the UnicodeDecodeErrors.
I too had to make the UTF-8 change (the first, not the second) to pytesseract.py to get it working for the same reasons. A stream of bytes being returned is no problem for me because I know how to convert that into what I need using .encode and .decode.
I ran into instances of a UnicodeDecodeError when processing some images, particularly ones where the type isn't very clear.
It seems like what's happening is tesseract's putting in some bytes that python has trouble picking out the codec for. I did a bit of digging around it seems like the standard output from tesseract is encoded in utf-8:
https://code.google.com/p/tesseract-ocr/wiki/FAQ#What_output_formats_can_Tesseract_produce?
Altering the file open command in pytesseract.py to explicitly state the expected encoding prevents the error, but also changes the print behaviour (returns encoded bytes instead of a string):
f = open(output_file_name, encoding='utf-8', errors='replace')
Altering the return command to encode the output to ascii, and then decode it, allows the current output to be maintained (as far as I can see - tested it with a few working images) and also squashes the UnicodeDecodeErrors.
return f.read().strip().encode('ascii', 'replace').decode('ascii', 'replace')
This seems to work.. and seems to allow processing of any image to get some output, however useful it may be.
Thoughts welcome - not sure if it's the best way to go about resolving the errors.
The text was updated successfully, but these errors were encountered: