UnicodeDecodeErrors when processing certain images #26

davidelstone · 2015-11-08T11:17:47Z

I ran into instances of a UnicodeDecodeError when processing some images, particularly ones where the type isn't very clear.

It seems like what's happening is tesseract's putting in some bytes that python has trouble picking out the codec for. I did a bit of digging around it seems like the standard output from tesseract is encoded in utf-8:

https://code.google.com/p/tesseract-ocr/wiki/FAQ#What_output_formats_can_Tesseract_produce?

Altering the file open command in pytesseract.py to explicitly state the expected encoding prevents the error, but also changes the print behaviour (returns encoded bytes instead of a string):

f = open(output_file_name, encoding='utf-8', errors='replace')

Altering the return command to encode the output to ascii, and then decode it, allows the current output to be maintained (as far as I can see - tested it with a few working images) and also squashes the UnicodeDecodeErrors.

return f.read().strip().encode('ascii', 'replace').decode('ascii', 'replace')

This seems to work.. and seems to allow processing of any image to get some output, however useful it may be.

Thoughts welcome - not sure if it's the best way to go about resolving the errors.

The text was updated successfully, but these errors were encountered:

qacollective · 2017-02-04T10:07:24Z

I too had to make the UTF-8 change (the first, not the second) to pytesseract.py to get it working for the same reasons. A stream of bytes being returned is no problem for me because I know how to convert that into what I need using .encode and .decode.

chicocvenancio mentioned this issue Sep 29, 2016

Pytesseract: UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 2: character maps to <undefined> #44

Closed

bozhodimitrov mentioned this issue Apr 12, 2017

Decode tesseract's output as UTF-8 #33

Merged

bozhodimitrov closed this as completed Apr 12, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnicodeDecodeErrors when processing certain images #26

UnicodeDecodeErrors when processing certain images #26

davidelstone commented Nov 8, 2015

qacollective commented Feb 4, 2017

UnicodeDecodeErrors when processing certain images #26

UnicodeDecodeErrors when processing certain images #26

Comments

davidelstone commented Nov 8, 2015

qacollective commented Feb 4, 2017