IOError for OSD (psm 0) #75

movaid7 · 2017-11-14T21:49:03Z

Running pytesseract on Raspbian, python 2.7, tesseract 4.00 (4.00.00dev-625-g9c2fa0d).

My code is as follows:

ret = image_to_string(im,lang='eng',boxes=False, config="-psm 0")

Error:

Traceback (most recent call last):
  File "/home/pi/Vocable/TESTING.py", line 114, in <module>
    ret = image_to_string(im,lang='eng',boxes=False, config="-psm 0")
  File "/usr/local/lib/python2.7/dist-packages/pytesseract/pytesseract.py", line 126, in image_to_string
    f = open(output_file_name, 'rb')
IOError: [Errno 2] No such file or directory: '/tmp/tess_lN5JlN.txt'

If I run code with psm 1 [Recognition with OSD], I receive no errors but the upside down text is simply treated as right-side-up text, producing garbage results. (This was tested on an inverted test.png)

Essentially text recognition works but OSD does not.

The text was updated successfully, but these errors were encountered:

movaid7 · 2017-11-14T22:00:35Z

Do I need to just specify the location of OSD trained data and if so, how do I specify it?

bozhodimitrov · 2017-11-14T22:26:11Z

Hi @movaid7 - thanks for reporting the issue.
Let me check what are the arguments that the pytesseract passes to the tesseract executable.
By the way - what is the version of pytesseract that you are using on Raspbian.

movaid7 · 2017-11-15T01:25:06Z

Hi. @int3l. I'm currently running pytesseract v0.1.7
[pip installed it without cloning repo]

bozhodimitrov · 2017-11-15T01:41:36Z

Thank you @movaid7 , it looks like I am familiar with your problem.
I have some time, so I will try to resolve this issue.

Can I ask you for one favor, please post the names of the pytesseract temp files in your /tmp folder.
They should start with "tess_" prefix.

It seems like the pytesseract logic doesn't handle the tesseract output filenames correctly.
And I need to know what is the file name format (and file extension if there is such).

We have two PRs for resolving this - #67 and #40.
But they are not refactored very well.

movaid7 · 2017-11-15T02:08:13Z

Sure. No problem.

├── tess_28Hb8s.bmp
├── tess_4_2VtD.osd
├── tess_7_4nr4.bmp
├── tess_BivtYq.txt
├── tess_dxX6fq.osd
├── tess_eaGs5_.txt
├── tess_FOZuxX.bmp
├── tess_hdCcMS.txt
├── tess_lN5JlN.osd
├── tess_N8B5yi.bmp
├── tess_NAcmkV.txt
├── tess_qEhWQb.txt
├── tess_RlIPP1.osd
├── tess_UEerR7.txt
├── tess_vbtZwF.bmp
├── tess_wDn6B5.osd
├── tess_YPDuMH.bmp
└── tess_ZCr9Tk.osd

bozhodimitrov · 2017-11-15T02:21:18Z

Nice. Those files are leftovers.
Can you check the content of tess_lN5JlN.osd - what I need to know is - is it a plain-text file with the raw recognized text or it is some kind of binary/xml/etc. file.

Also if you don't need the temp files, can you delete them all and then try to run the error example once.
And once again - post the list of created files.
Thanks for the help!

bozhodimitrov · 2017-11-15T03:19:21Z

Also about your question related to the OSD - OSD information from the Tesseract documentation and Tesseract PSM modes.

Basically you should have your osd file in the same dir as the language traineddata file.

By the way - try this command and check if tesseract creates txt temp file with the psm 0 option:

ret = image_to_string(im, lang='eng', config="-psm 0 txt")

movaid7 · 2017-11-15T17:02:12Z

Deleting the temp files and running the function had no effect on the outcome.

The .osd files that are produced are formatted like

Page number: 0
Orientation in degrees: 0
Rotate: 0
Orientation confidence: 0.79
Script: Latin
Script confidence: 9.80

The osd above was produced from an inverted picture that was not recognised as such. This is the reason for the low orientation confidence level.

Running the command
ret = image_to_string(im, lang='eng', config="-psm 0 txt")
produced just a temp .osd file and no .txt file. I would assume that running any command with psm 0 should not be producing a txt file, as the command performs no OCR. I think pytesseract is incorrect in seeking a txt file here.

I also tried a 90° rotated picture and ran a command with psm 1 (OSD+OCR) and it worked really well -
picture orientation was corrected. The .osd file produced for that particular operation was:

Page number: 0
Orientation in degrees: 90
Rotate: 270
Orientation confidence: 13.54
Script: Latin
Script confidence: 37.71

bozhodimitrov · 2017-11-15T17:37:49Z

pytesseract incorrectly looks for txt file, but at the moment this is by design for the image_to_string function. There is no logic that limits the lookup when psm 0 mode is used.
So the control is in the user's hands.

I'm not sure if we should alter the logic, because this is a config behavior.
Remember that pytesseract is just a wrapper.

If we apply additional control, the new version of the library might be incompatible with some of the existing user code bases.

Maybe we can make the library more granular by separating the preparation logic into different functions. This way we can have functions like image_to_osd, image_to_pdf or etc.

Or even better - we can add dynamic approach like image_to_file - and you can specify what you want.
And this function will return the paths to the created output files.

chingjunehao · 2017-11-18T03:23:56Z

I have the same issue too, but in the same folder of my code, the tmp folder contains nothing.
Is it the matter of pytesseract code that doesn't keep the tmp text file?

movaid7 · 2017-11-18T12:14:05Z

No, @chingjunehao
Tesseract creates a txt file of recognised text in the tmp folder. Since psm 0 is "Orientation and script detection (OSD) only", no recognition is peformed and as a result no txt file is created by Tesseract.

The only file created by a psm 0 command is a .osd file that contains osd details, for example:

Orientation: 0
Orientation in degrees: 0
Orientation confidence: 22.31
Script: 1
Script confidence: 36.67

The issue is that pytesseract is looking for a txt file that never existed.

@int3l I do think dedicated functions could be useful but the problem is what do you return for osd only? Just a text dump of the .osd file contents?

bozhodimitrov · 2017-11-18T16:35:22Z

@movaid7 I suggest leaving the image_to_string function as it is (logically) - returning text or raising error exception, when there is no txt file.
And we can have new function image_to_file - returning list of temp file path strings created by tesseract.
(In this case, the user should parse the files manually according to his/her use case)

chingjunehao · 2017-11-19T13:49:04Z

@movaid7 I used the psm 10 which is "Treat the image as a single character." that will return the first character that detected in my image as .png file, but my tmp folder is empty.
Then, I tried using the method suggested by @int3l , make a function to the path where I want to put the output_file_name to.

def image_to_file(ofn):
    os.chdir("/home/chingjunehao/Downloads/tmp")
   return ofn

#The function in image_to_string

input_file_name = "%s.bmp" % tempnam()

output_file_name_base = tempnam()
if not boxes:
    output_file_name = '%s.txt' % output_file_name_base
else:
    output_file_name = '%s.box' % output_file_name_base
image_to_file(output_file_name)`

Isn't it, by this, the file supposed to be in the path that I set already?
End up still with this problem,

Traceback (most recent call last):
  File "./Main.py", line 584, in <module>
    main()
  File "./Main.py", line 277, in main
    result111 = pytesseract.image_to_string(img10,lang = 'eng', config="-psm 10")
  File "/home/chingjunehao/.local/lib/python2.7/site-packages/pytesseract/pytesseract.py", line 131, in image_to_string
    f = open(output_file_name, 'rb')
IOError: [Errno 2] No such file or directory: '/tmp/tess_ZUXd0Y.txt'

movaid7 · 2017-11-19T14:32:57Z

@chingjunehao try to run it with "boxes=False" and see if a txt file was created at either of the two locations (default tmp folder and the one you created within Downloads)

bozhodimitrov · 2017-11-19T18:01:24Z

@chingjunehao you can check, which temp folder path is used currently with the following code:

import tempfile
print(tempfile.gettempdir())

That will be the path that pytesseract will pass as tesseract cli arguments.

chingjunehao · 2017-11-20T00:14:19Z

@movaid7 there's no files that being created after running, but previously, there's .bmp files in the default tmp folder.
@int3l /tmp is where the path that printed, which means it runs in the same folder as my code.

bozhodimitrov · 2018-01-12T07:06:50Z

Since now we have the more generic functionality with #85, I'm resolving this issue.
@movaid7, @chingjunehao I hope that the new implementation is good enough for your use cases.
Please feel free to reopen it if you have any other comments or suggestions.

abelsaug · 2018-07-09T20:00:49Z

The orientation parameter is not available with image_to_data, right?

bozhodimitrov mentioned this issue Jan 12, 2018

Added verbose option that returns detailed output from tesseract run #85

Merged

bozhodimitrov added the Feature Request label Jan 12, 2018

bozhodimitrov closed this as completed Jan 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IOError for OSD (psm 0) #75

IOError for OSD (psm 0) #75

movaid7 commented Nov 14, 2017

movaid7 commented Nov 14, 2017

bozhodimitrov commented Nov 14, 2017

movaid7 commented Nov 15, 2017 •

edited

Loading

bozhodimitrov commented Nov 15, 2017 •

edited

Loading

movaid7 commented Nov 15, 2017

bozhodimitrov commented Nov 15, 2017 •

edited

Loading

bozhodimitrov commented Nov 15, 2017 •

edited

Loading

movaid7 commented Nov 15, 2017

bozhodimitrov commented Nov 15, 2017 •

edited

Loading

chingjunehao commented Nov 18, 2017

movaid7 commented Nov 18, 2017

bozhodimitrov commented Nov 18, 2017

chingjunehao commented Nov 19, 2017 •

edited

Loading

movaid7 commented Nov 19, 2017

bozhodimitrov commented Nov 19, 2017 •

edited

Loading

chingjunehao commented Nov 20, 2017

bozhodimitrov commented Jan 12, 2018

abelsaug commented Jul 9, 2018

IOError for OSD (psm 0) #75

IOError for OSD (psm 0) #75

Comments

movaid7 commented Nov 14, 2017

movaid7 commented Nov 14, 2017

bozhodimitrov commented Nov 14, 2017

movaid7 commented Nov 15, 2017 • edited Loading

bozhodimitrov commented Nov 15, 2017 • edited Loading

movaid7 commented Nov 15, 2017

bozhodimitrov commented Nov 15, 2017 • edited Loading

bozhodimitrov commented Nov 15, 2017 • edited Loading

movaid7 commented Nov 15, 2017

bozhodimitrov commented Nov 15, 2017 • edited Loading

chingjunehao commented Nov 18, 2017

movaid7 commented Nov 18, 2017

bozhodimitrov commented Nov 18, 2017

chingjunehao commented Nov 19, 2017 • edited Loading

movaid7 commented Nov 19, 2017

bozhodimitrov commented Nov 19, 2017 • edited Loading

chingjunehao commented Nov 20, 2017

bozhodimitrov commented Jan 12, 2018

abelsaug commented Jul 9, 2018

movaid7 commented Nov 15, 2017 •

edited

Loading

bozhodimitrov commented Nov 15, 2017 •

edited

Loading

bozhodimitrov commented Nov 15, 2017 •

edited

Loading

bozhodimitrov commented Nov 15, 2017 •

edited

Loading

bozhodimitrov commented Nov 15, 2017 •

edited

Loading

chingjunehao commented Nov 19, 2017 •

edited

Loading

bozhodimitrov commented Nov 19, 2017 •

edited

Loading