Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OSD #105

Closed
ricardomga opened this issue Feb 27, 2018 · 15 comments
Closed

OSD #105

ricardomga opened this issue Feb 27, 2018 · 15 comments

Comments

@ricardomga
Copy link

ricardomga commented Feb 27, 2018

Hello,
Is it possible to use psm 0 to get the osd information? I am geting an error doing this.

  • The code
pytesseract.image_to_string(
            img,
            lang='por',
            config='--tessdata-dir "./tessdata/" -psm 0',
            output_type='dict'
)
  • The error
pytesseract.pytesseract.TesseractError: (1, "read_params_file: Can't open 0 read_params_file: Can't open txt Tesseract Open Source OCR Engine v4.0.0-alpha.20180109 with Leptonica Warning. Invalid resolution 0 dpi. Using 70 instead. Warning. Invalid resolution 0 dpi. Using 70 instead. Too few characters. Skipping this page Error during processing.")
@bozhodimitrov
Copy link
Collaborator

bozhodimitrov commented Feb 27, 2018

Did you first tried the same CLI command equivalent in terminal/shell?
As far as I can tell, the documentation specifies --psm NUM format instead -psm.
Can you try it?
If this is the case, I will updated the documented line in the README file.

PS: Also for the output_type, try to use the pytesseract.Output class attributes instead of hard-coding it, because the notation can change in the future and this can break your code. :)

@ricardomga
Copy link
Author

ricardomga commented Feb 27, 2018

Thank you for the help in advance.
You were right. Now it gives me the following error:

line 116, in run_tesseract
    raise TesseractError(status_code, get_errors(error_string))
pytesseract.pytesseract.TesseractError: (3221225477, '')

I think is because --psm 0 creates an .osd file instead of .txt file

@bozhodimitrov
Copy link
Collaborator

That's correct, but what is the raw output message from the tesseract command itself?
Can you run it and let us know what's the result.

I guess that we need additional function/s (or functionality) for the different PSM modes, since the output format is not txt.

@ricardomga
Copy link
Author

  • The tesseract command:
tesseract img.jpg out --psm 0
  • Output:
Tesseract Open Source OCR Engine v4.0.0-alpha.20180109 with Leptonica
Warning. Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 346
Warning. Invalid resolution 0 dpi. Using 70 instead.
  • File output(out.osd):
Page number: 0
Orientation in degrees: 0
Rotate: 0
Orientation confidence: 13.95
Script: Latin
Script confidence: 6.67

@bozhodimitrov
Copy link
Collaborator

The warning is familiar to me, the problem is that tesseract doesn't return exit code 0 in that case, which is not nice :D

But the warning itself means that there is a missing image metadata information. Maybe a beforehand conversion of the image can help.

At the moment we should adjust the pytesseract logic at two places in order to be able to read and return the content of the .osd file.

Can you try to report the exit code of the tesseract command.
You can check that with the following command after execution of the tesseract command:

echo $?

Also try to convert the image in order to workaround the tesseract warning.

@ricardomga
Copy link
Author

The exit code when run in the terminal is 0. But when run through pytesseract it is not, do you have any idea why?
There is any way of knowing what 3221225477 exit code means?

@bozhodimitrov
Copy link
Collaborator

bozhodimitrov commented Feb 27, 2018

You can patch the pytesseract library temporarily on line 133 and you can print the command with:

print(' '.join(command))

PS: We have a new function image_to_osd. You can try your example images with it.
Feel free to reopen if you have any other comments/questions.

@me-suzy
Copy link

me-suzy commented Oct 8, 2022

You can patch the pytesseract library temporarily on line 133 and you can print the command with:

print(' '.join(command))

PS: We have a new function image_to_osd. You can try your example images with it. Feel free to reopen if you have any other comments/questions.

your pytesseract.py doesn't exist anymore. Please upload again.

@stefan6419846
Copy link
Contributor

stefan6419846 commented Oct 8, 2022

your pytesseract.py doesn't exist anymore. Please upload again.

The comment is from 2018, so things might have changed.

The file still exists, although the directory structure has been migrated and this file is available at https://github.com/madmaze/pytesseract/blob/master/pytesseract/pytesseract.py now. At the moment, you will have to add the print statement to this line:

@me-suzy
Copy link

me-suzy commented Oct 8, 2022

ok, thanks, I download and replace the file.

Now, I have another problem with this pytesseract.py

    import pytesseract
  File "C:\Users\Castel\AppData\Roaming\Python\Python310\site-packages\pytesseract\__init__.py", line 70
    <title>pytesseract/__init__.py at master · madmaze/pytesseract</title>
                                             ^
SyntaxError: invalid character '·' (U+00B7)

image

@stefan6419846
Copy link
Contributor

You did not download the actual (raw) file, but the rendered HTML code from GitHub.

@me-suzy
Copy link

me-suzy commented Oct 8, 2022

your pytesseract.py doesn't exist anymore. Please upload again.

The comment is from 2018, so things might have changed.

The file still exists, although the directory structure has been migrated and this file is available at https://github.com/madmaze/pytesseract/blob/master/pytesseract/pytesseract.py now. At the moment, you will have to add the print statement to this line:

ok, I download, and change the line. But I get the same error.

Can you please attach the file, after edit with one of those 2 commands?

    print(' '.join(command))
    print(' '.join(cmd_args))

I realy don't understand where to change the file. Because I change many time, and didn't work. Please edit and attach here the new version, please.

@stefan6419846
Copy link
Contributor

I still do not get what you want to achieve: What is your intent with commenting on this old issue and trying to do some changes there? If it is related to #455, please answer the actual questions there. You will (usually) never be able to fix an issue by just printing anything to the terminal - the print(' '.join(cmd_args)) just allows you to see what pytesseract calls Tesseract with to further debug possible call issues.

@me-suzy
Copy link

me-suzy commented Oct 8, 2022

ok, this is my problem. And I don't know what to do. My Python code (convert with OCR from PDF in Text file) is very good, but cannot succed because of this error.

Please tell me how to fix it.

image

@stefan6419846
Copy link
Contributor

As mentioned previously, please keep these issues separate ones - the issue from your last comment is already discussed in #455, while asking twice will not really change anything about the support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants