Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Method for adding config settings #141

Closed
Jeremiah-England opened this issue Jul 31, 2018 · 18 comments
Closed

Method for adding config settings #141

Jeremiah-England opened this issue Jul 31, 2018 · 18 comments

Comments

@Jeremiah-England
Copy link

Jeremiah-England commented Jul 31, 2018

Look mama, no config files!

I was wrestling with config files for some of the settings when I ran across this google group discussion about tesseract using java and it made my mouth water. Here's a code snippet from their discussion:

tesseract = new Tesseract();                      
tesseract.setOcrEngineMode(TessAPI.TessOcrEngineMode.OEM_TESSERACT_ONLY);
tesseract.setPageSegMode(7);
tesseract.setTessVariable("load_system_dawg", "0");
tesseract.setTessVariable("load_freq_dawg", "0");
tesseract.setTessVariable("load_punc_dawg", "0");
tesseract.setTessVariable("load_number_dawg", "0");

At first you may think, well that's cool I guess but you can really do the same thing by just defining a long string of configs and calling it whenever you need it. For example, '--psm 10 --oem 3 -c load_system_dawg=0 load_freq_dawg=0 load_punc_dawg=0 . . .'

In the tesseract documentation, it mentions that you can't change 'init only' parameters with tesseract executable option -c. And those 'init only' parameters would include some of the ones I've been messing with. I think that most people would say that it would be nice to be able to set your variables for your config file directly in python using a set_config_variable method instead of having to go make a config file. Since some of the variables that are being set in the code above are in fact 'init only', the Java guys must be creating a config file (I did not sniff through their code to verify this, however) from java code.

I haven't done it yet because I'm not too familiar with the code inside pytesseract, but right now making a temporary config file and letting it be loadable via a set_config_variable method doesn't seem very hard from my perspective. Here's the high level logic I'm thinking about:

  • When pytesseract is imported, check the config folder to see if a temp.txt file exists. If so, wipe it clean. If not, create one.
  • When someone calls the tsr.set_config_variable method, just write the variable, a space, and the value on a new line in the temp.txt file.
  • You could also have a method to delete the variable from the file and thus return tesseract to the default.
  • When any of the OCR functions are called, if the user does not manually supply another config file, use the temp.txt as the config file unless it's empty.

Why this would be a good feature:

  • For me and others like me who wrote their first line of code 8 months ago, even little trips to the back-end of config files or source code can be confusing and take lot's of time.
  • There's a lot of super ridiculously lazy people out there just like me who would rather not know anything about how the programs and libraries work which they're using, but just want to use them to make other interesting applications.

But maybe it's actually not very easy to implement. Is this actually possible?

@Mebus
Copy link

Mebus commented Aug 1, 2018

This would be great, I wish I was able to do something like this:

https://stackoverflow.com/questions/4944830/how-to-make-tesseract-to-recognize-only-numbers-when-they-are-mixed-with-letter

Mebus

@bozhodimitrov
Copy link
Collaborator

bozhodimitrov commented Aug 1, 2018

Hi, thank you very much for the proposal.
This can be implemented and in fact - you can implement it with a custom logic for yourself.
At the end of the day - you can make your own logic for handling config files, and then you can pass the resulting config file via the config method argument.

As far as integrating this into pytesseract - well, if I have some free time, I will try to implement the logic for this. The only "problematic" part of this is - where to store this temp config.

And btw, we can have this nice python approach:

config = pytesseract.temp_config(path='<custom_filepath>')
config.set_variables({'key': 'value'})
pytesseract.image_to_string('<image_filepath>', config=config)

@Debjoy10
Copy link

@int3l the above method is giving an error AttributeError: module 'pytesseract' has no attribute 'temp_config'. Any solutions?

@bozhodimitrov
Copy link
Collaborator

@Debjoy10 it is not implemented yet. This is a feature request.

@Raghwendra-Dey
Copy link

@int3l do we have any workaround for doing this for the time being?

@bozhodimitrov
Copy link
Collaborator

bozhodimitrov commented Jul 27, 2019

@Raghwendra-Dey please take a look at the README documentation and the example configurations.
You should use the specific tesseract command options - but for this you should go to the Tesseract Wiki Documentation.

@Raghwendra-Dey
Copy link

@int3l we were looking around for modifying the variables like editor_image_word_bb_color , editor_word_height , editor_word_width , etc. but no way to do it in pytesseract, though tesseract has its work around...
https://guides.gdpicture.com/content/Affecting%20Tesseract%20OCR%20engine%20with%20special%20parameters.html

@bozhodimitrov
Copy link
Collaborator

I am not very familiar with the tesseract custom config files, where you can add this options and then pass the custom config file to pytesseract via the config argument.

Maybe you should ask in the Tesseract Github Issue Tracker. Pytesseract is just a tin wrapper around the tesseract executable.

@Debjoy10
Copy link

Can you tell me more about the config argument? I have made a config file but finding it difficult to use it in pytesseract.
Thanks in advance.

@Raghwendra-Dey
Copy link

@int3l ??

@bozhodimitrov
Copy link
Collaborator

Take a look at the Tesseract OCR documentation and example config files.
Closing this since it is not a pytesseract specific issue. Please ask in the Tesseract Github Issue Tracker.

@Debjoy10
Copy link

But my doubt is, I want to use the config file in pytesseract, does pytesseract provide a way to do that conveniently(inside the code)?

@Debjoy10
Copy link

Apologies for the previous comment. It was mistakenly pasted here.

@bozhodimitrov bozhodimitrov reopened this Jul 28, 2019
@bozhodimitrov
Copy link
Collaborator

bozhodimitrov commented Jul 28, 2019

@Debjoy10 Sorry, I see what you mean - In that case try to specify the name of the config file as second argument (string) to pytesseract.pytesseract.run_and_get_output instead of using pytesseract.image_to_string.

This function allows a lot more control, but it is not "public", although you can use it.
And you are right - at the moment, this is a bit of limitation for pytesseract itself.

bozhodimitrov pushed a commit that referenced this issue Jul 28, 2019
So it can be possible to omit it in cases, where it's not needed.
In general It will help with #141
@bozhodimitrov
Copy link
Collaborator

Soon it will be possible to import run_and_get_output directly from pytesseract.run_and_get_output.

@Debjoy10
Copy link

Debjoy10 commented Aug 4, 2019

Thanks for the reply. However, what I was wanting to use was pytesseract.image_to_data. Is there a workaround for that too?
And, with pytesseract.run_and_get_output do we need to provide the path to the config file, or it can retrive that file from the configs directory of tesseract?

@EricPHamilton
Copy link

I was able to supply my own config file by using the following:
("words" is the name of my config file)
pytesseract.run_and_get_output(im, extension="txt", config="words")

@naourass
Copy link

Where do you save "words" custom config file ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants