Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use global TessAPI instances with parallel processing #1008

Open
jrobble opened this issue Nov 22, 2019 · 2 comments
Open

Use global TessAPI instances with parallel processing #1008

jrobble opened this issue Nov 22, 2019 · 2 comments

Comments

@jrobble
Copy link
Member

jrobble commented Nov 22, 2019

From here:

@jrobble:

Throughout this function I see that you make sure that when performing parallel processing you always get a new TessAPI and remove it when you're done with it. Otherwise, you try to use the cached models in tess_api_map. Please add a comment above the function signature explaining this and the motivation behind it.

I believe you were motivated to do this to avoid deadlock. Sorry if we've discussed this before, but I just want to make sure that the positive impact of parallel processing outweighs the negative impact of initializing new TessAPIs.

Maybe I'm mistaken, but did you talk about a potential way for these TessAPI instances to be reused when performing parallel processing? I'm not sure I'm looking at the most recent code.

If not, when you run two tests back to back with the same image file and languages (so that the language models are cached between jobs), what is the performance difference when the number of parallel language threads is set to the number of languages, vs. not using parallel processing at all? Is there a point where the cost outweighs the benefit?

@hhuangMITRE:

Great questions.

In my last test with using global APIs vs parallel processing, I found that parallel processing (using local-thread started APIs) still shaved off ~0.2 seconds in my VM (which was running 2 CPU cores). Fortunately, it seems the performance benefit of the parallel OCR runs still outweighs the cost of initializing each API locally.

I'm currently still trying to resolve the deadlock issues when using global APIs when processing scripts in parallel.

For individual script runs (image-mode only), I believe it is possible to use the global APIs but I've encountered weird resource issues so far when using global APIs in that instance. Global APIs in parallel script mode occasionally works, other times Tesseract complains a resource isn't properly shut down in the threads. I have an idea to resolve this by passing global APIs directly into each thread, which hopefully resolves this problem.

However, I found that it's not possible at all for the PDF processing mode to use global APIs, since two threads may potentially wind up using the same Tesseract API at once.

From @jrobble:

Thanks for the intel. ~0.2 seconds isn't a lot of time, but certainly better than nothing. I'm wondering how that scales with number of CPUs and number of OCR threads.

"I have an idea to resolve this by passing global APIs directly into each thread, which hopefully resolves this problem." - What's the alternative way to reuse them other than passing them directly into each thread?

"However, I found that it's not possible at all for the PDF processing mode to use global APIs, since two threads may potentially wind up using the same Tesseract API at once." - Yeah, I anticipated this. One option is to cache multiples of each model type. Is that worth considering?

From @hhuangMITRE:

Currently each thread starts and stops its own Tesseract API. When I tried to store them inside of the global API cache, there are complaints of memory not being properly released after the threads finish processing images.

Since each Tesseract API is started directly within the thread, it's possible each thread also expects the API to be released after the thread is closed (and throws Tesseract memory errors otherwise). Hence, we might be able to resolve the issue by initializing and passing each API to each thread instead.

Also, yes, I think we can cache multiple copies of the same Tesseract API. If we are passing unique APIs to each thread as they get started up, we can avoid deadlocking issues and also potentially resolve the global API issue as well. We can also pair each API with an atomic boolean or another atomic variable to also prevent multiple threads from trying to access the same API.

I think it's worth exploring both options in our next PR.

@hhuangMITRE
Copy link
Contributor

hhuangMITRE commented May 4, 2021

Bumped this issue as we're currently discussing global API instances for PDF and (potentially) video threads.

Refer to the last part of this comment.

@jrobble
Copy link
Member Author

jrobble commented May 7, 2021

I ran tests (VideoProcessingTest) on a 255-frame 320x240 video. It had no text, only faces:

std::map<std::string,std::string> custom_properties = {
   {"TESSERACT_LANGUAGE", "eng, script/Latin"},
   {"ENABLE_OSD_AUTOMATION", "FALSE"},
   {"MAX_PARALLEL_SCRIPT_THREADS" , "4"}
};

Took 55 sec. 87.99% of all time spent in TessApiWrapper::TessApiWrapper().

std::map<std::string,std::string> custom_properties = {
   {"TESSERACT_LANGUAGE", "eng, script/Latin"},
   {"ENABLE_OSD_AUTOMATION", "FALSE"},
   {"MAX_PARALLEL_SCRIPT_THREADS" , "1"}
};

Took 10 sec. 0.57% of all time spent in TessApiWrapper::TessApiWrapper().


I took test-video-detection.avi and made it longer. The same 3 frames repeat 100 times. I ran these tests:

std::map<std::string,std::string> custom_properties = {
   {"TESSERACT_LANGUAGE", "eng, script/Latin"},
   {"ENABLE_OSD_AUTOMATION", "FALSE"},
   {"MAX_PARALLEL_SCRIPT_THREADS" , "4"}
};

Took 8 min 32 sec. 12.53% (1 min 4.153 sec.) of all time spent in TessApiWrapper::TessApiWrapper().

std::map<std::string,std::string> custom_properties = {
   {"TESSERACT_LANGUAGE", "eng, script/Latin"},
   {"ENABLE_OSD_AUTOMATION", "FALSE"},
   {"MAX_PARALLEL_SCRIPT_THREADS" , "1"}
};

Took 11 min 29 sec. 0.01% (0.0689 sec) of all time spent in TessApiWrapper::TessApiWrapper().


Two lessons:

  1. In the current implementation, if the video has a lot of text then use parallel script threads. If not, then do serial processing.

  2. There are non-trivial gains to be made by reusing Tesseract APIs when parallel processing.

@jrobble jrobble modified the milestones: Milestone 1, Milestone 2 May 14, 2021
@jrobble jrobble modified the milestones: Milestone 2, Milestone 3 Aug 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

2 participants