Use global TessAPI instances with parallel processing #1008

jrobble · 2019-11-22T15:45:20Z

From here:

Throughout this function I see that you make sure that when performing parallel processing you always get a new TessAPI and remove it when you're done with it. Otherwise, you try to use the cached models in tess_api_map. Please add a comment above the function signature explaining this and the motivation behind it.

I believe you were motivated to do this to avoid deadlock. Sorry if we've discussed this before, but I just want to make sure that the positive impact of parallel processing outweighs the negative impact of initializing new TessAPIs.

Maybe I'm mistaken, but did you talk about a potential way for these TessAPI instances to be reused when performing parallel processing? I'm not sure I'm looking at the most recent code.

If not, when you run two tests back to back with the same image file and languages (so that the language models are cached between jobs), what is the performance difference when the number of parallel language threads is set to the number of languages, vs. not using parallel processing at all? Is there a point where the cost outweighs the benefit?

@hhuangMITRE:

Great questions.

In my last test with using global APIs vs parallel processing, I found that parallel processing (using local-thread started APIs) still shaved off ~0.2 seconds in my VM (which was running 2 CPU cores). Fortunately, it seems the performance benefit of the parallel OCR runs still outweighs the cost of initializing each API locally.

I'm currently still trying to resolve the deadlock issues when using global APIs when processing scripts in parallel.

For individual script runs (image-mode only), I believe it is possible to use the global APIs but I've encountered weird resource issues so far when using global APIs in that instance. Global APIs in parallel script mode occasionally works, other times Tesseract complains a resource isn't properly shut down in the threads. I have an idea to resolve this by passing global APIs directly into each thread, which hopefully resolves this problem.

However, I found that it's not possible at all for the PDF processing mode to use global APIs, since two threads may potentially wind up using the same Tesseract API at once.

From @jrobble:

Thanks for the intel. ~0.2 seconds isn't a lot of time, but certainly better than nothing. I'm wondering how that scales with number of CPUs and number of OCR threads.

"I have an idea to resolve this by passing global APIs directly into each thread, which hopefully resolves this problem." - What's the alternative way to reuse them other than passing them directly into each thread?

"However, I found that it's not possible at all for the PDF processing mode to use global APIs, since two threads may potentially wind up using the same Tesseract API at once." - Yeah, I anticipated this. One option is to cache multiples of each model type. Is that worth considering?

From @hhuangMITRE:

Currently each thread starts and stops its own Tesseract API. When I tried to store them inside of the global API cache, there are complaints of memory not being properly released after the threads finish processing images.

Since each Tesseract API is started directly within the thread, it's possible each thread also expects the API to be released after the thread is closed (and throws Tesseract memory errors otherwise). Hence, we might be able to resolve the issue by initializing and passing each API to each thread instead.

Also, yes, I think we can cache multiple copies of the same Tesseract API. If we are passing unique APIs to each thread as they get started up, we can avoid deadlocking issues and also potentially resolve the global API issue as well. We can also pair each API with an atomic boolean or another atomic variable to also prevent multiple threads from trying to access the same API.

I think it's worth exploring both options in our next PR.

hhuangMITRE · 2021-05-04T14:24:22Z

Bumped this issue as we're currently discussing global API instances for PDF and (potentially) video threads.

Refer to the last part of this comment.

jrobble · 2021-05-07T19:22:14Z

I ran tests (VideoProcessingTest) on a 255-frame 320x240 video. It had no text, only faces:

std::map<std::string,std::string> custom_properties = {
   {"TESSERACT_LANGUAGE", "eng, script/Latin"},
   {"ENABLE_OSD_AUTOMATION", "FALSE"},
   {"MAX_PARALLEL_SCRIPT_THREADS" , "4"}
};

Took 55 sec. 87.99% of all time spent in TessApiWrapper::TessApiWrapper().

std::map<std::string,std::string> custom_properties = {
   {"TESSERACT_LANGUAGE", "eng, script/Latin"},
   {"ENABLE_OSD_AUTOMATION", "FALSE"},
   {"MAX_PARALLEL_SCRIPT_THREADS" , "1"}
};

Took 10 sec. 0.57% of all time spent in TessApiWrapper::TessApiWrapper().

I took test-video-detection.avi and made it longer. The same 3 frames repeat 100 times. I ran these tests:

std::map<std::string,std::string> custom_properties = {
   {"TESSERACT_LANGUAGE", "eng, script/Latin"},
   {"ENABLE_OSD_AUTOMATION", "FALSE"},
   {"MAX_PARALLEL_SCRIPT_THREADS" , "4"}
};

Took 8 min 32 sec. 12.53% (1 min 4.153 sec.) of all time spent in TessApiWrapper::TessApiWrapper().

std::map<std::string,std::string> custom_properties = {
   {"TESSERACT_LANGUAGE", "eng, script/Latin"},
   {"ENABLE_OSD_AUTOMATION", "FALSE"},
   {"MAX_PARALLEL_SCRIPT_THREADS" , "1"}
};

Took 11 min 29 sec. 0.01% (0.0689 sec) of all time spent in TessApiWrapper::TessApiWrapper().

Two lessons:

In the current implementation, if the video has a lot of text then use parallel script threads. If not, then do serial processing.
There are non-trivial gains to be made by reusing Tesseract APIs when parallel processing.

jrobble added enhancement wishlist release 5.0.0 labels Nov 22, 2019

jrobble added this to the Release 5.0.0 milestone Nov 22, 2019

jrobble assigned hhuangMITRE Nov 22, 2019

jrobble added this to To do in OpenMPF: Development via automation Nov 22, 2019

jrobble mentioned this issue Nov 22, 2019

Tesseract parallel processing openmpf/openmpf-components#126

Merged

jrobble modified the milestones: Release 5.0.0, Release 5.1 Jun 23, 2020

jrobble removed the release 5.0.0 label Jun 23, 2020

jrobble mentioned this issue May 5, 2021

Update Tesseract component to support videos #1285

Closed

jrobble modified the milestones: Milestone 1, Milestone 2 May 14, 2021

jrobble modified the milestones: Milestone 2, Milestone 3 Aug 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use global TessAPI instances with parallel processing #1008

Use global TessAPI instances with parallel processing #1008

jrobble commented Nov 22, 2019

hhuangMITRE commented May 4, 2021 •

edited by jrobble

jrobble commented May 7, 2021 •

edited

Use global TessAPI instances with parallel processing #1008

Use global TessAPI instances with parallel processing #1008

Comments

jrobble commented Nov 22, 2019

hhuangMITRE commented May 4, 2021 • edited by jrobble

jrobble commented May 7, 2021 • edited

hhuangMITRE commented May 4, 2021 •

edited by jrobble

jrobble commented May 7, 2021 •

edited