You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Throughout this function I see that you make sure that when performing parallel processing you always get a new TessAPI and remove it when you're done with it. Otherwise, you try to use the cached models in tess_api_map. Please add a comment above the function signature explaining this and the motivation behind it.
I believe you were motivated to do this to avoid deadlock. Sorry if we've discussed this before, but I just want to make sure that the positive impact of parallel processing outweighs the negative impact of initializing new TessAPIs.
Maybe I'm mistaken, but did you talk about a potential way for these TessAPI instances to be reused when performing parallel processing? I'm not sure I'm looking at the most recent code.
If not, when you run two tests back to back with the same image file and languages (so that the language models are cached between jobs), what is the performance difference when the number of parallel language threads is set to the number of languages, vs. not using parallel processing at all? Is there a point where the cost outweighs the benefit?
In my last test with using global APIs vs parallel processing, I found that parallel processing (using local-thread started APIs) still shaved off ~0.2 seconds in my VM (which was running 2 CPU cores). Fortunately, it seems the performance benefit of the parallel OCR runs still outweighs the cost of initializing each API locally.
I'm currently still trying to resolve the deadlock issues when using global APIs when processing scripts in parallel.
For individual script runs (image-mode only), I believe it is possible to use the global APIs but I've encountered weird resource issues so far when using global APIs in that instance. Global APIs in parallel script mode occasionally works, other times Tesseract complains a resource isn't properly shut down in the threads. I have an idea to resolve this by passing global APIs directly into each thread, which hopefully resolves this problem.
However, I found that it's not possible at all for the PDF processing mode to use global APIs, since two threads may potentially wind up using the same Tesseract API at once.
Thanks for the intel. ~0.2 seconds isn't a lot of time, but certainly better than nothing. I'm wondering how that scales with number of CPUs and number of OCR threads.
"I have an idea to resolve this by passing global APIs directly into each thread, which hopefully resolves this problem." - What's the alternative way to reuse them other than passing them directly into each thread?
"However, I found that it's not possible at all for the PDF processing mode to use global APIs, since two threads may potentially wind up using the same Tesseract API at once." - Yeah, I anticipated this. One option is to cache multiples of each model type. Is that worth considering?
Currently each thread starts and stops its own Tesseract API. When I tried to store them inside of the global API cache, there are complaints of memory not being properly released after the threads finish processing images.
Since each Tesseract API is started directly within the thread, it's possible each thread also expects the API to be released after the thread is closed (and throws Tesseract memory errors otherwise). Hence, we might be able to resolve the issue by initializing and passing each API to each thread instead.
Also, yes, I think we can cache multiple copies of the same Tesseract API. If we are passing unique APIs to each thread as they get started up, we can avoid deadlocking issues and also potentially resolve the global API issue as well. We can also pair each API with an atomic boolean or another atomic variable to also prevent multiple threads from trying to access the same API.
I think it's worth exploring both options in our next PR.
The text was updated successfully, but these errors were encountered:
From here:
@jrobble:
@hhuangMITRE:
From @jrobble:
From @hhuangMITRE:
The text was updated successfully, but these errors were encountered: