Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multithreading in diart.benchmark #85

Closed
juanmc2005 opened this issue Aug 31, 2022 · 5 comments
Closed

Multithreading in diart.benchmark #85

juanmc2005 opened this issue Aug 31, 2022 · 5 comments
Labels
feature New feature or request
Milestone

Comments

@juanmc2005
Copy link
Owner

Problem

Running a benchmark on a huge dataset can take a lot of time. One of the main bottlenecks is that files are processed sequentially.

Idea

Make diart.benchmark (and hence diart.tune) run concurrently on many files at once with a predefined number of workers.
It would be great if progress bars could be kept, otherwise we need to find a good solution to show progress.

Another potential problem is having N segmentation and embedding model copies in memory, but since they're stateless there should be a workaround to share them. However I would accept a first version with N models in RAM anyways and think about potential improvements afterwards.

See RxPY concurrency

@juanmc2005 juanmc2005 added the feature New feature or request label Aug 31, 2022
@juanmc2005
Copy link
Owner Author

juanmc2005 commented Sep 13, 2022

For progress bars, see p_tqdm, tqdm with locks

@hbredin
Copy link
Collaborator

hbredin commented Sep 13, 2022

Alternative: rich

@juanmc2005
Copy link
Owner Author

There are two options for progress bars:

  1. A single bar where 1 iteration = 1 file (p_tqdm, rich)
  2. Multiple bars where 1 bar = 1 file, and 1 iteration = 1 chunk/batch (tqdm with locks)

I would accept both but strongly prefer the second.
I'm sure there's also a workaround for rich.

@juanmc2005
Copy link
Owner Author

I've been working on this lately.

Rich works well with multithreading, but for some reason it's extremely slow to spawn new workers (maybe because of the GIL?).
When moving to multiprocessing, Rich does not work anymore with multiple bars because the instance of Progress can't be shared between processes. The only solution that I found for this was to use tqdm with locks.

Whenever multiprocessing is not needed, rich is used by default. I'm also implementing it in a way that users can manually choose the progress bar they want.

@juanmc2005
Copy link
Owner Author

Implemented in #124

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants