-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Task library slowdown if the number of domains is greater than 8 #8
Comments
Can you also link to where the game_of_life sources are available? This would be useful for reproduction. I've fixed the issue with more minor GCs in a recent commit. aca9ea9. However, I still noticed similar performance difference as earlier. Can you confirm this? |
game_of_life (with task api) The above code contains print statements. When comparing the time taken by each code I've omitted the printing part. I haven't reproduced it with the updated version of |
Time Pre
Time Post
The Channels : Task : The only difference I see is that the channels' Process 0 handles interrupt while the task's implementation doesn't |
Ok. This is quite useful. Is there any difference in minor/major allocation / collections? It is not ideal that domain 0 is up to something in the task library at 24 domains. Any insights on how to improve this? |
I took a brief look at what's going on using perf Task based one:
Channel based one:
The task-based one executes fewer instructions per cycle. Wonder where the stalls might be? @sadiqj |
Sorry 😅️, I had no insights for improving this. |
Here's an oddity, I'm running
|
@sadiqj that is encouraging but odd. I could very well believe that the slowdown that we witness only occurs on our machines. We may dig in a little to remove any potential overheads. |
Hi, I noticed this issue as well. Just for fun, I built a simple ray tracer and parallelized it using Concretely, I can generate my reference image in ~3s using a purely sequential run. If I use 2 domains, I obtain results in just under ~2s. With six domains, I get down to ~1.7s. But with seven rendering domains, the time jumps an order of magnitude: 18s. If I ask for eight rendering domains, the time jumps to 35s. But then, interestingly, it seems to flatten out. I've tried asking for as many as 20 domains, and the running time hovers around 35s. Fwiw, I am running this on a Windows laptop (4 physical cores / 8 logical) using Debian inside of WSL2. As a result, my ability to see what exactly is going on is limited (e.g., I don't think I can use perf inside WSL2, but not totally confident). Anyways, I just thought I'd share on the off chance that this turns out to be helpful. Happy to try to provide more info if needed! |
Actually, I was able to get perf going in WSL2. Using
Whereas, if I run
I'm still not sure what the underlying story is, but hopefully this points in the right direction? |
Domainslib has evolved quite a bit since this issue was reported for game of life benchmark. Notably, the task library has a Chase-Lev work-stealing queue underneath for scheduling tasks (#29) and uses effect handlers to manage task creation (#51). I ran the same benchmarks again on domainslib.0.4.0 and here are the numbers for
I think it's safe to say we no longer witness slowdown on higher number of cores, hence closing this issue. @dalev thanks for the inputs. Since your machine contains 4 cores / 8 threads, having more than 8 domains is sure to slow down the program. Ideally, the number of domains = number of physical cores for best results. You might be interested in https://github.com/ocaml-multicore/ocaml-multicore/wiki/Concurrency-and-parallelism-design-notes. I'd suggest trying your benchmark again on the latest version of domainslib. Feel free to make a new issue if needed. |
Below are 2 examples, both run Channels' and Tasks' respectively in order
The first one uses 8 domains.
The second one uses 12 domains
The numbers and the trace file have a different distribution of minor gc calls when one is using the task library and the number of domains is something greater than 8, like 12. (The picture doesn't reveal all the processes, but the behaviour looks similar in all the processes)
The text was updated successfully, but these errors were encountered: