Task library slowdown if the number of domains is greater than 8 #8

shubhamkumar13 · 2020-04-28T16:56:35Z

Below are 2 examples, both run Channels' and Tasks' respectively in order

The first one uses 8 domains.

Channels

 OCAMLRUNPARAM="S=0x400" taskset --cpu-list 2-13 chrt -r 1 _build/default/game_of_life_multicore.exe 8 256

 **** GC stats ****
 Minor words:		93193
 Promoted words:	462
 Major words:		2101822
 Minor collections:	24u
 Major collections:	6

Tasks

OCAMLRUNPARAM="S=0x400" taskset --cpu-list 2-13 chrt -r 1 _build/default /game_of_life_multicore_task.exe 8 256

**** GC stats ****
Minor words:		95355
Promoted words:		302
Major words:		2101662
Minor collections:	23u
Major collections:	6

The second one uses 12 domains

Channels

 OCAMLRUNPARAM="S=0x400" taskset --cpu-list 2-13,16-27 chrt -r 1 _build/default/game_of_life_multicore.exe 12 256 

 **** GC stats ****
 Minor words:		144641
 Promoted words:	614
 Major words:		2101990
 Minor collections:	23u
 Major collections:	6

Tasks

 OCAMLRUNPARAM="S=0x400" taskset --cpu-list 2-13,16-27 chrt -r 1 _build/default/game_of_life_multicore_task.exe 12 256

 **** GC stats ****
 Minor words:		7690302
 Promoted words:	4392
 Major words:		2105768
 Minor collections:	57u
 Major collections:	6

The numbers and the trace file have a different distribution of minor gc calls when one is using the task library and the number of domains is something greater than 8, like 12. (The picture doesn't reveal all the processes, but the behaviour looks similar in all the processes)

The text was updated successfully, but these errors were encountered:

kayceesrk · 2020-04-30T08:34:59Z

Can you also link to where the game_of_life sources are available? This would be useful for reproduction.

I've fixed the issue with more minor GCs in a recent commit. aca9ea9. However, I still noticed similar performance difference as earlier. Can you confirm this?

shubhamkumar13 · 2020-04-30T10:45:03Z

game_of_life (with task api)
game_of_life (with channels)

The above code contains print statements. When comparing the time taken by each code I've omitted the printing part.

I haven't reproduced it with the updated version of recv_poll, but I'll do it asap.

shubhamkumar13 · 2020-05-01T11:38:06Z

Time Pre aca9ea9

Domains	Channels	Tasks
1	50.486	50.612
2	25.355	25.359
4	12.763	12.742
8	6.472	6.441
12	4.416	4.502
16	3.376	3.322
20	2.794	2.842
24	2.357	3.02

Time Post aca9ea9

Domains	Channels	Tasks
1	45.187	44.37
2	23.105	22.286
4	11.663	11.201
8	5.952	5.668
12	4.078	3.976
16	3.102	2.927
20	2.534	2.513
24	2.165	2.67

The aca9ea9 commit makes channels' and tasks' running time at par. The case when the number of domains = 24 is the only exception as far as I can see.

Channels :

Task :

The only difference I see is that the channels' Process 0 handles interrupt while the task's implementation doesn't

kayceesrk · 2020-05-01T12:04:20Z

Ok. This is quite useful. Is there any difference in minor/major allocation / collections? It is not ideal that domain 0 is up to something in the task library at 24 domains. Any insights on how to improve this?

kayceesrk · 2020-05-01T13:02:47Z

I took a brief look at what's going on using perf

Task based one:

$ sudo perf stat -d -d -d taskset --cpu-list 2-13,16-27 chrt -r 1 _build/default/life.exe 8 32 > /dev/null

 Performance counter stats for 'taskset --cpu-list 2-13,16-27 chrt -r 1 _build/default/life.exe 8 32':

       6451.373141      task-clock (msec)         #    7.024 CPUs utilized          
               278      context-switches          #    0.043 K/sec                  
                 7      cpu-migrations            #    0.001 K/sec                  
             4,762      page-faults               #    0.738 K/sec                  
   14,10,37,73,143      cycles                    #    2.186 GHz                      (61.81%)
   10,92,49,34,212      instructions              #    0.77  insn per cycle           (69.69%)
    1,95,94,05,759      branches                  #  303.719 M/sec                    (69.70%)
      34,82,03,068      branch-misses             #   17.77% of all branches          (69.67%)
    4,20,22,61,296      L1-dcache-loads           #  651.375 M/sec                    (69.69%)
       1,69,14,320      L1-dcache-load-misses     #    0.40% of all L1-dcache hits    (69.78%)
          2,28,921      LLC-loads                 #    0.035 M/sec                    (69.49%)
            14,994      LLC-load-misses           #    6.55% of all LL-cache hits     (69.18%)
   <not supported>      L1-icache-loads                                             
          8,87,531      L1-icache-load-misses                                         (68.80%)
    4,21,38,41,980      dTLB-loads                #  653.170 M/sec                    (68.40%)
            60,833      dTLB-load-misses          #    0.00% of all dTLB cache hits   (60.81%)
          3,85,245      iTLB-loads                #    0.060 M/sec                    (61.15%)
             8,306      iTLB-load-misses          #    2.16% of all iTLB cache hits   (61.51%)
   <not supported>      L1-dcache-prefetches                                        
   <not supported>      L1-dcache-prefetch-misses                                   

       0.918466448 seconds time elapsed

Channel based one:

$ sudo perf stat -d -d -d taskset --cpu-list 2-13,16-27 chrt -r 1 _build/default/life_chan.exe 8 32 > /dev/null

 Performance counter stats for 'taskset --cpu-list 2-13,16-27 chrt -r 1 _build/default/life_chan.exe 8 32':

       5913.120151      task-clock (msec)         #    6.956 CPUs utilized          
               263      context-switches          #    0.044 K/sec                  
                 7      cpu-migrations            #    0.001 K/sec                  
             4,769      page-faults               #    0.807 K/sec                  
   12,92,58,46,442      cycles                    #    2.186 GHz                      (61.07%)
   10,84,83,09,319      instructions              #    0.84  insn per cycle           (68.80%)
    1,96,65,42,992      branches                  #  332.573 M/sec                    (68.88%)
      35,05,69,980      branch-misses             #   17.83% of all branches          (68.91%)
    4,16,47,99,597      L1-dcache-loads           #  704.332 M/sec                    (68.91%)
       1,70,57,139      L1-dcache-load-misses     #    0.41% of all L1-dcache hits    (69.36%)
          2,25,617      LLC-loads                 #    0.038 M/sec                    (69.59%)
            22,118      LLC-load-misses           #    9.80% of all LL-cache hits     (69.58%)
   <not supported>      L1-icache-loads                                             
          6,55,354      L1-icache-load-misses                                         (69.65%)
    4,17,28,41,970      dTLB-loads                #  705.692 M/sec                    (69.57%)
            17,932      dTLB-load-misses          #    0.00% of all dTLB cache hits   (61.53%)
          3,76,332      iTLB-loads                #    0.064 M/sec                    (61.52%)
            29,448      iTLB-load-misses          #    7.83% of all iTLB cache hits   (61.44%)
   <not supported>      L1-dcache-prefetches                                        
   <not supported>      L1-dcache-prefetch-misses                                   

       0.850127551 seconds time elapsed

The task-based one executes fewer instructions per cycle. Wonder where the stalls might be? @sadiqj

shubhamkumar13 · 2020-05-01T16:52:24Z

Ok. This is quite useful. Is there any difference in minor/major allocation / collections? It is not ideal that domain 0 is up to something in the task library at 24 domains. Any insights on how to improve this?

Sorry 😅️, I had no insights for improving this.

sadiqj · 2020-05-01T19:26:55Z

Here's an oddity, I'm running domainslib 0.2 which doesn't have aca9ea9 and yet I see very little difference in performance between task and chan on a dual 36 core machine:

time ./game_of_life_multicore_task.exe 4 256 > /dev/null

real	0m7.614s
user	0m29.458s
sys	0m0.036s
--
time ./game_of_life_multicore_chan.exe 4 256 > /dev/null

real	0m7.347s
user	0m28.455s
sys	0m0.029s
--
time ./game_of_life_multicore_task.exe 8 256 > /dev/null

real	0m4.167s
user	0m31.413s
sys	0m0.052s
--
time ./game_of_life_multicore_chan.exe 8 256 > /dev/null

real	0m4.103s
user	0m31.130s
sys	0m0.052s
--
time ./game_of_life_multicore_task.exe 16 256 > /dev/null

real	0m2.595s
user	0m36.173s
sys	0m0.064s
--
time ./game_of_life_multicore_chan.exe 16 256 > /dev/null

real	0m2.656s
user	0m36.330s
sys	0m0.049s
--
time ./game_of_life_multicore_task.exe 32 256 > /dev/null

real	0m2.084s
user	0m39.834s
sys	0m0.133s
--
time ./game_of_life_multicore_chan.exe 32 256 > /dev/null

real	0m2.063s
user	0m39.497s
sys	0m0.104s
--
time ./game_of_life_multicore_task.exe 48 256 > /dev/null

real	0m1.726s
user	0m48.704s
sys	0m0.146s
--
time ./game_of_life_multicore_chan.exe 48 256 > /dev/null

real	0m1.761s
user	0m47.047s
sys	0m0.152s
--
time ./game_of_life_multicore_task.exe 64 256 > /dev/null

real	0m1.593s
user	0m56.907s
sys	0m0.146s
--
time ./game_of_life_multicore_chan.exe 64 256 > /dev/null

real	0m1.545s
user	0m52.133s
sys	0m0.145s

kayceesrk · 2020-05-02T07:15:44Z

@sadiqj that is encouraging but odd. I could very well believe that the slowdown that we witness only occurs on our machines. We may dig in a little to remove any potential overheads.

dalev · 2021-02-19T04:38:44Z

Hi, I noticed this issue as well. Just for fun, I built a simple ray tracer and parallelized it using domainslib. As long as I use no more than six domains for rendering (plus one domain to tick the console progress bar), things work just fine. As soon as I ask for seven rendering domains, the program slows down significantly.

Concretely, I can generate my reference image in ~3s using a purely sequential run. If I use 2 domains, I obtain results in just under ~2s. With six domains, I get down to ~1.7s.

But with seven rendering domains, the time jumps an order of magnitude: 18s. If I ask for eight rendering domains, the time jumps to 35s. But then, interestingly, it seems to flatten out. I've tried asking for as many as 20 domains, and the running time hovers around 35s.

Fwiw, I am running this on a Windows laptop (4 physical cores / 8 logical) using Debian inside of WSL2. As a result, my ability to see what exactly is going on is limited (e.g., I don't think I can use perf inside WSL2, but not totally confident).

Anyways, I just thought I'd share on the off chance that this turns out to be helpful. Happy to try to provide more info if needed!

dalev · 2021-02-20T16:37:22Z

Actually, I was able to get perf going in WSL2. Using time perf record -g -- _build/default/bin/main.exe -max-threads 7, the elapsed time is about 18 seconds, and perf report shows this at the top:

34.78% 34.78% main.exe main.exe [.] caml_try_run_on_all_domains_with_spin_work

21.66% 21.66% main.exe main.exe [.] caml_gc_log

18.31% 0.00% main.exe [unknown] [.] 0x00000000000004fd

16.94% 0.00% main.exe [unknown] [.] 0x1074c08548fffff9

15.27% 0.00% main.exe [unknown] [.] 0x00000000000007fd

15.27% 0.00% main.exe [unknown] [.] 0x000055844f643238

12.13% 12.13% main.exe main.exe [.] major_collection_slice

5.29% 5.29% main.exe main.exe [.] caml_try_stw_empty_minor_heap_on_all_domains

4.18% 4.18% main.exe main.exe [.] caml_empty_minor_heaps_once

3.21% 3.21% main.exe main.exe [.] caml_try_run_on_all_domains

3.16% 3.16% main.exe main.exe [.] caml_adopt_orphaned_work

2.12% 2.12% main.exe main.exe [.] caml_final_update_first

Whereas, if I run time perf record -g -- _build/default/bin/main.exe -max-threads 6, the elapsed time is only 1.8 seconds, and perf report shows:

38.66% 0.00% main.exe [unknown] [.] 0x00000000000004fd

33.62% 0.00% main.exe [unknown] [.] 0x00000000000007fd

33.62% 0.00% main.exe [unknown] [.] 0x00005566d1ee6238

13.98% 13.98% main.exe main.exe [.] caml_curry2

11.77% 11.77% main.exe main.exe [.] camlPath_tracer__Affine__fun_3663
... elided ...

2.24% 2.23% main.exe main.exe [.] caml_try_run_on_all_domains_with_spin_work

I'm still not sure what the underlying story is, but hopefully this points in the right direction?

Sudha247 · 2021-12-24T07:08:54Z

Domainslib has evolved quite a bit since this issue was reported for game of life benchmark. Notably, the task library has a Chase-Lev work-stealing queue underneath for scheduling tasks (#29) and uses effect handlers to manage task creation (#51).

I ran the same benchmarks again on domainslib.0.4.0 and here are the numbers for 256 2048:

Cores	Task	Chan
1	49.19	60.06
2	25.94	36.19
4	12.86	23.38
8	6.72	17.32
12	4.68	15.42
16	3.65	14.44
20	3	14.05
24	2.7	13.42

I think it's safe to say we no longer witness slowdown on higher number of cores, hence closing this issue.

@dalev thanks for the inputs. Since your machine contains 4 cores / 8 threads, having more than 8 domains is sure to slow down the program. Ideally, the number of domains = number of physical cores for best results. You might be interested in https://github.com/ocaml-multicore/ocaml-multicore/wiki/Concurrency-and-parallelism-design-notes. I'd suggest trying your benchmark again on the latest version of domainslib. Feel free to make a new issue if needed.

kayceesrk added this to To do in Domains-only Multicore OCaml Apr 30, 2020

kayceesrk moved this from To do to To do: Performance in Domains-only Multicore OCaml May 1, 2020

edwintorok mentioned this issue Oct 6, 2021

Task.await deadlock (task finishes but await never returns) #47

Closed

Sudha247 closed this as completed Dec 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Task library slowdown if the number of domains is greater than 8 #8

Task library slowdown if the number of domains is greater than 8 #8

shubhamkumar13 commented Apr 28, 2020 •

edited

Loading

kayceesrk commented Apr 30, 2020 •

edited

Loading

shubhamkumar13 commented Apr 30, 2020 •

edited

Loading

shubhamkumar13 commented May 1, 2020 •

edited

Loading

kayceesrk commented May 1, 2020

kayceesrk commented May 1, 2020

shubhamkumar13 commented May 1, 2020

sadiqj commented May 1, 2020 •

edited

Loading

kayceesrk commented May 2, 2020

dalev commented Feb 19, 2021 •

edited

Loading

dalev commented Feb 20, 2021

Sudha247 commented Dec 24, 2021

Task library slowdown if the number of domains is greater than 8 #8

Task library slowdown if the number of domains is greater than 8 #8

Comments

shubhamkumar13 commented Apr 28, 2020 • edited Loading

kayceesrk commented Apr 30, 2020 • edited Loading

shubhamkumar13 commented Apr 30, 2020 • edited Loading

shubhamkumar13 commented May 1, 2020 • edited Loading

kayceesrk commented May 1, 2020

kayceesrk commented May 1, 2020

shubhamkumar13 commented May 1, 2020

sadiqj commented May 1, 2020 • edited Loading

kayceesrk commented May 2, 2020

dalev commented Feb 19, 2021 • edited Loading

dalev commented Feb 20, 2021

Sudha247 commented Dec 24, 2021

shubhamkumar13 commented Apr 28, 2020 •

edited

Loading

kayceesrk commented Apr 30, 2020 •

edited

Loading

shubhamkumar13 commented Apr 30, 2020 •

edited

Loading

shubhamkumar13 commented May 1, 2020 •

edited

Loading

sadiqj commented May 1, 2020 •

edited

Loading

dalev commented Feb 19, 2021 •

edited

Loading