Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task library slowdown if the number of domains is greater than 8 #8

Closed
shubhamkumar13 opened this issue Apr 28, 2020 · 11 comments
Closed

Comments

@shubhamkumar13
Copy link

shubhamkumar13 commented Apr 28, 2020

Below are 2 examples, both run Channels' and Tasks' respectively in order

  • The first one uses 8 domains.

    • Channels
     OCAMLRUNPARAM="S=0x400" taskset --cpu-list 2-13 chrt -r 1 _build/default/game_of_life_multicore.exe 8 256
    
     **** GC stats ****
     Minor words:		93193
     Promoted words:	462
     Major words:		2101822
     Minor collections:	24u
     Major collections:	6
    

    Screenshot from 2020-04-28 22-18-36

    • Tasks
    OCAMLRUNPARAM="S=0x400" taskset --cpu-list 2-13 chrt -r 1 _build/default /game_of_life_multicore_task.exe 8 256
    
    **** GC stats ****
    Minor words:		95355
    Promoted words:		302
    Major words:		2101662
    Minor collections:	23u
    Major collections:	6
    

    Screenshot from 2020-04-28 22-18-03

  • The second one uses 12 domains

    • Channels
     OCAMLRUNPARAM="S=0x400" taskset --cpu-list 2-13,16-27 chrt -r 1 _build/default/game_of_life_multicore.exe 12 256 
    
     **** GC stats ****
     Minor words:		144641
     Promoted words:	614
     Major words:		2101990
     Minor collections:	23u
     Major collections:	6
    

    Screenshot from 2020-04-28 22-10-21

    • Tasks
     OCAMLRUNPARAM="S=0x400" taskset --cpu-list 2-13,16-27 chrt -r 1 _build/default/game_of_life_multicore_task.exe 12 256
    
     **** GC stats ****
     Minor words:		7690302
     Promoted words:	4392
     Major words:		2105768
     Minor collections:	57u
     Major collections:	6
    

    Screenshot from 2020-04-28 22-08-58

The numbers and the trace file have a different distribution of minor gc calls when one is using the task library and the number of domains is something greater than 8, like 12. (The picture doesn't reveal all the processes, but the behaviour looks similar in all the processes)

@kayceesrk
Copy link
Contributor

kayceesrk commented Apr 30, 2020

Can you also link to where the game_of_life sources are available? This would be useful for reproduction.

I've fixed the issue with more minor GCs in a recent commit. aca9ea9. However, I still noticed similar performance difference as earlier. Can you confirm this?

@shubhamkumar13
Copy link
Author

shubhamkumar13 commented Apr 30, 2020

game_of_life (with task api)
game_of_life (with channels)

The above code contains print statements. When comparing the time taken by each code I've omitted the printing part.

I haven't reproduced it with the updated version of recv_poll, but I'll do it asap.

@kayceesrk kayceesrk moved this from To do to To do: Performance in Domains-only Multicore OCaml May 1, 2020
@shubhamkumar13
Copy link
Author

shubhamkumar13 commented May 1, 2020

Time Pre aca9ea9

Domains Channels Tasks
1 50.486 50.612
2 25.355 25.359
4 12.763 12.742
8 6.472 6.441
12 4.416 4.502
16 3.376 3.322
20 2.794 2.842
24 2.357 3.02

Time Post aca9ea9

Domains Channels Tasks
1 45.187 44.37
2 23.105 22.286
4 11.663 11.201
8 5.952 5.668
12 4.078 3.976
16 3.102 2.927
20 2.534 2.513
24 2.165 2.67

The aca9ea9 commit makes channels' and tasks' running time at par. The case when the number of domains = 24 is the only exception as far as I can see.

Channels :

Screenshot from 2020-05-01 17-03-51

Task :

Screenshot from 2020-05-01 17-04-13

The only difference I see is that the channels' Process 0 handles interrupt while the task's implementation doesn't

@kayceesrk
Copy link
Contributor

Ok. This is quite useful. Is there any difference in minor/major allocation / collections? It is not ideal that domain 0 is up to something in the task library at 24 domains. Any insights on how to improve this?

@kayceesrk
Copy link
Contributor

I took a brief look at what's going on using perf

Task based one:

$ sudo perf stat -d -d -d taskset --cpu-list 2-13,16-27 chrt -r 1 _build/default/life.exe 8 32 > /dev/null

 Performance counter stats for 'taskset --cpu-list 2-13,16-27 chrt -r 1 _build/default/life.exe 8 32':

       6451.373141      task-clock (msec)         #    7.024 CPUs utilized          
               278      context-switches          #    0.043 K/sec                  
                 7      cpu-migrations            #    0.001 K/sec                  
             4,762      page-faults               #    0.738 K/sec                  
   14,10,37,73,143      cycles                    #    2.186 GHz                      (61.81%)
   10,92,49,34,212      instructions              #    0.77  insn per cycle           (69.69%)
    1,95,94,05,759      branches                  #  303.719 M/sec                    (69.70%)
      34,82,03,068      branch-misses             #   17.77% of all branches          (69.67%)
    4,20,22,61,296      L1-dcache-loads           #  651.375 M/sec                    (69.69%)
       1,69,14,320      L1-dcache-load-misses     #    0.40% of all L1-dcache hits    (69.78%)
          2,28,921      LLC-loads                 #    0.035 M/sec                    (69.49%)
            14,994      LLC-load-misses           #    6.55% of all LL-cache hits     (69.18%)
   <not supported>      L1-icache-loads                                             
          8,87,531      L1-icache-load-misses                                         (68.80%)
    4,21,38,41,980      dTLB-loads                #  653.170 M/sec                    (68.40%)
            60,833      dTLB-load-misses          #    0.00% of all dTLB cache hits   (60.81%)
          3,85,245      iTLB-loads                #    0.060 M/sec                    (61.15%)
             8,306      iTLB-load-misses          #    2.16% of all iTLB cache hits   (61.51%)
   <not supported>      L1-dcache-prefetches                                        
   <not supported>      L1-dcache-prefetch-misses                                   

       0.918466448 seconds time elapsed

Channel based one:

$ sudo perf stat -d -d -d taskset --cpu-list 2-13,16-27 chrt -r 1 _build/default/life_chan.exe 8 32 > /dev/null

 Performance counter stats for 'taskset --cpu-list 2-13,16-27 chrt -r 1 _build/default/life_chan.exe 8 32':

       5913.120151      task-clock (msec)         #    6.956 CPUs utilized          
               263      context-switches          #    0.044 K/sec                  
                 7      cpu-migrations            #    0.001 K/sec                  
             4,769      page-faults               #    0.807 K/sec                  
   12,92,58,46,442      cycles                    #    2.186 GHz                      (61.07%)
   10,84,83,09,319      instructions              #    0.84  insn per cycle           (68.80%)
    1,96,65,42,992      branches                  #  332.573 M/sec                    (68.88%)
      35,05,69,980      branch-misses             #   17.83% of all branches          (68.91%)
    4,16,47,99,597      L1-dcache-loads           #  704.332 M/sec                    (68.91%)
       1,70,57,139      L1-dcache-load-misses     #    0.41% of all L1-dcache hits    (69.36%)
          2,25,617      LLC-loads                 #    0.038 M/sec                    (69.59%)
            22,118      LLC-load-misses           #    9.80% of all LL-cache hits     (69.58%)
   <not supported>      L1-icache-loads                                             
          6,55,354      L1-icache-load-misses                                         (69.65%)
    4,17,28,41,970      dTLB-loads                #  705.692 M/sec                    (69.57%)
            17,932      dTLB-load-misses          #    0.00% of all dTLB cache hits   (61.53%)
          3,76,332      iTLB-loads                #    0.064 M/sec                    (61.52%)
            29,448      iTLB-load-misses          #    7.83% of all iTLB cache hits   (61.44%)
   <not supported>      L1-dcache-prefetches                                        
   <not supported>      L1-dcache-prefetch-misses                                   

       0.850127551 seconds time elapsed

The task-based one executes fewer instructions per cycle. Wonder where the stalls might be? @sadiqj

@shubhamkumar13
Copy link
Author

Ok. This is quite useful. Is there any difference in minor/major allocation / collections? It is not ideal that domain 0 is up to something in the task library at 24 domains. Any insights on how to improve this?

Sorry 😅️, I had no insights for improving this.

@sadiqj
Copy link

sadiqj commented May 1, 2020

Here's an oddity, I'm running domainslib 0.2 which doesn't have aca9ea9 and yet I see very little difference in performance between task and chan on a dual 36 core machine:

time ./game_of_life_multicore_task.exe 4 256 > /dev/null

real	0m7.614s
user	0m29.458s
sys	0m0.036s
--
time ./game_of_life_multicore_chan.exe 4 256 > /dev/null

real	0m7.347s
user	0m28.455s
sys	0m0.029s
--
time ./game_of_life_multicore_task.exe 8 256 > /dev/null

real	0m4.167s
user	0m31.413s
sys	0m0.052s
--
time ./game_of_life_multicore_chan.exe 8 256 > /dev/null

real	0m4.103s
user	0m31.130s
sys	0m0.052s
--
time ./game_of_life_multicore_task.exe 16 256 > /dev/null

real	0m2.595s
user	0m36.173s
sys	0m0.064s
--
time ./game_of_life_multicore_chan.exe 16 256 > /dev/null

real	0m2.656s
user	0m36.330s
sys	0m0.049s
--
time ./game_of_life_multicore_task.exe 32 256 > /dev/null

real	0m2.084s
user	0m39.834s
sys	0m0.133s
--
time ./game_of_life_multicore_chan.exe 32 256 > /dev/null

real	0m2.063s
user	0m39.497s
sys	0m0.104s
--
time ./game_of_life_multicore_task.exe 48 256 > /dev/null

real	0m1.726s
user	0m48.704s
sys	0m0.146s
--
time ./game_of_life_multicore_chan.exe 48 256 > /dev/null

real	0m1.761s
user	0m47.047s
sys	0m0.152s
--
time ./game_of_life_multicore_task.exe 64 256 > /dev/null

real	0m1.593s
user	0m56.907s
sys	0m0.146s
--
time ./game_of_life_multicore_chan.exe 64 256 > /dev/null

real	0m1.545s
user	0m52.133s
sys	0m0.145s

@kayceesrk
Copy link
Contributor

@sadiqj that is encouraging but odd. I could very well believe that the slowdown that we witness only occurs on our machines. We may dig in a little to remove any potential overheads.

@dalev
Copy link

dalev commented Feb 19, 2021

Hi, I noticed this issue as well. Just for fun, I built a simple ray tracer and parallelized it using domainslib. As long as I use no more than six domains for rendering (plus one domain to tick the console progress bar), things work just fine. As soon as I ask for seven rendering domains, the program slows down significantly.

Concretely, I can generate my reference image in ~3s using a purely sequential run. If I use 2 domains, I obtain results in just under ~2s. With six domains, I get down to ~1.7s.

But with seven rendering domains, the time jumps an order of magnitude: 18s. If I ask for eight rendering domains, the time jumps to 35s. But then, interestingly, it seems to flatten out. I've tried asking for as many as 20 domains, and the running time hovers around 35s.

Fwiw, I am running this on a Windows laptop (4 physical cores / 8 logical) using Debian inside of WSL2. As a result, my ability to see what exactly is going on is limited (e.g., I don't think I can use perf inside WSL2, but not totally confident).

Anyways, I just thought I'd share on the off chance that this turns out to be helpful. Happy to try to provide more info if needed!

@dalev
Copy link

dalev commented Feb 20, 2021

Actually, I was able to get perf going in WSL2. Using time perf record -g -- _build/default/bin/main.exe -max-threads 7, the elapsed time is about 18 seconds, and perf report shows this at the top:

  • 34.78% 34.78% main.exe main.exe [.] caml_try_run_on_all_domains_with_spin_work
  • 21.66% 21.66% main.exe main.exe [.] caml_gc_log
  • 18.31% 0.00% main.exe [unknown] [.] 0x00000000000004fd
  • 16.94% 0.00% main.exe [unknown] [.] 0x1074c08548fffff9
  • 15.27% 0.00% main.exe [unknown] [.] 0x00000000000007fd
  • 15.27% 0.00% main.exe [unknown] [.] 0x000055844f643238
  • 12.13% 12.13% main.exe main.exe [.] major_collection_slice
  • 5.29% 5.29% main.exe main.exe [.] caml_try_stw_empty_minor_heap_on_all_domains
  • 4.18% 4.18% main.exe main.exe [.] caml_empty_minor_heaps_once
  • 3.21% 3.21% main.exe main.exe [.] caml_try_run_on_all_domains
  • 3.16% 3.16% main.exe main.exe [.] caml_adopt_orphaned_work
  • 2.12% 2.12% main.exe main.exe [.] caml_final_update_first

Whereas, if I run time perf record -g -- _build/default/bin/main.exe -max-threads 6, the elapsed time is only 1.8 seconds, and perf report shows:

  • 38.66% 0.00% main.exe [unknown] [.] 0x00000000000004fd
  • 33.62% 0.00% main.exe [unknown] [.] 0x00000000000007fd
  • 33.62% 0.00% main.exe [unknown] [.] 0x00005566d1ee6238
  • 13.98% 13.98% main.exe main.exe [.] caml_curry2
  • 11.77% 11.77% main.exe main.exe [.] camlPath_tracer__Affine__fun_3663
    ... elided ...
  • 2.24% 2.23% main.exe main.exe [.] caml_try_run_on_all_domains_with_spin_work

I'm still not sure what the underlying story is, but hopefully this points in the right direction?

@Sudha247
Copy link
Contributor

Domainslib has evolved quite a bit since this issue was reported for game of life benchmark. Notably, the task library has a Chase-Lev work-stealing queue underneath for scheduling tasks (#29) and uses effect handlers to manage task creation (#51).

I ran the same benchmarks again on domainslib.0.4.0 and here are the numbers for 256 2048:

Cores Task Chan
1 49.19 60.06
2 25.94 36.19
4 12.86 23.38
8 6.72 17.32
12 4.68 15.42
16 3.65 14.44
20 3 14.05
24 2.7 13.42

I think it's safe to say we no longer witness slowdown on higher number of cores, hence closing this issue.


@dalev thanks for the inputs. Since your machine contains 4 cores / 8 threads, having more than 8 domains is sure to slow down the program. Ideally, the number of domains = number of physical cores for best results. You might be interested in https://github.com/ocaml-multicore/ocaml-multicore/wiki/Concurrency-and-parallelism-design-notes. I'd suggest trying your benchmark again on the latest version of domainslib. Feel free to make a new issue if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Domains-only Multicore OCaml
  
To do: Performance
Development

No branches or pull requests

5 participants