Help needed for a NUMA-effects reproducer #865

ikabadzhov · 2022-08-02T18:55:52Z

Dear experts,
My question is whether there is a TBB benchmark to see NUMA effects. I have tried to do map reduces on long vectors, but I never saw any NUMA effects. Do I correctly understand from https://link.springer.com/chapter/10.1007/978-1-4842-4398-5_20, that TBB has its own mechanism to optimally decide where work is done?

Note is that, using OpenMP, NUMA problems are very visible.

ikabadzhov · 2022-08-04T15:15:33Z

Update: Following the answer from https://community.intel.com/t5/Intel-oneAPI-Threading-Building/What-is-the-current-state-of-art-solution-to-NUMA-effects-with/m-p/1405677/emcs_t/S2h8ZW1haWx8dG9waWNfc3Vic2NyaXB0aW9ufEw2REUwNjBVNUtMSjhOfDE0MDU2Nzd8U1VCU0NSSVBUSU9OU3xoSw#M15138, I ran this benchmark: https://github.com/Apress/pro-TBB/blob/master/ch20/fig_20_05.cpp, with the only change that I instead of tbb::task_scheduler_init init{nth}, now use: oneapi::tbb::task_arena arena(nth);. And also I defaulted vsize = 1000000000;.

Reason for the change was:

error: 'task_scheduler_init' is not a member of 'tbb'
   tbb::task_scheduler_init init{nth};
        ^~~~~~~~~~~~~~~~~~~

What I am doing is checking the running time of pinned tasks to 1 numa domain and 2 numa domains, e.g.:

2 NUMA domains:

 Performance counter stats for 'taskset -c 12,13,14,15 ./gif' (3 runs):

          13394.65 msec task-clock:u              #    1.457 CPUs utilized            ( +-  1.33% )
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
           3346764      page-faults:u             #  250.228 K/sec                    ( +-  9.11% )
       10789002582      cycles:u                  #    0.807 GHz                      ( +-  0.64% )
       11008077434      instructions:u            #    1.01  insn per cycle           ( +-  0.00% )
        1504179394      branches:u                #  112.463 M/sec                    ( +-  0.02% )
             33761      branch-misses:u           #    0.00% of all branches          ( +-  0.22% )

             9.192 +- 0.328 seconds time elapsed  ( +-  3.57% )

Time: 1.50884 seconds; Bandwidth: 15906.2MB/s
Time: 1.50178 seconds; Bandwidth: 15981MB/s
Time: 1.48301 seconds; Bandwidth: 16183.3MB/s

1 NUMA domain:

 Performance counter stats for 'taskset -c 14,15,16,17 ./gif' (3 runs):

          14537.66 msec task-clock:u              #    1.432 CPUs utilized            ( +-  0.33% )
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
           4724987      page-faults:u             #  322.927 K/sec                    ( +-  0.20% )
       10230499278      cycles:u                  #    0.699 GHz                      ( +-  0.23% )
       11009327851      instructions:u            #    1.07  insn per cycle           ( +-  0.00% )
        1505533689      branches:u                #  102.895 M/sec                    ( +-  0.00% )
             32679      branch-misses:u           #    0.00% of all branches          ( +-  2.77% )

           10.1549 +- 0.0329 seconds time elapsed  ( +-  0.32% )

I should have minimal noise in my system.

vossmjp · 2022-08-11T13:31:58Z

@ikabadzhov have you seen that OpenMP is showing better bandwidth than oneTBB or you are just looking for a case that highlights that oneTBB can in some configurations shows poor performance without NUMA features activated? There are some key differences in how oneTBB and OpenMP behave by default. By default oneTBB does not pin threads to cores, while OpenMP does. By default, oneTBB uses work-stealing and auto-partitioning to balance the load across cores, while OpenMP uses static partitioning into roughly equal sized chunks and then a repeatable static scheduling of those chunks to pinned threads. Depending on the workload, these defaults can show different tolerance to NUMA effects. You can mimic many of these OpenMP defaults in oneTBB by using NUMA-aware task_arenas and/or static partitioners.

isaevil · 2022-10-05T11:25:33Z

@ikabadzhov is this issue still relevant for you? Could you please respond?

ikabadzhov · 2022-10-10T16:33:26Z

Thanks a lot for the replies. The issue could be closed. My main goal was to observe NUMA effects in the ROOT's dataframe. For that I used tests from here. After inspecting the behaviour of ROOT with and without NUMA-aware TBB arena: brief conclusions: 1. we see NUMA effects (within the ROOT's dataframe) only on low core counts, and those effects decrease proportionally with increasing the number of cores. 2. Applying a NUMA-aware TBB arena does help on the lower core counts. We did not end up applying the NUMA aware mechanism - as 1. the most interesting cases (high core counts) are already performing good; 2. there were several limitations which we manually bypassed - for instance having several nested ForEach witihin a vector of arenas - for the purpose of benchmarks we restructured our end to have only a single ForEach call; 3. need to require for a newer tbb version (which might not be available). I might be missing something, but 1. was the main motivation.

ikabadzhov closed this as completed Oct 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Help needed for a NUMA-effects reproducer #865

Help needed for a NUMA-effects reproducer #865

ikabadzhov commented Aug 2, 2022

ikabadzhov commented Aug 4, 2022

vossmjp commented Aug 11, 2022

isaevil commented Oct 5, 2022

ikabadzhov commented Oct 10, 2022

Help needed for a NUMA-effects reproducer #865

Help needed for a NUMA-effects reproducer #865

Comments

ikabadzhov commented Aug 2, 2022

ikabadzhov commented Aug 4, 2022

vossmjp commented Aug 11, 2022

isaevil commented Oct 5, 2022

ikabadzhov commented Oct 10, 2022