Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help needed for a NUMA-effects reproducer #865

Closed
ikabadzhov opened this issue Aug 2, 2022 · 4 comments
Closed

Help needed for a NUMA-effects reproducer #865

ikabadzhov opened this issue Aug 2, 2022 · 4 comments

Comments

@ikabadzhov
Copy link

Dear experts,
My question is whether there is a TBB benchmark to see NUMA effects. I have tried to do map reduces on long vectors, but I never saw any NUMA effects. Do I correctly understand from https://link.springer.com/chapter/10.1007/978-1-4842-4398-5_20, that TBB has its own mechanism to optimally decide where work is done?

Note is that, using OpenMP, NUMA problems are very visible.

@ikabadzhov
Copy link
Author

Update: Following the answer from https://community.intel.com/t5/Intel-oneAPI-Threading-Building/What-is-the-current-state-of-art-solution-to-NUMA-effects-with/m-p/1405677/emcs_t/S2h8ZW1haWx8dG9waWNfc3Vic2NyaXB0aW9ufEw2REUwNjBVNUtMSjhOfDE0MDU2Nzd8U1VCU0NSSVBUSU9OU3xoSw#M15138, I ran this benchmark: https://github.com/Apress/pro-TBB/blob/master/ch20/fig_20_05.cpp, with the only change that I instead of tbb::task_scheduler_init init{nth}, now use: oneapi::tbb::task_arena arena(nth);. And also I defaulted vsize = 1000000000;.

Reason for the change was:

error: 'task_scheduler_init' is not a member of 'tbb'
   tbb::task_scheduler_init init{nth};
        ^~~~~~~~~~~~~~~~~~~

What I am doing is checking the running time of pinned tasks to 1 numa domain and 2 numa domains, e.g.:

  • 2 NUMA domains:
 Performance counter stats for 'taskset -c 12,13,14,15 ./gif' (3 runs):

          13394.65 msec task-clock:u              #    1.457 CPUs utilized            ( +-  1.33% )
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
           3346764      page-faults:u             #  250.228 K/sec                    ( +-  9.11% )
       10789002582      cycles:u                  #    0.807 GHz                      ( +-  0.64% )
       11008077434      instructions:u            #    1.01  insn per cycle           ( +-  0.00% )
        1504179394      branches:u                #  112.463 M/sec                    ( +-  0.02% )
             33761      branch-misses:u           #    0.00% of all branches          ( +-  0.22% )

             9.192 +- 0.328 seconds time elapsed  ( +-  3.57% )

Time: 1.50884 seconds; Bandwidth: 15906.2MB/s
Time: 1.50178 seconds; Bandwidth: 15981MB/s
Time: 1.48301 seconds; Bandwidth: 16183.3MB/s
  • 1 NUMA domain:
 Performance counter stats for 'taskset -c 14,15,16,17 ./gif' (3 runs):

          14537.66 msec task-clock:u              #    1.432 CPUs utilized            ( +-  0.33% )
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
           4724987      page-faults:u             #  322.927 K/sec                    ( +-  0.20% )
       10230499278      cycles:u                  #    0.699 GHz                      ( +-  0.23% )
       11009327851      instructions:u            #    1.07  insn per cycle           ( +-  0.00% )
        1505533689      branches:u                #  102.895 M/sec                    ( +-  0.00% )
             32679      branch-misses:u           #    0.00% of all branches          ( +-  2.77% )

           10.1549 +- 0.0329 seconds time elapsed  ( +-  0.32% )

I should have minimal noise in my system.

@vossmjp
Copy link

vossmjp commented Aug 11, 2022

@ikabadzhov have you seen that OpenMP is showing better bandwidth than oneTBB or you are just looking for a case that highlights that oneTBB can in some configurations shows poor performance without NUMA features activated? There are some key differences in how oneTBB and OpenMP behave by default. By default oneTBB does not pin threads to cores, while OpenMP does. By default, oneTBB uses work-stealing and auto-partitioning to balance the load across cores, while OpenMP uses static partitioning into roughly equal sized chunks and then a repeatable static scheduling of those chunks to pinned threads. Depending on the workload, these defaults can show different tolerance to NUMA effects. You can mimic many of these OpenMP defaults in oneTBB by using NUMA-aware task_arenas and/or static partitioners.

@isaevil
Copy link
Contributor

isaevil commented Oct 5, 2022

@ikabadzhov is this issue still relevant for you? Could you please respond?

@ikabadzhov
Copy link
Author

Thanks a lot for the replies. The issue could be closed. My main goal was to observe NUMA effects in the ROOT's dataframe. For that I used tests from here. After inspecting the behaviour of ROOT with and without NUMA-aware TBB arena: brief conclusions: 1. we see NUMA effects (within the ROOT's dataframe) only on low core counts, and those effects decrease proportionally with increasing the number of cores. 2. Applying a NUMA-aware TBB arena does help on the lower core counts. We did not end up applying the NUMA aware mechanism - as 1. the most interesting cases (high core counts) are already performing good; 2. there were several limitations which we manually bypassed - for instance having several nested ForEach witihin a vector of arenas - for the purpose of benchmarks we restructured our end to have only a single ForEach call; 3. need to require for a newer tbb version (which might not be available). I might be missing something, but 1. was the main motivation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants