[OpenMP] Adding a throttling threshold to bound dependent tasking mem… #82274

rpereira-dev · 2024-02-19T19:27:36Z

Please refer to https://reviews.llvm.org/D158416

…ory footprint

github-actions · 2024-02-19T19:27:52Z

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be
notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write
permissions for the repository. In which case you can instead tag reviewers by
name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review
by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate
is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

github-actions · 2024-02-19T19:29:58Z

✅ With the latest revision this PR passed the C/C++ code formatter.

rpereira-dev · 2024-02-19T19:33:30Z

@jprotze We exchanged by mail on this on 21/11/2023 - you were concerned about performances on adding a global atomic counter

I suggested adding a compile-time option and disable this new throttling parameter it by default.
It is not implemented in this patch, should I add the compile-time option ?

…ed deprecated 'master' by 'single' in unit tests

jprotze

I rebased the patch to f7c2e5f, because your branch did not build with my build configuration. Then I build llvm for these two versions.

First, I saw a flaky livelock, when running taskbench from epcc-openmp-microbenchmarks. I could only reproduce the livelock with your patch, but not with the version from main. Your patch seems to introduce a race condition on task scheduling or the barrier logic.

Then, I see a significant performance impact by this patch. I execute below Fibonacci code with 96 threads on our machine and get 2,459s (with the patch) vs 0,068s (without the patch), which is a 36x runtime increase.
Even execution with a single thread is significantly impacted (0.670s vs. 0.500s). And the serial execution is faster than the execution with 48 threads.

The crucial thing is: never put anything on the code path for included tasks (if(0)).

#include <stdio.h>
#include <stdlib.h>
int fib(int n) {
  int i, j;
  if (n<2) {
    return n;
  } else {
    #pragma omp task shared(i) if(n>15)
    i=fib(n-1);
    #pragma omp task shared(j) if(n>15)
    j=fib(n-2);
    if (n>15) {
      #pragma omp taskwait
    }
    return i+j;
  }
}

int main(int argc, char** argv) {
  int n = 5;
  if (argc>1) 
    n = atoi(argv[1]);
  #pragma omp parallel
  #pragma omp single
  {
    printf("fib(%i) = %i\n", n, fib(n));
  }
  return 0;
}

Compiled as:

clang -fopenmp -g -O3 fib-if0.c

Executed as:

time env OMP_PLACES="cores" OMP_PROC_BIND=close OMP_NUM_THREADS=96 ~/testdir/openmp/a.out 34

jprotze · 2024-02-21T13:42:41Z

openmp/runtime/src/kmp_tasking.cpp

-        __kmp_task_is_allowed(gtid, __kmp_task_stealing_constraint, taskdata,
+  if (__kmp_enable_task_throttling && TCR_4(thread_data->td.td_deque_ntasks) >=
+                                          __kmp_task_maximum_ready_per_thread) {
+    if (__kmp_task_is_allowed(gtid, __kmp_task_stealing_constraint, taskdata,


The logic here seems broken. Expanding the task queue is only necessary, if it is not large enough.

I pulled from main and updated the patch with your feedback. Thanks.
Running on 16-cores Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz with OMP_PLACES="cores" OMP_PROC_BIND=close OMP_NUM_THREADS=16

On performances

Using the fibonacci program you provided

Compiling with #define KMP_COMPILE_GLOBAL_TASK_THROTTLING 0 and running with KMP_ENABLE_TASK_THROTTLING=0

$ time ./a.out 34 fib(34) = 5702887 real 0m0.210s

Compiling with #define KMP_COMPILE_GLOBAL_TASK_THROTTLING 0 and running with KMP_ENABLE_TASK_THROTTLING=1 (= current default behavior)

$ time ./a.out 34 fib(34) = 5702887 real 0m0.201s

Compiling with #define KMP_COMPILE_GLOBAL_TASK_THROTTLING 1 and running with KMP_ENABLE_TASK_THROTTLING=0

$ time ./a.out 34 fib(34) = 5702887 real 0m0.210s

Compiling with #define KMP_COMPILE_GLOBAL_TASK_THROTTLING 1 and running with KMP_ENABLE_TASK_THROTTLING=1

$ time ./a.out 34 fib(34) = 5702887 real 0m7.786s

This makes me think of adding an additional run-time option KMP_ENABLE_GLOBAL_TASK_THROTTLING to disable only "global throttling" at run-time even if its compiled, so users can still use the current "per-thread throttling"

On livelock

I could not reproduce on the above machine and configuration using taskbench from epcc-openmp-microbenchmarks.
(I ran 200+ times using default taskbench options)
I built LLVM using -DCMAKE_BUILD_TYPE=RelWithDebInfo that ends up being a gcc-12 with -O2 -g

On unit tests

In the current MR state, the test omp_throttling_max.c will fail if "global throttling" is not compiled (#define KMP_COMPILE_GLOBAL_TASK_THROTTLING 0) - any advices on disabling the test in such case ?

oh and...

On the broken logic

Yes I broke that sorry! I reverted modifications.
As far as my understanding goes, this section of the runtime is related to deferring tasks with a priority.
Such tasks are stored into "per team" queues (and not "per thread" as for 0-priority tasks) - so it makes sense to ignore the newly KMP_TASK_MAXIMUM_READY_PER_THREAD parameter anyway

The current main branch behavior is preserved (if the queue for the given priority is full and throttling is enabled, throttle, else resize the queue)

Performance
An alternative approach might be to only count tasks enqueued/added to the dependence graph, not tasks generated/freed. This would resolve the performance impact for if0 tasks.

Unit Tests
Make the define a cmake option (check for example LIBOMP_OMPT_SUPPORT). Follow the flow of LIBOMP_OMPT_SUPPORT to the tests. Finally the option ends up as a lit feature and we can use it as // REQUIRES: ompt in ompt tests.

Broken Logic
I think, the function should still consider the global task limit and not enqueue the task if the limit is reached.

PS: it would be better to keep general discussion at the global level and not in a code comment ;)

rpereira-dev · 2024-02-27T15:06:17Z

@jprotze Agreeing on adding a throttling parameter based on the number of tasks added to TDGs.
Better for performances, and mixed with the existing "per-thread-ready-tasks" threshold, it should fix the initial concern on memory footprint, at least on my motivating use-cases that follows a single-producer/multi-consumer pattern.
I'll be working on it.
Thanks for the advices on unit testing

jprotze · 2024-02-27T15:16:30Z

The only tasks you will not count this way are tasks that started executing and are on the execution stack of a thread.

…unit tests and epcc+fib

…y pushed

[OpenMP] Adding a throttling threshold to bound dependent tasking mem…

459dfd3

…ory footprint

llvmbot added the openmp:libomp OpenMP host runtime label Feb 19, 2024

Fix comment typos, disabled clang-format on unit tests header, replac…

ed32762

…ed deprecated 'master' by 'single' in unit tests

jprotze requested changes Feb 22, 2024

View reviewed changes

rpereira-dev added 3 commits February 23, 2024 14:16

Merge branch 'main' of https://github.com/llvm/llvm-project

9fc9269

Updates, see llvm#82274

4ea80bc

git clang-format

21f4e45

rpereira-dev added 7 commits March 4, 2024 16:53

[WIP] Throttling, 'max_child' isn't implemented yet

09da1dd

[WIP] max_child first draft

a80ee34

Throttling, global and per-task parameters implemented, tested under …

e695ac3

…unit tests and epcc+fib

Merge branch 'main' of https://github.com/llvm/llvm-project

ff86df5

Merge branch 'main' of https://github.com/llvm/llvm-project into dev

892c9e0

Merge branch 'dev' of https://github.com/rpereira-dev/llvm-project

a403495

Reverted 'inline' removal from __kmp_execute_tasks_template mistakenl…

4fe1580

…y pushed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OpenMP] Adding a throttling threshold to bound dependent tasking mem… #82274

[OpenMP] Adding a throttling threshold to bound dependent tasking mem… #82274

rpereira-dev commented Feb 19, 2024

github-actions bot commented Feb 19, 2024

github-actions bot commented Feb 19, 2024 •

edited

Loading

rpereira-dev commented Feb 19, 2024 •

edited

Loading

jprotze left a comment •

edited

Loading

jprotze Feb 21, 2024

rpereira-dev Feb 23, 2024 •

edited

Loading

rpereira-dev Feb 23, 2024 •

edited

Loading

jprotze Feb 23, 2024

rpereira-dev commented Feb 27, 2024 •

edited

Loading

jprotze commented Feb 27, 2024

[OpenMP] Adding a throttling threshold to bound dependent tasking mem… #82274

Are you sure you want to change the base?

[OpenMP] Adding a throttling threshold to bound dependent tasking mem… #82274

Conversation

rpereira-dev commented Feb 19, 2024

github-actions bot commented Feb 19, 2024

github-actions bot commented Feb 19, 2024 • edited Loading

rpereira-dev commented Feb 19, 2024 • edited Loading

jprotze left a comment • edited Loading

Choose a reason for hiding this comment

jprotze Feb 21, 2024

Choose a reason for hiding this comment

rpereira-dev Feb 23, 2024 • edited Loading

Choose a reason for hiding this comment

On performances

On livelock

On unit tests

rpereira-dev Feb 23, 2024 • edited Loading

Choose a reason for hiding this comment

On the broken logic

jprotze Feb 23, 2024

Choose a reason for hiding this comment

rpereira-dev commented Feb 27, 2024 • edited Loading

jprotze commented Feb 27, 2024

github-actions bot commented Feb 19, 2024 •

edited

Loading

rpereira-dev commented Feb 19, 2024 •

edited

Loading

jprotze left a comment •

edited

Loading

rpereira-dev Feb 23, 2024 •

edited

Loading

rpereira-dev Feb 23, 2024 •

edited

Loading

rpereira-dev commented Feb 27, 2024 •

edited

Loading