-
Notifications
You must be signed in to change notification settings - Fork 12k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[OpenMP] Adding a throttling threshold to bound dependent tasking mem… #82274
base: main
Are you sure you want to change the base?
Conversation
Thank you for submitting a Pull Request (PR) to the LLVM Project! This PR will be automatically labeled and the relevant teams will be If you wish to, you can add reviewers by using the "Reviewers" section on this page. If this is not working for you, it is probably because you do not have write If you have received no comments on your PR for a week, you can request a review If you have further questions, they may be answered by the LLVM GitHub User Guide. You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums. |
✅ With the latest revision this PR passed the C/C++ code formatter. |
@jprotze We exchanged by mail on this on 21/11/2023 - you were concerned about performances on adding a global atomic counter I suggested adding a compile-time option and disable this new throttling parameter it by default. |
…ed deprecated 'master' by 'single' in unit tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rebased the patch to f7c2e5f, because your branch did not build with my build configuration. Then I build llvm for these two versions.
First, I saw a flaky livelock, when running taskbench from epcc-openmp-microbenchmarks. I could only reproduce the livelock with your patch, but not with the version from main. Your patch seems to introduce a race condition on task scheduling or the barrier logic.
Then, I see a significant performance impact by this patch. I execute below Fibonacci code with 96 threads on our machine and get 2,459s (with the patch) vs 0,068s (without the patch), which is a 36x runtime increase.
Even execution with a single thread is significantly impacted (0.670s vs. 0.500s). And the serial execution is faster than the execution with 48 threads.
The crucial thing is: never put anything on the code path for included tasks (if(0)).
#include <stdio.h>
#include <stdlib.h>
int fib(int n) {
int i, j;
if (n<2) {
return n;
} else {
#pragma omp task shared(i) if(n>15)
i=fib(n-1);
#pragma omp task shared(j) if(n>15)
j=fib(n-2);
if (n>15) {
#pragma omp taskwait
}
return i+j;
}
}
int main(int argc, char** argv) {
int n = 5;
if (argc>1)
n = atoi(argv[1]);
#pragma omp parallel
#pragma omp single
{
printf("fib(%i) = %i\n", n, fib(n));
}
return 0;
}
Compiled as:
clang -fopenmp -g -O3 fib-if0.c
Executed as:
time env OMP_PLACES="cores" OMP_PROC_BIND=close OMP_NUM_THREADS=96 ~/testdir/openmp/a.out 34
openmp/runtime/src/kmp_tasking.cpp
Outdated
__kmp_task_is_allowed(gtid, __kmp_task_stealing_constraint, taskdata, | ||
if (__kmp_enable_task_throttling && TCR_4(thread_data->td.td_deque_ntasks) >= | ||
__kmp_task_maximum_ready_per_thread) { | ||
if (__kmp_task_is_allowed(gtid, __kmp_task_stealing_constraint, taskdata, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic here seems broken. Expanding the task queue is only necessary, if it is not large enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pulled from main and updated the patch with your feedback. Thanks.
Running on 16-cores Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz with OMP_PLACES="cores" OMP_PROC_BIND=close OMP_NUM_THREADS=16
On performances
Using the fibonacci program you provided
Compiling with #define KMP_COMPILE_GLOBAL_TASK_THROTTLING 0
and running with KMP_ENABLE_TASK_THROTTLING=0
$ time ./a.out 34
fib(34) = 5702887
real 0m0.210s
Compiling with #define KMP_COMPILE_GLOBAL_TASK_THROTTLING 0
and running with KMP_ENABLE_TASK_THROTTLING=1
(= current default behavior)
$ time ./a.out 34
fib(34) = 5702887
real 0m0.201s
Compiling with #define KMP_COMPILE_GLOBAL_TASK_THROTTLING 1
and running with KMP_ENABLE_TASK_THROTTLING=0
$ time ./a.out 34
fib(34) = 5702887
real 0m0.210s
Compiling with #define KMP_COMPILE_GLOBAL_TASK_THROTTLING 1
and running with KMP_ENABLE_TASK_THROTTLING=1
$ time ./a.out 34
fib(34) = 5702887
real 0m7.786s
This makes me think of adding an additional run-time option KMP_ENABLE_GLOBAL_TASK_THROTTLING
to disable only "global throttling" at run-time even if its compiled, so users can still use the current "per-thread throttling"
On livelock
I could not reproduce on the above machine and configuration using taskbench from epcc-openmp-microbenchmarks.
(I ran 200+ times using default taskbench options)
I built LLVM using -DCMAKE_BUILD_TYPE=RelWithDebInfo
that ends up being a gcc-12 with -O2 -g
On unit tests
In the current MR state, the test omp_throttling_max.c
will fail if "global throttling" is not compiled (#define KMP_COMPILE_GLOBAL_TASK_THROTTLING 0
) - any advices on disabling the test in such case ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh and...
On the broken logic
Yes I broke that sorry! I reverted modifications.
As far as my understanding goes, this section of the runtime is related to deferring tasks with a priority.
Such tasks are stored into "per team" queues (and not "per thread" as for 0-priority tasks) - so it makes sense to ignore the newly KMP_TASK_MAXIMUM_READY_PER_THREAD
parameter anyway
The current main
branch behavior is preserved (if the queue for the given priority is full and throttling is enabled, throttle, else resize the queue)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Performance
An alternative approach might be to only count tasks enqueued/added to the dependence graph, not tasks generated/freed. This would resolve the performance impact for if0 tasks.
Unit Tests
Make the define a cmake option (check for example LIBOMP_OMPT_SUPPORT
). Follow the flow of LIBOMP_OMPT_SUPPORT to the tests. Finally the option ends up as a lit feature and we can use it as // REQUIRES: ompt
in ompt tests.
Broken Logic
I think, the function should still consider the global task limit and not enqueue the task if the limit is reached.
PS: it would be better to keep general discussion at the global level and not in a code comment ;)
@jprotze Agreeing on adding a throttling parameter based on the number of tasks added to TDGs. |
The only tasks you will not count this way are tasks that started executing and are on the execution stack of a thread. |
…unit tests and epcc+fib
Please refer to https://reviews.llvm.org/D158416