Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OpenMP] Adding a throttling threshold to bound dependent tasking mem… #82274

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

rpereira-dev
Copy link

Copy link

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be
notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write
permissions for the repository. In which case you can instead tag reviewers by
name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review
by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate
is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@llvmbot llvmbot added the openmp:libomp OpenMP host runtime label Feb 19, 2024
Copy link

github-actions bot commented Feb 19, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

@rpereira-dev
Copy link
Author

rpereira-dev commented Feb 19, 2024

@jprotze We exchanged by mail on this on 21/11/2023 - you were concerned about performances on adding a global atomic counter

I suggested adding a compile-time option and disable this new throttling parameter it by default.
It is not implemented in this patch, should I add the compile-time option ?

…ed deprecated 'master' by 'single' in unit tests
Copy link
Collaborator

@jprotze jprotze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rebased the patch to f7c2e5f, because your branch did not build with my build configuration. Then I build llvm for these two versions.

First, I saw a flaky livelock, when running taskbench from epcc-openmp-microbenchmarks. I could only reproduce the livelock with your patch, but not with the version from main. Your patch seems to introduce a race condition on task scheduling or the barrier logic.

Then, I see a significant performance impact by this patch. I execute below Fibonacci code with 96 threads on our machine and get 2,459s (with the patch) vs 0,068s (without the patch), which is a 36x runtime increase.
Even execution with a single thread is significantly impacted (0.670s vs. 0.500s). And the serial execution is faster than the execution with 48 threads.

The crucial thing is: never put anything on the code path for included tasks (if(0)).

#include <stdio.h>
#include <stdlib.h>
int fib(int n) {
  int i, j;
  if (n<2) {
    return n;
  } else {
    #pragma omp task shared(i) if(n>15)
    i=fib(n-1);
    #pragma omp task shared(j) if(n>15)
    j=fib(n-2);
    if (n>15) {
      #pragma omp taskwait
    }
    return i+j;
  }
}

int main(int argc, char** argv) {
  int n = 5;
  if (argc>1) 
    n = atoi(argv[1]);
  #pragma omp parallel
  #pragma omp single
  {
    printf("fib(%i) = %i\n", n, fib(n));
  }
  return 0;
}

Compiled as:

clang -fopenmp -g -O3 fib-if0.c

Executed as:

time env OMP_PLACES="cores" OMP_PROC_BIND=close OMP_NUM_THREADS=96 ~/testdir/openmp/a.out 34

__kmp_task_is_allowed(gtid, __kmp_task_stealing_constraint, taskdata,
if (__kmp_enable_task_throttling && TCR_4(thread_data->td.td_deque_ntasks) >=
__kmp_task_maximum_ready_per_thread) {
if (__kmp_task_is_allowed(gtid, __kmp_task_stealing_constraint, taskdata,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic here seems broken. Expanding the task queue is only necessary, if it is not large enough.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pulled from main and updated the patch with your feedback. Thanks.
Running on 16-cores Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz with OMP_PLACES="cores" OMP_PROC_BIND=close OMP_NUM_THREADS=16

On performances

Using the fibonacci program you provided

Compiling with #define KMP_COMPILE_GLOBAL_TASK_THROTTLING 0 and running with KMP_ENABLE_TASK_THROTTLING=0

$ time ./a.out 34
fib(34) = 5702887
real	0m0.210s

Compiling with #define KMP_COMPILE_GLOBAL_TASK_THROTTLING 0 and running with KMP_ENABLE_TASK_THROTTLING=1 (= current default behavior)

$ time ./a.out 34
fib(34) = 5702887
real	0m0.201s

Compiling with #define KMP_COMPILE_GLOBAL_TASK_THROTTLING 1 and running with KMP_ENABLE_TASK_THROTTLING=0

$ time ./a.out 34
fib(34) = 5702887
real	0m0.210s

Compiling with #define KMP_COMPILE_GLOBAL_TASK_THROTTLING 1 and running with KMP_ENABLE_TASK_THROTTLING=1

$ time ./a.out 34
fib(34) = 5702887
real	0m7.786s

This makes me think of adding an additional run-time option KMP_ENABLE_GLOBAL_TASK_THROTTLING to disable only "global throttling" at run-time even if its compiled, so users can still use the current "per-thread throttling"

On livelock

I could not reproduce on the above machine and configuration using taskbench from epcc-openmp-microbenchmarks.
(I ran 200+ times using default taskbench options)
I built LLVM using -DCMAKE_BUILD_TYPE=RelWithDebInfo that ends up being a gcc-12 with -O2 -g

On unit tests

In the current MR state, the test omp_throttling_max.c will fail if "global throttling" is not compiled (#define KMP_COMPILE_GLOBAL_TASK_THROTTLING 0) - any advices on disabling the test in such case ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh and...

On the broken logic

Yes I broke that sorry! I reverted modifications.
As far as my understanding goes, this section of the runtime is related to deferring tasks with a priority.
Such tasks are stored into "per team" queues (and not "per thread" as for 0-priority tasks) - so it makes sense to ignore the newly KMP_TASK_MAXIMUM_READY_PER_THREAD parameter anyway

The current main branch behavior is preserved (if the queue for the given priority is full and throttling is enabled, throttle, else resize the queue)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance
An alternative approach might be to only count tasks enqueued/added to the dependence graph, not tasks generated/freed. This would resolve the performance impact for if0 tasks.

Unit Tests
Make the define a cmake option (check for example LIBOMP_OMPT_SUPPORT). Follow the flow of LIBOMP_OMPT_SUPPORT to the tests. Finally the option ends up as a lit feature and we can use it as // REQUIRES: ompt in ompt tests.

Broken Logic
I think, the function should still consider the global task limit and not enqueue the task if the limit is reached.

PS: it would be better to keep general discussion at the global level and not in a code comment ;)

@rpereira-dev
Copy link
Author

rpereira-dev commented Feb 27, 2024

@jprotze Agreeing on adding a throttling parameter based on the number of tasks added to TDGs.
Better for performances, and mixed with the existing "per-thread-ready-tasks" threshold, it should fix the initial concern on memory footprint, at least on my motivating use-cases that follows a single-producer/multi-consumer pattern.
I'll be working on it.
Thanks for the advices on unit testing

@jprotze
Copy link
Collaborator

jprotze commented Feb 27, 2024

The only tasks you will not count this way are tasks that started executing and are on the execution stack of a thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
openmp:libomp OpenMP host runtime
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants