-
Notifications
You must be signed in to change notification settings - Fork 15k
Description
Hi All,
when LIBOMP_OMPT_SUPPORT=TRUE, I cannot collect correct backtrace in sampling based profiling on NVIDIA Nsight compute. This profiler uses the same method as perf does, so perf is most likely also affected. The issue affects Aarch64, but not x86_64. 32-bit ARM might be affected too, but I have no platform to test on. Recompiling libomp with LIB_OMPT_SUPPORT=FALSE helps avoiding the problem, but obviously breaks collection of OMPT events.
Backtrace with LIBOMP_OMPT_SUPPORT=TRUE:
Call stack at 0.0311147s:
omp_example!std::chrono::duration<...>::count() const
omp_example!work()
omp_example!main
omp_example!main
libomp.so!__kmp_invoke_microtask
[Unknown]!0x1
[Broken backtraces]![Broken backtraces]
Backtrace with LIBOMP_OMPT_SUPPORT=FALSE:
Call stack at 0.0274804s:
omp_example!std::chrono::duration<...>::duration<...>(...)
omp_example!work()
omp_example!main
omp_example!main
libomp.so!__kmp_invoke_microtask
libomp.so!__kmp_invoke_task_func
libomp.so!__kmp_fork_call
libomp.so!__kmpc_fork_call
omp_example!main
libc.so.6!__libc_start_call_main
libc.so.6!__libc_start_main@@GLIBC_2
omp_example!_start
[Broken backtraces]![Broken backtraces]
I took a look at the __kmp_invoke_microtask code for aarch64. I am not knowledgeable about frame pointers nor aarch64 assembly at all, but I think the issue is that when OMPT is enabled, some data is pushed to the stack between the save of the frame pointer register (x29) to the stack and saving the stack pointer to the frame pointer register. This way, the stack pointer register does not point to the memory where previous frame pointer was saved. As experiment I changed the order of operations:
diff --git a/openmp/runtime/src/z_Linux_asm.S b/openmp/runtime/src/z_Linux_asm.S
index 0bf9f07a13f1..d641efcef2a8 100644
--- a/openmp/runtime/src/z_Linux_asm.S
+++ b/openmp/runtime/src/z_Linux_asm.S
@@ -1358,10 +1358,10 @@ __tid = 8
PACBTI_C
stp x29, x30, [sp, #-16]!
+ mov x29, sp
# if OMPT_SUPPORT
stp x19, x20, [sp, #-16]!
# endif
- mov x29, sp
orr w9, wzr, #1
add w9, w9, w3, lsr #1
@@ -1413,11 +1413,11 @@ KMP_LABEL(kmp_0):
KMP_LABEL(kmp_1):
blr x8
orr w0, wzr, #1
- mov sp, x29
# if OMPT_SUPPORT
str xzr, [x19]
ldp x19, x20, [sp], #16
# endif
+ mov sp, x29
ldp x29, x30, [sp], #16
PACBTI_RET
ret
I recompiled with LIB_OMPT_SUPPORT=TRUE and got both backtrace and OMPT working for my toy example.
Backtrace with the experimental change:
Call stack at 0.0358592s:
omp_example!std::chrono::duration<...>::count() const
omp_example!bool std::chrono::operator< <long, std::ratio<1l, 1000000000l>, long, std::ratio<1l, 1l> >(...)
omp_example!work()
omp_example!main
omp_example!main
libomp.so!__kmp_invoke_microtask
libomp.so!__kmp_invoke_task_func
libomp.so!__kmp_fork_call
libomp.so!__kmpc_fork_call
omp_example!main
libc.so.6!__libc_start_call_main
libc.so.6!__libc_start_main@@GLIBC_2
omp_example!_start
[Broken backtraces]![Broken backtraces]
Toy example:
#include <chrono>
#include <iostream>
void work() {
const auto start = std::chrono::steady_clock::now();
auto now = std::chrono::steady_clock::now();
int i = 0;
while ((now - start) < std::chrono::seconds{1}) {
i = (i + 7) * 74 % 211;
now = std::chrono::steady_clock::now();
}
std::cout << "i = " << i << '\n';
}
int main() {
#pragma omp parallel for num_threads(8)
for (int i = 0; i < 8; ++i) {
work();
}
return 0;
}
Compiled and profiled with:
/home/dowj/local/llvm/ompt_fixed/bin/clang++ -g -fopenmp=libomp -o CMakeFiles/omp_example.dir/omp_example.cpp.o -c /home/dowj/sources/omp_example/omp_example.cpp
LD_LIBRARY_PATH=/home/dowj/local/llvm/ompt_fixed/lib/aarch64-unknown-linux-gnu:$LD_LIBRARY_PATH nsys profile --output=gh200_debug_ompt_fixed --backtrace=fp --trace=openmp --force-overwrite=true ./omp_example
Could someone with better understanding of frame pointers and OMPT take a look?