-
Notifications
You must be signed in to change notification settings - Fork 11k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[OpenMP] Offload incorrectly calls cuda from a static initializer #74507
Comments
Thank you Felix for raising the issue with LLVM. Will keep a watch on this. thanks again. |
We observed similar issues with |
cc @jhuber6 here as he is working on linking plugins statically so could potentially touch this part |
Sorry never noticed this issue before. Can you provide a command line argument that reproduces it on your end? I recently had to contend with destructor behavior in #87009 which amounted to calling our deinitialization function from Changing this would be a lot of pain, since we'd basically need to delay library loading until the first target region is locked, and then that would prevent us from being able to unregister the libraries in the same phase ordering. I've used nsys on |
Works as expected with NSYS on my |
I think Nsight Systems doesn't show the warning/error, but Nsight Compute does. I've mentioned this in #85770. Well, I seemingly broke my LLVM installation while updating to a newer LLVM. Here's the output from LLVM 16 & NVIDIA Compute of CUDA 12.2: $ cat test.c
int main( void )
{
#pragma omp target
{}
}
$ clang -fopenmp --offload-arch=native test.c
clang-16: warning: CUDA version is newer than the latest partially supported version 11.8 [-Wunknown-cuda-version]
clang-16: warning: CUDA version is newer than the latest partially supported version 11.8 [-Wunknown-cuda-version]
clang-16: warning: CUDA version is newer than the latest partially supported version 11.8 [-Wunknown-cuda-version]
$ ncu --version
NVIDIA (R) Nsight Compute Command Line Profiler
Copyright (c) 2018-2023 NVIDIA Corporation
Version 2023.2.0.0 (build 32895467) (public-release)
$ ncu ./a.out
==ERROR== The application returned an error code (11).
==WARNING== No kernels were profiled.
==WARNING== Profiling kernels launched by child processes requires the --target-processes all option. This matches what the user was experiencing in the initial report. NVIDIA has added the error message in newer versions. |
Is it only the device initialization that's problematic? The description seems to state that any calls to |
I see it now using |
The mentioned post in the NVIDIA developer forums points to using LLVM 17.0.5, which would be before #74397, if you meant this PR.
I can't answer that question unfortunately. I know that we also call |
It would be very difficult to provide this interface without relying on some kind of static initializer. CUDA itself creates a static initializer with default priority that calls
We had to use
Looking at the other issue, I think that the OMPT case should at least be simplified with #87009 since we will no longer be initializing OMPT multiple times within dynamic library loads. I don't know if that will improve your experience or not, we'll see. |
This will fortunately not be an issue with our next major Score-P release anymore, as we thought a lot about how we may be able to work around this issue and have found a way to initialize our adapters at a later point. With LLVM 18 and onward, we will still have issues tracking CUDA events through CUPTI when users are writing OpenMP target code (partially because of that later initialization), but that's a separate issue (#85770). |
Also I mistakenly said that |
Sure, I can reproduce it with these steps: $ cat test.c
int main( void )
{
#pragma omp target
{}
}
$ clang --version
clang version 18.1.2
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/apps/software/Core/Compilers/LLVM/18.1.2/bin
$ clang -fopenmp --offload-target=sm_75 test.c
$ ncu --version
ncu --version
NVIDIA (R) Nsight Compute Command Line Profiler
Copyright (c) 2018-2023 NVIDIA Corporation
Version 2024.1.0.0 (build 33681293) (public-release)
$ ncu ./a.out
==ERROR== Cuda is initialized before the tool, e.g. by calling a Cuda API from a static initializer.
==ERROR== Initializing Cuda during program initialization results in undefined behavior.
==ERROR== See https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#initialization
==ERROR== The application returned an error code (1).
==WARNING== No kernels were profiled. I'm using CUDA 12.4 on Ubuntu 22.04LTS, with CUDA installed via their
|
Here's what I get with my build $ cat input.c
#include <omp.h>
int main() {
int isDevice = 0;
#pragma omp target map(from : isDevice)
{ isDevice = omp_is_initial_device(); }
return isDevice;
}
$ clang --version
clang version 19.0.0git
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/jhuber/Documents/llvm/clang/bin
$ clang -fopenmp --offload-arch=sm_89 input.c
$ ncu --version
NVIDIA (R) Nsight Compute Command Line Profiler
Copyright (c) 2018-2023 NVIDIA Corporation
Version 2024.1.0.0 (build 33681293) (public-release)
$ ncu ./a.out
==PROF== Connected to process 1331 (/home/jhuber/Documents/llvm/llvm-project/build/a.out)
==PROF== Profiling "__omp_offloading_10302_aec9ca..." - 0: 0%....50%....100% - 8 passes
==PROF== Disconnected from process 1331
[1331] a.out@127.0.0.1
__omp_offloading_10302_aec9caf_main_l5 (1, 1, 1)x(128, 1, 1), Context 1, Stream 13, Device 0, CC 8.9
Section: GPU Speed Of Light Throughput
----------------------- ------------- ------------
Metric Name Metric Unit Metric Value
----------------------- ------------- ------------
DRAM Frequency cycle/nsecond 6.76
SM Frequency cycle/usecond 716.36
Elapsed Cycles cycle 16,689
Memory Throughput % 0.35
DRAM Throughput % 0.35
Duration usecond 23.30
L1/TEX Cache Throughput % 2.88
L2 Cache Throughput % 0.22
SM Active Cycles cycle 312.96
Compute (SM) Throughput % 0.10
----------------------- ------------- ------------ |
Interesting.. I'll build a trunk version and check if the issue still occurs. |
Sure, let me know. This seems to with with and without my patch that removes the |
This seems to make a difference. On my personal machine, also with Arch Linux, $ cat /etc/os-release | head -n 1
NAME="EndeavourOS"
$ clang --version
clang version 17.0.6
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
$ ncu --version
NVIDIA (R) Nsight Compute Command Line Profiler
Copyright (c) 2018-2023 NVIDIA Corporation
Version 2024.1.0.0 (build 33681293) (public-release)
$ clang -fopenmp --offload-arch=sm_80 && ncu ./a.out
==PROF== Connected to process 63198 (/home/jreuter/a.out)
==PROF== Profiling "__omp_offloading_1030b_4c99da..." - 0: 0%....50%....100% - 8 passes
==PROF== Disconnected from process 63198
[63198] a.out@127.0.0.1
__omp_offloading_1030b_4c99da_main_l3 (1, 1, 1)x(128, 1, 1), Context 1, Stream 13, Device 0, CC 8.6
Section: GPU Speed Of Light Throughput
----------------------- ------------- ------------
Metric Name Metric Unit Metric Value
----------------------- ------------- ------------
DRAM Frequency cycle/nsecond 6,78
SM Frequency cycle/nsecond 1,50
Elapsed Cycles cycle 29.904
Memory Throughput % 7,12
DRAM Throughput % 7,12
Duration usecond 19,97
L1/TEX Cache Throughput % 2,43
L2 Cache Throughput % 2,29
SM Active Cycles cycle 610,13
Compute (SM) Throughput % 0,08
----------------------- ------------- ------------ At the same time, LLVM 19.0git also seems to fix the issue. Even on my Ubuntu machine, I'm able to run $ cat test.c
int main( void )
{
#pragma omp target
{}
}
$ clang --version
clang version 19.0.0git (https://github.com/llvm/llvm-project.git e9e4ab8147f0b8a8d1c8f1a70bef3999ffeeb651)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/jreuter/Projects/Compilers/llvm-project/_build/_install/bin
$ clang -fopenmp --offload-arch=sm_75 test.c
$ ncu ./a.out
==PROF== Connected to process 229629 (/home/jreuter/a.out)
==PROF== Profiling "__omp_offloading_10303_1b815e..." - 0: 0%....50%....100% - 9 passes
==PROF== Disconnected from process 229629
[229629] a.out@127.0.0.1
__omp_offloading_10303_1b815e3_main_l3 (1, 1, 1)x(128, 1, 1), Context 1, Stream 13, Device 0, CC 7.5
Section: GPU Speed Of Light Throughput
----------------------- ------------- ------------
Metric Name Metric Unit Metric Value
----------------------- ------------- ------------
DRAM Frequency cycle/nsecond 4,71
SM Frequency cycle/nsecond 1,05
Elapsed Cycles cycle 13.924
Memory Throughput % 2,82
DRAM Throughput % 2,82
Duration usecond 13,15
L1/TEX Cache Throughput % 3,24
L2 Cache Throughput % 0,89
SM Active Cycles cycle 609,88
Compute (SM) Throughput % 0,31
----------------------- ------------- ------------
OPT This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full
waves across all SMs. Look at Launch Statistics for more details. @fs-nv: Are you able to check if this also solves the issue for you? |
Interesting, it's possible 421085f changed something since I moved the initializer from priority 1 to priority 101, since anything outside of |
There is certainly a difference. Checking when LLVM 19git (e9e4ab8): Breakpoint 1, 0x00007ffff4c304a0 in cuInit () from /opt/apps/software/Core/Libraries/CUDA/12.4.0/nsight-compute-2024.1.0/target/linux-desktop-glibc_2_11_3-x64/./libcuda-injection.so
(gdb) bt
#0 0x00007ffff4c304a0 in cuInit ()
from /opt/apps/software/Core/Libraries/CUDA/12.4.0/nsight-compute-2024.1.0/target/linux-desktop-glibc_2_11_3-x64/./libcuda-injection.so
#1 0x00007ffff42c63c4 in llvm::omp::target::plugin::CUDAPluginTy::initImpl() ()
from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.rtl.cuda.so
#2 0x00007ffff42e26df in llvm::omp::target::plugin::GenericPluginTy::init() ()
from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.rtl.cuda.so
#3 0x00007ffff42ed37e in llvm::omp::target::plugin::PluginTy::PluginTy() ()
from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.rtl.cuda.so
#4 0x00007ffff42e8899 in __tgt_rtl_init_plugin ()
from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.rtl.cuda.so
#5 0x00007ffff7f4395d in PluginAdaptorTy::init() ()
from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.so.19.0git
#6 0x00007ffff7f4319b in PluginAdaptorTy::create(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.so.19.0git
#7 0x00007ffff7f4476e in PluginManager::init() ()
from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.so.19.0git
#8 0x00007ffff7f407e0 in initRuntime() ()
from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.so.19.0git
#9 0x00007ffff7f2a6bc in __tgt_register_lib ()
from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.so.19.0git
#10 0x0000555555555099 in omp_offloading.descriptor_reg ()
#11 0x00007ffff4029ebb in call_init (env=<optimized out>, argv=0x7fffffffc998, argc=1) at ../csu/libc-start.c:145
#12 __libc_start_main_impl (main=0x5555555551a0 <main>, argc=1, argv=0x7fffffffc998, init=<optimized out>, fini=<optimized out>,
rtld_fini=<optimized out>, stack_end=0x7fffffffc988) at ../csu/libc-start.c:379
#13 0x00005555555550d5 in _start () LLVM 18: Breakpoint 1, 0x00007ffff4c304a0 in cuInit () from /opt/apps/software/Core/Libraries/CUDA/12.4.0/nsight-compute-2024.1.0/target/linux-desktop-glibc_2_11_3-x64/./libcuda-injection.so
(gdb) bt
#0 0x00007ffff4c304a0 in cuInit ()
from /opt/apps/software/Core/Libraries/CUDA/12.4.0/nsight-compute-2024.1.0/target/linux-desktop-glibc_2_11_3-x64/./libcuda-injection.so
#1 0x00007ffff42f8d04 in llvm::omp::target::plugin::CUDAPluginTy::initImpl() ()
from /opt/apps/software/Core/Compilers/LLVM/18.1.2/lib/../lib/libomptarget.rtl.cuda.so
#2 0x00007ffff42cffcf in llvm::omp::target::plugin::GenericPluginTy::init() ()
from /opt/apps/software/Core/Compilers/LLVM/18.1.2/lib/../lib/libomptarget.rtl.cuda.so
#3 0x00007ffff42d7a7e in llvm::omp::target::plugin::Plugin::Plugin() ()
from /opt/apps/software/Core/Compilers/LLVM/18.1.2/lib/../lib/libomptarget.rtl.cuda.so
#4 0x00007ffff42d0579 in __tgt_rtl_init_plugin ()
from /opt/apps/software/Core/Compilers/LLVM/18.1.2/lib/../lib/libomptarget.rtl.cuda.so
#5 0x00007ffff7f44efc in PluginAdaptorTy::init() () from /opt/apps/software/Core/Compilers/LLVM/18.1.2/lib/libomptarget.so.18.1
#6 0x00007ffff7f447f6 in PluginAdaptorTy::create(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /opt/apps/software/Core/Compilers/LLVM/18.1.2/lib/libomptarget.so.18.1
#7 0x00007ffff7f45e29 in PluginManager::init() () from /opt/apps/software/Core/Compilers/LLVM/18.1.2/lib/libomptarget.so.18.1
#8 0x00007ffff7f42bc5 in init() () from /opt/apps/software/Core/Compilers/LLVM/18.1.2/lib/libomptarget.so.18.1
#9 0x00007ffff7fc947e in call_init (l=<optimized out>, argc=argc@entry=1, argv=argv@entry=0x7fffffffa088,
env=env@entry=0x7fffffffa098) at ./elf/dl-init.c:70
#10 0x00007ffff7fc9568 in call_init (env=0x7fffffffa098, argv=0x7fffffffa088, argc=1, l=<optimized out>) at ./elf/dl-init.c:33
#11 _dl_init (main_map=0x7ffff7ffe2e0, argc=1, argv=0x7fffffffa088, env=0x7fffffffa098) at ./elf/dl-init.c:117
#12 0x00007ffff7fe32ca in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#13 0x0000000000000001 in ?? ()
#14 0x00007fffffffa4f6 in ?? ()
#15 0x0000000000000000 in ?? () |
That's a surprise, could you do me a favor and revert 421085f locally and then rebuild your test application to see if that makes a difference? To my knowledge that's the only change within the timeframe that affected the startup. Either that or #80460. I could probably try to dissect this more, but in general the direction the library is heading is definitely more sane so I'm glad the changes I'm making seem to have actually fixed something I wasn't even aware of. |
It seems like #80499 changed this. When reverting this commit, we get the old behavior: Breakpoint 1, 0x00007ffff4c304a0 in cuInit () from /opt/apps/software/Core/Libraries/CUDA/12.4.0/nsight-compute-2024.1.0/target/linux-desktop-glibc_2_11_3-x64/./libcuda-injection.so
(gdb) bt
#0 0x00007ffff4c304a0 in cuInit ()
from /opt/apps/software/Core/Libraries/CUDA/12.4.0/nsight-compute-2024.1.0/target/linux-desktop-glibc_2_11_3-x64/./libcuda-injection.so
#1 0x00007ffff42c63c4 in llvm::omp::target::plugin::CUDAPluginTy::initImpl() ()
from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.rtl.cuda.so
#2 0x00007ffff42e26df in llvm::omp::target::plugin::GenericPluginTy::init() ()
from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.rtl.cuda.so
#3 0x00007ffff42ed37e in llvm::omp::target::plugin::PluginTy::PluginTy() ()
from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.rtl.cuda.so
#4 0x00007ffff42e8899 in __tgt_rtl_init_plugin ()
from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.rtl.cuda.so
#5 0x00007ffff7f4366d in PluginAdaptorTy::init() ()
from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.so.19.0git
#6 0x00007ffff7f42eab in PluginAdaptorTy::create(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.so.19.0git
#7 0x00007ffff7f4447e in PluginManager::init() ()
from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.so.19.0git
#8 0x00007ffff7f40578 in init() ()
from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.so.19.0git
#9 0x00007ffff7fc947e in call_init (l=<optimized out>, argc=argc@entry=1, argv=argv@entry=0x7fffffffc528,
env=env@entry=0x7fffffffc538) at ./elf/dl-init.c:70
#10 0x00007ffff7fc9568 in call_init (env=0x7fffffffc538, argv=0x7fffffffc528, argc=1, l=<optimized out>) at ./elf/dl-init.c:33
#11 _dl_init (main_map=0x7ffff7ffe2e0, argc=1, argv=0x7fffffffc528, env=0x7fffffffc538) at ./elf/dl-init.c:117
#12 0x00007ffff7fe32ca in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#13 0x0000000000000001 in ?? ()
#14 0x00007fffffffc959 in ?? ()
#15 0x0000000000000000 in ?? ()
(gdb) continue
Continuing.
==ERROR== Cuda is initialized before the tool, e.g. by calling a Cuda API from a static initializer.
==ERROR== Initializing Cuda during program initialization results in undefined behavior.
==ERROR== See https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#initialization
[Inferior 1 (process 255049) exited with code 01] $ git log --oneline
6bd1fc01757d (HEAD -> main) Revert "[Libomptarget] Remove global ctor and use reference counting (#80499)"
e9e4ab8147f0 (origin/main, origin/HEAD) Revert "Reland "[PassManager] Support MachineFunctionProperties (#83668)"" (#87138) 421085f did not change anything. I haven't tried to revert #80460 as I don't have enough time right now due to Easter stuff 😄 |
Ah, sorry that's the one I should have linked instead of #80460. It was the follow-up that actually changed the runtime. Very nice to see that my insistence of rewriting this stuff to use a deterministic init / deinit order fixed problems like these. Future patches will improve this further so that we don't have any global behavior in the OpenMP runtime itself as well as making only a single hook to call OMPT. Unsure if we should close this issue, since I don't fully understand what Nvidia's assertions are here, but it seems to work so I'll leave that to your discretion. |
Oh yeah, I totally forgot that the patch in question removed the global constructor from |
Hi, thanks for all the info, and sorry for the delay in validating this. I will try to get to it ASAP. Could someone however clarify for me please what is the exact version I should be testing with? |
$ git clone <llvm-repo>
$ cd llvm-project && mkdir build && cd build
$ cmake ../llvm -DCMAKE_INSTALL_PREFIX=<path> -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_PROJECTS="clang;clang-tools-extra" -DLLVM_ENABLE_RUNTIMES="openmp" -DLLVM_OPTIMIZED_TABLEGEN=ON -DLLVM_ENABLE_ASSERTIONS=ON -DLLVM_TARGETS_TO_BUILD="NVPTX;X86" -G Ninja
$ ninja install This should work as a minimal configuration to build a trunk LLVM compiler with OpenMP. |
We tried with this version and found no issues when using ncu on apps compiled with it. Thanks for fixing this. |
Originating from this Nsight Compute profiler forum post, I believe llvm's opemp cuda offloading incorrectly calls the CUDA API from its library's static initializer. This can be easily seen when inspecting the callstack at
cuInit
for a simple application built withclang++ -fopenmp
.According to the CUDA documentation, this is not allowed:
While this may happen to work in isolation, Nvidia's CUDA profiler, Nsight Computes, relies on applications using this interface correctly with respect to program and library initialization. In this case, the profiler will either crash or at least fail to profile as it suspends the app in the first CUDA API call, and with this API call happening during program initialization, the application is never even fully initialized, which prevents the profiler's frontend process from attaching to it.
The request here is to change llvm's usage of CUDA in openmp offloading to avoid CUDA API calls in static library or program initializers to comply with the CUDA documentation and avoid undefined behavior caused by undefined library initialization ordering.
This is the sample application, but any similar one would probably reproduce this, too:
It is compiled as below, where
sm_75
may need to be replaced to the actual target GPU architecture.The text was updated successfully, but these errors were encountered: