[OpenMP] Offload incorrectly calls cuda from a static initializer #74507

fs-nv · 2023-12-05T18:23:47Z

Originating from this Nsight Compute profiler forum post, I believe llvm's opemp cuda offloading incorrectly calls the CUDA API from its library's static initializer. This can be easily seen when inspecting the callstack at cuInit for a simple application built with clang++ -fopenmp.

#0  0x00007ffff28d9310 in cuInit () from /lib/x86_64-linux-gnu/libcuda.so.1
#1  0x00007ffff42f45f0 in llvm::omp::target::plugin::CUDAPluginTy::initImpl() () from /opt/lib/libomptarget.rtl.cuda.so
#2  0x00007ffff42ff82f in llvm::omp::target::plugin::GenericPluginTy::init() () from /opt/lib/libomptarget.rtl.cuda.so
#3  0x00007ffff42f6529 in llvm::omp::target::plugin::Plugin::Plugin() () from /opt/lib/libomptarget.rtl.cuda.so
#4  0x00007ffff42ffa8e in __tgt_rtl_init_plugin () from /opt/lib/libomptarget.rtl.cuda.so
#5  0x00007ffff7ae374d in RTLsTy::attemptLoadRTL(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, RTLInfoTy&) () from /opt/lib/libomptarget.so.18git
#6  0x00007ffff7ae3435 in RTLsTy::loadRTLs() () from /opt/lib/libomptarget.so.18git
#7  0x00007ffff7ae31a8 in init() () from /opt/lib/libomptarget.so.18git
#8  0x00007ffff7fe0b9a in call_init (l=<optimized out>, argc=argc@entry=1, argv=argv@entry=0x7fffffffe2b8, env=env@entry=0x7fffffffe2c8) at dl-init.c:72
#9  0x00007ffff7fe0ca1 in call_init (env=0x7fffffffe2c8, argv=0x7fffffffe2b8, argc=1, l=<optimized out>) at dl-init.c:30
#10 _dl_init (main_map=0x7ffff7ffe190, argc=1, argv=0x7fffffffe2b8, env=0x7fffffffe2c8) at dl-init.c:119
#11 0x00007ffff7fd013a in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#12 0x0000000000000001 in ?? ()
#13 0x00007fffffffe574 in ?? ()
#14 0x0000000000000000 in ?? ()

According to the CUDA documentation, this is not allowed:

The CUDA interfaces use global state that is initialized during host program initiation and destroyed during host program termination. The CUDA runtime and driver cannot detect if this state is invalid, so using any of these interfaces (implicitly or explicitly) during program initiation (or termination after main) will result in undefined behavior.

While this may happen to work in isolation, Nvidia's CUDA profiler, Nsight Computes, relies on applications using this interface correctly with respect to program and library initialization. In this case, the profiler will either crash or at least fail to profile as it suspends the app in the first CUDA API call, and with this API call happening during program initialization, the application is never even fully initialized, which prevents the profiler's frontend process from attaching to it.

The request here is to change llvm's usage of CUDA in openmp offloading to avoid CUDA API calls in static library or program initializers to comply with the CUDA documentation and avoid undefined behavior caused by undefined library initialization ordering.

This is the sample application, but any similar one would probably reproduce this, too:

#include <iostream>
#include <omp.h>
#include <cstdlib>

void saxpy(float a, float* x, float* y, int sz) {
	double t = 0.0;
	double tb, te;
	tb = omp_get_wtime();
#pragma omp target teams distribute parallel for map(to:x[0:sz]) map(tofrom:y[0:sz])
{
	for (int i = 0; i < sz; i++) {
		y[i] = a * x[i] + y[i];
	}
}
	te = omp_get_wtime();
	t = te - tb;
	printf("Time of kernel: %lf\n", t);
}

int main() {
	auto x = (float*) malloc(1000 * sizeof(float));
	auto y = (float*) calloc(1000, sizeof(float));
	
	for (int i = 0; i < 1000; i++) {
		x[i] = i;
	}
	
	saxpy(42, x, y, 1000);
	
    return 0;
}

It is compiled as below, where sm_75 may need to be replaced to the actual target GPU architecture.

clang++ -O3 -fopenmp -fopenmp-targets=nvptx64 saxpy.cpp -o saxpy --offload-arch=sm_75 -fopenmp-offload-mandatory

The text was updated successfully, but these errors were encountered:

jebalunode · 2023-12-06T01:33:07Z

Thank you Felix for raising the issue with LLVM. Will keep a watch on this. thanks again.

Thyre · 2023-12-07T08:59:05Z

We observed similar issues with libomptarget.so and OMPT in Score-P, where the early initialization causes both our CUDA and ROCm adapters to crash.

#69318

shiltian · 2024-03-29T22:50:16Z

cc @jhuber6 here as he is working on linking plugins statically so could potentially touch this part

jhuber6 · 2024-03-29T23:03:42Z

Sorry never noticed this issue before.

Can you provide a command line argument that reproduces it on your end? I recently had to contend with destructor behavior in #87009 which amounted to calling our deinitialization function from atexit after CUDA has registered its own. I haven't seen any issues with the constructor portion however, but that doesn't mean it's unsound. The behavior after the above patch lands should at least be deterministic, as we no longer rely on whatever order dynamic libraries decide to get torn down in.

Changing this would be a lot of pain, since we'd basically need to delay library loading until the first target region is locked, and then that would prevent us from being able to unregister the libraries in the same phase ordering. I've used nsys on libomptarget before without issue, so what are you hitting?

jhuber6 · 2024-03-30T00:23:29Z

> nsys nvprof saxpy | grep 'omp_offloading' -C2 -B2             
WARNING: saxpy and any of its children processes will be profiled.

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                        Name                      
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ------------------------------------------------
    100.0            2,560          1   2,560.0   2,560.0     2,560     2,560          0.0  __omp_offloading_10302_aec9c22__Z5saxpyfPfS_i_l9

Works as expected with NSYS on my sm_89 card.

Thyre · 2024-03-30T00:37:31Z

> nsys nvprof saxpy | grep 'omp_offloading' -C2 -B2             
WARNING: saxpy and any of its children processes will be profiled.

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                        Name                      
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ------------------------------------------------
    100.0            2,560          1   2,560.0   2,560.0     2,560     2,560          0.0  __omp_offloading_10302_aec9c22__Z5saxpyfPfS_i_l9

Works as expected with NSYS on my sm_89 card.

I think Nsight Systems doesn't show the warning/error, but Nsight Compute does. I've mentioned this in #85770.
Will update this with a reproducer in a moment.

Well, I seemingly broke my LLVM installation while updating to a newer LLVM. Here's the output from LLVM 16 & NVIDIA Compute of CUDA 12.2:

$ cat test.c
int main( void )
{
    #pragma omp target
        {}
}
$ clang -fopenmp --offload-arch=native test.c
clang-16: warning: CUDA version is newer than the latest partially supported version 11.8 [-Wunknown-cuda-version]
clang-16: warning: CUDA version is newer than the latest partially supported version 11.8 [-Wunknown-cuda-version]
clang-16: warning: CUDA version is newer than the latest partially supported version 11.8 [-Wunknown-cuda-version]
$ ncu --version
NVIDIA (R) Nsight Compute Command Line Profiler
Copyright (c) 2018-2023 NVIDIA Corporation
Version 2023.2.0.0 (build 32895467) (public-release)
$ ncu ./a.out 
==ERROR== The application returned an error code (11).
==WARNING== No kernels were profiled.
==WARNING== Profiling kernels launched by child processes requires the --target-processes all option.

This matches what the user was experiencing in the initial report. NVIDIA has added the error message in newer versions.

jhuber6 · 2024-03-30T00:40:30Z

I think Nsight Systems doesn't show the warning/error, but Nsight Compute does. I've mentioned this in #85770. Will update this with a reproducer in a moment.

Is it only the device initialization that's problematic? The description seems to state that any calls to cuInit are questionable.

jhuber6 · 2024-03-30T00:42:48Z

I see it now using sudo ncu ./a.out. Also confirmed that the above patch does nothing to change that behavior. Did this only start happening when @jdoerfert made us eagerly initialize the plugins?

Thyre · 2024-03-30T01:07:43Z

I see it now using sudo ncu ./a.out. Also confirmed that the above patch does nothing to change that behavior. Did this only start happening when @jdoerfert made us eagerly initialize the plugins?

The mentioned post in the NVIDIA developer forums points to using LLVM 17.0.5, which would be before #74397, if you meant this PR.
In Score-P, we noticed issues with OMPT + CUDA & HIP starting with LLVM 16 (which pointed me to 5b67bce).

Is it only the device initialization that's problematic? The description seems to state that any calls to cuInit are questionable.

I can't answer that question unfortunately. I know that we also call cuInit in our CUDA adapter, but don't have a stack trace ready for where our tool actually crashed with older LLVM versions.

jhuber6 · 2024-03-30T02:39:44Z

The mentioned post in the NVIDIA developer forums points to using LLVM 17.0.5, which would be before #74397, if you meant this PR.
In Score-P, we noticed issues with OMPT + CUDA & HIP starting with LLVM 16 (which pointed me to 5b67bce).

It would be very difficult to provide this interface without relying on some kind of static initializer. CUDA itself creates a static initializer with default priority that calls __cudaRegisterFatBinary. So presumably this function doesn't need to initialize anything else. Unfortunately I can only guess at what goes on inside of NVIDIA's black box here. Right now what we do is basically


void deinit() {...}
void [[gnu::constructor(101)]] init() {
 ...
 cuInit();
 ...
 atexit(&deinit);
}

We had to use atexit for similar reasons, because CUDA uses some global destructors (I think libcrypto is the one I can get from the logs) we need to make sure that we destroy the runtime before those run.

I can't answer that question unfortunately. I know that we also call cuInit in our CUDA adapter, but don't have a stack trace ready for where our tool actually crashed with older LLVM versions.

Looking at the other issue, I think that the OMPT case should at least be simplified with #87009 since we will no longer be initializing OMPT multiple times within dynamic library loads. I don't know if that will improve your experience or not, we'll see.

Thyre · 2024-03-30T02:50:46Z

I can't answer that question unfortunately. I know that we also call cuInit in our CUDA adapter, but don't have a stack trace ready for where our tool actually crashed with older LLVM versions.

Looking at the other issue, I think that the OMPT case should at least be simplified with #87009 since we will no longer be initializing OMPT multiple times within dynamic library loads. I don't know if that will improve your experience or not, we'll see.

This will fortunately not be an issue with our next major Score-P release anymore, as we thought a lot about how we may be able to work around this issue and have found a way to initialize our adapters at a later point. With LLVM 18 and onward, we will still have issues tracking CUDA events through CUPTI when users are writing OpenMP target code (partially because of that later initialization), but that's a separate issue (#85770).

jhuber6 · 2024-03-30T02:53:41Z

Also I mistakenly said that sudo ncu ./a.out didn't work, this was actually due to it failing to load a shared library since it didn't match my library path internally. After fixing that it worked and printed a bunch of stuff. If you have a small reproducer that would help.

Thyre · 2024-03-30T02:57:08Z

Sure, I can reproduce it with these steps:

$ cat test.c                                      
int main( void )
{
    #pragma omp target
        {}
}
$ clang --version
clang version 18.1.2
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/apps/software/Core/Compilers/LLVM/18.1.2/bin
$ clang -fopenmp --offload-target=sm_75 test.c
$ ncu --version
ncu --version
NVIDIA (R) Nsight Compute Command Line Profiler
Copyright (c) 2018-2023 NVIDIA Corporation
Version 2024.1.0.0 (build 33681293) (public-release)
$ ncu ./a.out
==ERROR== Cuda is initialized before the tool, e.g. by calling a Cuda API from a static initializer.
==ERROR== Initializing Cuda during program initialization results in undefined behavior.
==ERROR== See https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#initialization
==ERROR== The application returned an error code (1).
==WARNING== No kernels were profiled.

I'm using CUDA 12.4 on Ubuntu 22.04LTS, with CUDA installed via their .run file and LLVM built manually with the following flags:

cmake -G "Ninja" ../llvm -DCMAKE_BUILD_TYPE=Release \
                    -DCMAKE_C_COMPILER=gcc-12 \
                    -DCMAKE_CXX_COMPILER=g++-12 \
                    -DLIBOMPTARGET_ENABLE_DEBUG:Bool=On \
                    -DLLVM_ENABLE_PROJECTS="clang;lld" \
                    -DLLVM_ENABLE_RUNTIMES:STRING="libunwind;libcxxabi;libcxx;compiler-rt;openmp" \
                    -DLLVM_LINK_LLVM_DYLIB:BOOL=On \
                    -DCLANG_LINK_CLANG_DYLIB:BOOL=On \
                    -DLLVM_CCACHE_BUILD=Off  \
                    -DLLVM_ENABLE_ASSERTIONS:BOOL=ON \
                    -DLLVM_PARALLEL_LINK_JOBS=1 \
                    -DLLVM_TARGETS_TO_BUILD="X86;NVPTX;" \
                    -DLLVM_ENABLE_PLUGINS:BOOL=On \
                    -DCMAKE_INSTALL_PREFIX=/opt/apps/software/Core/Compilers/LLVM/18.1.2/ \
                    -DLLVM_ENABLE_RTTI:Bool=On

jhuber6 · 2024-03-30T03:13:30Z

Sure, I can reproduce it with these steps:

Here's what I get with my build

$ cat input.c 
#include <omp.h>

int main() {
  int isDevice = 0;
#pragma omp target map(from : isDevice)
  { isDevice = omp_is_initial_device(); }
  return isDevice;
}
$ clang --version
clang version 19.0.0git
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/jhuber/Documents/llvm/clang/bin
$ clang -fopenmp --offload-arch=sm_89 input.c 
$ ncu --version
NVIDIA (R) Nsight Compute Command Line Profiler
Copyright (c) 2018-2023 NVIDIA Corporation
Version 2024.1.0.0 (build 33681293) (public-release)
$ ncu ./a.out 
==PROF== Connected to process 1331 (/home/jhuber/Documents/llvm/llvm-project/build/a.out)
==PROF== Profiling "__omp_offloading_10302_aec9ca..." - 0: 0%....50%....100% - 8 passes
==PROF== Disconnected from process 1331
[1331] a.out@127.0.0.1
  __omp_offloading_10302_aec9caf_main_l5 (1, 1, 1)x(128, 1, 1), Context 1, Stream 13, Device 0, CC 8.9
    Section: GPU Speed Of Light Throughput
    ----------------------- ------------- ------------
    Metric Name               Metric Unit Metric Value
    ----------------------- ------------- ------------
    DRAM Frequency          cycle/nsecond         6.76
    SM Frequency            cycle/usecond       716.36
    Elapsed Cycles                  cycle       16,689
    Memory Throughput                   %         0.35
    DRAM Throughput                     %         0.35
    Duration                      usecond        23.30
    L1/TEX Cache Throughput             %         2.88
    L2 Cache Throughput                 %         0.22
    SM Active Cycles                cycle       312.96
    Compute (SM) Throughput             %         0.10
    ----------------------- ------------- ------------

Thyre · 2024-03-30T03:16:38Z

Interesting.. I'll build a trunk version and check if the issue still occurs.

jhuber6 · 2024-03-30T03:20:56Z

Interesting.. I'll build a trunk version and check if the issue still occurs.

Sure, let me know. This seems to with with and without my patch that removes the dlopen mess. However, if @fs-nv is correct and static constructors are evil then I see no reason that it would change between versions, since we've pretty much always done it this way. Tough to say. I'm on Arch Linux with glibc 2.39 if that's important.

Thyre · 2024-03-30T03:56:51Z

This seems to make a difference. On my personal machine, also with Arch Linux, ncu works perfectly fine with the repository Clang (17.0.6) and CUDA.

$ cat /etc/os-release | head -n 1
NAME="EndeavourOS"
$ clang --version
clang version 17.0.6
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
$ ncu --version
NVIDIA (R) Nsight Compute Command Line Profiler
Copyright (c) 2018-2023 NVIDIA Corporation
Version 2024.1.0.0 (build 33681293) (public-release)
$ clang -fopenmp --offload-arch=sm_80 && ncu ./a.out
==PROF== Connected to process 63198 (/home/jreuter/a.out)
==PROF== Profiling "__omp_offloading_1030b_4c99da..." - 0: 0%....50%....100% - 8 passes
==PROF== Disconnected from process 63198
[63198] a.out@127.0.0.1
  __omp_offloading_1030b_4c99da_main_l3 (1, 1, 1)x(128, 1, 1), Context 1, Stream 13, Device 0, CC 8.6
    Section: GPU Speed Of Light Throughput
    ----------------------- ------------- ------------
    Metric Name               Metric Unit Metric Value
    ----------------------- ------------- ------------
    DRAM Frequency          cycle/nsecond         6,78
    SM Frequency            cycle/nsecond         1,50
    Elapsed Cycles                  cycle       29.904
    Memory Throughput                   %         7,12
    DRAM Throughput                     %         7,12
    Duration                      usecond        19,97
    L1/TEX Cache Throughput             %         2,43
    L2 Cache Throughput                 %         2,29
    SM Active Cycles                cycle       610,13
    Compute (SM) Throughput             %         0,08
    ----------------------- ------------- ------------

At the same time, LLVM 19.0git also seems to fix the issue. Even on my Ubuntu machine, I'm able to run ncu just fine.

$ cat test.c
int main( void )
{
    #pragma omp target
        {}
}
$ clang --version
clang version 19.0.0git (https://github.com/llvm/llvm-project.git e9e4ab8147f0b8a8d1c8f1a70bef3999ffeeb651)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/jreuter/Projects/Compilers/llvm-project/_build/_install/bin
$ clang -fopenmp --offload-arch=sm_75 test.c
$ ncu ./a.out
==PROF== Connected to process 229629 (/home/jreuter/a.out)
==PROF== Profiling "__omp_offloading_10303_1b815e..." - 0: 0%....50%....100% - 9 passes
==PROF== Disconnected from process 229629
[229629] a.out@127.0.0.1
  __omp_offloading_10303_1b815e3_main_l3 (1, 1, 1)x(128, 1, 1), Context 1, Stream 13, Device 0, CC 7.5
    Section: GPU Speed Of Light Throughput
    ----------------------- ------------- ------------
    Metric Name               Metric Unit Metric Value
    ----------------------- ------------- ------------
    DRAM Frequency          cycle/nsecond         4,71
    SM Frequency            cycle/nsecond         1,05
    Elapsed Cycles                  cycle       13.924
    Memory Throughput                   %         2,82
    DRAM Throughput                     %         2,82
    Duration                      usecond        13,15
    L1/TEX Cache Throughput             %         3,24
    L2 Cache Throughput                 %         0,89
    SM Active Cycles                cycle       609,88
    Compute (SM) Throughput             %         0,31
    ----------------------- ------------- ------------

    OPT   This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full      
          waves across all SMs. Look at Launch Statistics for more details.

@fs-nv: Are you able to check if this also solves the issue for you?

jhuber6 · 2024-03-30T12:04:10Z

This seems to make a difference. On my personal machine, also with Arch Linux, ncu works perfectly fine with the repository Clang (17.0.6) and CUDA.
@fs-nv: Are you able to check if this also solves the issue for you?

Interesting, it's possible 421085f changed something since I moved the initializer from priority 1 to priority 101, since anything outside of libc less than 101 isn't really allowed.

Thyre · 2024-03-30T12:52:22Z

Interesting, it's possible 421085f changed something since I moved the initializer from priority 1 to priority 101, since anything outside of libc less than 101 isn't really allowed.

There is certainly a difference. Checking when cuInit() is called for the first time, it is now in _start instead of _dl_start_user. This should also fix #69318, as it doesn't cause Score-P v8.4 to crash anymore.

LLVM 19git (e9e4ab8):

Breakpoint 1, 0x00007ffff4c304a0 in cuInit () from /opt/apps/software/Core/Libraries/CUDA/12.4.0/nsight-compute-2024.1.0/target/linux-desktop-glibc_2_11_3-x64/./libcuda-injection.so
(gdb) bt
#0  0x00007ffff4c304a0 in cuInit ()
   from /opt/apps/software/Core/Libraries/CUDA/12.4.0/nsight-compute-2024.1.0/target/linux-desktop-glibc_2_11_3-x64/./libcuda-injection.so
#1  0x00007ffff42c63c4 in llvm::omp::target::plugin::CUDAPluginTy::initImpl() ()
   from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.rtl.cuda.so
#2  0x00007ffff42e26df in llvm::omp::target::plugin::GenericPluginTy::init() ()
   from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.rtl.cuda.so
#3  0x00007ffff42ed37e in llvm::omp::target::plugin::PluginTy::PluginTy() ()
   from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.rtl.cuda.so
#4  0x00007ffff42e8899 in __tgt_rtl_init_plugin ()
   from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.rtl.cuda.so
#5  0x00007ffff7f4395d in PluginAdaptorTy::init() ()
   from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.so.19.0git
#6  0x00007ffff7f4319b in PluginAdaptorTy::create(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.so.19.0git
#7  0x00007ffff7f4476e in PluginManager::init() ()
   from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.so.19.0git
#8  0x00007ffff7f407e0 in initRuntime() ()
   from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.so.19.0git
#9  0x00007ffff7f2a6bc in __tgt_register_lib ()
   from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.so.19.0git
#10 0x0000555555555099 in omp_offloading.descriptor_reg ()
#11 0x00007ffff4029ebb in call_init (env=<optimized out>, argv=0x7fffffffc998, argc=1) at ../csu/libc-start.c:145
#12 __libc_start_main_impl (main=0x5555555551a0 <main>, argc=1, argv=0x7fffffffc998, init=<optimized out>, fini=<optimized out>, 
    rtld_fini=<optimized out>, stack_end=0x7fffffffc988) at ../csu/libc-start.c:379
#13 0x00005555555550d5 in _start ()

LLVM 18:

Breakpoint 1, 0x00007ffff4c304a0 in cuInit () from /opt/apps/software/Core/Libraries/CUDA/12.4.0/nsight-compute-2024.1.0/target/linux-desktop-glibc_2_11_3-x64/./libcuda-injection.so
(gdb) bt
#0  0x00007ffff4c304a0 in cuInit ()
   from /opt/apps/software/Core/Libraries/CUDA/12.4.0/nsight-compute-2024.1.0/target/linux-desktop-glibc_2_11_3-x64/./libcuda-injection.so
#1  0x00007ffff42f8d04 in llvm::omp::target::plugin::CUDAPluginTy::initImpl() ()
   from /opt/apps/software/Core/Compilers/LLVM/18.1.2/lib/../lib/libomptarget.rtl.cuda.so
#2  0x00007ffff42cffcf in llvm::omp::target::plugin::GenericPluginTy::init() ()
   from /opt/apps/software/Core/Compilers/LLVM/18.1.2/lib/../lib/libomptarget.rtl.cuda.so
#3  0x00007ffff42d7a7e in llvm::omp::target::plugin::Plugin::Plugin() ()
   from /opt/apps/software/Core/Compilers/LLVM/18.1.2/lib/../lib/libomptarget.rtl.cuda.so
#4  0x00007ffff42d0579 in __tgt_rtl_init_plugin ()
   from /opt/apps/software/Core/Compilers/LLVM/18.1.2/lib/../lib/libomptarget.rtl.cuda.so
#5  0x00007ffff7f44efc in PluginAdaptorTy::init() () from /opt/apps/software/Core/Compilers/LLVM/18.1.2/lib/libomptarget.so.18.1
#6  0x00007ffff7f447f6 in PluginAdaptorTy::create(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /opt/apps/software/Core/Compilers/LLVM/18.1.2/lib/libomptarget.so.18.1
#7  0x00007ffff7f45e29 in PluginManager::init() () from /opt/apps/software/Core/Compilers/LLVM/18.1.2/lib/libomptarget.so.18.1
#8  0x00007ffff7f42bc5 in init() () from /opt/apps/software/Core/Compilers/LLVM/18.1.2/lib/libomptarget.so.18.1
#9  0x00007ffff7fc947e in call_init (l=<optimized out>, argc=argc@entry=1, argv=argv@entry=0x7fffffffa088, 
    env=env@entry=0x7fffffffa098) at ./elf/dl-init.c:70
#10 0x00007ffff7fc9568 in call_init (env=0x7fffffffa098, argv=0x7fffffffa088, argc=1, l=<optimized out>) at ./elf/dl-init.c:33
#11 _dl_init (main_map=0x7ffff7ffe2e0, argc=1, argv=0x7fffffffa088, env=0x7fffffffa098) at ./elf/dl-init.c:117
#12 0x00007ffff7fe32ca in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#13 0x0000000000000001 in ?? ()
#14 0x00007fffffffa4f6 in ?? ()
#15 0x0000000000000000 in ?? ()

jhuber6 · 2024-03-30T13:00:09Z

Interesting, it's possible 421085f changed something since I moved the initializer from priority 1 to priority 101, since anything outside of libc less than 101 isn't really allowed.

There is certainly a difference. Checking when cuInit() is called for the first time, it is now in _start instead of _dl_start_user. This should also fix #69318, as it doesn't cause Score-P v8.4 to crash anymore.

That's a surprise, could you do me a favor and revert 421085f locally and then rebuild your test application to see if that makes a difference? To my knowledge that's the only change within the timeframe that affected the startup. Either that or #80460.

I could probably try to dissect this more, but in general the direction the library is heading is definitely more sane so I'm glad the changes I'm making seem to have actually fixed something I wasn't even aware of.

Thyre · 2024-03-30T14:11:44Z

It seems like #80499 changed this. When reverting this commit, we get the old behavior:

Breakpoint 1, 0x00007ffff4c304a0 in cuInit () from /opt/apps/software/Core/Libraries/CUDA/12.4.0/nsight-compute-2024.1.0/target/linux-desktop-glibc_2_11_3-x64/./libcuda-injection.so
(gdb) bt
#0  0x00007ffff4c304a0 in cuInit ()
   from /opt/apps/software/Core/Libraries/CUDA/12.4.0/nsight-compute-2024.1.0/target/linux-desktop-glibc_2_11_3-x64/./libcuda-injection.so
#1  0x00007ffff42c63c4 in llvm::omp::target::plugin::CUDAPluginTy::initImpl() ()
   from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.rtl.cuda.so
#2  0x00007ffff42e26df in llvm::omp::target::plugin::GenericPluginTy::init() ()
   from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.rtl.cuda.so
#3  0x00007ffff42ed37e in llvm::omp::target::plugin::PluginTy::PluginTy() ()
   from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.rtl.cuda.so
#4  0x00007ffff42e8899 in __tgt_rtl_init_plugin ()
   from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.rtl.cuda.so
#5  0x00007ffff7f4366d in PluginAdaptorTy::init() ()
   from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.so.19.0git
#6  0x00007ffff7f42eab in PluginAdaptorTy::create(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.so.19.0git
#7  0x00007ffff7f4447e in PluginManager::init() ()
   from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.so.19.0git
#8  0x00007ffff7f40578 in init() ()
   from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.so.19.0git
#9  0x00007ffff7fc947e in call_init (l=<optimized out>, argc=argc@entry=1, argv=argv@entry=0x7fffffffc528, 
    env=env@entry=0x7fffffffc538) at ./elf/dl-init.c:70
#10 0x00007ffff7fc9568 in call_init (env=0x7fffffffc538, argv=0x7fffffffc528, argc=1, l=<optimized out>) at ./elf/dl-init.c:33
#11 _dl_init (main_map=0x7ffff7ffe2e0, argc=1, argv=0x7fffffffc528, env=0x7fffffffc538) at ./elf/dl-init.c:117
#12 0x00007ffff7fe32ca in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#13 0x0000000000000001 in ?? ()
#14 0x00007fffffffc959 in ?? ()
#15 0x0000000000000000 in ?? ()
(gdb) continue
Continuing.
==ERROR== Cuda is initialized before the tool, e.g. by calling a Cuda API from a static initializer.
==ERROR== Initializing Cuda during program initialization results in undefined behavior.
==ERROR== See https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#initialization
[Inferior 1 (process 255049) exited with code 01]

$ git log --oneline
6bd1fc01757d (HEAD -> main) Revert "[Libomptarget] Remove global ctor and use reference counting (#80499)"
e9e4ab8147f0 (origin/main, origin/HEAD) Revert "Reland "[PassManager] Support MachineFunctionProperties (#83668)"" (#87138)

421085f did not change anything. I haven't tried to revert #80460 as I don't have enough time right now due to Easter stuff 😄

jhuber6 · 2024-03-30T14:17:10Z

It seems like #80499 changed this. When reverting this commit, we get the old behavior:

Breakpoint 1, 0x00007ffff4c304a0 in cuInit () from /opt/apps/software/Core/Libraries/CUDA/12.4.0/nsight-compute-2024.1.0/target/linux-desktop-glibc_2_11_3-x64/./libcuda-injection.so
(gdb) bt
#0  0x00007ffff4c304a0 in cuInit ()
   from /opt/apps/software/Core/Libraries/CUDA/12.4.0/nsight-compute-2024.1.0/target/linux-desktop-glibc_2_11_3-x64/./libcuda-injection.so
#1  0x00007ffff42c63c4 in llvm::omp::target::plugin::CUDAPluginTy::initImpl() ()
   from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.rtl.cuda.so
#2  0x00007ffff42e26df in llvm::omp::target::plugin::GenericPluginTy::init() ()
   from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.rtl.cuda.so
#3  0x00007ffff42ed37e in llvm::omp::target::plugin::PluginTy::PluginTy() ()
   from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.rtl.cuda.so
#4  0x00007ffff42e8899 in __tgt_rtl_init_plugin ()
   from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.rtl.cuda.so
#5  0x00007ffff7f4366d in PluginAdaptorTy::init() ()
   from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.so.19.0git
#6  0x00007ffff7f42eab in PluginAdaptorTy::create(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.so.19.0git
#7  0x00007ffff7f4447e in PluginManager::init() ()
   from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.so.19.0git
#8  0x00007ffff7f40578 in init() ()
   from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.so.19.0git
#9  0x00007ffff7fc947e in call_init (l=<optimized out>, argc=argc@entry=1, argv=argv@entry=0x7fffffffc528, 
    env=env@entry=0x7fffffffc538) at ./elf/dl-init.c:70
#10 0x00007ffff7fc9568 in call_init (env=0x7fffffffc538, argv=0x7fffffffc528, argc=1, l=<optimized out>) at ./elf/dl-init.c:33
#11 _dl_init (main_map=0x7ffff7ffe2e0, argc=1, argv=0x7fffffffc528, env=0x7fffffffc538) at ./elf/dl-init.c:117
#12 0x00007ffff7fe32ca in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#13 0x0000000000000001 in ?? ()
#14 0x00007fffffffc959 in ?? ()
#15 0x0000000000000000 in ?? ()
(gdb) continue
Continuing.
==ERROR== Cuda is initialized before the tool, e.g. by calling a Cuda API from a static initializer.
==ERROR== Initializing Cuda during program initialization results in undefined behavior.
==ERROR== See https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#initialization
[Inferior 1 (process 255049) exited with code 01]

$ git log --oneline
6bd1fc01757d (HEAD -> main) Revert "[Libomptarget] Remove global ctor and use reference counting (#80499)"
e9e4ab8147f0 (origin/main, origin/HEAD) Revert "Reland "[PassManager] Support MachineFunctionProperties (#83668)"" (#87138)

421085f did not change anything. I haven't tried to revert #80460 as I don't have enough time right now due to Easter stuff 😄

Ah, sorry that's the one I should have linked instead of #80460. It was the follow-up that actually changed the runtime. Very nice to see that my insistence of rewriting this stuff to use a deterministic init / deinit order fixed problems like these. Future patches will improve this further so that we don't have any global behavior in the OpenMP runtime itself as well as making only a single hook to call OMPT.

Unsure if we should close this issue, since I don't fully understand what Nvidia's assertions are here, but it seems to work so I'll leave that to your discretion.

Thyre · 2024-03-30T14:32:26Z

I would leave this to @fs-nv, as he's opened the issue initially and probably knows a lot more about what they expect of a runtime implementing things related to CUDA. For us, the issue seems to be fixed. I closed my issue #69318 with a comment.

jhuber6 · 2024-03-30T14:50:52Z

Oh yeah, I totally forgot that the patch in question removed the global constructor from libomptarget itself, thus tying it to the user's initialization rather than the dynamic library initialization. This is why the constructors are now called from _start by the user's __init_array_start[] global instead of doing it inside of the dynamic loader. Funny how I can just forget the details of patches I made just a month ago.

fs-nv · 2024-04-05T06:45:42Z

Hi, thanks for all the info, and sorry for the delay in validating this. I will try to get to it ASAP. Could someone however clarify for me please what is the exact version I should be testing with?

Thyre · 2024-04-05T07:34:46Z

@fs-nv There's no release with this change yet, but if you compile any main commit after ea174c0 (February 22nd), it should work.

jhuber6 · 2024-04-07T14:22:26Z

Hi, thanks for all the info, and sorry for the delay in validating this. I will try to get to it ASAP. Could someone however clarify for me please what is the exact version I should be testing with?

$ git clone <llvm-repo>
$ cd llvm-project && mkdir build && cd build
$ cmake ../llvm -DCMAKE_INSTALL_PREFIX=<path> -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_PROJECTS="clang;clang-tools-extra" -DLLVM_ENABLE_RUNTIMES="openmp" -DLLVM_OPTIMIZED_TABLEGEN=ON -DLLVM_ENABLE_ASSERTIONS=ON -DLLVM_TARGETS_TO_BUILD="NVPTX;X86" -G Ninja
$ ninja install

This should work as a minimal configuration to build a trunk LLVM compiler with OpenMP.

fs-nv · 2024-04-10T07:12:44Z

We tried with this version and found no issues when using ncu on apps compiled with it. Thanks for fixing this.

github-actions bot added the new issue label Dec 5, 2023

EugeneZelenko added cuda openmp:libomptarget OpenMP offload runtime and removed new issue labels Dec 5, 2023

Thyre mentioned this issue Mar 19, 2024

[LLVM 18+][OpenMP] OpenMP target offloading will cause missed CUPTI callbacks #85770

Open

shiltian changed the title ~~openmp cuda offload incorrectly calls cuda from a static initializer~~ [OpenMP] Offload incorrectly calls cuda from a static initializer Mar 29, 2024

shiltian assigned jhuber6 and unassigned jhuber6 Mar 29, 2024

fs-nv closed this as completed Apr 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OpenMP] Offload incorrectly calls cuda from a static initializer #74507

[OpenMP] Offload incorrectly calls cuda from a static initializer #74507

fs-nv commented Dec 5, 2023

jebalunode commented Dec 6, 2023

Thyre commented Dec 7, 2023

shiltian commented Mar 29, 2024

jhuber6 commented Mar 29, 2024

jhuber6 commented Mar 30, 2024

Thyre commented Mar 30, 2024 •

edited

Loading

jhuber6 commented Mar 30, 2024

jhuber6 commented Mar 30, 2024 •

edited

Loading

Thyre commented Mar 30, 2024 •

edited

Loading

jhuber6 commented Mar 30, 2024 •

edited

Loading

Thyre commented Mar 30, 2024

jhuber6 commented Mar 30, 2024

Thyre commented Mar 30, 2024 •

edited

Loading

jhuber6 commented Mar 30, 2024 •

edited

Loading

Thyre commented Mar 30, 2024

jhuber6 commented Mar 30, 2024

Thyre commented Mar 30, 2024 •

edited

Loading

jhuber6 commented Mar 30, 2024 •

edited

Loading

Thyre commented Mar 30, 2024 •

edited

Loading

jhuber6 commented Mar 30, 2024

Thyre commented Mar 30, 2024

jhuber6 commented Mar 30, 2024

Thyre commented Mar 30, 2024

jhuber6 commented Mar 30, 2024 •

edited

Loading

fs-nv commented Apr 5, 2024

Thyre commented Apr 5, 2024 •

edited

Loading

jhuber6 commented Apr 7, 2024

fs-nv commented Apr 10, 2024

[OpenMP] Offload incorrectly calls cuda from a static initializer #74507

[OpenMP] Offload incorrectly calls cuda from a static initializer #74507

Comments

fs-nv commented Dec 5, 2023

jebalunode commented Dec 6, 2023

Thyre commented Dec 7, 2023

shiltian commented Mar 29, 2024

jhuber6 commented Mar 29, 2024

jhuber6 commented Mar 30, 2024

Thyre commented Mar 30, 2024 • edited Loading

jhuber6 commented Mar 30, 2024

jhuber6 commented Mar 30, 2024 • edited Loading

Thyre commented Mar 30, 2024 • edited Loading

jhuber6 commented Mar 30, 2024 • edited Loading

Thyre commented Mar 30, 2024

jhuber6 commented Mar 30, 2024

Thyre commented Mar 30, 2024 • edited Loading

jhuber6 commented Mar 30, 2024 • edited Loading

Thyre commented Mar 30, 2024

jhuber6 commented Mar 30, 2024

Thyre commented Mar 30, 2024 • edited Loading

jhuber6 commented Mar 30, 2024 • edited Loading

Thyre commented Mar 30, 2024 • edited Loading

jhuber6 commented Mar 30, 2024

Thyre commented Mar 30, 2024

jhuber6 commented Mar 30, 2024

Thyre commented Mar 30, 2024

jhuber6 commented Mar 30, 2024 • edited Loading

fs-nv commented Apr 5, 2024

Thyre commented Apr 5, 2024 • edited Loading

jhuber6 commented Apr 7, 2024

fs-nv commented Apr 10, 2024

Thyre commented Mar 30, 2024 •

edited

Loading

jhuber6 commented Mar 30, 2024 •

edited

Loading

Thyre commented Mar 30, 2024 •

edited

Loading

jhuber6 commented Mar 30, 2024 •

edited

Loading

Thyre commented Mar 30, 2024 •

edited

Loading

jhuber6 commented Mar 30, 2024 •

edited

Loading

Thyre commented Mar 30, 2024 •

edited

Loading

jhuber6 commented Mar 30, 2024 •

edited

Loading

Thyre commented Mar 30, 2024 •

edited

Loading

jhuber6 commented Mar 30, 2024 •

edited

Loading

Thyre commented Apr 5, 2024 •

edited

Loading