Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OpenMP] Offload incorrectly calls cuda from a static initializer #74507

Closed
fs-nv opened this issue Dec 5, 2023 · 28 comments
Closed

[OpenMP] Offload incorrectly calls cuda from a static initializer #74507

fs-nv opened this issue Dec 5, 2023 · 28 comments
Labels
cuda openmp:libomptarget OpenMP offload runtime

Comments

@fs-nv
Copy link

fs-nv commented Dec 5, 2023

Originating from this Nsight Compute profiler forum post, I believe llvm's opemp cuda offloading incorrectly calls the CUDA API from its library's static initializer. This can be easily seen when inspecting the callstack at cuInit for a simple application built with clang++ -fopenmp.

#0  0x00007ffff28d9310 in cuInit () from /lib/x86_64-linux-gnu/libcuda.so.1
#1  0x00007ffff42f45f0 in llvm::omp::target::plugin::CUDAPluginTy::initImpl() () from /opt/lib/libomptarget.rtl.cuda.so
#2  0x00007ffff42ff82f in llvm::omp::target::plugin::GenericPluginTy::init() () from /opt/lib/libomptarget.rtl.cuda.so
#3  0x00007ffff42f6529 in llvm::omp::target::plugin::Plugin::Plugin() () from /opt/lib/libomptarget.rtl.cuda.so
#4  0x00007ffff42ffa8e in __tgt_rtl_init_plugin () from /opt/lib/libomptarget.rtl.cuda.so
#5  0x00007ffff7ae374d in RTLsTy::attemptLoadRTL(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, RTLInfoTy&) () from /opt/lib/libomptarget.so.18git
#6  0x00007ffff7ae3435 in RTLsTy::loadRTLs() () from /opt/lib/libomptarget.so.18git
#7  0x00007ffff7ae31a8 in init() () from /opt/lib/libomptarget.so.18git
#8  0x00007ffff7fe0b9a in call_init (l=<optimized out>, argc=argc@entry=1, argv=argv@entry=0x7fffffffe2b8, env=env@entry=0x7fffffffe2c8) at dl-init.c:72
#9  0x00007ffff7fe0ca1 in call_init (env=0x7fffffffe2c8, argv=0x7fffffffe2b8, argc=1, l=<optimized out>) at dl-init.c:30
#10 _dl_init (main_map=0x7ffff7ffe190, argc=1, argv=0x7fffffffe2b8, env=0x7fffffffe2c8) at dl-init.c:119
#11 0x00007ffff7fd013a in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#12 0x0000000000000001 in ?? ()
#13 0x00007fffffffe574 in ?? ()
#14 0x0000000000000000 in ?? ()

According to the CUDA documentation, this is not allowed:

The CUDA interfaces use global state that is initialized during host program initiation and destroyed during host program termination. The CUDA runtime and driver cannot detect if this state is invalid, so using any of these interfaces (implicitly or explicitly) during program initiation (or termination after main) will result in undefined behavior.

While this may happen to work in isolation, Nvidia's CUDA profiler, Nsight Computes, relies on applications using this interface correctly with respect to program and library initialization. In this case, the profiler will either crash or at least fail to profile as it suspends the app in the first CUDA API call, and with this API call happening during program initialization, the application is never even fully initialized, which prevents the profiler's frontend process from attaching to it.

The request here is to change llvm's usage of CUDA in openmp offloading to avoid CUDA API calls in static library or program initializers to comply with the CUDA documentation and avoid undefined behavior caused by undefined library initialization ordering.

This is the sample application, but any similar one would probably reproduce this, too:

#include <iostream>
#include <omp.h>
#include <cstdlib>

void saxpy(float a, float* x, float* y, int sz) {
	double t = 0.0;
	double tb, te;
	tb = omp_get_wtime();
#pragma omp target teams distribute parallel for map(to:x[0:sz]) map(tofrom:y[0:sz])
{
	for (int i = 0; i < sz; i++) {
		y[i] = a * x[i] + y[i];
	}
}
	te = omp_get_wtime();
	t = te - tb;
	printf("Time of kernel: %lf\n", t);
}

int main() {
	auto x = (float*) malloc(1000 * sizeof(float));
	auto y = (float*) calloc(1000, sizeof(float));
	
	for (int i = 0; i < 1000; i++) {
		x[i] = i;
	}
	
	saxpy(42, x, y, 1000);
	
    return 0;
}

It is compiled as below, where sm_75 may need to be replaced to the actual target GPU architecture.

clang++ -O3 -fopenmp -fopenmp-targets=nvptx64 saxpy.cpp -o saxpy --offload-arch=sm_75 -fopenmp-offload-mandatory
@EugeneZelenko EugeneZelenko added cuda openmp:libomptarget OpenMP offload runtime and removed new issue labels Dec 5, 2023
@jebalunode
Copy link

Thank you Felix for raising the issue with LLVM. Will keep a watch on this. thanks again.

@Thyre
Copy link

Thyre commented Dec 7, 2023

We observed similar issues with libomptarget.so and OMPT in Score-P, where the early initialization causes both our CUDA and ROCm adapters to crash.

#69318

@shiltian shiltian changed the title openmp cuda offload incorrectly calls cuda from a static initializer [OpenMP] Offload incorrectly calls cuda from a static initializer Mar 29, 2024
@shiltian shiltian assigned jhuber6 and unassigned jhuber6 Mar 29, 2024
@shiltian
Copy link
Contributor

cc @jhuber6 here as he is working on linking plugins statically so could potentially touch this part

@jhuber6
Copy link
Contributor

jhuber6 commented Mar 29, 2024

Sorry never noticed this issue before.

Can you provide a command line argument that reproduces it on your end? I recently had to contend with destructor behavior in #87009 which amounted to calling our deinitialization function from atexit after CUDA has registered its own. I haven't seen any issues with the constructor portion however, but that doesn't mean it's unsound. The behavior after the above patch lands should at least be deterministic, as we no longer rely on whatever order dynamic libraries decide to get torn down in.

Changing this would be a lot of pain, since we'd basically need to delay library loading until the first target region is locked, and then that would prevent us from being able to unregister the libraries in the same phase ordering. I've used nsys on libomptarget before without issue, so what are you hitting?

@jhuber6
Copy link
Contributor

jhuber6 commented Mar 30, 2024

> nsys nvprof saxpy | grep 'omp_offloading' -C2 -B2             
WARNING: saxpy and any of its children processes will be profiled.

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                        Name                      
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ------------------------------------------------
    100.0            2,560          1   2,560.0   2,560.0     2,560     2,560          0.0  __omp_offloading_10302_aec9c22__Z5saxpyfPfS_i_l9

Works as expected with NSYS on my sm_89 card.

@Thyre
Copy link

Thyre commented Mar 30, 2024

> nsys nvprof saxpy | grep 'omp_offloading' -C2 -B2             
WARNING: saxpy and any of its children processes will be profiled.

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                        Name                      
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ------------------------------------------------
    100.0            2,560          1   2,560.0   2,560.0     2,560     2,560          0.0  __omp_offloading_10302_aec9c22__Z5saxpyfPfS_i_l9

Works as expected with NSYS on my sm_89 card.

I think Nsight Systems doesn't show the warning/error, but Nsight Compute does. I've mentioned this in #85770.
Will update this with a reproducer in a moment.

Well, I seemingly broke my LLVM installation while updating to a newer LLVM. Here's the output from LLVM 16 & NVIDIA Compute of CUDA 12.2:

$ cat test.c
int main( void )
{
    #pragma omp target
        {}
}
$ clang -fopenmp --offload-arch=native test.c
clang-16: warning: CUDA version is newer than the latest partially supported version 11.8 [-Wunknown-cuda-version]
clang-16: warning: CUDA version is newer than the latest partially supported version 11.8 [-Wunknown-cuda-version]
clang-16: warning: CUDA version is newer than the latest partially supported version 11.8 [-Wunknown-cuda-version]
$ ncu --version
NVIDIA (R) Nsight Compute Command Line Profiler
Copyright (c) 2018-2023 NVIDIA Corporation
Version 2023.2.0.0 (build 32895467) (public-release)
$ ncu ./a.out 
==ERROR== The application returned an error code (11).
==WARNING== No kernels were profiled.
==WARNING== Profiling kernels launched by child processes requires the --target-processes all option.

This matches what the user was experiencing in the initial report. NVIDIA has added the error message in newer versions.

@jhuber6
Copy link
Contributor

jhuber6 commented Mar 30, 2024

I think Nsight Systems doesn't show the warning/error, but Nsight Compute does. I've mentioned this in #85770. Will update this with a reproducer in a moment.

Is it only the device initialization that's problematic? The description seems to state that any calls to cuInit are questionable.

@jhuber6
Copy link
Contributor

jhuber6 commented Mar 30, 2024

I see it now using sudo ncu ./a.out. Also confirmed that the above patch does nothing to change that behavior. Did this only start happening when @jdoerfert made us eagerly initialize the plugins?

@Thyre
Copy link

Thyre commented Mar 30, 2024

I see it now using sudo ncu ./a.out. Also confirmed that the above patch does nothing to change that behavior. Did this only start happening when @jdoerfert made us eagerly initialize the plugins?

The mentioned post in the NVIDIA developer forums points to using LLVM 17.0.5, which would be before #74397, if you meant this PR.
In Score-P, we noticed issues with OMPT + CUDA & HIP starting with LLVM 16 (which pointed me to 5b67bce).

Is it only the device initialization that's problematic? The description seems to state that any calls to cuInit are questionable.

I can't answer that question unfortunately. I know that we also call cuInit in our CUDA adapter, but don't have a stack trace ready for where our tool actually crashed with older LLVM versions.

@jhuber6
Copy link
Contributor

jhuber6 commented Mar 30, 2024

The mentioned post in the NVIDIA developer forums points to using LLVM 17.0.5, which would be before #74397, if you meant this PR.
In Score-P, we noticed issues with OMPT + CUDA & HIP starting with LLVM 16 (which pointed me to 5b67bce).

It would be very difficult to provide this interface without relying on some kind of static initializer. CUDA itself creates a static initializer with default priority that calls __cudaRegisterFatBinary. So presumably this function doesn't need to initialize anything else. Unfortunately I can only guess at what goes on inside of NVIDIA's black box here. Right now what we do is basically


void deinit() {...}
void [[gnu::constructor(101)]] init() {
 ...
 cuInit();
 ...
 atexit(&deinit);
}

We had to use atexit for similar reasons, because CUDA uses some global destructors (I think libcrypto is the one I can get from the logs) we need to make sure that we destroy the runtime before those run.

I can't answer that question unfortunately. I know that we also call cuInit in our CUDA adapter, but don't have a stack trace ready for where our tool actually crashed with older LLVM versions.

Looking at the other issue, I think that the OMPT case should at least be simplified with #87009 since we will no longer be initializing OMPT multiple times within dynamic library loads. I don't know if that will improve your experience or not, we'll see.

@Thyre
Copy link

Thyre commented Mar 30, 2024

I can't answer that question unfortunately. I know that we also call cuInit in our CUDA adapter, but don't have a stack trace ready for where our tool actually crashed with older LLVM versions.

Looking at the other issue, I think that the OMPT case should at least be simplified with #87009 since we will no longer be initializing OMPT multiple times within dynamic library loads. I don't know if that will improve your experience or not, we'll see.

This will fortunately not be an issue with our next major Score-P release anymore, as we thought a lot about how we may be able to work around this issue and have found a way to initialize our adapters at a later point. With LLVM 18 and onward, we will still have issues tracking CUDA events through CUPTI when users are writing OpenMP target code (partially because of that later initialization), but that's a separate issue (#85770).

@jhuber6
Copy link
Contributor

jhuber6 commented Mar 30, 2024

Also I mistakenly said that sudo ncu ./a.out didn't work, this was actually due to it failing to load a shared library since it didn't match my library path internally. After fixing that it worked and printed a bunch of stuff. If you have a small reproducer that would help.

@Thyre
Copy link

Thyre commented Mar 30, 2024

Sure, I can reproduce it with these steps:

$ cat test.c                                      
int main( void )
{
    #pragma omp target
        {}
}
$ clang --version
clang version 18.1.2
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/apps/software/Core/Compilers/LLVM/18.1.2/bin
$ clang -fopenmp --offload-target=sm_75 test.c
$ ncu --version
ncu --version
NVIDIA (R) Nsight Compute Command Line Profiler
Copyright (c) 2018-2023 NVIDIA Corporation
Version 2024.1.0.0 (build 33681293) (public-release)
$ ncu ./a.out
==ERROR== Cuda is initialized before the tool, e.g. by calling a Cuda API from a static initializer.
==ERROR== Initializing Cuda during program initialization results in undefined behavior.
==ERROR== See https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#initialization
==ERROR== The application returned an error code (1).
==WARNING== No kernels were profiled.

I'm using CUDA 12.4 on Ubuntu 22.04LTS, with CUDA installed via their .run file and LLVM built manually with the following flags:

cmake -G "Ninja" ../llvm -DCMAKE_BUILD_TYPE=Release \
                    -DCMAKE_C_COMPILER=gcc-12 \
                    -DCMAKE_CXX_COMPILER=g++-12 \
                    -DLIBOMPTARGET_ENABLE_DEBUG:Bool=On \
                    -DLLVM_ENABLE_PROJECTS="clang;lld" \
                    -DLLVM_ENABLE_RUNTIMES:STRING="libunwind;libcxxabi;libcxx;compiler-rt;openmp" \
                    -DLLVM_LINK_LLVM_DYLIB:BOOL=On \
                    -DCLANG_LINK_CLANG_DYLIB:BOOL=On \
                    -DLLVM_CCACHE_BUILD=Off  \
                    -DLLVM_ENABLE_ASSERTIONS:BOOL=ON \
                    -DLLVM_PARALLEL_LINK_JOBS=1 \
                    -DLLVM_TARGETS_TO_BUILD="X86;NVPTX;" \
                    -DLLVM_ENABLE_PLUGINS:BOOL=On \
                    -DCMAKE_INSTALL_PREFIX=/opt/apps/software/Core/Compilers/LLVM/18.1.2/ \
                    -DLLVM_ENABLE_RTTI:Bool=On

@jhuber6
Copy link
Contributor

jhuber6 commented Mar 30, 2024

Sure, I can reproduce it with these steps:

Here's what I get with my build

$ cat input.c 
#include <omp.h>

int main() {
  int isDevice = 0;
#pragma omp target map(from : isDevice)
  { isDevice = omp_is_initial_device(); }
  return isDevice;
}
$ clang --version
clang version 19.0.0git
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/jhuber/Documents/llvm/clang/bin
$ clang -fopenmp --offload-arch=sm_89 input.c 
$ ncu --version
NVIDIA (R) Nsight Compute Command Line Profiler
Copyright (c) 2018-2023 NVIDIA Corporation
Version 2024.1.0.0 (build 33681293) (public-release)
$ ncu ./a.out 
==PROF== Connected to process 1331 (/home/jhuber/Documents/llvm/llvm-project/build/a.out)
==PROF== Profiling "__omp_offloading_10302_aec9ca..." - 0: 0%....50%....100% - 8 passes
==PROF== Disconnected from process 1331
[1331] a.out@127.0.0.1
  __omp_offloading_10302_aec9caf_main_l5 (1, 1, 1)x(128, 1, 1), Context 1, Stream 13, Device 0, CC 8.9
    Section: GPU Speed Of Light Throughput
    ----------------------- ------------- ------------
    Metric Name               Metric Unit Metric Value
    ----------------------- ------------- ------------
    DRAM Frequency          cycle/nsecond         6.76
    SM Frequency            cycle/usecond       716.36
    Elapsed Cycles                  cycle       16,689
    Memory Throughput                   %         0.35
    DRAM Throughput                     %         0.35
    Duration                      usecond        23.30
    L1/TEX Cache Throughput             %         2.88
    L2 Cache Throughput                 %         0.22
    SM Active Cycles                cycle       312.96
    Compute (SM) Throughput             %         0.10
    ----------------------- ------------- ------------

@Thyre
Copy link

Thyre commented Mar 30, 2024

Interesting.. I'll build a trunk version and check if the issue still occurs.

@jhuber6
Copy link
Contributor

jhuber6 commented Mar 30, 2024

Interesting.. I'll build a trunk version and check if the issue still occurs.

Sure, let me know. This seems to with with and without my patch that removes the dlopen mess. However, if @fs-nv is correct and static constructors are evil then I see no reason that it would change between versions, since we've pretty much always done it this way. Tough to say. I'm on Arch Linux with glibc 2.39 if that's important.

@Thyre
Copy link

Thyre commented Mar 30, 2024

This seems to make a difference. On my personal machine, also with Arch Linux, ncu works perfectly fine with the repository Clang (17.0.6) and CUDA.

$ cat /etc/os-release | head -n 1
NAME="EndeavourOS"
$ clang --version
clang version 17.0.6
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
$ ncu --version
NVIDIA (R) Nsight Compute Command Line Profiler
Copyright (c) 2018-2023 NVIDIA Corporation
Version 2024.1.0.0 (build 33681293) (public-release)
$ clang -fopenmp --offload-arch=sm_80 && ncu ./a.out
==PROF== Connected to process 63198 (/home/jreuter/a.out)
==PROF== Profiling "__omp_offloading_1030b_4c99da..." - 0: 0%....50%....100% - 8 passes
==PROF== Disconnected from process 63198
[63198] a.out@127.0.0.1
  __omp_offloading_1030b_4c99da_main_l3 (1, 1, 1)x(128, 1, 1), Context 1, Stream 13, Device 0, CC 8.6
    Section: GPU Speed Of Light Throughput
    ----------------------- ------------- ------------
    Metric Name               Metric Unit Metric Value
    ----------------------- ------------- ------------
    DRAM Frequency          cycle/nsecond         6,78
    SM Frequency            cycle/nsecond         1,50
    Elapsed Cycles                  cycle       29.904
    Memory Throughput                   %         7,12
    DRAM Throughput                     %         7,12
    Duration                      usecond        19,97
    L1/TEX Cache Throughput             %         2,43
    L2 Cache Throughput                 %         2,29
    SM Active Cycles                cycle       610,13
    Compute (SM) Throughput             %         0,08
    ----------------------- ------------- ------------

At the same time, LLVM 19.0git also seems to fix the issue. Even on my Ubuntu machine, I'm able to run ncu just fine.

$ cat test.c
int main( void )
{
    #pragma omp target
        {}
}
$ clang --version
clang version 19.0.0git (https://github.com/llvm/llvm-project.git e9e4ab8147f0b8a8d1c8f1a70bef3999ffeeb651)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/jreuter/Projects/Compilers/llvm-project/_build/_install/bin
$ clang -fopenmp --offload-arch=sm_75 test.c
$ ncu ./a.out
==PROF== Connected to process 229629 (/home/jreuter/a.out)
==PROF== Profiling "__omp_offloading_10303_1b815e..." - 0: 0%....50%....100% - 9 passes
==PROF== Disconnected from process 229629
[229629] a.out@127.0.0.1
  __omp_offloading_10303_1b815e3_main_l3 (1, 1, 1)x(128, 1, 1), Context 1, Stream 13, Device 0, CC 7.5
    Section: GPU Speed Of Light Throughput
    ----------------------- ------------- ------------
    Metric Name               Metric Unit Metric Value
    ----------------------- ------------- ------------
    DRAM Frequency          cycle/nsecond         4,71
    SM Frequency            cycle/nsecond         1,05
    Elapsed Cycles                  cycle       13.924
    Memory Throughput                   %         2,82
    DRAM Throughput                     %         2,82
    Duration                      usecond        13,15
    L1/TEX Cache Throughput             %         3,24
    L2 Cache Throughput                 %         0,89
    SM Active Cycles                cycle       609,88
    Compute (SM) Throughput             %         0,31
    ----------------------- ------------- ------------

    OPT   This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full      
          waves across all SMs. Look at Launch Statistics for more details.                                                                                                                                   

@fs-nv: Are you able to check if this also solves the issue for you?

@jhuber6
Copy link
Contributor

jhuber6 commented Mar 30, 2024

This seems to make a difference. On my personal machine, also with Arch Linux, ncu works perfectly fine with the repository Clang (17.0.6) and CUDA.
@fs-nv: Are you able to check if this also solves the issue for you?

Interesting, it's possible 421085f changed something since I moved the initializer from priority 1 to priority 101, since anything outside of libc less than 101 isn't really allowed.

@Thyre
Copy link

Thyre commented Mar 30, 2024

Interesting, it's possible 421085f changed something since I moved the initializer from priority 1 to priority 101, since anything outside of libc less than 101 isn't really allowed.

There is certainly a difference. Checking when cuInit() is called for the first time, it is now in _start instead of _dl_start_user. This should also fix #69318, as it doesn't cause Score-P v8.4 to crash anymore.

LLVM 19git (e9e4ab8):

Breakpoint 1, 0x00007ffff4c304a0 in cuInit () from /opt/apps/software/Core/Libraries/CUDA/12.4.0/nsight-compute-2024.1.0/target/linux-desktop-glibc_2_11_3-x64/./libcuda-injection.so
(gdb) bt
#0  0x00007ffff4c304a0 in cuInit ()
   from /opt/apps/software/Core/Libraries/CUDA/12.4.0/nsight-compute-2024.1.0/target/linux-desktop-glibc_2_11_3-x64/./libcuda-injection.so
#1  0x00007ffff42c63c4 in llvm::omp::target::plugin::CUDAPluginTy::initImpl() ()
   from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.rtl.cuda.so
#2  0x00007ffff42e26df in llvm::omp::target::plugin::GenericPluginTy::init() ()
   from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.rtl.cuda.so
#3  0x00007ffff42ed37e in llvm::omp::target::plugin::PluginTy::PluginTy() ()
   from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.rtl.cuda.so
#4  0x00007ffff42e8899 in __tgt_rtl_init_plugin ()
   from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.rtl.cuda.so
#5  0x00007ffff7f4395d in PluginAdaptorTy::init() ()
   from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.so.19.0git
#6  0x00007ffff7f4319b in PluginAdaptorTy::create(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.so.19.0git
#7  0x00007ffff7f4476e in PluginManager::init() ()
   from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.so.19.0git
#8  0x00007ffff7f407e0 in initRuntime() ()
   from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.so.19.0git
#9  0x00007ffff7f2a6bc in __tgt_register_lib ()
   from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.so.19.0git
#10 0x0000555555555099 in omp_offloading.descriptor_reg ()
#11 0x00007ffff4029ebb in call_init (env=<optimized out>, argv=0x7fffffffc998, argc=1) at ../csu/libc-start.c:145
#12 __libc_start_main_impl (main=0x5555555551a0 <main>, argc=1, argv=0x7fffffffc998, init=<optimized out>, fini=<optimized out>, 
    rtld_fini=<optimized out>, stack_end=0x7fffffffc988) at ../csu/libc-start.c:379
#13 0x00005555555550d5 in _start ()

LLVM 18:

Breakpoint 1, 0x00007ffff4c304a0 in cuInit () from /opt/apps/software/Core/Libraries/CUDA/12.4.0/nsight-compute-2024.1.0/target/linux-desktop-glibc_2_11_3-x64/./libcuda-injection.so
(gdb) bt
#0  0x00007ffff4c304a0 in cuInit ()
   from /opt/apps/software/Core/Libraries/CUDA/12.4.0/nsight-compute-2024.1.0/target/linux-desktop-glibc_2_11_3-x64/./libcuda-injection.so
#1  0x00007ffff42f8d04 in llvm::omp::target::plugin::CUDAPluginTy::initImpl() ()
   from /opt/apps/software/Core/Compilers/LLVM/18.1.2/lib/../lib/libomptarget.rtl.cuda.so
#2  0x00007ffff42cffcf in llvm::omp::target::plugin::GenericPluginTy::init() ()
   from /opt/apps/software/Core/Compilers/LLVM/18.1.2/lib/../lib/libomptarget.rtl.cuda.so
#3  0x00007ffff42d7a7e in llvm::omp::target::plugin::Plugin::Plugin() ()
   from /opt/apps/software/Core/Compilers/LLVM/18.1.2/lib/../lib/libomptarget.rtl.cuda.so
#4  0x00007ffff42d0579 in __tgt_rtl_init_plugin ()
   from /opt/apps/software/Core/Compilers/LLVM/18.1.2/lib/../lib/libomptarget.rtl.cuda.so
#5  0x00007ffff7f44efc in PluginAdaptorTy::init() () from /opt/apps/software/Core/Compilers/LLVM/18.1.2/lib/libomptarget.so.18.1
#6  0x00007ffff7f447f6 in PluginAdaptorTy::create(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /opt/apps/software/Core/Compilers/LLVM/18.1.2/lib/libomptarget.so.18.1
#7  0x00007ffff7f45e29 in PluginManager::init() () from /opt/apps/software/Core/Compilers/LLVM/18.1.2/lib/libomptarget.so.18.1
#8  0x00007ffff7f42bc5 in init() () from /opt/apps/software/Core/Compilers/LLVM/18.1.2/lib/libomptarget.so.18.1
#9  0x00007ffff7fc947e in call_init (l=<optimized out>, argc=argc@entry=1, argv=argv@entry=0x7fffffffa088, 
    env=env@entry=0x7fffffffa098) at ./elf/dl-init.c:70
#10 0x00007ffff7fc9568 in call_init (env=0x7fffffffa098, argv=0x7fffffffa088, argc=1, l=<optimized out>) at ./elf/dl-init.c:33
#11 _dl_init (main_map=0x7ffff7ffe2e0, argc=1, argv=0x7fffffffa088, env=0x7fffffffa098) at ./elf/dl-init.c:117
#12 0x00007ffff7fe32ca in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#13 0x0000000000000001 in ?? ()
#14 0x00007fffffffa4f6 in ?? ()
#15 0x0000000000000000 in ?? ()

@jhuber6
Copy link
Contributor

jhuber6 commented Mar 30, 2024

Interesting, it's possible 421085f changed something since I moved the initializer from priority 1 to priority 101, since anything outside of libc less than 101 isn't really allowed.

There is certainly a difference. Checking when cuInit() is called for the first time, it is now in _start instead of _dl_start_user. This should also fix #69318, as it doesn't cause Score-P v8.4 to crash anymore.

That's a surprise, could you do me a favor and revert 421085f locally and then rebuild your test application to see if that makes a difference? To my knowledge that's the only change within the timeframe that affected the startup. Either that or #80460.

I could probably try to dissect this more, but in general the direction the library is heading is definitely more sane so I'm glad the changes I'm making seem to have actually fixed something I wasn't even aware of.

@Thyre
Copy link

Thyre commented Mar 30, 2024

It seems like #80499 changed this. When reverting this commit, we get the old behavior:

Breakpoint 1, 0x00007ffff4c304a0 in cuInit () from /opt/apps/software/Core/Libraries/CUDA/12.4.0/nsight-compute-2024.1.0/target/linux-desktop-glibc_2_11_3-x64/./libcuda-injection.so
(gdb) bt
#0  0x00007ffff4c304a0 in cuInit ()
   from /opt/apps/software/Core/Libraries/CUDA/12.4.0/nsight-compute-2024.1.0/target/linux-desktop-glibc_2_11_3-x64/./libcuda-injection.so
#1  0x00007ffff42c63c4 in llvm::omp::target::plugin::CUDAPluginTy::initImpl() ()
   from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.rtl.cuda.so
#2  0x00007ffff42e26df in llvm::omp::target::plugin::GenericPluginTy::init() ()
   from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.rtl.cuda.so
#3  0x00007ffff42ed37e in llvm::omp::target::plugin::PluginTy::PluginTy() ()
   from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.rtl.cuda.so
#4  0x00007ffff42e8899 in __tgt_rtl_init_plugin ()
   from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.rtl.cuda.so
#5  0x00007ffff7f4366d in PluginAdaptorTy::init() ()
   from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.so.19.0git
#6  0x00007ffff7f42eab in PluginAdaptorTy::create(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.so.19.0git
#7  0x00007ffff7f4447e in PluginManager::init() ()
   from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.so.19.0git
#8  0x00007ffff7f40578 in init() ()
   from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.so.19.0git
#9  0x00007ffff7fc947e in call_init (l=<optimized out>, argc=argc@entry=1, argv=argv@entry=0x7fffffffc528, 
    env=env@entry=0x7fffffffc538) at ./elf/dl-init.c:70
#10 0x00007ffff7fc9568 in call_init (env=0x7fffffffc538, argv=0x7fffffffc528, argc=1, l=<optimized out>) at ./elf/dl-init.c:33
#11 _dl_init (main_map=0x7ffff7ffe2e0, argc=1, argv=0x7fffffffc528, env=0x7fffffffc538) at ./elf/dl-init.c:117
#12 0x00007ffff7fe32ca in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#13 0x0000000000000001 in ?? ()
#14 0x00007fffffffc959 in ?? ()
#15 0x0000000000000000 in ?? ()
(gdb) continue
Continuing.
==ERROR== Cuda is initialized before the tool, e.g. by calling a Cuda API from a static initializer.
==ERROR== Initializing Cuda during program initialization results in undefined behavior.
==ERROR== See https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#initialization
[Inferior 1 (process 255049) exited with code 01]
$ git log --oneline
6bd1fc01757d (HEAD -> main) Revert "[Libomptarget] Remove global ctor and use reference counting (#80499)"
e9e4ab8147f0 (origin/main, origin/HEAD) Revert "Reland "[PassManager] Support MachineFunctionProperties (#83668)"" (#87138)

421085f did not change anything. I haven't tried to revert #80460 as I don't have enough time right now due to Easter stuff 😄

@jhuber6
Copy link
Contributor

jhuber6 commented Mar 30, 2024

It seems like #80499 changed this. When reverting this commit, we get the old behavior:

Breakpoint 1, 0x00007ffff4c304a0 in cuInit () from /opt/apps/software/Core/Libraries/CUDA/12.4.0/nsight-compute-2024.1.0/target/linux-desktop-glibc_2_11_3-x64/./libcuda-injection.so
(gdb) bt
#0  0x00007ffff4c304a0 in cuInit ()
   from /opt/apps/software/Core/Libraries/CUDA/12.4.0/nsight-compute-2024.1.0/target/linux-desktop-glibc_2_11_3-x64/./libcuda-injection.so
#1  0x00007ffff42c63c4 in llvm::omp::target::plugin::CUDAPluginTy::initImpl() ()
   from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.rtl.cuda.so
#2  0x00007ffff42e26df in llvm::omp::target::plugin::GenericPluginTy::init() ()
   from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.rtl.cuda.so
#3  0x00007ffff42ed37e in llvm::omp::target::plugin::PluginTy::PluginTy() ()
   from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.rtl.cuda.so
#4  0x00007ffff42e8899 in __tgt_rtl_init_plugin ()
   from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.rtl.cuda.so
#5  0x00007ffff7f4366d in PluginAdaptorTy::init() ()
   from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.so.19.0git
#6  0x00007ffff7f42eab in PluginAdaptorTy::create(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.so.19.0git
#7  0x00007ffff7f4447e in PluginManager::init() ()
   from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.so.19.0git
#8  0x00007ffff7f40578 in init() ()
   from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.so.19.0git
#9  0x00007ffff7fc947e in call_init (l=<optimized out>, argc=argc@entry=1, argv=argv@entry=0x7fffffffc528, 
    env=env@entry=0x7fffffffc538) at ./elf/dl-init.c:70
#10 0x00007ffff7fc9568 in call_init (env=0x7fffffffc538, argv=0x7fffffffc528, argc=1, l=<optimized out>) at ./elf/dl-init.c:33
#11 _dl_init (main_map=0x7ffff7ffe2e0, argc=1, argv=0x7fffffffc528, env=0x7fffffffc538) at ./elf/dl-init.c:117
#12 0x00007ffff7fe32ca in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#13 0x0000000000000001 in ?? ()
#14 0x00007fffffffc959 in ?? ()
#15 0x0000000000000000 in ?? ()
(gdb) continue
Continuing.
==ERROR== Cuda is initialized before the tool, e.g. by calling a Cuda API from a static initializer.
==ERROR== Initializing Cuda during program initialization results in undefined behavior.
==ERROR== See https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#initialization
[Inferior 1 (process 255049) exited with code 01]
$ git log --oneline
6bd1fc01757d (HEAD -> main) Revert "[Libomptarget] Remove global ctor and use reference counting (#80499)"
e9e4ab8147f0 (origin/main, origin/HEAD) Revert "Reland "[PassManager] Support MachineFunctionProperties (#83668)"" (#87138)

421085f did not change anything. I haven't tried to revert #80460 as I don't have enough time right now due to Easter stuff 😄

Ah, sorry that's the one I should have linked instead of #80460. It was the follow-up that actually changed the runtime. Very nice to see that my insistence of rewriting this stuff to use a deterministic init / deinit order fixed problems like these. Future patches will improve this further so that we don't have any global behavior in the OpenMP runtime itself as well as making only a single hook to call OMPT.

Unsure if we should close this issue, since I don't fully understand what Nvidia's assertions are here, but it seems to work so I'll leave that to your discretion.

@Thyre
Copy link

Thyre commented Mar 30, 2024

I would leave this to @fs-nv, as he's opened the issue initially and probably knows a lot more about what they expect of a runtime implementing things related to CUDA. For us, the issue seems to be fixed. I closed my issue #69318 with a comment.

@jhuber6
Copy link
Contributor

jhuber6 commented Mar 30, 2024

Oh yeah, I totally forgot that the patch in question removed the global constructor from libomptarget itself, thus tying it to the user's initialization rather than the dynamic library initialization. This is why the constructors are now called from _start by the user's __init_array_start[] global instead of doing it inside of the dynamic loader. Funny how I can just forget the details of patches I made just a month ago.

@fs-nv
Copy link
Author

fs-nv commented Apr 5, 2024

Hi, thanks for all the info, and sorry for the delay in validating this. I will try to get to it ASAP. Could someone however clarify for me please what is the exact version I should be testing with?

@Thyre
Copy link

Thyre commented Apr 5, 2024

@fs-nv There's no release with this change yet, but if you compile any main commit after ea174c0 (February 22nd), it should work.

@jhuber6
Copy link
Contributor

jhuber6 commented Apr 7, 2024

Hi, thanks for all the info, and sorry for the delay in validating this. I will try to get to it ASAP. Could someone however clarify for me please what is the exact version I should be testing with?

$ git clone <llvm-repo>
$ cd llvm-project && mkdir build && cd build
$ cmake ../llvm -DCMAKE_INSTALL_PREFIX=<path> -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_PROJECTS="clang;clang-tools-extra" -DLLVM_ENABLE_RUNTIMES="openmp" -DLLVM_OPTIMIZED_TABLEGEN=ON -DLLVM_ENABLE_ASSERTIONS=ON -DLLVM_TARGETS_TO_BUILD="NVPTX;X86" -G Ninja
$ ninja install

This should work as a minimal configuration to build a trunk LLVM compiler with OpenMP.

@fs-nv
Copy link
Author

fs-nv commented Apr 10, 2024

We tried with this version and found no issues when using ncu on apps compiled with it. Thanks for fixing this.

@fs-nv fs-nv closed this as completed Apr 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda openmp:libomptarget OpenMP offload runtime
Projects
None yet
Development

No branches or pull requests

6 participants