Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OpenMP] offload to amdgpu hit "Input is not an ELF file" error #77798

Closed
ye-luo opened this issue Jan 11, 2024 · 1 comment · Fixed by #77828
Closed

[OpenMP] offload to amdgpu hit "Input is not an ELF file" error #77798

ye-luo opened this issue Jan 11, 2024 · 1 comment · Fixed by #77828
Labels
openmp:libomptarget OpenMP offload runtime

Comments

@ye-luo
Copy link
Contributor

ye-luo commented Jan 11, 2024

llvm 164f85d
SLES 15 SP4 ROCM 5.7.0

Use https://github.com/ye-luo/miniqmc

mkdir build_amdgpu; cd build_amdgpu
cmake -DCMAKE_CXX_COMPILER=clang++ -DQMC_ENABLE_ROCM=ON -DCMAKE_CXX_FLAGS=--gcc-toolchain=/soft/compilers/gcc/12.2.0/x86_64-suse-linux -DENABLE_OFFLOAD=ON -DQMC_GPU_ARCHS=gfx90a ..
cd src/Particle/tests
make -j32 test_distance_table
./test_distance_table

output error at the end of the run.

test_distance_table: /gpfs/jlse-fs0/users/yeluo/opt/llvm-clang/llvm-project-nightly/openmp/libomptarget/plugins-nextgen/common/src/GlobalHandler.cpp:31: Expected<ELF64LEObjectFile> llvm::omp::target::plugin::GenericGlobalHandlerTy::getELFObjectFile(DeviceImageTy &): Assertion `utils::elf::isELF(Image.getMemoryBuffer().getBuffer()) && "Input is not an ELF file"' failed.`

backtrace shows

Thread 1 "test_distance_t" received signal SIGABRT, Aborted.
0x00007fffef42fd2b in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x00007fffef42fd2b in raise () from /lib64/libc.so.6
#1  0x00007fffef4313e5 in abort () from /lib64/libc.so.6
#2  0x00007fffef427c6a in __assert_fail_base () from /lib64/libc.so.6
#3  0x00007fffef427cf2 in __assert_fail () from /lib64/libc.so.6
#4  0x00007fffde3fa529 in llvm::omp::target::plugin::GenericGlobalHandlerTy::isSymbolInImage(llvm::omp::target::plugin::GenericDeviceTy&, llvm::omp::target::plugin::DeviceImageTy&, llvm::StringRef) ()
   from /soft/compilers/llvm/master-nightly/lib/libomptarget.rtl.amdgpu.so
#5  0x00007fffde3e3a6f in llvm::omp::target::plugin::AMDGPUDeviceTy::callGlobalCtorDtorCommon(llvm::omp::target::plugin::GenericPluginTy&, llvm::omp::target::plugin::DeviceImageTy&, char const*) ()
   from /soft/compilers/llvm/master-nightly/lib/libomptarget.rtl.amdgpu.so
#6  0x00007fffde3c34f0 in llvm::omp::target::plugin::AMDGPUDeviceTy::callGlobalDestructors(llvm::omp::target::plugin::GenericPluginTy&, llvm::omp::target::plugin::DeviceImageTy&) ()
   from /soft/compilers/llvm/master-nightly/lib/libomptarget.rtl.amdgpu.so
#7  0x00007fffde3eb0e5 in llvm::omp::target::plugin::GenericDeviceTy::deinit(llvm::omp::target::plugin::GenericPluginTy&) ()
   from /soft/compilers/llvm/master-nightly/lib/libomptarget.rtl.amdgpu.so
#8  0x00007fffde3f02f2 in llvm::omp::target::plugin::GenericPluginTy::deinitDevice(int) ()
   from /soft/compilers/llvm/master-nightly/lib/libomptarget.rtl.amdgpu.so
#9  0x00007fffde3e301d in llvm::omp::target::plugin::Plugin::deinit() ()
   from /soft/compilers/llvm/master-nightly/lib/libomptarget.rtl.amdgpu.so
#10 0x00007fffde3e2c58 in llvm::omp::target::plugin::Plugin::~Plugin() ()
   from /soft/compilers/llvm/master-nightly/lib/libomptarget.rtl.amdgpu.so
#11 0x00007fffef4331be in __cxa_finalize () from /lib64/libc.so.6
#12 0x00007fffde3be373 in __do_global_dtors_aux () from /soft/compilers/llvm/master-nightly/lib/libomptarget.rtl.amdgpu.so
#13 0x00007fffffffbcd0 in ?? ()
#14 0x00007ffff7fe29b3 in _dl_fini () from /lib64/ld-linux-x86-64.so.2
@llvmbot
Copy link
Collaborator

llvmbot commented Jan 11, 2024

@llvm/issue-subscribers-openmp

Author: Ye Luo (ye-luo)

llvm 164f85d SLES 15 SP4 ROCM 5.7.0

Use https://github.com/ye-luo/miniqmc

mkdir build_amdgpu; cd build_amdgpu
cmake -DCMAKE_CXX_COMPILER=clang++ -DQMC_ENABLE_ROCM=ON -DCMAKE_CXX_FLAGS=--gcc-toolchain=/soft/compilers/gcc/12.2.0/x86_64-suse-linux -DENABLE_OFFLOAD=ON -DQMC_GPU_ARCHS=gfx90a ..
cd src/Particle/tests
make -j32 test_distance_table
./test_distance_table

output error at the end of the run.

test_distance_table: /gpfs/jlse-fs0/users/yeluo/opt/llvm-clang/llvm-project-nightly/openmp/libomptarget/plugins-nextgen/common/src/GlobalHandler.cpp:31: Expected&lt;ELF64LEObjectFile&gt; llvm::omp::target::plugin::GenericGlobalHandlerTy::getELFObjectFile(DeviceImageTy &amp;): Assertion `utils::elf::isELF(Image.getMemoryBuffer().getBuffer()) &amp;&amp; "Input is not an ELF file"' failed.`

backtrace shows

Thread 1 "test_distance_t" received signal SIGABRT, Aborted.
0x00007fffef42fd2b in raise () from /lib64/libc.so.6
(gdb) bt
#<!-- -->0  0x00007fffef42fd2b in raise () from /lib64/libc.so.6
#<!-- -->1  0x00007fffef4313e5 in abort () from /lib64/libc.so.6
#<!-- -->2  0x00007fffef427c6a in __assert_fail_base () from /lib64/libc.so.6
#<!-- -->3  0x00007fffef427cf2 in __assert_fail () from /lib64/libc.so.6
#<!-- -->4  0x00007fffde3fa529 in llvm::omp::target::plugin::GenericGlobalHandlerTy::isSymbolInImage(llvm::omp::target::plugin::GenericDeviceTy&amp;, llvm::omp::target::plugin::DeviceImageTy&amp;, llvm::StringRef) ()
   from /soft/compilers/llvm/master-nightly/lib/libomptarget.rtl.amdgpu.so
#<!-- -->5  0x00007fffde3e3a6f in llvm::omp::target::plugin::AMDGPUDeviceTy::callGlobalCtorDtorCommon(llvm::omp::target::plugin::GenericPluginTy&amp;, llvm::omp::target::plugin::DeviceImageTy&amp;, char const*) ()
   from /soft/compilers/llvm/master-nightly/lib/libomptarget.rtl.amdgpu.so
#<!-- -->6  0x00007fffde3c34f0 in llvm::omp::target::plugin::AMDGPUDeviceTy::callGlobalDestructors(llvm::omp::target::plugin::GenericPluginTy&amp;, llvm::omp::target::plugin::DeviceImageTy&amp;) ()
   from /soft/compilers/llvm/master-nightly/lib/libomptarget.rtl.amdgpu.so
#<!-- -->7  0x00007fffde3eb0e5 in llvm::omp::target::plugin::GenericDeviceTy::deinit(llvm::omp::target::plugin::GenericPluginTy&amp;) ()
   from /soft/compilers/llvm/master-nightly/lib/libomptarget.rtl.amdgpu.so
#<!-- -->8  0x00007fffde3f02f2 in llvm::omp::target::plugin::GenericPluginTy::deinitDevice(int) ()
   from /soft/compilers/llvm/master-nightly/lib/libomptarget.rtl.amdgpu.so
#<!-- -->9  0x00007fffde3e301d in llvm::omp::target::plugin::Plugin::deinit() ()
   from /soft/compilers/llvm/master-nightly/lib/libomptarget.rtl.amdgpu.so
#<!-- -->10 0x00007fffde3e2c58 in llvm::omp::target::plugin::Plugin::~Plugin() ()
   from /soft/compilers/llvm/master-nightly/lib/libomptarget.rtl.amdgpu.so
#<!-- -->11 0x00007fffef4331be in __cxa_finalize () from /lib64/libc.so.6
#<!-- -->12 0x00007fffde3be373 in __do_global_dtors_aux () from /soft/compilers/llvm/master-nightly/lib/libomptarget.rtl.amdgpu.so
#<!-- -->13 0x00007fffffffbcd0 in ?? ()
#<!-- -->14 0x00007ffff7fe29b3 in _dl_fini () from /lib64/ld-linux-x86-64.so.2

jhuber6 added a commit to jhuber6/llvm-project that referenced this issue Jan 11, 2024
Summary:
The constructors and destructors look up a symbol in the ELF quickly to
determine if they need to be run on the GPU. This allows us to avoid the
very slow actions required to do the slower lookup using the vendor API.

One problem occurs with how we handle the lifetime of these images.
Right now there is no invariant to specify the lifetime of the
underlying binary image that is loaded. In the typical case, this comes
from the binary itself in the `.llvm.offloading` section, meaning that
the lifetime of the binary should match the executable itself. This
would work fine, if it weren't for the fact that the plugin is loaded
via `dlopen` and can have a teardown order out of sync with the main
executable.

This was likely what was occuring when this failed on some systems but
not others. A potential solution would be to simply copy images into
memory so the runtime does not rely on external references. Another
would be to manually zero these out after initialization as to prevent
this mistake from happening accidentally. The former has the benefit of
making some checks easier, and allowing for constant initialization be
done on the ELF itself (normally we can't do this because writing to a
constant section, e.g. .llvm.offloading is a segfault.). The downside
would be the extra time required to copy the image in bulk (Although we
are likely doing this in the vendor runtimes as well).

This patch went with a quick solution to simply set a boolean value at
initialization time if we need to call destructors.

Fixes: llvm#77798
@EugeneZelenko EugeneZelenko added openmp:libomptarget OpenMP offload runtime and removed openmp labels Jan 11, 2024
justinfargnoli pushed a commit to justinfargnoli/llvm-project that referenced this issue Jan 28, 2024
…lvm#77828)

Summary:
The constructors and destructors look up a symbol in the ELF quickly to
determine if they need to be run on the GPU. This allows us to avoid the
very slow actions required to do the slower lookup using the vendor API.

One problem occurs with how we handle the lifetime of these images.
Right now there is no invariant to specify the lifetime of the
underlying binary image that is loaded. In the typical case, this comes
from the binary itself in the `.llvm.offloading` section, meaning that
the lifetime of the binary should match the executable itself. This
would work fine, if it weren't for the fact that the plugin is loaded
via `dlopen` and can have a teardown order out of sync with the main
executable.

This was likely what was occuring when this failed on some systems but
not others. A potential solution would be to simply copy images into
memory so the runtime does not rely on external references. Another
would be to manually zero these out after initialization as to prevent
this mistake from happening accidentally. The former has the benefit of
making some checks easier, and allowing for constant initialization be
done on the ELF itself (normally we can't do this because writing to a
constant section, e.g. .llvm.offloading is a segfault.). The downside
would be the extra time required to copy the image in bulk (Although we
are likely doing this in the vendor runtimes as well).

This patch went with a quick solution to simply set a boolean value at
initialization time if we need to call destructors.

Fixes: llvm#77798
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
openmp:libomptarget OpenMP offload runtime
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants