Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AMDGPU][SILowerSGPRSpills][OpenMP] Commit breaks OpenMC on AMD MI250 #63983

Open
jtramm opened this issue Jul 20, 2023 · 17 comments
Open

[AMDGPU][SILowerSGPRSpills][OpenMP] Commit breaks OpenMC on AMD MI250 #63983

jtramm opened this issue Jul 20, 2023 · 17 comments
Assignees

Comments

@jtramm
Copy link

jtramm commented Jul 20, 2023

Commit 7a98f08 breaks the OpenMC app when running on AMD GPUs via OpenMP offloading. When this commit is reverted, OpenMC runs correctly.

The behavior we see with OpenMC with this commit varies. Sometimes it runs cleanly but produces slightly incorrect results, and sometimes OpenMC crashes at runtime, giving a variety of memory related errors, e.g.:

AMDGPU fatal error 1: Memory access fault by GPU 2 (agent 0x56135fcf2f00) at virtual address 0x7fe62f1f5000. Reasons: Unknown (0)

OpenMC can be downloaded, compiled, installed, and tested for correctness via script at: https://github.com/jtramm/openmc_offloading_builder

I'm also happy to test out any proposed patches. In the interim, I'd vote that we revert 7a98f08 in main until a fix is found.

Another notable issue with 7a98f08 is that it increases OpenMC's compile time from about 10 minutes up to 15 minutes. No significant performance gains are noted from the patch, so in OpenMC's case at least, the extra compile time doesn't seem to be worth it.

@llvmbot
Copy link
Collaborator

llvmbot commented Jul 20, 2023

@llvm/issue-subscribers-backend-amdgpu

@cdevadas
Copy link
Collaborator

Is this crash specific to only MI250 GPUs?

@arsenm
Copy link
Contributor

arsenm commented Jul 20, 2023

As a long shot you can try https://reviews.llvm.org/D145329

@jtramm
Copy link
Author

jtramm commented Jul 21, 2023

  • The crash does not appear to be specific to the MI250. OpenMC crashes/behaves similarly on the MI100 as the MI250.

  • I tested the patch at https://reviews.llvm.org/D145329 (at least the parts that are not in llvm/test) and OpenMC behaved the same, crashing or producing incorrect answers.

@yashssh
Copy link
Contributor

yashssh commented Jul 21, 2023

Hi @jtramm, I can see the error that you shared, command I used ./build_openmc.sh small. Can you guide me on how can I extract the GPU kernels from this test?

@jtramm
Copy link
Author

jtramm commented Jul 21, 2023

Glad to hear you're able to reproduce the error!

At first glance, it seems like it would be tricky to extract kernels from OpenMC, as they are each quite large and involve accessing many complex hierarchies of data structures. There's probably 10k lines of code that runs on device, and much of the host code is spent initializing data structures to be passed to the device for the kernels to use. If looking to manually reduce the program size, it would be a huge effort, and may only reduce the program size by a modest fraction.

There may be automated tools for cloning memory states and re-running kernels that could be used to automate the extraction process?

@yashssh
Copy link
Contributor

yashssh commented Jul 21, 2023

Thanks, John! I can try reducing the program but the first step of isolating the test from the build environment seems tricky to me. Will it be possible to get a standalone reproducer?

@jtramm
Copy link
Author

jtramm commented Jul 21, 2023

Once compiled and your environment is setup, you can navigate to one of the progression tests in the benchmark repository, e.g., openmc_offloading_benchmarks/progression_tests/small and then simply run OpenMC as:

openmc --event

To know if it ran correctly or not, you can compare OpenMC's output against the expected_results.txt file in each problem's directory.

@yashssh
Copy link
Contributor

yashssh commented Jul 24, 2023

If I understand correctly openmc --event executes the openmc binary with the inputs that are there in openmc_offloading_benchmarks/progression_tests/small directory, where openmc is the one big executable built while building OpenMC library.
Since OpenMC already has its own test suite is it possible if one of the tests in that suite is failing because of these changes? If yes that will be lot more easier to debug. I tried running the OpenMC tests locally following the steps here but I can't seem to get the setup right. Can you check if any of those test is failing while I continue looking into openmc_offloading_benchmarks/progression_tests/small failure?

@jtramm
Copy link
Author

jtramm commented Jul 24, 2023

Yes, when building openmc, a .so library file is built that contains most of the OpenMC code, and openmc is the executable that is a simple main that gets built that loads/runs things in the .so. When you run openmc --event, it runs openmc using the .xml files (and cross sections file from the environment) as inputs.

I don't think it would be useful to run OpenMC's included regressions tests. If the simple pincell model at openmc_offloading_benchmarks/progression_tests/small is failing then nearly all of OpenMC's other regression tests will also fail. The openmc_offloading_benchmarks/progression_tests/small model is about as simple of an input as it gets. The unit tests are for the python interface, so would not have any device code in them.

@yashssh
Copy link
Contributor

yashssh commented Jul 26, 2023

Is debug build broken? I can't build the library after adding -DCMAKE_BUILD_TYPE=[Debug|RelWithDebInfo] to the ./build_openmc script. -DCMAKE_BUILD_TYPE=Release works fine. Attaching stack trace.

[ 66%] Linking CXX shared library lib/libopenmc.so                                                                                                                                                                                                                   
/usr/local/lib/python3.8/dist-packages/cmake/data/bin/cmake -E cmake_link_script CMakeFiles/libopenmc.dir/link.txt --verbose=1                                                                                                                                       
/long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/openmc_offloading_builder/llvm-install/bin/clang++ -fPIC -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx908  -fopenmp -fopenmp-cuda-mode -Dgsl_CONFIG_CONTRACT_CHE
CKING_OFF -Wno-tautological-constant-compare -Wno-openmp-mapping -g -shared -Wl,-soname,libopenmc.so -o lib/libopenmc.so CMakeFiles/libopenmc.dir/Unity/unity_0_cxx.cxx.o  -Wl,-rpath,/usr/lib/x86_64-linux-gnu/hdf5/serial::::::::::::::::::::::::::::::::::::::::::
:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: /usr/lib/x86_64-linux-gnu/hdf5/serial/libhdf5.so /usr/lib/x86_64-linux-gnu/libpthread.so /usr/lib/x86_64-linux-gnu/libsz.so /usr/lib/x86_64-linux-gnu/libz.so /usr/lib/x86_64-linux-gnu/libdl.
so -lm /usr/lib/x86_64-linux-gnu/hdf5/serial/libhdf5_hl.so lib/libpugixml.a /usr/local/lib/libfmt.a                                                                                                                                                                  
clang-linker-wrapper: /long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/openmc_offloading_builder/llvm-project/llvm/lib/Target/AMDGPU/SIOptimizeVGPRLiveRange.cpp:532: void (anonymous namespace)::SIOptimizeVGPRLiveRange::optimize
LiveRange(Register, MachineBasicBlock *, MachineBasicBlock *, MachineBasicBlock *, SmallSetVector<MachineBasicBlock *, 16> &) const: Assertion `!O.readsReg()' failed.                                                                                               
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.                                                                                                                                                          
Stack dump:                                                                                                                                                                                                                                                          
0.      Program arguments: /long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/openmc_offloading_builder/llvm-install/bin/clang-linker-wrapper --host-triple=x86_64-unknown-linux-gnu --device-debug --linker-path=/usr/bin/ld -- -z relro --hash-styl
e=gnu --eh-frame-hdr -m elf_x86_64 -shared -o lib/libopenmc.so /lib/x86_64-linux-gnu/crti.o /usr/lib/gcc/x86_64-linux-gnu/9/crtbeginS.o -L/long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/openmc_offloading_builder/llvm-install/bin/../lib/x86_64
-unknown-linux-gnu -L/long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/openmc_offloading_builder/llvm-install/lib/clang/17/lib/x86_64-unknown-linux-gnu -L/usr/lib/gcc/x86_64-linux-gnu/9 -L/usr/lib/gcc/x86_64-linux-gnu/9/../../../../lib64 -L/lib
/x86_64-linux-gnu -L/lib/../lib64 -L/usr/lib/x86_64-linux-gnu -L/usr/lib/../lib64 -L/lib -L/usr/lib -L/long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/openmc_offloading_builder/llvm-install/lib -L/long_pathname_so_that_rpms_can_package_the_deb
ug_info/src/extlibs/openmc/llvm-install/lib -L/long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/openmc/llvm-install/lib -L. -soname libopenmc.so CMakeFiles/libopenmc.dir/Unity/unity_0_cxx.cxx.o -rpath /usr/lib/x86_64-linux-gnu/hdf5/serial::::::
:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: /usr/lib/x86_64-linux-gnu/hdf5/serial/libhdf5.so /usr/lib/x86_64-linux-gnu/libpthread.so /usr/lib/x86_64-linux-gnu/libsz.so /usr/lib/x86_64-linux-gnu/libz
.so /usr/lib/x86_64-linux-gnu/libdl.so -lm /usr/lib/x86_64-linux-gnu/hdf5/serial/libhdf5_hl.so lib/libpugixml.a /usr/local/lib/libfmt.a -lstdc++ -lm -lomp -lomptarget -lomptarget.devicertl -L/long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/ope
nmc_offloading_builder/llvm-install/lib -lgcc_s -lgcc -lpthread -lc -lgcc_s -lgcc /usr/lib/gcc/x86_64-linux-gnu/9/crtendS.o /lib/x86_64-linux-gnu/crtn.o                                                                                                             
1.      Running pass 'CallGraph Pass Manager' on module 'ld-temp.o'.                                                                                                                                                                                                 
2.      Running pass 'SI Optimize VGPR LiveRange' on function '@__omp_offloading_39_35d4096__ZN6openmc31process_advance_particle_eventsEv_l252'                                                                                                                      
 #0 0x0000000002fb1d38 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/openmc_offloading_builder/llvm-install/bin/clang-linker-wrapper+0x2fb1d38)                                            
 #1 0x0000000002fafb3e llvm::sys::RunSignalHandlers() (/long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/openmc_offloading_builder/llvm-install/bin/clang-linker-wrapper+0x2fafb3e)                                                                 
 #2 0x0000000002fb24ed SignalHandler(int) Signals.cpp:0:0                                                                                                                                                                                                            
 #3 0x00007ff0aa421420 __restore_rt (/lib/x86_64-linux-gnu/libpthread.so.0+0x14420)                                                                                                                                                                                  
 #4 0x00007ff0a9eb400b raise /build/glibc-SzIz7B/glibc-2.31/signal/../sysdeps/unix/sysv/linux/raise.c:51:1                                                                                                                                                           
 #5 0x00007ff0a9e93859 abort /build/glibc-SzIz7B/glibc-2.31/stdlib/abort.c:81:7                                                                                                                                                                                      
 #6 0x00007ff0a9e93729 get_sysdep_segment_value /build/glibc-SzIz7B/glibc-2.31/intl/loadmsgcat.c:509:8                                                                                                                                                               
 #7 0x00007ff0a9e93729 _nl_load_domain /build/glibc-SzIz7B/glibc-2.31/intl/loadmsgcat.c:970:34                                                                                                                                                                       
 #8 0x00007ff0a9ea4fd6 (/lib/x86_64-linux-gnu/libc.so.6+0x33fd6)                                                                                                                                                                                                     
 #9 0x0000000002405184 (anonymous namespace)::SIOptimizeVGPRLiveRange::runOnMachineFunction(llvm::MachineFunction&) SIOptimizeVGPRLiveRange.cpp:0:0                                                                                                                  
#10 0x00000000030b09f0 llvm::MachineFunctionPass::runOnFunction(llvm::Function&) (/long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/openmc_offloading_builder/llvm-install/bin/clang-linker-wrapper+0x30b09f0)                                      
#11 0x00000000029d1627 llvm::FPPassManager::runOnFunction(llvm::Function&) (/long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/openmc_offloading_builder/llvm-install/bin/clang-linker-wrapper+0x29d1627)                                            
#12 0x0000000002be0a51 (anonymous namespace)::CGPassManager::runOnModule(llvm::Module&) CallGraphSCCPass.cpp:0:0                                                                                                                                                     
#13 0x00000000029d20a7 llvm::legacy::PassManagerImpl::run(llvm::Module&) (/long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/openmc_offloading_builder/llvm-install/bin/clang-linker-wrapper+0x29d20a7)                                              
#14 0x0000000003545a5b codegen(llvm::lto::Config const&, llvm::TargetMachine*, std::function<llvm::Expected<std::unique_ptr<llvm::CachedFileStream, std::default_delete<llvm::CachedFileStream>>> (unsigned int, llvm::Twine const&)>, unsigned int, llvm::Module&, llvm::ModuleSummaryIndex const&) LTOBackend.cpp:0:0
#15 0x0000000003544a62 llvm::lto::backend(llvm::lto::Config const&, std::function<llvm::Expected<std::unique_ptr<llvm::CachedFileStream, std::default_delete<llvm::CachedFileStream>>> (unsigned int, llvm::Twine const&)>, unsigned int, llvm::Module&, llvm::ModuleSummaryIndex&) (/long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/openmc_offloading_builder/llvm-install/bin/clang-linker-wrapper+0x3544a62)
#16 0x0000000003516f90 llvm::lto::LTO::runRegularLTO(std::function<llvm::Expected<std::unique_ptr<llvm::CachedFileStream, std::default_delete<llvm::CachedFileStream>>> (unsigned int, llvm::Twine const&)>) (/long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/openmc_offloading_builder/llvm-install/bin/clang-linker-wrapper+0x3516f90)
#17 0x00000000035165b7 llvm::lto::LTO::run(std::function<llvm::Expected<std::unique_ptr<llvm::CachedFileStream, std::default_delete<llvm::CachedFileStream>>> (unsigned int, llvm::Twine const&)>, std::function<llvm::Expected<std::function<llvm::Expected<std::unique_ptr<llvm::CachedFileStream, std::default_delete<llvm::CachedFileStream>>> (unsigned int, llvm::Twine const&)>> (unsigned int, llvm::StringRef, llvm::Twine const&)>) (/long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/openmc_offloading_builder/llvm-install/bin/clang-linker-wrapper+0x35165b7)
#18 0x0000000002098508 (anonymous namespace)::linkBitcodeFiles(llvm::SmallVectorImpl<llvm::object::OffloadFile>&, llvm::SmallVectorImpl<llvm::StringRef>&, llvm::opt::ArgList const&) ClangLinkerWrapper.cpp:0:0
#19 0x000000000209234e llvm::Error (anonymous namespace)::linkAndWrapDeviceFiles(llvm::SmallVectorImpl<llvm::object::OffloadFile>&, llvm::opt::InputArgList const&, char**, int)::$_0::operator()<llvm::SmallVector<llvm::object::OffloadFile, 3u>>(llvm::SmallVector<llvm::object::OffloadFile, 3u>&) const ClangLinkerWrapper.cpp:0:0
#20 0x0000000002089c2d (anonymous namespace)::linkAndWrapDeviceFiles(llvm::SmallVectorImpl<llvm::object::OffloadFile>&, llvm::opt::InputArgList const&, char**, int) ClangLinkerWrapper.cpp:0:0
#21 0x000000000208573a main (/long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/openmc_offloading_builder/llvm-install/bin/clang-linker-wrapper+0x208573a)
#22 0x00007ff0a9e95083 __libc_start_main /build/glibc-SzIz7B/glibc-2.31/csu/../csu/libc-start.c:342:3
#23 0x000000000208452e _start (/long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/openmc_offloading_builder/llvm-install/bin/clang-linker-wrapper+0x208452e)
 #0 0x0000000002fb1d38 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/openmc_offloading_builder/llvm-install/bin/clang-linker-wrapper+0x2fb1d38)
 #1 0x0000000002fafb75 llvm::sys::RunSignalHandlers() (/long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/openmc_offloading_builder/llvm-install/bin/clang-linker-wrapper+0x2fafb75)
 #2 0x0000000002fb24ed SignalHandler(int) Signals.cpp:0:0
 #3 0x00007ff0aa421420 __restore_rt (/lib/x86_64-linux-gnu/libpthread.so.0+0x14420)
 #4 0x00007ff0a9eb400b raise /build/glibc-SzIz7B/glibc-2.31/signal/../sysdeps/unix/sysv/linux/raise.c:51:1
 #5 0x00007ff0a9e93859 abort /build/glibc-SzIz7B/glibc-2.31/stdlib/abort.c:81:7
 #6 0x00007ff0a9e93729 get_sysdep_segment_value /build/glibc-SzIz7B/glibc-2.31/intl/loadmsgcat.c:509:8
 #7 0x00007ff0a9e93729 _nl_load_domain /build/glibc-SzIz7B/glibc-2.31/intl/loadmsgcat.c:970:34
 #8 0x00007ff0a9ea4fd6 (/lib/x86_64-linux-gnu/libc.so.6+0x33fd6)
 #9 0x0000000002405184 (anonymous namespace)::SIOptimizeVGPRLiveRange::runOnMachineFunction(llvm::MachineFunction&) SIOptimizeVGPRLiveRange.cpp:0:0
#10 0x00000000030b09f0 llvm::MachineFunctionPass::runOnFunction(llvm::Function&) (/long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/openmc_offloading_builder/llvm-install/bin/clang-linker-wrapper+0x30b09f0)
#11 0x00000000029d1627 llvm::FPPassManager::runOnFunction(llvm::Function&) (/long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/openmc_offloading_builder/llvm-install/bin/clang-linker-wrapper+0x29d1627)
#12 0x0000000002be0a51 (anonymous namespace)::CGPassManager::runOnModule(llvm::Module&) CallGraphSCCPass.cpp:0:0
#13 0x00000000029d20a7 llvm::legacy::PassManagerImpl::run(llvm::Module&) (/long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/openmc_offloading_builder/llvm-install/bin/clang-linker-wrapper+0x29d20a7)
#14 0x0000000003545a5b codegen(llvm::lto::Config const&, llvm::TargetMachine*, std::function<llvm::Expected<std::unique_ptr<llvm::CachedFileStream, std::default_delete<llvm::CachedFileStream>>> (unsigned int, llvm::Twine const&)>, unsigned int, llvm::Module&, llvm::ModuleSummaryIndex const&) LTOBackend.cpp:0:0
#15 0x0000000003544a62 llvm::lto::backend(llvm::lto::Config const&, std::function<llvm::Expected<std::unique_ptr<llvm::CachedFileStream, std::default_delete<llvm::CachedFileStream>>> (unsigned int, llvm::Twine const&)>, unsigned int, llvm::Module&, llvm::ModuleSummaryIndex&) (/long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/openmc_offloading_builder/llvm-install/bin/clang-linker-wrapper+0x3544a62)
#16 0x0000000003516f90 llvm::lto::LTO::runRegularLTO(std::function<llvm::Expected<std::unique_ptr<llvm::CachedFileStream, std::default_delete<llvm::CachedFileStream>>> (unsigned int, llvm::Twine const&)>) (/long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/openmc_offloading_builder/llvm-install/bin/clang-linker-wrapper+0x3516f90)
#17 0x00000000035165b7 llvm::lto::LTO::run(std::function<llvm::Expected<std::unique_ptr<llvm::CachedFileStream, std::default_delete<llvm::CachedFileStream>>> (unsigned int, llvm::Twine const&)>, std::function<llvm::Expected<std::function<llvm::Expected<std::unique_ptr<llvm::CachedFileStream, std::default_delete<llvm::CachedFileStream>>> (unsigned int, llvm::Twine const&)>> (unsigned int, llvm::StringRef, llvm::Twine const&)>) (/long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/openmc_offloading_builder/llvm-install/bin/clang-linker-wrapper+0x35165b7)
#18 0x0000000002098508 (anonymous namespace)::linkBitcodeFiles(llvm::SmallVectorImpl<llvm::object::OffloadFile>&, llvm::SmallVectorImpl<llvm::StringRef>&, llvm::opt::ArgList const&) ClangLinkerWrapper.cpp:0:0
#19 0x000000000209234e llvm::Error (anonymous namespace)::linkAndWrapDeviceFiles(llvm::SmallVectorImpl<llvm::object::OffloadFile>&, llvm::opt::InputArgList const&, char**, int)::$_0::operator()<llvm::SmallVector<llvm::object::OffloadFile, 3u>>(llvm::SmallVector<llvm::object::OffloadFile, 3u>&) const ClangLinkerWrapper.cpp:0:0
#20 0x0000000002089c2d (anonymous namespace)::linkAndWrapDeviceFiles(llvm::SmallVectorImpl<llvm::object::OffloadFile>&, llvm::opt::InputArgList const&, char**, int) ClangLinkerWrapper.cpp:0:0
#21 0x000000000208573a main (/long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/openmc_offloading_builder/llvm-install/bin/clang-linker-wrapper+0x208573a)
#22 0x00007ff0a9e95083 __libc_start_main /build/glibc-SzIz7B/glibc-2.31/csu/../csu/libc-start.c:342:3
#23 0x000000000208452e _start (/long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/openmc_offloading_builder/llvm-install/bin/clang-linker-wrapper+0x208452e)
clang++: error: unable to execute command: Aborted (core dumped)
clang++: error: linker command failed due to signal (use -v to see invocation)
make[2]: *** [CMakeFiles/libopenmc.dir/build.make:106: lib/libopenmc.so] Error 1
make[2]: Leaving directory '/long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/openmc_offloading_builder/openmc/build'
make[1]: *** [CMakeFiles/Makefile2:188: CMakeFiles/libopenmc.dir/all] Error 2
make[1]: Leaving directory '/long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/openmc_offloading_builder/openmc/build'
make: *** [Makefile:136: all] Error 2

@jtramm
Copy link
Author

jtramm commented Jul 27, 2023

I'm not sure about the llvm crash, but if you want to add debugging flags to the OpenMC build, these can be enabled by adding -Ddebug=on to your cmake line for OpenMC. You can edit the debugging flags at https://github.com/exasmr/openmc/blob/0d5b181a1d82d7d0073e5d0532f02212019c01dd/CMakeLists.txt#L122

@jdoerfert
Copy link
Member

jdoerfert commented Jul 27, 2023

Is debug build broken? I can't build the library after adding -DCMAKE_BUILD_TYPE=[Debug|RelWithDebInfo] to the ./build_openmc script. -DCMAKE_BUILD_TYPE=Release works fine. Attaching stack trace.

/long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/openmc_offloading_builder/llvm-install/bin/clang++ -fPIC -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx908  -fopenmp -fopenmp-cuda-mode -Dgsl_CONFIG_CONTRACT_CHE
CKING_OFF -Wno-tautological-constant-compare -Wno-openmp-mapping -g -shared -Wl,-soname,libopenmc.so -o lib/libopenmc.so CMakeFiles/libopenmc.dir/Unity/unity_0_cxx.cxx.o  -Wl,-rpath,/usr/lib/x86_64-linux-gnu/hdf5/serial::::::::::::::::::::::::::::::::::::::::::
:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: /usr/lib/x86_64-linux-gnu/hdf5/serial/libhdf5.so /usr/lib/x86_64-linux-gnu/libpthread.so /usr/lib/x86_64-linux-gnu/libsz.so /usr/lib/x86_64-linux-gnu/libz.so /usr/lib/x86_64-linux-gnu/libdl.
so -lm /usr/lib/x86_64-linux-gnu/hdf5/serial/libhdf5_hl.so lib/libpugixml.a /usr/local/lib/libfmt.a                                                                                                                                                                  
clang-linker-wrapper: /long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/openmc_offloading_builder/llvm-project/llvm/lib/Target/AMDGPU/SIOptimizeVGPRLiveRange.cpp:532: void (anonymous namespace)::SIOptimizeVGPRLiveRange::optimize
LiveRange(Register, MachineBasicBlock *, MachineBasicBlock *, MachineBasicBlock *, SmallSetVector<MachineBasicBlock *, 16> &) const: Assertion `!O.readsReg()' failed.                                                                                               
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.                                                                                                                                                                                                                                                      

@yashssh You might want to do what the error suggest, file a bug for the AMDGPU backend crash here.

@yashssh
Copy link
Contributor

yashssh commented Jul 27, 2023

I'm not sure about the llvm crash, but if you want to add debugging flags to the OpenMC build, these can be enabled by adding -Ddebug=on to your cmake line for OpenMC.

I built the library with this flag but when I load it inside gdb/rocgdb I see Reading symbols from openmc ... (No debugging symbols found in openmc) . Is it supposed to be like that?

@yashssh
Copy link
Contributor

yashssh commented Jul 27, 2023

@yashssh You might want to do what the error suggest, file a bug for the AMDGPU backend crash here.
Opened issue #64163

@yashssh
Copy link
Contributor

yashssh commented Aug 1, 2023

I'm stuck in trying to extract any meaningful information to proceed further. Loading the binaries in RocGdb didn't work as highlighted in previous comments, I also tried setting up all the hip environment variables but didn't see anything I can use. Any pointers on how I can proceed? Maybe convert it to an assert failure or something else that's easy to pinpoint as a compiler failure?

@jdoerfert @shiltian

@arsenm
Copy link
Contributor

arsenm commented Aug 1, 2023

I'm stuck in trying to extract any meaningful information to proceed further. Loading the binaries in RocGdb didn't work as highlighted in previous comments, I also tried setting up all the hip environment variables but didn't see anything I can use. Any pointers on how I can proceed? Maybe convert it to an assert failure or something else that's easy to pinpoint as a compiler failure?

You don't need debug info to load in rocgdb. Just remove the -g compile flags, you can still at least see what kernel is executing. Also, you could start by fixing the debug info assert (these happen regularly and usually aren't that complex to fix)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants