-
Notifications
You must be signed in to change notification settings - Fork 12k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AMDGPU][SILowerSGPRSpills][OpenMP] Commit breaks OpenMC on AMD MI250 #63983
Comments
@llvm/issue-subscribers-backend-amdgpu |
Is this crash specific to only MI250 GPUs? |
As a long shot you can try https://reviews.llvm.org/D145329 |
|
Hi @jtramm, I can see the error that you shared, command I used |
Glad to hear you're able to reproduce the error! At first glance, it seems like it would be tricky to extract kernels from OpenMC, as they are each quite large and involve accessing many complex hierarchies of data structures. There's probably 10k lines of code that runs on device, and much of the host code is spent initializing data structures to be passed to the device for the kernels to use. If looking to manually reduce the program size, it would be a huge effort, and may only reduce the program size by a modest fraction. There may be automated tools for cloning memory states and re-running kernels that could be used to automate the extraction process? |
Thanks, John! I can try reducing the program but the first step of isolating the test from the build environment seems tricky to me. Will it be possible to get a standalone reproducer? |
Once compiled and your environment is setup, you can navigate to one of the progression tests in the benchmark repository, e.g.,
To know if it ran correctly or not, you can compare OpenMC's output against the |
If I understand correctly |
Yes, when building openmc, a I don't think it would be useful to run OpenMC's included regressions tests. If the simple pincell model at |
Is debug build broken? I can't build the library after adding
|
I'm not sure about the llvm crash, but if you want to add debugging flags to the OpenMC build, these can be enabled by adding |
@yashssh You might want to do what the error suggest, file a bug for the AMDGPU backend crash here. |
I built the library with this flag but when I load it inside gdb/rocgdb I see |
I'm stuck in trying to extract any meaningful information to proceed further. Loading the binaries in RocGdb didn't work as highlighted in previous comments, I also tried setting up all the hip environment variables but didn't see anything I can use. Any pointers on how I can proceed? Maybe convert it to an assert failure or something else that's easy to pinpoint as a compiler failure? |
You don't need debug info to load in rocgdb. Just remove the -g compile flags, you can still at least see what kernel is executing. Also, you could start by fixing the debug info assert (these happen regularly and usually aren't that complex to fix) |
Commit 7a98f08 breaks the OpenMC app when running on AMD GPUs via OpenMP offloading. When this commit is reverted, OpenMC runs correctly.
The behavior we see with OpenMC with this commit varies. Sometimes it runs cleanly but produces slightly incorrect results, and sometimes OpenMC crashes at runtime, giving a variety of memory related errors, e.g.:
OpenMC can be downloaded, compiled, installed, and tested for correctness via script at: https://github.com/jtramm/openmc_offloading_builder
I'm also happy to test out any proposed patches. In the interim, I'd vote that we revert 7a98f08 in main until a fix is found.
Another notable issue with 7a98f08 is that it increases OpenMC's compile time from about 10 minutes up to 15 minutes. No significant performance gains are noted from the patch, so in OpenMC's case at least, the extra compile time doesn't seem to be worth it.
The text was updated successfully, but these errors were encountered: