-
Notifications
You must be signed in to change notification settings - Fork 12k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[OpenMP][libomptarget] Fix potential atomics ordering bug #70503
Conversation
This addresses a potential ordering bug in the AMDGPU plugin that may cause an assertion error at runtime, due to the Slot signal not being updated. Original author @carlobertolli
@llvm/pr-subscribers-backend-amdgpu Author: Jan Patrick Lehr (jplehr) ChangesThis addresses a potential ordering bug in the AMDGPU plugin that may cause an assertion error at runtime, due to the Slot signal not being updated. Full diff: https://github.com/llvm/llvm-project/pull/70503.diff 1 Files Affected:
diff --git a/openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp b/openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
index 756c5003b0d542c..c1502d680a3170b 100644
--- a/openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
+++ b/openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
@@ -1035,14 +1035,14 @@ struct AMDGPUStreamTy {
/// should be executed. Notice we use the post action mechanism to codify the
/// asynchronous operation.
static bool asyncActionCallback(hsa_signal_value_t Value, void *Args) {
- StreamSlotTy *Slot = reinterpret_cast<StreamSlotTy *>(Args);
- assert(Slot && "Invalid slot");
- assert(Slot->Signal && "Invalid signal");
-
// This thread is outside the stream mutex. Make sure the thread sees the
// changes on the slot.
std::atomic_thread_fence(std::memory_order_acquire);
+ StreamSlotTy *Slot = reinterpret_cast<StreamSlotTy *>(Args);
+ assert(Slot && "Invalid slot");
+ assert(Slot->Signal && "Invalid signal");
+
// Peform the operation.
if (auto Err = Slot->performAction())
FATAL_MESSAGE(1, "Error peforming post action: %s",
|
This patch looks able to fix #65811? |
I don't think the fence in the plugin is doing anything useful and they should be removed. hsa is responsible for manipulating the fence. |
StreamSlotTy *Slot = reinterpret_cast<StreamSlotTy *>(Args); | ||
assert(Slot && "Invalid slot"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part of the assert can stay in the function prolog
// This thread is outside the stream mutex. Make sure the thread sees the | ||
// changes on the slot. | ||
std::atomic_thread_fence(std::memory_order_acquire); | ||
|
||
StreamSlotTy *Slot = reinterpret_cast<StreamSlotTy *>(Args); | ||
assert(Slot && "Invalid slot"); | ||
assert(Slot->Signal && "Invalid signal"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming the Signal field is some kind of atomic type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still relevant?
Will need to reevaluate. We have seen sporadic assertion errors in the buildbot about the Slot or the Signal. |
There are known issues in HSA/driver causing assertion failure around slot/signal in the plugin. ROCm/ROCm#2616 |
Yes, the known issue is about the signal dependencies and their synchronization. I remember that we were seeing the assertion about the But the assertion error on the |
I’m going to close this PR and, should the need arise, open a new one. |
This addresses a potential ordering bug in the AMDGPU plugin that may cause an assertion error at runtime, due to the Slot signal not being updated.
Original author @carlobertolli