Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OpenMP][libomptarget] Fix potential atomics ordering bug #70503

Closed
wants to merge 1 commit into from

Conversation

jplehr
Copy link
Contributor

@jplehr jplehr commented Oct 27, 2023

This addresses a potential ordering bug in the AMDGPU plugin that may cause an assertion error at runtime, due to the Slot signal not being updated.
Original author @carlobertolli

This addresses a potential ordering bug in the AMDGPU plugin that may
cause an assertion error at runtime, due to the Slot signal not being
updated.
Original author @carlobertolli
@llvmbot
Copy link

llvmbot commented Oct 27, 2023

@llvm/pr-subscribers-backend-amdgpu

Author: Jan Patrick Lehr (jplehr)

Changes

This addresses a potential ordering bug in the AMDGPU plugin that may cause an assertion error at runtime, due to the Slot signal not being updated.
Original author @carlobertolli


Full diff: https://github.com/llvm/llvm-project/pull/70503.diff

1 Files Affected:

  • (modified) openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp (+4-4)
diff --git a/openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp b/openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
index 756c5003b0d542c..c1502d680a3170b 100644
--- a/openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
+++ b/openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
@@ -1035,14 +1035,14 @@ struct AMDGPUStreamTy {
   /// should be executed. Notice we use the post action mechanism to codify the
   /// asynchronous operation.
   static bool asyncActionCallback(hsa_signal_value_t Value, void *Args) {
-    StreamSlotTy *Slot = reinterpret_cast<StreamSlotTy *>(Args);
-    assert(Slot && "Invalid slot");
-    assert(Slot->Signal && "Invalid signal");
-
     // This thread is outside the stream mutex. Make sure the thread sees the
     // changes on the slot.
     std::atomic_thread_fence(std::memory_order_acquire);
 
+    StreamSlotTy *Slot = reinterpret_cast<StreamSlotTy *>(Args);
+    assert(Slot && "Invalid slot");
+    assert(Slot->Signal && "Invalid signal");
+
     // Peform the operation.
     if (auto Err = Slot->performAction())
       FATAL_MESSAGE(1, "Error peforming post action: %s",

@shiltian
Copy link
Contributor

This patch looks able to fix #65811?

@carlobertolli
Copy link
Member

carlobertolli commented Oct 27, 2023

This patch looks able to fix #65811?

Partially. The problem @ye-luo reported is still there after this patch is applied but this change is necessary as a first progress towards resolution of the bug.

@ye-luo
Copy link
Contributor

ye-luo commented Oct 27, 2023

I don't think the fence in the plugin is doing anything useful and they should be removed. hsa is responsible for manipulating the fence.

Comment on lines +1042 to +1043
StreamSlotTy *Slot = reinterpret_cast<StreamSlotTy *>(Args);
assert(Slot && "Invalid slot");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part of the assert can stay in the function prolog

// This thread is outside the stream mutex. Make sure the thread sees the
// changes on the slot.
std::atomic_thread_fence(std::memory_order_acquire);

StreamSlotTy *Slot = reinterpret_cast<StreamSlotTy *>(Args);
assert(Slot && "Invalid slot");
assert(Slot->Signal && "Invalid signal");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming the Signal field is some kind of atomic type?

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still relevant?

@jplehr
Copy link
Contributor Author

jplehr commented Feb 7, 2024

Will need to reevaluate. We have seen sporadic assertion errors in the buildbot about the Slot or the Signal.

@ye-luo
Copy link
Contributor

ye-luo commented Feb 7, 2024

Will need to reevaluate. We have seen sporadic assertion errors in the buildbot about the Slot or the Signal.

There are known issues in HSA/driver causing assertion failure around slot/signal in the plugin. ROCm/ROCm#2616
AMD already reordered the lines https://github.com/ROCm/llvm-project/blob/364e1dce640fa1ad38a5d46ad8d4adc3dc6c29aa/openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp#L1662 and the issue remains.

@jplehr
Copy link
Contributor Author

jplehr commented Feb 9, 2024

Yes, the known issue is about the signal dependencies and their synchronization. I remember that we were seeing the assertion about the Slot failing on the buildbots sometimes. I'm not sure that is related to the missed synchronization.

But the assertion error on the Slot happened very rarely, so I would need to re-evaluate.

@jplehr
Copy link
Contributor Author

jplehr commented Apr 23, 2024

I’m going to close this PR and, should the need arise, open a new one.

@jplehr jplehr closed this Apr 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants