Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[libomptarget][nextgen-plugin] Use SCRELEASE/SCACQUIRE in packet headers #85678

Merged
merged 1 commit into from
Mar 20, 2024

Conversation

doru1004
Copy link
Contributor

@doru1004 doru1004 commented Mar 18, 2024

This patch updates the construction of packet headers to replace the usage of ACQUIRE/RELEASE with SCACQUIRE/SCRELEASE which is now recommended.
The patch also fixes a potential source of problems by ensuring the atomic store operation refers to the header field directly and doesn't rely on it being the first field in the packet struct.

Documentation:

HSA_PACKET_HEADER_SCACQUIRE_FENCE_SCOPE
Acquire fence scope. The value of this sub-field determines the scope and type of the memory fence
operation applied before the packet enters the active phase. An acquire fence ensures that any
subsequent global segment or image loads by any unit of execution that belongs to a dispatch that has
not yet entered the active phase on any queue of the same kernel agent, sees any data previously
released at the scopes specified by the acquire fence. The value of this sub-field must be one of hsa_
fence_scope_t.
HSA_PACKET_HEADER_ACQUIRE_FENCE_SCOPE
Deprecated: Renamed as HSA_PACKET_HEADER_SCACQUIRE_FENCE_SCOPE.
HSA_PACKET_HEADER_SCRELEASE_FENCE_SCOPE
Release fence scope. The value of this sub-field determines the scope and type of the memory fence
operation applied after kernel completion but before the packet is completed. A release fence makes any
global segment or image data that was stored by any unit of execution that belonged to a dispatch that
has completed the active phase on any queue of the same kernel agent visible in all the scopes specified
by the release fence. The value of this sub-field must be one of hsa_fence_scope_t.
HSA_PACKET_HEADER_RELEASE_FENCE_SCOPE
Deprecated: Renamed as HSA_PACKET_HEADER_SCRELEASE_FENCE_SCOPE.

Source: https://hsafoundation.com/wp-content/uploads/2021/02/HSA-Runtime-1.2.pdf

@llvmbot
Copy link
Collaborator

llvmbot commented Mar 18, 2024

@llvm/pr-subscribers-libc

@llvm/pr-subscribers-backend-amdgpu

Author: Gheorghe-Teodor Bercea (doru1004)

Changes

This patch updates the construction of packet headers to replace the usage of ACQUIRE/RELEASE with SCACQUIRE/SCRELEASE which is now recommended.
The patch also fixes a potential source of problems by ensuring the atomic store operation refers to the header field directly and doesn't rely on it being the first field in the packet struct.


Full diff: https://github.com/llvm/llvm-project/pull/85678.diff

1 Files Affected:

  • (modified) openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp (+10-11)
diff --git a/openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp b/openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
index fce7454bf2800d..886d8732d088fc 100644
--- a/openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
+++ b/openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
@@ -826,15 +826,14 @@ struct AMDGPUQueueTy {
   /// Assumes the queue lock is acquired.
   void publishKernelPacket(uint64_t PacketId, uint16_t Setup,
                            hsa_kernel_dispatch_packet_t *Packet) {
-    uint32_t *PacketPtr = reinterpret_cast<uint32_t *>(Packet);
-
-    uint16_t Header = HSA_PACKET_TYPE_KERNEL_DISPATCH << HSA_PACKET_HEADER_TYPE;
-    Header |= HSA_FENCE_SCOPE_SYSTEM << HSA_PACKET_HEADER_ACQUIRE_FENCE_SCOPE;
-    Header |= HSA_FENCE_SCOPE_SYSTEM << HSA_PACKET_HEADER_RELEASE_FENCE_SCOPE;
+    uint16_t Header =
+        (HSA_PACKET_TYPE_KERNEL_DISPATCH << HSA_PACKET_HEADER_TYPE) |
+        (HSA_FENCE_SCOPE_SYSTEM << HSA_PACKET_HEADER_SCACQUIRE_FENCE_SCOPE) |
+        (HSA_FENCE_SCOPE_SYSTEM << HSA_PACKET_HEADER_SCRELEASE_FENCE_SCOPE);
 
     // Publish the packet. Do not modify the package after this point.
     uint32_t HeaderWord = Header | (Setup << 16u);
-    __atomic_store_n(PacketPtr, HeaderWord, __ATOMIC_RELEASE);
+    __atomic_store_n((uint32_t *)&Packet->header, HeaderWord, __ATOMIC_RELEASE);
 
     // Signal the doorbell about the published packet.
     hsa_signal_store_relaxed(Queue->doorbell_signal, PacketId);
@@ -845,15 +844,15 @@ struct AMDGPUQueueTy {
   /// barrier dependencies (signals) are satisfied. Assumes the queue is locked
   void publishBarrierPacket(uint64_t PacketId,
                             hsa_barrier_and_packet_t *Packet) {
-    uint32_t *PacketPtr = reinterpret_cast<uint32_t *>(Packet);
     uint16_t Setup = 0;
-    uint16_t Header = HSA_PACKET_TYPE_BARRIER_AND << HSA_PACKET_HEADER_TYPE;
-    Header |= HSA_FENCE_SCOPE_SYSTEM << HSA_PACKET_HEADER_ACQUIRE_FENCE_SCOPE;
-    Header |= HSA_FENCE_SCOPE_SYSTEM << HSA_PACKET_HEADER_RELEASE_FENCE_SCOPE;
+    uint16_t Header =
+        (HSA_PACKET_TYPE_BARRIER_AND << HSA_PACKET_HEADER_TYPE) |
+        (HSA_FENCE_SCOPE_SYSTEM << HSA_PACKET_HEADER_SCACQUIRE_FENCE_SCOPE) |
+        (HSA_FENCE_SCOPE_SYSTEM << HSA_PACKET_HEADER_SCRELEASE_FENCE_SCOPE);
 
     // Publish the packet. Do not modify the package after this point.
     uint32_t HeaderWord = Header | (Setup << 16u);
-    __atomic_store_n(PacketPtr, HeaderWord, __ATOMIC_RELEASE);
+    __atomic_store_n((uint32_t *)&Packet->header, HeaderWord, __ATOMIC_RELEASE);
 
     // Signal the doorbell about the published packet.
     hsa_signal_store_relaxed(Queue->doorbell_signal, PacketId);

@dhruvachak
Copy link
Contributor

Can you cite the source of this recommendation? I know that https://hsafoundation.com/wp-content/uploads/2021/02/HSA-Runtime-1.2.pdf recommends SCACQUIRE/SCRELEASE but I also see ROCm docs using ACQUIRE/RELEASE https://doc-sep4.readthedocs.io/en/latest/Tutorial/Optimizing-Dispatches.html

@doru1004
Copy link
Contributor Author

Can you cite the source of this recommendation? I know that https://hsafoundation.com/wp-content/uploads/2021/02/HSA-Runtime-1.2.pdf recommends SCACQUIRE/SCRELEASE but I also see ROCm docs using ACQUIRE/RELEASE https://doc-sep4.readthedocs.io/en/latest/Tutorial/Optimizing-Dispatches.html.

I updated the comment. Please let me know if this clarifies it. The HSA Runtime documents mentions the ACQUIRE/RELEASE versions as being deprecated.

@@ -826,15 +826,14 @@ struct AMDGPUQueueTy {
/// Assumes the queue lock is acquired.
void publishKernelPacket(uint64_t PacketId, uint16_t Setup,
hsa_kernel_dispatch_packet_t *Packet) {
uint32_t *PacketPtr = reinterpret_cast<uint32_t *>(Packet);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this removed and replaced with a C-style cast on a 16-byte value?

@jhuber6
Copy link
Contributor

jhuber6 commented Mar 18, 2024

I don't think we need to worry about the header moving honestly, that would be an incredibly ABI breaking change and it's called header for a reason. We already rely implicitly on this storing the two 16-byte values that are right next to eachother.

Right now we have "store 32-bytes at the beginning of the struct" and this turns that into "Store 32-bytes at a 16-byte pointer assuming that the out of bounds access is correct." We also do not want to make this two-16 byte stores because this should happen in a single operation. So I think it's fine the way it is.

We should probably change the name though, the documentation says it's depcreated and just makes them the same value anyway so it's NFC. From the HSA headers:

HSA_PACKET_HEADER_SCACQUIRE_FENCE_SCOPE = 9,
HSA_PACKET_HEADER_ACQUIRE_FENCE_SCOPE = 9,

@doru1004
Copy link
Contributor Author

I don't think we need to worry about the header moving honestly, that would be an incredibly ABI breaking change and it's called header for a reason. We already rely implicitly on this storing the two 16-byte values that are right next to eachother.

Right now we have "store 32-bytes at the beginning of the struct" and this turns that into "Store 32-bytes at a 16-byte pointer assuming that the out of bounds access is correct." We also do not want to make this two-16 byte stores because this should happen in a single operation. So I think it's fine the way it is.

We should probably change the name though, the documentation says it's depcreated and just makes them the same value anyway so it's NFC. From the HSA headers:

HSA_PACKET_HEADER_SCACQUIRE_FENCE_SCOPE = 9,
HSA_PACKET_HEADER_ACQUIRE_FENCE_SCOPE = 9,

This is your code no?

// Initialize the packet header and set the doorbell signal to begin execution
  // by the HSA runtime.
  uint16_t header =
      (HSA_PACKET_TYPE_KERNEL_DISPATCH << HSA_PACKET_HEADER_TYPE) |
      (HSA_FENCE_SCOPE_SYSTEM << HSA_PACKET_HEADER_SCACQUIRE_FENCE_SCOPE) |
      (HSA_FENCE_SCOPE_SYSTEM << HSA_PACKET_HEADER_SCRELEASE_FENCE_SCOPE);
  uint32_t header_word = header | (setup << 16u);
  __atomic_store_n((uint32_t *)&packet->header, header_word, __ATOMIC_RELEASE);
  hsa_signal_store_relaxed(queue->doorbell_signal, packet_id);

@jhuber6
Copy link
Contributor

jhuber6 commented Mar 19, 2024

I don't think we need to worry about the header moving honestly, that would be an incredibly ABI breaking change and it's called header for a reason. We already rely implicitly on this storing the two 16-byte values that are right next to eachother.
Right now we have "store 32-bytes at the beginning of the struct" and this turns that into "Store 32-bytes at a 16-byte pointer assuming that the out of bounds access is correct." We also do not want to make this two-16 byte stores because this should happen in a single operation. So I think it's fine the way it is.
We should probably change the name though, the documentation says it's depcreated and just makes them the same value anyway so it's NFC. From the HSA headers:

HSA_PACKET_HEADER_SCACQUIRE_FENCE_SCOPE = 9,
HSA_PACKET_HEADER_ACQUIRE_FENCE_SCOPE = 9,

This is your code no?

// Initialize the packet header and set the doorbell signal to begin execution
  // by the HSA runtime.
  uint16_t header =
      (HSA_PACKET_TYPE_KERNEL_DISPATCH << HSA_PACKET_HEADER_TYPE) |
      (HSA_FENCE_SCOPE_SYSTEM << HSA_PACKET_HEADER_SCACQUIRE_FENCE_SCOPE) |
      (HSA_FENCE_SCOPE_SYSTEM << HSA_PACKET_HEADER_SCRELEASE_FENCE_SCOPE);
  uint32_t header_word = header | (setup << 16u);
  __atomic_store_n((uint32_t *)&packet->header, header_word, __ATOMIC_RELEASE);
  hsa_signal_store_relaxed(queue->doorbell_signal, packet_id);

It is, I should probably update it as well. I think I originally copied part of that from the old plugins somewhere.

@doru1004
Copy link
Contributor Author

It is, I should probably update it as well. I think I originally copied part of that from the old plugins somewhere.

That's ok no problem, I can make it consistent.

@jhuber6
Copy link
Contributor

jhuber6 commented Mar 19, 2024

It is, I should probably update it as well. I think I originally copied part of that from the old plugins somewhere.

That's ok no problem, I can make it consistent.

diff --git a/libc/utils/gpu/loader/amdgpu/Loader.cpp b/libc/utils/gpu/loader/amdgpu/Loader.cpp
index e3911eda2bd8..7fd45acfd47e 100644
--- a/libc/utils/gpu/loader/amdgpu/Loader.cpp
+++ b/libc/utils/gpu/loader/amdgpu/Loader.cpp
@@ -276,7 +276,8 @@ hsa_status_t launch_kernel(hsa_agent_t dev_agent, hsa_executable_t executable,
       (HSA_FENCE_SCOPE_SYSTEM << HSA_PACKET_HEADER_SCACQUIRE_FENCE_SCOPE) |
       (HSA_FENCE_SCOPE_SYSTEM << HSA_PACKET_HEADER_SCRELEASE_FENCE_SCOPE);
   uint32_t header_word = header | (setup << 16u);
-  __atomic_store_n((uint32_t *)&packet->header, header_word, __ATOMIC_RELEASE);
+  __atomic_store_n(reinterpret_cast<uint32_t *>(packet), header_word,
+                   __ATOMIC_RELEASE);
   hsa_signal_store_relaxed(queue->doorbell_signal, packet_id);
 
   // Wait until the kernel has completed execution on the device. Periodically

Is what I'd do.

@doru1004 doru1004 merged commit 927308a into llvm:main Mar 20, 2024
5 checks passed
@doru1004 doru1004 deleted the fix-header-setting branch March 20, 2024 15:22
@doru1004 doru1004 restored the fix-header-setting branch March 20, 2024 15:38
doru1004 added a commit that referenced this pull request Mar 20, 2024
chencha3 pushed a commit to chencha3/llvm-project that referenced this pull request Mar 23, 2024
…ers (llvm#85678)

This patch updates the construction of packet headers to replace the
usage of ACQUIRE/RELEASE with SCACQUIRE/SCRELEASE which is now
recommended.
The patch also ensures consistency across kernel dispatches.
chencha3 pushed a commit to chencha3/llvm-project that referenced this pull request Mar 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants