AMDGPU/GFX12: Insert waitcnts before stores with scope_sys #82996

petar-avramovic · 2024-02-26T12:44:55Z

Insert waitcnts for loads and atomics before stores with system scope.
Scope is field in instruction encoding and corresponds to desired
coherence level in cache hierarchy.
Intrinsic stores can set scope in cache policy operand.
If volatile keyword is used on generic stores memory legalizer will set
scope to system. Generic stores, by default, get lowest scope level.
Waitcnts are not required if it is guaranteed that memory is cached.
For example vulkan shaders can guarantee this.
TODO: implement flag for frontends to give us a hint not to insert waits.
Expecting vulkan flag to be implemented as vulkan:private MMRA.

llvmbot · 2024-02-26T12:45:32Z

@llvm/pr-subscribers-llvm-globalisel

@llvm/pr-subscribers-backend-amdgpu

Author: Petar Avramovic (petar-avramovic)

Changes

Insert waitcnts for loads and atomics before stores with system scope.
Scope is field in instruction encoding and corresponds to desired
coherence level in cache hierarchy. Only intrinsic stores can set scope.
Currently there is no reliable way to set scope on generic stores they
are by default lowest scope level.
Waitcnts are not required if it is guaranteed that memory is cached.
For example vulkan shaders can guarantee this.
TODO: implement flag for frontends to give us a hint not to insert waits.
Expecting vulkan flag to be implemented as vulkan:private MMRA.

Full diff: https://github.com/llvm/llvm-project/pull/82996.diff

5 Files Affected:

(modified) llvm/lib/Target/AMDGPU/SIInstrInfo.h (+2)
(modified) llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp (+38)
(modified) llvm/lib/Target/AMDGPU/SOPInstructions.td (+1)
(added) llvm/test/CodeGen/AMDGPU/wait-for-stores-with-scope_sys.ll (+16)
(added) llvm/test/CodeGen/AMDGPU/wait-for-stores-with-scope_sys.mir (+22)

diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.h b/llvm/lib/Target/AMDGPU/SIInstrInfo.h
index d774826c1d08c0..a8a33a5fecb413 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.h
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.h
@@ -949,6 +949,8 @@ class SIInstrInfo final : public AMDGPUGenInstrInfo {
       return AMDGPU::S_WAIT_BVHCNT;
     case AMDGPU::S_WAIT_DSCNT_soft:
       return AMDGPU::S_WAIT_DSCNT;
+    case AMDGPU::S_WAIT_KMCNT_soft:
+      return AMDGPU::S_WAIT_KMCNT;
     default:
       return Opcode;
     }
diff --git a/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp b/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
index f62e808b33e42b..8dba1fc8398442 100644
--- a/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
+++ b/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
@@ -312,6 +312,10 @@ class SICacheControl {
                                               SIMemOp Op, bool IsVolatile,
                                               bool IsNonTemporal) const = 0;
 
+  virtual bool expandSystemScopeStore(MachineBasicBlock::iterator &MI) const {
+    return false;
+  };
+
   /// Inserts any necessary instructions at position \p Pos relative
   /// to instruction \p MI to ensure memory instructions before \p Pos of kind
   /// \p Op associated with address spaces \p AddrSpace have completed. Used
@@ -603,6 +607,8 @@ class SIGfx12CacheControl : public SIGfx11CacheControl {
                                       SIAtomicAddrSpace AddrSpace, SIMemOp Op,
                                       bool IsVolatile,
                                       bool IsNonTemporal) const override;
+
+  bool expandSystemScopeStore(MachineBasicBlock::iterator &MI) const override;
 };
 
 class SIMemoryLegalizer final : public MachineFunctionPass {
@@ -2381,6 +2387,34 @@ bool SIGfx12CacheControl::enableVolatileAndOrNonTemporal(
   return Changed;
 }
 
+bool SIGfx12CacheControl::expandSystemScopeStore(
+    MachineBasicBlock::iterator &MI) const {
+
+  MachineOperand *CPol = TII->getNamedOperand(*MI, OpName::cpol);
+  if (CPol && ((CPol->getImm() & CPol::SCOPE) == CPol::SCOPE_SYS)) {
+    // Stores with system scope (SCOPE_SYS) need to wait for:
+    // - loads or atomics(returning) - wait for {LOAD|SAMPLE|BVH|KM}CNT==0
+    // - non-returning-atomics       - wait for STORECNT==0
+    //   TODO: SIInsertWaitcnts will not always be able to remove STORECNT waits
+    //   since it does not distinguish atomics-with-return from regular stores.
+
+    // There is no need to wait if memory is cached (mtype != UC).
+    // For example shader-visible memory is cached.
+    // TODO: implement flag for frontend to give us a hint not to insert waits.
+    MachineBasicBlock &MBB = *MI->getParent();
+    DebugLoc DL = MI->getDebugLoc();
+
+    BuildMI(MBB, MI, DL, TII->get(S_WAIT_LOADCNT_soft)).addImm(0);
+    BuildMI(MBB, MI, DL, TII->get(S_WAIT_SAMPLECNT_soft)).addImm(0);
+    BuildMI(MBB, MI, DL, TII->get(S_WAIT_BVHCNT_soft)).addImm(0);
+    BuildMI(MBB, MI, DL, TII->get(S_WAIT_KMCNT_soft)).addImm(0);
+    BuildMI(MBB, MI, DL, TII->get(S_WAIT_STORECNT_soft)).addImm(0);
+    return true;
+  }
+
+  return false;
+}
+
 bool SIMemoryLegalizer::removeAtomicPseudoMIs() {
   if (AtomicPseudoMIs.empty())
     return false;
@@ -2467,6 +2501,10 @@ bool SIMemoryLegalizer::expandStore(const SIMemOpInfo &MOI,
   Changed |= CC->enableVolatileAndOrNonTemporal(
       MI, MOI.getInstrAddrSpace(), SIMemOp::STORE, MOI.isVolatile(),
       MOI.isNonTemporal());
+
+  // GFX12 specific, scope(desired coherence domain in cache hierarchy) is
+  // instruction field, do not confuse it with atomic scope.
+  Changed |= CC->expandSystemScopeStore(MI);
   return Changed;
 }
 
diff --git a/llvm/lib/Target/AMDGPU/SOPInstructions.td b/llvm/lib/Target/AMDGPU/SOPInstructions.td
index 0fe2845f8edc31..b5de311f8c58ce 100644
--- a/llvm/lib/Target/AMDGPU/SOPInstructions.td
+++ b/llvm/lib/Target/AMDGPU/SOPInstructions.td
@@ -1601,6 +1601,7 @@ let SubtargetPredicate = isGFX12Plus in {
   def S_WAIT_SAMPLECNT_soft : SOPP_Pseudo <"s_soft_wait_samplecnt", (ins s16imm:$simm16), "$simm16">;
   def S_WAIT_BVHCNT_soft : SOPP_Pseudo <"s_soft_wait_bvhcnt", (ins s16imm:$simm16), "$simm16">;
   def S_WAIT_DSCNT_soft : SOPP_Pseudo <"s_soft_wait_dscnt", (ins s16imm:$simm16), "$simm16">;
+  def S_WAIT_KMCNT_soft : SOPP_Pseudo <"s_soft_wait_kmcnt", (ins s16imm:$simm16), "$simm16">;
 }
 
 def S_SETHALT : SOPP_Pseudo <"s_sethalt" , (ins i32imm:$simm16), "$simm16",
diff --git a/llvm/test/CodeGen/AMDGPU/wait-for-stores-with-scope_sys.ll b/llvm/test/CodeGen/AMDGPU/wait-for-stores-with-scope_sys.ll
new file mode 100644
index 00000000000000..ed06b8aef6cac9
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/wait-for-stores-with-scope_sys.ll
@@ -0,0 +1,16 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 4
+; RUN: llc -march=amdgcn -mcpu=gfx1200 -verify-machineinstrs < %s | FileCheck -check-prefix=GFX12 %s
+; RUN: llc -global-isel -march=amdgcn -mcpu=gfx1200 -verify-machineinstrs < %s | FileCheck -check-prefix=GFX12 %s
+
+define amdgpu_ps void @intrinsic_store_system_scope(i32 %val, <4 x i32> inreg %rsrc, i32 %vindex, i32 %voffset, i32 inreg %soffset) {
+; GFX12-LABEL: intrinsic_store_system_scope:
+; GFX12:       ; %bb.0:
+; GFX12-NEXT:    buffer_store_b32 v0, v[1:2], s[0:3], s4 idxen offen scope:SCOPE_SYS
+; GFX12-NEXT:    s_nop 0
+; GFX12-NEXT:    s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
+; GFX12-NEXT:    s_endpgm
+  call void @llvm.amdgcn.struct.buffer.store.i32(i32 %val, <4 x i32> %rsrc, i32 %vindex, i32 %voffset, i32 %soffset, i32 24)
+  ret void
+}
+
+declare void @llvm.amdgcn.struct.buffer.store.i32(i32, <4 x i32>, i32, i32, i32, i32 immarg)
diff --git a/llvm/test/CodeGen/AMDGPU/wait-for-stores-with-scope_sys.mir b/llvm/test/CodeGen/AMDGPU/wait-for-stores-with-scope_sys.mir
new file mode 100644
index 00000000000000..463a70a555a192
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/wait-for-stores-with-scope_sys.mir
@@ -0,0 +1,22 @@
+# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 4
+# RUN: llc -mtriple=amdgcn -mcpu=gfx1200 -run-pass=si-memory-legalizer  %s -o - | FileCheck -check-prefix=GFX12 %s
+
+---
+name: intrinsic_store_system_scope
+body: |
+  bb.0:
+    liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $sgpr4, $vgpr0, $vgpr1, $vgpr2
+
+    ; GFX12-LABEL: name: intrinsic_store_system_scope
+    ; GFX12: liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $sgpr4, $vgpr0, $vgpr1, $vgpr2
+    ; GFX12-NEXT: {{  $}}
+    ; GFX12-NEXT: S_WAIT_LOADCNT_soft 0
+    ; GFX12-NEXT: S_WAIT_SAMPLECNT_soft 0
+    ; GFX12-NEXT: S_WAIT_BVHCNT_soft 0
+    ; GFX12-NEXT: S_WAIT_KMCNT_soft 0
+    ; GFX12-NEXT: S_WAIT_STORECNT_soft 0
+    ; GFX12-NEXT: BUFFER_STORE_DWORD_VBUFFER_BOTHEN_exact killed renamable $vgpr0, killed renamable $vgpr1_vgpr2, killed renamable $sgpr0_sgpr1_sgpr2_sgpr3, killed renamable $sgpr4, 0, 24, 0, implicit $exec :: (dereferenceable store (s32), align 1, addrspace 8)
+    ; GFX12-NEXT: S_ENDPGM 0
+    BUFFER_STORE_DWORD_VBUFFER_BOTHEN_exact killed renamable $vgpr0, killed renamable $vgpr1_vgpr2, killed renamable $sgpr0_sgpr1_sgpr2_sgpr3, killed renamable $sgpr4, 0, 24, 0, implicit $exec :: (dereferenceable store (s32), align 1, addrspace 8)
+    S_ENDPGM 0
+...

jayfoad · 2024-02-26T13:08:28Z

llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp

+
+  // GFX12 specific, scope(desired coherence domain in cache hierarchy) is
+  // instruction field, do not confuse it with atomic scope.
+  Changed |= CC->expandSystemScopeStore(MI);


Is this also needed for atomic stores? They returned early on line 2495 so they won't hit this code.

As far as I know no, needed only non-atomic stores

I don't see why it wouldn't be needed on a store release? store release already waits, but it doesn't (explicitly) wait for all the counters we need to wait on for expandSystemScopeStore

llvm/test/CodeGen/AMDGPU/wait-for-stores-with-scope_sys.ll

arsenm · 2024-02-26T13:51:44Z

llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp

+    // For example shader-visible memory is cached.
+    // TODO: implement flag for frontend to give us a hint not to insert waits.
+    MachineBasicBlock &MBB = *MI->getParent();
+    DebugLoc DL = MI->getDebugLoc();


arsenm · 2024-02-26T13:53:12Z

llvm/test/CodeGen/AMDGPU/wait-for-stores-with-scope_sys.ll

@@ -0,0 +1,16 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 4
+; RUN: llc -march=amdgcn -mcpu=gfx1200 -verify-machineinstrs < %s | FileCheck -check-prefix=GFX12 %s


-global-isel=0

Pierre-vh · 2024-02-26T13:59:42Z

llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp

+    MachineBasicBlock::iterator &MI) const {
+
+  MachineOperand *CPol = TII->getNamedOperand(*MI, OpName::cpol);
+  if (CPol && ((CPol->getImm() & CPol::SCOPE) == CPol::SCOPE_SYS)) {


invert condition & early return

github-actions · 2024-02-27T17:02:46Z

✅ With the latest revision this PR passed the C/C++ code formatter.

petar-avramovic · 2024-02-27T17:04:28Z

Volatile stores set scope_sys, insert waits there also

jayfoad · 2024-02-28T14:21:16Z

llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp

@@ -2364,6 +2396,9 @@ bool SIGfx12CacheControl::enableVolatileAndOrNonTemporal(
  if (IsVolatile) {
    Changed |= setScope(MI, AMDGPU::CPol::SCOPE_SYS);

+    if (Op == SIMemOp::STORE)
+      Changed |= insertWaitsBeforeSystemScopeStore(MI);


It is a bit messy that we need this extra call to insertWaitsBeforeSystemScopeStore here, because the call to insertWait below modifies MI so it no longer refers to the store. But I guess it is OK.

jayfoad · 2024-02-28T14:23:06Z

llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp

+  //   TODO: SIInsertWaitcnts will not always be able to remove STORECNT waits
+  //   since it does not distinguish atomics-with-return from regular stores.
+  // There is no need to wait if memory is cached (mtype != UC).
+  // For example shader-visible memory is cached.


I don't understand the statement that "shader-visible memory is cached". Surely we are compiling a shader, so any memory the shader refers to is "shader-visible", so why do we need to worry about uncached memory?

jayfoad · 2024-02-28T14:23:32Z

llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp

+    MachineBasicBlock::iterator &MI) const {
+  MachineOperand *CPol = TII->getNamedOperand(*MI, OpName::cpol);
+  if (!CPol || ((CPol->getImm() & CPol::SCOPE) != CPol::SCOPE_SYS))
+    return false;


Returning early no longer makes this code any shorter or clearer.

Insert waitcnts for loads and atomics before stores with system scope. Scope is field in instruction encoding and corresponds to desired coherence level in cache hierarchy. Intrinsic stores can set scope in cache policy operand. If volatile keyword is used on generic stores memory legalizer will set scope to system. Generic stores, by default, get lowest scope level. Waitcnts are not required if it is guaranteed that memory is cached. For example vulkan shaders can guarantee this. TODO: implement flag for frontends to give us a hint not to insert waits. Expecting vulkan flag to be implemented as vulkan:private MMRA.

petar-avramovic requested review from Pierre-vh and jayfoad February 26, 2024 12:45

llvmbot added the backend:AMDGPU label Feb 26, 2024

petar-avramovic requested a review from nhaehnle February 26, 2024 12:45

jayfoad reviewed Feb 26, 2024

View reviewed changes

Pierre-vh reviewed Feb 26, 2024

View reviewed changes

llvm/test/CodeGen/AMDGPU/wait-for-stores-with-scope_sys.ll Outdated Show resolved Hide resolved

arsenm reviewed Feb 26, 2024

View reviewed changes

Pierre-vh reviewed Feb 26, 2024

View reviewed changes

petar-avramovic force-pushed the scope_sys_store_waitcnts branch 2 times, most recently from b41acdc to 498816b Compare February 27, 2024 17:00

llvmbot added the llvm:globalisel label Feb 27, 2024

petar-avramovic force-pushed the scope_sys_store_waitcnts branch from 498816b to 8a6e770 Compare February 27, 2024 17:04

jayfoad reviewed Feb 28, 2024

View reviewed changes

petar-avramovic force-pushed the scope_sys_store_waitcnts branch from 8a6e770 to d784fc6 Compare February 28, 2024 15:12

jayfoad approved these changes Feb 28, 2024

View reviewed changes

petar-avramovic merged commit 3e35ba5 into llvm:main Feb 28, 2024
3 of 4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AMDGPU/GFX12: Insert waitcnts before stores with scope_sys #82996

AMDGPU/GFX12: Insert waitcnts before stores with scope_sys #82996

petar-avramovic commented Feb 26, 2024 •

edited

llvmbot commented Feb 26, 2024 •

edited

jayfoad Feb 26, 2024

petar-avramovic Feb 26, 2024

Pierre-vh Feb 26, 2024

arsenm Feb 26, 2024

arsenm Feb 26, 2024

Pierre-vh Feb 26, 2024

github-actions bot commented Feb 27, 2024 •

edited

petar-avramovic commented Feb 27, 2024

jayfoad Feb 28, 2024

jayfoad Feb 28, 2024

jayfoad Feb 28, 2024

		@@ -0,0 +1,16 @@
		; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 4
		; RUN: llc -march=amdgcn -mcpu=gfx1200 -verify-machineinstrs < %s \| FileCheck -check-prefix=GFX12 %s

AMDGPU/GFX12: Insert waitcnts before stores with scope_sys #82996

AMDGPU/GFX12: Insert waitcnts before stores with scope_sys #82996

Conversation

petar-avramovic commented Feb 26, 2024 • edited

llvmbot commented Feb 26, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Feb 27, 2024 • edited

petar-avramovic commented Feb 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

petar-avramovic commented Feb 26, 2024 •

edited

llvmbot commented Feb 26, 2024 •

edited

github-actions bot commented Feb 27, 2024 •

edited