[AMDGPU] Add flag to force emit s_nop #117839

jrbyrnes · 2024-11-27T03:40:28Z

We have amdgpu-waitcnt-forcezero but that causes variable wait based on outstanding memory instructions.

This allows to specifiy a certain number of stall cycles per instruction, which is useful for debugging. I'm not sure if there already is an equivalent feature.

Change-Id: I1bddb498d73e0138d14a8cb312082e8794da2e47

llvmbot · 2024-11-27T03:41:02Z

@llvm/pr-subscribers-backend-amdgpu

Author: Jeffrey Byrnes (jrbyrnes)

Changes

We have amdgpu-waitcnt-forcezero but that causes variable wait based on outstanding memory instructions.

This allows to specifiy a certain number of stall cycles per instruction, which is useful for debugging. I'm not sure if there already is an equivalent feature.

Full diff: https://github.com/llvm/llvm-project/pull/117839.diff

3 Files Affected:

(modified) llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp (+11-4)
(added) llvm/test/CodeGen/AMDGPU/snop-padding-terminator.mir (+71)
(added) llvm/test/CodeGen/AMDGPU/snop-padding.ll (+124)

diff --git a/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp b/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp
index 44afccb0690d0d..870db75bc65b1b 100644
--- a/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp
+++ b/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp
@@ -24,8 +24,8 @@ using namespace llvm;
 
 namespace {
 
-struct MFMAPaddingRatioParser : public cl::parser<unsigned> {
-  MFMAPaddingRatioParser(cl::Option &O) : cl::parser<unsigned>(O) {}
+struct PaddingRatioParser : public cl::parser<unsigned> {
+  PaddingRatioParser(cl::Option &O) : cl::parser<unsigned>(O) {}
 
   bool parse(cl::Option &O, StringRef ArgName, StringRef Arg, unsigned &Value) {
     if (Arg.getAsInteger(0, Value))
@@ -40,7 +40,7 @@ struct MFMAPaddingRatioParser : public cl::parser<unsigned> {
 
 } // end anonymous namespace
 
-static cl::opt<unsigned, false, MFMAPaddingRatioParser>
+static cl::opt<unsigned, false, PaddingRatioParser>
     MFMAPaddingRatio("amdgpu-mfma-padding-ratio", cl::init(0), cl::Hidden,
                      cl::desc("Fill a percentage of the latency between "
                               "neighboring MFMA with s_nops."));
@@ -49,6 +49,11 @@ static cl::opt<unsigned> MaxExhaustiveHazardSearch(
     "amdgpu-max-exhaustive-hazard-search", cl::init(128), cl::Hidden,
     cl::desc("Maximum function size for exhausive hazard search"));
 
+static cl::opt<unsigned, false, PaddingRatioParser>
+    NopPadding("amdgpu-snop-padding", cl::Hidden,
+               cl::desc("Insert a s_nop between every instruction for a given "
+                        "number of cycles."));
+
 //===----------------------------------------------------------------------===//
 // Hazard Recognizer Implementation
 //===----------------------------------------------------------------------===//
@@ -325,7 +330,9 @@ unsigned GCNHazardRecognizer::PreEmitNoops(MachineInstr *MI) {
   unsigned W = PreEmitNoopsCommon(MI);
   fixHazards(MI);
   CurrCycleInstr = nullptr;
-  return W;
+  unsigned NopPad =
+      NopPadding.getNumOccurrences() && !MI->isTerminator() ? NopPadding : 0;
+  return std::max(W, NopPad);
 }
 
 unsigned GCNHazardRecognizer::PreEmitNoopsCommon(MachineInstr *MI) {
diff --git a/llvm/test/CodeGen/AMDGPU/snop-padding-terminator.mir b/llvm/test/CodeGen/AMDGPU/snop-padding-terminator.mir
new file mode 100644
index 00000000000000..6960ac5b0b4bf6
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/snop-padding-terminator.mir
@@ -0,0 +1,71 @@
+# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 5
+# RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx908 --amdgpu-snop-padding=0 --run-pass=post-RA-hazard-rec %s -o - | FileCheck -check-prefixes=GCN %s
+# RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx908 --amdgpu-snop-padding=1 --run-pass=post-RA-hazard-rec %s -o - | FileCheck -check-prefixes=GCN-NOP1 %s
+# RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx908 --amdgpu-snop-padding=20 --run-pass=post-RA-hazard-rec %s -o - | FileCheck -check-prefixes=GCN-NOP20 %s
+
+---
+name: waitcnt-debug-non-first-terminators
+liveins:
+machineFunctionInfo:
+  isEntryFunction: true
+body:             |
+  ; GCN-LABEL: name: waitcnt-debug-non-first-terminators
+  ; GCN: bb.0:
+  ; GCN-NEXT:   successors: %bb.1(0x40000000), %bb.2(0x40000000)
+  ; GCN-NEXT: {{  $}}
+  ; GCN-NEXT:   S_CBRANCH_SCC1 %bb.1, implicit $scc
+  ; GCN-NEXT:   S_BRANCH %bb.2, implicit $scc
+  ; GCN-NEXT: {{  $}}
+  ; GCN-NEXT: bb.1:
+  ; GCN-NEXT:   successors: %bb.2(0x80000000)
+  ; GCN-NEXT: {{  $}}
+  ; GCN-NEXT:   S_NOP 0
+  ; GCN-NEXT: {{  $}}
+  ; GCN-NEXT: bb.2:
+  ; GCN-NEXT:   S_NOP 0
+  ;
+  ; GCN-NOP1-LABEL: name: waitcnt-debug-non-first-terminators
+  ; GCN-NOP1: bb.0:
+  ; GCN-NOP1-NEXT:   successors: %bb.1(0x40000000), %bb.2(0x40000000)
+  ; GCN-NOP1-NEXT: {{  $}}
+  ; GCN-NOP1-NEXT:   S_CBRANCH_SCC1 %bb.1, implicit $scc
+  ; GCN-NOP1-NEXT:   S_BRANCH %bb.2, implicit $scc
+  ; GCN-NOP1-NEXT: {{  $}}
+  ; GCN-NOP1-NEXT: bb.1:
+  ; GCN-NOP1-NEXT:   successors: %bb.2(0x80000000)
+  ; GCN-NOP1-NEXT: {{  $}}
+  ; GCN-NOP1-NEXT:   S_NOP 0
+  ; GCN-NOP1-NEXT:   S_NOP 0
+  ; GCN-NOP1-NEXT: {{  $}}
+  ; GCN-NOP1-NEXT: bb.2:
+  ; GCN-NOP1-NEXT:   S_NOP 0
+  ; GCN-NOP1-NEXT:   S_NOP 0
+  ;
+  ; GCN-NOP20-LABEL: name: waitcnt-debug-non-first-terminators
+  ; GCN-NOP20: bb.0:
+  ; GCN-NOP20-NEXT:   successors: %bb.1(0x40000000), %bb.2(0x40000000)
+  ; GCN-NOP20-NEXT: {{  $}}
+  ; GCN-NOP20-NEXT:   S_CBRANCH_SCC1 %bb.1, implicit $scc
+  ; GCN-NOP20-NEXT:   S_BRANCH %bb.2, implicit $scc
+  ; GCN-NOP20-NEXT: {{  $}}
+  ; GCN-NOP20-NEXT: bb.1:
+  ; GCN-NOP20-NEXT:   successors: %bb.2(0x80000000)
+  ; GCN-NOP20-NEXT: {{  $}}
+  ; GCN-NOP20-NEXT:   S_NOP 7
+  ; GCN-NOP20-NEXT:   S_NOP 7
+  ; GCN-NOP20-NEXT:   S_NOP 3
+  ; GCN-NOP20-NEXT:   S_NOP 0
+  ; GCN-NOP20-NEXT: {{  $}}
+  ; GCN-NOP20-NEXT: bb.2:
+  ; GCN-NOP20-NEXT:   S_NOP 7
+  ; GCN-NOP20-NEXT:   S_NOP 7
+  ; GCN-NOP20-NEXT:   S_NOP 3
+  ; GCN-NOP20-NEXT:   S_NOP 0
+  bb.0:
+    S_CBRANCH_SCC1 %bb.1, implicit $scc
+    S_BRANCH %bb.2, implicit $scc
+  bb.1:
+    S_NOP 0
+  bb.2:
+    S_NOP 0
+...
diff --git a/llvm/test/CodeGen/AMDGPU/snop-padding.ll b/llvm/test/CodeGen/AMDGPU/snop-padding.ll
new file mode 100644
index 00000000000000..254507a54a2f75
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/snop-padding.ll
@@ -0,0 +1,124 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -mtriple=amdgcn -mcpu=gfx908 < %s | FileCheck -enable-var-scope -check-prefixes=GCN %s
+; RUN: llc -mtriple=amdgcn  --amdgpu-snop-padding=0 -mcpu=gfx908 < %s | FileCheck -check-prefixes=GCN %s
+; RUN: llc -mtriple=amdgcn  --amdgpu-snop-padding=1 -mcpu=gfx908 < %s | FileCheck -check-prefixes=GCN-NOP1 %s
+; RUN: llc -mtriple=amdgcn  --amdgpu-snop-padding=20 -mcpu=gfx908 < %s | FileCheck -check-prefixes=GCN-NOP20 %s
+
+
+define amdgpu_kernel void @fadd_v4f32(ptr addrspace(1) %out, ptr addrspace(1) %in) #0 {
+; GCN-LABEL: fadd_v4f32:
+; GCN:       ; %bb.0:
+; GCN-NEXT:    s_load_dwordx4 s[8:11], s[4:5], 0x24
+; GCN-NEXT:    v_mov_b32_e32 v4, 0
+; GCN-NEXT:    s_waitcnt lgkmcnt(0)
+; GCN-NEXT:    s_load_dwordx8 s[0:7], s[10:11], 0x0
+; GCN-NEXT:    s_waitcnt lgkmcnt(0)
+; GCN-NEXT:    v_mov_b32_e32 v0, s7
+; GCN-NEXT:    v_mov_b32_e32 v1, s6
+; GCN-NEXT:    v_mov_b32_e32 v5, s5
+; GCN-NEXT:    v_mov_b32_e32 v6, s4
+; GCN-NEXT:    v_add_f32_e32 v3, s3, v0
+; GCN-NEXT:    v_add_f32_e32 v2, s2, v1
+; GCN-NEXT:    v_add_f32_e32 v1, s1, v5
+; GCN-NEXT:    v_add_f32_e32 v0, s0, v6
+; GCN-NEXT:    global_store_dwordx4 v4, v[0:3], s[8:9]
+; GCN-NEXT:    s_endpgm
+;
+; GCN-NOP1-LABEL: fadd_v4f32:
+; GCN-NOP1:       ; %bb.0:
+; GCN-NOP1-NEXT:    s_nop 0
+; GCN-NOP1-NEXT:    s_load_dwordx4 s[8:11], s[4:5], 0x24
+; GCN-NOP1-NEXT:    s_nop 0
+; GCN-NOP1-NEXT:    v_mov_b32_e32 v4, 0
+; GCN-NOP1-NEXT:    s_nop 0
+; GCN-NOP1-NEXT:    s_waitcnt lgkmcnt(0)
+; GCN-NOP1-NEXT:    s_nop 0
+; GCN-NOP1-NEXT:    s_load_dwordx8 s[0:7], s[10:11], 0x0
+; GCN-NOP1-NEXT:    s_nop 0
+; GCN-NOP1-NEXT:    s_waitcnt lgkmcnt(0)
+; GCN-NOP1-NEXT:    s_nop 0
+; GCN-NOP1-NEXT:    v_mov_b32_e32 v0, s7
+; GCN-NOP1-NEXT:    s_nop 0
+; GCN-NOP1-NEXT:    v_mov_b32_e32 v1, s6
+; GCN-NOP1-NEXT:    s_nop 0
+; GCN-NOP1-NEXT:    v_mov_b32_e32 v5, s5
+; GCN-NOP1-NEXT:    s_nop 0
+; GCN-NOP1-NEXT:    v_mov_b32_e32 v6, s4
+; GCN-NOP1-NEXT:    s_nop 0
+; GCN-NOP1-NEXT:    v_add_f32_e32 v3, s3, v0
+; GCN-NOP1-NEXT:    s_nop 0
+; GCN-NOP1-NEXT:    v_add_f32_e32 v2, s2, v1
+; GCN-NOP1-NEXT:    s_nop 0
+; GCN-NOP1-NEXT:    v_add_f32_e32 v1, s1, v5
+; GCN-NOP1-NEXT:    s_nop 0
+; GCN-NOP1-NEXT:    v_add_f32_e32 v0, s0, v6
+; GCN-NOP1-NEXT:    s_nop 0
+; GCN-NOP1-NEXT:    global_store_dwordx4 v4, v[0:3], s[8:9]
+; GCN-NOP1-NEXT:    s_endpgm
+;
+; GCN-NOP20-LABEL: fadd_v4f32:
+; GCN-NOP20:       ; %bb.0:
+; GCN-NOP20-NEXT:    s_nop 7
+; GCN-NOP20-NEXT:    s_nop 7
+; GCN-NOP20-NEXT:    s_nop 3
+; GCN-NOP20-NEXT:    s_load_dwordx4 s[8:11], s[4:5], 0x24
+; GCN-NOP20-NEXT:    s_nop 7
+; GCN-NOP20-NEXT:    s_nop 7
+; GCN-NOP20-NEXT:    s_nop 3
+; GCN-NOP20-NEXT:    v_mov_b32_e32 v4, 0
+; GCN-NOP20-NEXT:    s_nop 7
+; GCN-NOP20-NEXT:    s_nop 7
+; GCN-NOP20-NEXT:    s_nop 3
+; GCN-NOP20-NEXT:    s_waitcnt lgkmcnt(0)
+; GCN-NOP20-NEXT:    s_nop 7
+; GCN-NOP20-NEXT:    s_nop 7
+; GCN-NOP20-NEXT:    s_nop 3
+; GCN-NOP20-NEXT:    s_load_dwordx8 s[0:7], s[10:11], 0x0
+; GCN-NOP20-NEXT:    s_nop 7
+; GCN-NOP20-NEXT:    s_nop 7
+; GCN-NOP20-NEXT:    s_nop 3
+; GCN-NOP20-NEXT:    s_waitcnt lgkmcnt(0)
+; GCN-NOP20-NEXT:    s_nop 7
+; GCN-NOP20-NEXT:    s_nop 7
+; GCN-NOP20-NEXT:    s_nop 3
+; GCN-NOP20-NEXT:    v_mov_b32_e32 v0, s7
+; GCN-NOP20-NEXT:    s_nop 7
+; GCN-NOP20-NEXT:    s_nop 7
+; GCN-NOP20-NEXT:    s_nop 3
+; GCN-NOP20-NEXT:    v_mov_b32_e32 v1, s6
+; GCN-NOP20-NEXT:    s_nop 7
+; GCN-NOP20-NEXT:    s_nop 7
+; GCN-NOP20-NEXT:    s_nop 3
+; GCN-NOP20-NEXT:    v_mov_b32_e32 v5, s5
+; GCN-NOP20-NEXT:    s_nop 7
+; GCN-NOP20-NEXT:    s_nop 7
+; GCN-NOP20-NEXT:    s_nop 3
+; GCN-NOP20-NEXT:    v_mov_b32_e32 v6, s4
+; GCN-NOP20-NEXT:    s_nop 7
+; GCN-NOP20-NEXT:    s_nop 7
+; GCN-NOP20-NEXT:    s_nop 3
+; GCN-NOP20-NEXT:    v_add_f32_e32 v3, s3, v0
+; GCN-NOP20-NEXT:    s_nop 7
+; GCN-NOP20-NEXT:    s_nop 7
+; GCN-NOP20-NEXT:    s_nop 3
+; GCN-NOP20-NEXT:    v_add_f32_e32 v2, s2, v1
+; GCN-NOP20-NEXT:    s_nop 7
+; GCN-NOP20-NEXT:    s_nop 7
+; GCN-NOP20-NEXT:    s_nop 3
+; GCN-NOP20-NEXT:    v_add_f32_e32 v1, s1, v5
+; GCN-NOP20-NEXT:    s_nop 7
+; GCN-NOP20-NEXT:    s_nop 7
+; GCN-NOP20-NEXT:    s_nop 3
+; GCN-NOP20-NEXT:    v_add_f32_e32 v0, s0, v6
+; GCN-NOP20-NEXT:    s_nop 7
+; GCN-NOP20-NEXT:    s_nop 7
+; GCN-NOP20-NEXT:    s_nop 3
+; GCN-NOP20-NEXT:    global_store_dwordx4 v4, v[0:3], s[8:9]
+; GCN-NOP20-NEXT:    s_endpgm
+  %b_ptr = getelementptr <4 x float>, ptr addrspace(1) %in, i32 1
+  %a = load <4 x float>, ptr addrspace(1) %in, align 16
+  %b = load <4 x float>, ptr addrspace(1) %b_ptr, align 16
+  %result = fadd <4 x float> %a, %b
+  store <4 x float> %result, ptr addrspace(1) %out, align 16
+  ret void
+}

Change-Id: Ia83c67144b7c23fdec2912c72dd30956eba17f03

jayfoad

Will this put nops inside bundles? See mask_hazard_getpc2 in test/CodeGen/AMDGPU/valu-mask-write-hazard.mir for an example.

jayfoad · 2024-11-27T09:41:05Z

llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp

+  unsigned NopPad =
+      NopPadding.getNumOccurrences() && !MI->isTerminator() ? NopPadding : 0;


Why don't you want nop before a terminator?

There can't be that between terminators.

Because of a rule that all terminators have to be consecutive at the end of the block?

You could still have a nop before the first terminator.

For the debug purpose, I'd say that would be fine because if hazard happens, the first instruction in the following BB has a nop before it.

llvm/test/CodeGen/AMDGPU/snop-padding-terminator.mir

llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp

Change-Id: Icdd0c8777ce360a8fcb4dc599f95263a59b8fc4f

llvm/test/CodeGen/AMDGPU/snop-padding.mir

Change-Id: I7de1185831662e8e8b8b5612c52f703a05a62d88

[AMDGPU] Add flag to force emit s_nop

1b645cf

Change-Id: I1bddb498d73e0138d14a8cb312082e8794da2e47

jrbyrnes requested review from arsenm and kerbowa November 27, 2024 03:40

llvmbot added the backend:AMDGPU label Nov 27, 2024

Fix comment

844decb

Change-Id: Ia83c67144b7c23fdec2912c72dd30956eba17f03

jayfoad reviewed Nov 27, 2024

View reviewed changes

Review comments

3fc160b

Change-Id: Icdd0c8777ce360a8fcb4dc599f95263a59b8fc4f

jayfoad reviewed Nov 28, 2024

View reviewed changes

llvm/test/CodeGen/AMDGPU/snop-padding.mir Show resolved Hide resolved

Fix bundle handling

241abc3

Change-Id: I7de1185831662e8e8b8b5612c52f703a05a62d88

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AMDGPU] Add flag to force emit s_nop #117839

[AMDGPU] Add flag to force emit s_nop #117839

Uh oh!

jrbyrnes commented Nov 27, 2024

Uh oh!

llvmbot commented Nov 27, 2024

Uh oh!

jayfoad left a comment

Uh oh!

jayfoad Nov 27, 2024

Uh oh!

shiltian Nov 27, 2024

Uh oh!

jayfoad Nov 27, 2024

Uh oh!

shiltian Nov 27, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

		unsigned NopPad =
		NopPadding.getNumOccurrences() && !MI->isTerminator() ? NopPadding : 0;

[AMDGPU] Add flag to force emit s_nop #117839

Are you sure you want to change the base?

[AMDGPU] Add flag to force emit s_nop #117839

Uh oh!

Conversation

jrbyrnes commented Nov 27, 2024

Uh oh!

llvmbot commented Nov 27, 2024

Uh oh!

jayfoad left a comment

Choose a reason for hiding this comment

Uh oh!

jayfoad Nov 27, 2024

Choose a reason for hiding this comment

Uh oh!

shiltian Nov 27, 2024

Choose a reason for hiding this comment

Uh oh!

jayfoad Nov 27, 2024

Choose a reason for hiding this comment

Uh oh!

shiltian Nov 27, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!