Skip to content

Conversation

jrbyrnes
Copy link
Contributor

We have amdgpu-waitcnt-forcezero but that causes variable wait based on outstanding memory instructions.

This allows to specifiy a certain number of stall cycles per instruction, which is useful for debugging. I'm not sure if there already is an equivalent feature.

Change-Id: I1bddb498d73e0138d14a8cb312082e8794da2e47
@llvmbot
Copy link
Member

llvmbot commented Nov 27, 2024

@llvm/pr-subscribers-backend-amdgpu

Author: Jeffrey Byrnes (jrbyrnes)

Changes

We have amdgpu-waitcnt-forcezero but that causes variable wait based on outstanding memory instructions.

This allows to specifiy a certain number of stall cycles per instruction, which is useful for debugging. I'm not sure if there already is an equivalent feature.


Full diff: https://github.com/llvm/llvm-project/pull/117839.diff

3 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp (+11-4)
  • (added) llvm/test/CodeGen/AMDGPU/snop-padding-terminator.mir (+71)
  • (added) llvm/test/CodeGen/AMDGPU/snop-padding.ll (+124)
diff --git a/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp b/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp
index 44afccb0690d0d..870db75bc65b1b 100644
--- a/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp
+++ b/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp
@@ -24,8 +24,8 @@ using namespace llvm;
 
 namespace {
 
-struct MFMAPaddingRatioParser : public cl::parser<unsigned> {
-  MFMAPaddingRatioParser(cl::Option &O) : cl::parser<unsigned>(O) {}
+struct PaddingRatioParser : public cl::parser<unsigned> {
+  PaddingRatioParser(cl::Option &O) : cl::parser<unsigned>(O) {}
 
   bool parse(cl::Option &O, StringRef ArgName, StringRef Arg, unsigned &Value) {
     if (Arg.getAsInteger(0, Value))
@@ -40,7 +40,7 @@ struct MFMAPaddingRatioParser : public cl::parser<unsigned> {
 
 } // end anonymous namespace
 
-static cl::opt<unsigned, false, MFMAPaddingRatioParser>
+static cl::opt<unsigned, false, PaddingRatioParser>
     MFMAPaddingRatio("amdgpu-mfma-padding-ratio", cl::init(0), cl::Hidden,
                      cl::desc("Fill a percentage of the latency between "
                               "neighboring MFMA with s_nops."));
@@ -49,6 +49,11 @@ static cl::opt<unsigned> MaxExhaustiveHazardSearch(
     "amdgpu-max-exhaustive-hazard-search", cl::init(128), cl::Hidden,
     cl::desc("Maximum function size for exhausive hazard search"));
 
+static cl::opt<unsigned, false, PaddingRatioParser>
+    NopPadding("amdgpu-snop-padding", cl::Hidden,
+               cl::desc("Insert a s_nop between every instruction for a given "
+                        "number of cycles."));
+
 //===----------------------------------------------------------------------===//
 // Hazard Recognizer Implementation
 //===----------------------------------------------------------------------===//
@@ -325,7 +330,9 @@ unsigned GCNHazardRecognizer::PreEmitNoops(MachineInstr *MI) {
   unsigned W = PreEmitNoopsCommon(MI);
   fixHazards(MI);
   CurrCycleInstr = nullptr;
-  return W;
+  unsigned NopPad =
+      NopPadding.getNumOccurrences() && !MI->isTerminator() ? NopPadding : 0;
+  return std::max(W, NopPad);
 }
 
 unsigned GCNHazardRecognizer::PreEmitNoopsCommon(MachineInstr *MI) {
diff --git a/llvm/test/CodeGen/AMDGPU/snop-padding-terminator.mir b/llvm/test/CodeGen/AMDGPU/snop-padding-terminator.mir
new file mode 100644
index 00000000000000..6960ac5b0b4bf6
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/snop-padding-terminator.mir
@@ -0,0 +1,71 @@
+# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 5
+# RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx908 --amdgpu-snop-padding=0 --run-pass=post-RA-hazard-rec %s -o - | FileCheck -check-prefixes=GCN %s
+# RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx908 --amdgpu-snop-padding=1 --run-pass=post-RA-hazard-rec %s -o - | FileCheck -check-prefixes=GCN-NOP1 %s
+# RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx908 --amdgpu-snop-padding=20 --run-pass=post-RA-hazard-rec %s -o - | FileCheck -check-prefixes=GCN-NOP20 %s
+
+---
+name: waitcnt-debug-non-first-terminators
+liveins:
+machineFunctionInfo:
+  isEntryFunction: true
+body:             |
+  ; GCN-LABEL: name: waitcnt-debug-non-first-terminators
+  ; GCN: bb.0:
+  ; GCN-NEXT:   successors: %bb.1(0x40000000), %bb.2(0x40000000)
+  ; GCN-NEXT: {{  $}}
+  ; GCN-NEXT:   S_CBRANCH_SCC1 %bb.1, implicit $scc
+  ; GCN-NEXT:   S_BRANCH %bb.2, implicit $scc
+  ; GCN-NEXT: {{  $}}
+  ; GCN-NEXT: bb.1:
+  ; GCN-NEXT:   successors: %bb.2(0x80000000)
+  ; GCN-NEXT: {{  $}}
+  ; GCN-NEXT:   S_NOP 0
+  ; GCN-NEXT: {{  $}}
+  ; GCN-NEXT: bb.2:
+  ; GCN-NEXT:   S_NOP 0
+  ;
+  ; GCN-NOP1-LABEL: name: waitcnt-debug-non-first-terminators
+  ; GCN-NOP1: bb.0:
+  ; GCN-NOP1-NEXT:   successors: %bb.1(0x40000000), %bb.2(0x40000000)
+  ; GCN-NOP1-NEXT: {{  $}}
+  ; GCN-NOP1-NEXT:   S_CBRANCH_SCC1 %bb.1, implicit $scc
+  ; GCN-NOP1-NEXT:   S_BRANCH %bb.2, implicit $scc
+  ; GCN-NOP1-NEXT: {{  $}}
+  ; GCN-NOP1-NEXT: bb.1:
+  ; GCN-NOP1-NEXT:   successors: %bb.2(0x80000000)
+  ; GCN-NOP1-NEXT: {{  $}}
+  ; GCN-NOP1-NEXT:   S_NOP 0
+  ; GCN-NOP1-NEXT:   S_NOP 0
+  ; GCN-NOP1-NEXT: {{  $}}
+  ; GCN-NOP1-NEXT: bb.2:
+  ; GCN-NOP1-NEXT:   S_NOP 0
+  ; GCN-NOP1-NEXT:   S_NOP 0
+  ;
+  ; GCN-NOP20-LABEL: name: waitcnt-debug-non-first-terminators
+  ; GCN-NOP20: bb.0:
+  ; GCN-NOP20-NEXT:   successors: %bb.1(0x40000000), %bb.2(0x40000000)
+  ; GCN-NOP20-NEXT: {{  $}}
+  ; GCN-NOP20-NEXT:   S_CBRANCH_SCC1 %bb.1, implicit $scc
+  ; GCN-NOP20-NEXT:   S_BRANCH %bb.2, implicit $scc
+  ; GCN-NOP20-NEXT: {{  $}}
+  ; GCN-NOP20-NEXT: bb.1:
+  ; GCN-NOP20-NEXT:   successors: %bb.2(0x80000000)
+  ; GCN-NOP20-NEXT: {{  $}}
+  ; GCN-NOP20-NEXT:   S_NOP 7
+  ; GCN-NOP20-NEXT:   S_NOP 7
+  ; GCN-NOP20-NEXT:   S_NOP 3
+  ; GCN-NOP20-NEXT:   S_NOP 0
+  ; GCN-NOP20-NEXT: {{  $}}
+  ; GCN-NOP20-NEXT: bb.2:
+  ; GCN-NOP20-NEXT:   S_NOP 7
+  ; GCN-NOP20-NEXT:   S_NOP 7
+  ; GCN-NOP20-NEXT:   S_NOP 3
+  ; GCN-NOP20-NEXT:   S_NOP 0
+  bb.0:
+    S_CBRANCH_SCC1 %bb.1, implicit $scc
+    S_BRANCH %bb.2, implicit $scc
+  bb.1:
+    S_NOP 0
+  bb.2:
+    S_NOP 0
+...
diff --git a/llvm/test/CodeGen/AMDGPU/snop-padding.ll b/llvm/test/CodeGen/AMDGPU/snop-padding.ll
new file mode 100644
index 00000000000000..254507a54a2f75
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/snop-padding.ll
@@ -0,0 +1,124 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -mtriple=amdgcn -mcpu=gfx908 < %s | FileCheck -enable-var-scope -check-prefixes=GCN %s
+; RUN: llc -mtriple=amdgcn  --amdgpu-snop-padding=0 -mcpu=gfx908 < %s | FileCheck -check-prefixes=GCN %s
+; RUN: llc -mtriple=amdgcn  --amdgpu-snop-padding=1 -mcpu=gfx908 < %s | FileCheck -check-prefixes=GCN-NOP1 %s
+; RUN: llc -mtriple=amdgcn  --amdgpu-snop-padding=20 -mcpu=gfx908 < %s | FileCheck -check-prefixes=GCN-NOP20 %s
+
+
+define amdgpu_kernel void @fadd_v4f32(ptr addrspace(1) %out, ptr addrspace(1) %in) #0 {
+; GCN-LABEL: fadd_v4f32:
+; GCN:       ; %bb.0:
+; GCN-NEXT:    s_load_dwordx4 s[8:11], s[4:5], 0x24
+; GCN-NEXT:    v_mov_b32_e32 v4, 0
+; GCN-NEXT:    s_waitcnt lgkmcnt(0)
+; GCN-NEXT:    s_load_dwordx8 s[0:7], s[10:11], 0x0
+; GCN-NEXT:    s_waitcnt lgkmcnt(0)
+; GCN-NEXT:    v_mov_b32_e32 v0, s7
+; GCN-NEXT:    v_mov_b32_e32 v1, s6
+; GCN-NEXT:    v_mov_b32_e32 v5, s5
+; GCN-NEXT:    v_mov_b32_e32 v6, s4
+; GCN-NEXT:    v_add_f32_e32 v3, s3, v0
+; GCN-NEXT:    v_add_f32_e32 v2, s2, v1
+; GCN-NEXT:    v_add_f32_e32 v1, s1, v5
+; GCN-NEXT:    v_add_f32_e32 v0, s0, v6
+; GCN-NEXT:    global_store_dwordx4 v4, v[0:3], s[8:9]
+; GCN-NEXT:    s_endpgm
+;
+; GCN-NOP1-LABEL: fadd_v4f32:
+; GCN-NOP1:       ; %bb.0:
+; GCN-NOP1-NEXT:    s_nop 0
+; GCN-NOP1-NEXT:    s_load_dwordx4 s[8:11], s[4:5], 0x24
+; GCN-NOP1-NEXT:    s_nop 0
+; GCN-NOP1-NEXT:    v_mov_b32_e32 v4, 0
+; GCN-NOP1-NEXT:    s_nop 0
+; GCN-NOP1-NEXT:    s_waitcnt lgkmcnt(0)
+; GCN-NOP1-NEXT:    s_nop 0
+; GCN-NOP1-NEXT:    s_load_dwordx8 s[0:7], s[10:11], 0x0
+; GCN-NOP1-NEXT:    s_nop 0
+; GCN-NOP1-NEXT:    s_waitcnt lgkmcnt(0)
+; GCN-NOP1-NEXT:    s_nop 0
+; GCN-NOP1-NEXT:    v_mov_b32_e32 v0, s7
+; GCN-NOP1-NEXT:    s_nop 0
+; GCN-NOP1-NEXT:    v_mov_b32_e32 v1, s6
+; GCN-NOP1-NEXT:    s_nop 0
+; GCN-NOP1-NEXT:    v_mov_b32_e32 v5, s5
+; GCN-NOP1-NEXT:    s_nop 0
+; GCN-NOP1-NEXT:    v_mov_b32_e32 v6, s4
+; GCN-NOP1-NEXT:    s_nop 0
+; GCN-NOP1-NEXT:    v_add_f32_e32 v3, s3, v0
+; GCN-NOP1-NEXT:    s_nop 0
+; GCN-NOP1-NEXT:    v_add_f32_e32 v2, s2, v1
+; GCN-NOP1-NEXT:    s_nop 0
+; GCN-NOP1-NEXT:    v_add_f32_e32 v1, s1, v5
+; GCN-NOP1-NEXT:    s_nop 0
+; GCN-NOP1-NEXT:    v_add_f32_e32 v0, s0, v6
+; GCN-NOP1-NEXT:    s_nop 0
+; GCN-NOP1-NEXT:    global_store_dwordx4 v4, v[0:3], s[8:9]
+; GCN-NOP1-NEXT:    s_endpgm
+;
+; GCN-NOP20-LABEL: fadd_v4f32:
+; GCN-NOP20:       ; %bb.0:
+; GCN-NOP20-NEXT:    s_nop 7
+; GCN-NOP20-NEXT:    s_nop 7
+; GCN-NOP20-NEXT:    s_nop 3
+; GCN-NOP20-NEXT:    s_load_dwordx4 s[8:11], s[4:5], 0x24
+; GCN-NOP20-NEXT:    s_nop 7
+; GCN-NOP20-NEXT:    s_nop 7
+; GCN-NOP20-NEXT:    s_nop 3
+; GCN-NOP20-NEXT:    v_mov_b32_e32 v4, 0
+; GCN-NOP20-NEXT:    s_nop 7
+; GCN-NOP20-NEXT:    s_nop 7
+; GCN-NOP20-NEXT:    s_nop 3
+; GCN-NOP20-NEXT:    s_waitcnt lgkmcnt(0)
+; GCN-NOP20-NEXT:    s_nop 7
+; GCN-NOP20-NEXT:    s_nop 7
+; GCN-NOP20-NEXT:    s_nop 3
+; GCN-NOP20-NEXT:    s_load_dwordx8 s[0:7], s[10:11], 0x0
+; GCN-NOP20-NEXT:    s_nop 7
+; GCN-NOP20-NEXT:    s_nop 7
+; GCN-NOP20-NEXT:    s_nop 3
+; GCN-NOP20-NEXT:    s_waitcnt lgkmcnt(0)
+; GCN-NOP20-NEXT:    s_nop 7
+; GCN-NOP20-NEXT:    s_nop 7
+; GCN-NOP20-NEXT:    s_nop 3
+; GCN-NOP20-NEXT:    v_mov_b32_e32 v0, s7
+; GCN-NOP20-NEXT:    s_nop 7
+; GCN-NOP20-NEXT:    s_nop 7
+; GCN-NOP20-NEXT:    s_nop 3
+; GCN-NOP20-NEXT:    v_mov_b32_e32 v1, s6
+; GCN-NOP20-NEXT:    s_nop 7
+; GCN-NOP20-NEXT:    s_nop 7
+; GCN-NOP20-NEXT:    s_nop 3
+; GCN-NOP20-NEXT:    v_mov_b32_e32 v5, s5
+; GCN-NOP20-NEXT:    s_nop 7
+; GCN-NOP20-NEXT:    s_nop 7
+; GCN-NOP20-NEXT:    s_nop 3
+; GCN-NOP20-NEXT:    v_mov_b32_e32 v6, s4
+; GCN-NOP20-NEXT:    s_nop 7
+; GCN-NOP20-NEXT:    s_nop 7
+; GCN-NOP20-NEXT:    s_nop 3
+; GCN-NOP20-NEXT:    v_add_f32_e32 v3, s3, v0
+; GCN-NOP20-NEXT:    s_nop 7
+; GCN-NOP20-NEXT:    s_nop 7
+; GCN-NOP20-NEXT:    s_nop 3
+; GCN-NOP20-NEXT:    v_add_f32_e32 v2, s2, v1
+; GCN-NOP20-NEXT:    s_nop 7
+; GCN-NOP20-NEXT:    s_nop 7
+; GCN-NOP20-NEXT:    s_nop 3
+; GCN-NOP20-NEXT:    v_add_f32_e32 v1, s1, v5
+; GCN-NOP20-NEXT:    s_nop 7
+; GCN-NOP20-NEXT:    s_nop 7
+; GCN-NOP20-NEXT:    s_nop 3
+; GCN-NOP20-NEXT:    v_add_f32_e32 v0, s0, v6
+; GCN-NOP20-NEXT:    s_nop 7
+; GCN-NOP20-NEXT:    s_nop 7
+; GCN-NOP20-NEXT:    s_nop 3
+; GCN-NOP20-NEXT:    global_store_dwordx4 v4, v[0:3], s[8:9]
+; GCN-NOP20-NEXT:    s_endpgm
+  %b_ptr = getelementptr <4 x float>, ptr addrspace(1) %in, i32 1
+  %a = load <4 x float>, ptr addrspace(1) %in, align 16
+  %b = load <4 x float>, ptr addrspace(1) %b_ptr, align 16
+  %result = fadd <4 x float> %a, %b
+  store <4 x float> %result, ptr addrspace(1) %out, align 16
+  ret void
+}

Change-Id: Ia83c67144b7c23fdec2912c72dd30956eba17f03
Copy link
Contributor

@jayfoad jayfoad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this put nops inside bundles? See mask_hazard_getpc2 in test/CodeGen/AMDGPU/valu-mask-write-hazard.mir for an example.

Comment on lines +333 to +334
unsigned NopPad =
NopPadding.getNumOccurrences() && !MI->isTerminator() ? NopPadding : 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't you want nop before a terminator?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There can't be that between terminators.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of a rule that all terminators have to be consecutive at the end of the block?

You could still have a nop before the first terminator.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the debug purpose, I'd say that would be fine because if hazard happens, the first instruction in the following BB has a nop before it.

Change-Id: Icdd0c8777ce360a8fcb4dc599f95263a59b8fc4f
Change-Id: I7de1185831662e8e8b8b5612c52f703a05a62d88
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants