-
Notifications
You must be signed in to change notification settings - Fork 14.8k
[AMDGPU] Add flag to force emit s_nop #117839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Change-Id: I1bddb498d73e0138d14a8cb312082e8794da2e47
@llvm/pr-subscribers-backend-amdgpu Author: Jeffrey Byrnes (jrbyrnes) ChangesWe have amdgpu-waitcnt-forcezero but that causes variable wait based on outstanding memory instructions. This allows to specifiy a certain number of stall cycles per instruction, which is useful for debugging. I'm not sure if there already is an equivalent feature. Full diff: https://github.com/llvm/llvm-project/pull/117839.diff 3 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp b/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp
index 44afccb0690d0d..870db75bc65b1b 100644
--- a/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp
+++ b/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp
@@ -24,8 +24,8 @@ using namespace llvm;
namespace {
-struct MFMAPaddingRatioParser : public cl::parser<unsigned> {
- MFMAPaddingRatioParser(cl::Option &O) : cl::parser<unsigned>(O) {}
+struct PaddingRatioParser : public cl::parser<unsigned> {
+ PaddingRatioParser(cl::Option &O) : cl::parser<unsigned>(O) {}
bool parse(cl::Option &O, StringRef ArgName, StringRef Arg, unsigned &Value) {
if (Arg.getAsInteger(0, Value))
@@ -40,7 +40,7 @@ struct MFMAPaddingRatioParser : public cl::parser<unsigned> {
} // end anonymous namespace
-static cl::opt<unsigned, false, MFMAPaddingRatioParser>
+static cl::opt<unsigned, false, PaddingRatioParser>
MFMAPaddingRatio("amdgpu-mfma-padding-ratio", cl::init(0), cl::Hidden,
cl::desc("Fill a percentage of the latency between "
"neighboring MFMA with s_nops."));
@@ -49,6 +49,11 @@ static cl::opt<unsigned> MaxExhaustiveHazardSearch(
"amdgpu-max-exhaustive-hazard-search", cl::init(128), cl::Hidden,
cl::desc("Maximum function size for exhausive hazard search"));
+static cl::opt<unsigned, false, PaddingRatioParser>
+ NopPadding("amdgpu-snop-padding", cl::Hidden,
+ cl::desc("Insert a s_nop between every instruction for a given "
+ "number of cycles."));
+
//===----------------------------------------------------------------------===//
// Hazard Recognizer Implementation
//===----------------------------------------------------------------------===//
@@ -325,7 +330,9 @@ unsigned GCNHazardRecognizer::PreEmitNoops(MachineInstr *MI) {
unsigned W = PreEmitNoopsCommon(MI);
fixHazards(MI);
CurrCycleInstr = nullptr;
- return W;
+ unsigned NopPad =
+ NopPadding.getNumOccurrences() && !MI->isTerminator() ? NopPadding : 0;
+ return std::max(W, NopPad);
}
unsigned GCNHazardRecognizer::PreEmitNoopsCommon(MachineInstr *MI) {
diff --git a/llvm/test/CodeGen/AMDGPU/snop-padding-terminator.mir b/llvm/test/CodeGen/AMDGPU/snop-padding-terminator.mir
new file mode 100644
index 00000000000000..6960ac5b0b4bf6
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/snop-padding-terminator.mir
@@ -0,0 +1,71 @@
+# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 5
+# RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx908 --amdgpu-snop-padding=0 --run-pass=post-RA-hazard-rec %s -o - | FileCheck -check-prefixes=GCN %s
+# RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx908 --amdgpu-snop-padding=1 --run-pass=post-RA-hazard-rec %s -o - | FileCheck -check-prefixes=GCN-NOP1 %s
+# RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx908 --amdgpu-snop-padding=20 --run-pass=post-RA-hazard-rec %s -o - | FileCheck -check-prefixes=GCN-NOP20 %s
+
+---
+name: waitcnt-debug-non-first-terminators
+liveins:
+machineFunctionInfo:
+ isEntryFunction: true
+body: |
+ ; GCN-LABEL: name: waitcnt-debug-non-first-terminators
+ ; GCN: bb.0:
+ ; GCN-NEXT: successors: %bb.1(0x40000000), %bb.2(0x40000000)
+ ; GCN-NEXT: {{ $}}
+ ; GCN-NEXT: S_CBRANCH_SCC1 %bb.1, implicit $scc
+ ; GCN-NEXT: S_BRANCH %bb.2, implicit $scc
+ ; GCN-NEXT: {{ $}}
+ ; GCN-NEXT: bb.1:
+ ; GCN-NEXT: successors: %bb.2(0x80000000)
+ ; GCN-NEXT: {{ $}}
+ ; GCN-NEXT: S_NOP 0
+ ; GCN-NEXT: {{ $}}
+ ; GCN-NEXT: bb.2:
+ ; GCN-NEXT: S_NOP 0
+ ;
+ ; GCN-NOP1-LABEL: name: waitcnt-debug-non-first-terminators
+ ; GCN-NOP1: bb.0:
+ ; GCN-NOP1-NEXT: successors: %bb.1(0x40000000), %bb.2(0x40000000)
+ ; GCN-NOP1-NEXT: {{ $}}
+ ; GCN-NOP1-NEXT: S_CBRANCH_SCC1 %bb.1, implicit $scc
+ ; GCN-NOP1-NEXT: S_BRANCH %bb.2, implicit $scc
+ ; GCN-NOP1-NEXT: {{ $}}
+ ; GCN-NOP1-NEXT: bb.1:
+ ; GCN-NOP1-NEXT: successors: %bb.2(0x80000000)
+ ; GCN-NOP1-NEXT: {{ $}}
+ ; GCN-NOP1-NEXT: S_NOP 0
+ ; GCN-NOP1-NEXT: S_NOP 0
+ ; GCN-NOP1-NEXT: {{ $}}
+ ; GCN-NOP1-NEXT: bb.2:
+ ; GCN-NOP1-NEXT: S_NOP 0
+ ; GCN-NOP1-NEXT: S_NOP 0
+ ;
+ ; GCN-NOP20-LABEL: name: waitcnt-debug-non-first-terminators
+ ; GCN-NOP20: bb.0:
+ ; GCN-NOP20-NEXT: successors: %bb.1(0x40000000), %bb.2(0x40000000)
+ ; GCN-NOP20-NEXT: {{ $}}
+ ; GCN-NOP20-NEXT: S_CBRANCH_SCC1 %bb.1, implicit $scc
+ ; GCN-NOP20-NEXT: S_BRANCH %bb.2, implicit $scc
+ ; GCN-NOP20-NEXT: {{ $}}
+ ; GCN-NOP20-NEXT: bb.1:
+ ; GCN-NOP20-NEXT: successors: %bb.2(0x80000000)
+ ; GCN-NOP20-NEXT: {{ $}}
+ ; GCN-NOP20-NEXT: S_NOP 7
+ ; GCN-NOP20-NEXT: S_NOP 7
+ ; GCN-NOP20-NEXT: S_NOP 3
+ ; GCN-NOP20-NEXT: S_NOP 0
+ ; GCN-NOP20-NEXT: {{ $}}
+ ; GCN-NOP20-NEXT: bb.2:
+ ; GCN-NOP20-NEXT: S_NOP 7
+ ; GCN-NOP20-NEXT: S_NOP 7
+ ; GCN-NOP20-NEXT: S_NOP 3
+ ; GCN-NOP20-NEXT: S_NOP 0
+ bb.0:
+ S_CBRANCH_SCC1 %bb.1, implicit $scc
+ S_BRANCH %bb.2, implicit $scc
+ bb.1:
+ S_NOP 0
+ bb.2:
+ S_NOP 0
+...
diff --git a/llvm/test/CodeGen/AMDGPU/snop-padding.ll b/llvm/test/CodeGen/AMDGPU/snop-padding.ll
new file mode 100644
index 00000000000000..254507a54a2f75
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/snop-padding.ll
@@ -0,0 +1,124 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -mtriple=amdgcn -mcpu=gfx908 < %s | FileCheck -enable-var-scope -check-prefixes=GCN %s
+; RUN: llc -mtriple=amdgcn --amdgpu-snop-padding=0 -mcpu=gfx908 < %s | FileCheck -check-prefixes=GCN %s
+; RUN: llc -mtriple=amdgcn --amdgpu-snop-padding=1 -mcpu=gfx908 < %s | FileCheck -check-prefixes=GCN-NOP1 %s
+; RUN: llc -mtriple=amdgcn --amdgpu-snop-padding=20 -mcpu=gfx908 < %s | FileCheck -check-prefixes=GCN-NOP20 %s
+
+
+define amdgpu_kernel void @fadd_v4f32(ptr addrspace(1) %out, ptr addrspace(1) %in) #0 {
+; GCN-LABEL: fadd_v4f32:
+; GCN: ; %bb.0:
+; GCN-NEXT: s_load_dwordx4 s[8:11], s[4:5], 0x24
+; GCN-NEXT: v_mov_b32_e32 v4, 0
+; GCN-NEXT: s_waitcnt lgkmcnt(0)
+; GCN-NEXT: s_load_dwordx8 s[0:7], s[10:11], 0x0
+; GCN-NEXT: s_waitcnt lgkmcnt(0)
+; GCN-NEXT: v_mov_b32_e32 v0, s7
+; GCN-NEXT: v_mov_b32_e32 v1, s6
+; GCN-NEXT: v_mov_b32_e32 v5, s5
+; GCN-NEXT: v_mov_b32_e32 v6, s4
+; GCN-NEXT: v_add_f32_e32 v3, s3, v0
+; GCN-NEXT: v_add_f32_e32 v2, s2, v1
+; GCN-NEXT: v_add_f32_e32 v1, s1, v5
+; GCN-NEXT: v_add_f32_e32 v0, s0, v6
+; GCN-NEXT: global_store_dwordx4 v4, v[0:3], s[8:9]
+; GCN-NEXT: s_endpgm
+;
+; GCN-NOP1-LABEL: fadd_v4f32:
+; GCN-NOP1: ; %bb.0:
+; GCN-NOP1-NEXT: s_nop 0
+; GCN-NOP1-NEXT: s_load_dwordx4 s[8:11], s[4:5], 0x24
+; GCN-NOP1-NEXT: s_nop 0
+; GCN-NOP1-NEXT: v_mov_b32_e32 v4, 0
+; GCN-NOP1-NEXT: s_nop 0
+; GCN-NOP1-NEXT: s_waitcnt lgkmcnt(0)
+; GCN-NOP1-NEXT: s_nop 0
+; GCN-NOP1-NEXT: s_load_dwordx8 s[0:7], s[10:11], 0x0
+; GCN-NOP1-NEXT: s_nop 0
+; GCN-NOP1-NEXT: s_waitcnt lgkmcnt(0)
+; GCN-NOP1-NEXT: s_nop 0
+; GCN-NOP1-NEXT: v_mov_b32_e32 v0, s7
+; GCN-NOP1-NEXT: s_nop 0
+; GCN-NOP1-NEXT: v_mov_b32_e32 v1, s6
+; GCN-NOP1-NEXT: s_nop 0
+; GCN-NOP1-NEXT: v_mov_b32_e32 v5, s5
+; GCN-NOP1-NEXT: s_nop 0
+; GCN-NOP1-NEXT: v_mov_b32_e32 v6, s4
+; GCN-NOP1-NEXT: s_nop 0
+; GCN-NOP1-NEXT: v_add_f32_e32 v3, s3, v0
+; GCN-NOP1-NEXT: s_nop 0
+; GCN-NOP1-NEXT: v_add_f32_e32 v2, s2, v1
+; GCN-NOP1-NEXT: s_nop 0
+; GCN-NOP1-NEXT: v_add_f32_e32 v1, s1, v5
+; GCN-NOP1-NEXT: s_nop 0
+; GCN-NOP1-NEXT: v_add_f32_e32 v0, s0, v6
+; GCN-NOP1-NEXT: s_nop 0
+; GCN-NOP1-NEXT: global_store_dwordx4 v4, v[0:3], s[8:9]
+; GCN-NOP1-NEXT: s_endpgm
+;
+; GCN-NOP20-LABEL: fadd_v4f32:
+; GCN-NOP20: ; %bb.0:
+; GCN-NOP20-NEXT: s_nop 7
+; GCN-NOP20-NEXT: s_nop 7
+; GCN-NOP20-NEXT: s_nop 3
+; GCN-NOP20-NEXT: s_load_dwordx4 s[8:11], s[4:5], 0x24
+; GCN-NOP20-NEXT: s_nop 7
+; GCN-NOP20-NEXT: s_nop 7
+; GCN-NOP20-NEXT: s_nop 3
+; GCN-NOP20-NEXT: v_mov_b32_e32 v4, 0
+; GCN-NOP20-NEXT: s_nop 7
+; GCN-NOP20-NEXT: s_nop 7
+; GCN-NOP20-NEXT: s_nop 3
+; GCN-NOP20-NEXT: s_waitcnt lgkmcnt(0)
+; GCN-NOP20-NEXT: s_nop 7
+; GCN-NOP20-NEXT: s_nop 7
+; GCN-NOP20-NEXT: s_nop 3
+; GCN-NOP20-NEXT: s_load_dwordx8 s[0:7], s[10:11], 0x0
+; GCN-NOP20-NEXT: s_nop 7
+; GCN-NOP20-NEXT: s_nop 7
+; GCN-NOP20-NEXT: s_nop 3
+; GCN-NOP20-NEXT: s_waitcnt lgkmcnt(0)
+; GCN-NOP20-NEXT: s_nop 7
+; GCN-NOP20-NEXT: s_nop 7
+; GCN-NOP20-NEXT: s_nop 3
+; GCN-NOP20-NEXT: v_mov_b32_e32 v0, s7
+; GCN-NOP20-NEXT: s_nop 7
+; GCN-NOP20-NEXT: s_nop 7
+; GCN-NOP20-NEXT: s_nop 3
+; GCN-NOP20-NEXT: v_mov_b32_e32 v1, s6
+; GCN-NOP20-NEXT: s_nop 7
+; GCN-NOP20-NEXT: s_nop 7
+; GCN-NOP20-NEXT: s_nop 3
+; GCN-NOP20-NEXT: v_mov_b32_e32 v5, s5
+; GCN-NOP20-NEXT: s_nop 7
+; GCN-NOP20-NEXT: s_nop 7
+; GCN-NOP20-NEXT: s_nop 3
+; GCN-NOP20-NEXT: v_mov_b32_e32 v6, s4
+; GCN-NOP20-NEXT: s_nop 7
+; GCN-NOP20-NEXT: s_nop 7
+; GCN-NOP20-NEXT: s_nop 3
+; GCN-NOP20-NEXT: v_add_f32_e32 v3, s3, v0
+; GCN-NOP20-NEXT: s_nop 7
+; GCN-NOP20-NEXT: s_nop 7
+; GCN-NOP20-NEXT: s_nop 3
+; GCN-NOP20-NEXT: v_add_f32_e32 v2, s2, v1
+; GCN-NOP20-NEXT: s_nop 7
+; GCN-NOP20-NEXT: s_nop 7
+; GCN-NOP20-NEXT: s_nop 3
+; GCN-NOP20-NEXT: v_add_f32_e32 v1, s1, v5
+; GCN-NOP20-NEXT: s_nop 7
+; GCN-NOP20-NEXT: s_nop 7
+; GCN-NOP20-NEXT: s_nop 3
+; GCN-NOP20-NEXT: v_add_f32_e32 v0, s0, v6
+; GCN-NOP20-NEXT: s_nop 7
+; GCN-NOP20-NEXT: s_nop 7
+; GCN-NOP20-NEXT: s_nop 3
+; GCN-NOP20-NEXT: global_store_dwordx4 v4, v[0:3], s[8:9]
+; GCN-NOP20-NEXT: s_endpgm
+ %b_ptr = getelementptr <4 x float>, ptr addrspace(1) %in, i32 1
+ %a = load <4 x float>, ptr addrspace(1) %in, align 16
+ %b = load <4 x float>, ptr addrspace(1) %b_ptr, align 16
+ %result = fadd <4 x float> %a, %b
+ store <4 x float> %result, ptr addrspace(1) %out, align 16
+ ret void
+}
|
Change-Id: Ia83c67144b7c23fdec2912c72dd30956eba17f03
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this put nops inside bundles? See mask_hazard_getpc2
in test/CodeGen/AMDGPU/valu-mask-write-hazard.mir
for an example.
unsigned NopPad = | ||
NopPadding.getNumOccurrences() && !MI->isTerminator() ? NopPadding : 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't you want nop before a terminator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There can't be that between terminators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of a rule that all terminators have to be consecutive at the end of the block?
You could still have a nop before the first terminator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the debug purpose, I'd say that would be fine because if hazard happens, the first instruction in the following BB has a nop before it.
Change-Id: Icdd0c8777ce360a8fcb4dc599f95263a59b8fc4f
Change-Id: I7de1185831662e8e8b8b5612c52f703a05a62d88
We have amdgpu-waitcnt-forcezero but that causes variable wait based on outstanding memory instructions.
This allows to specifiy a certain number of stall cycles per instruction, which is useful for debugging. I'm not sure if there already is an equivalent feature.