-
Notifications
You must be signed in to change notification settings - Fork 15.2k
[AMDGPU] Improve isBasicBlockPrologue to only add necessary instructions #113303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@llvm/pr-subscribers-llvm-globalisel @llvm/pr-subscribers-backend-amdgpu Author: None (alex-t) ChangesCurrently, we consider all SGPR spill and WWM spill instructions to belong Basic Block prologue. Patch is 200.18 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/113303.diff 18 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
index 89a2eb4f18946b..f0ad090ec393e7 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
@@ -8901,6 +8901,22 @@ unsigned SIInstrInfo::getLiveRangeSplitOpcode(Register SrcReg,
return AMDGPU::COPY;
}
+bool SIInstrInfo::isPrologueOperandReload(const MachineInstr &MI) const {
+ unsigned Opcode = MI.getOpcode();
+ if ((isSGPRSpill(MI) &&
+ (MI.mayLoad() || Opcode == AMDGPU::SI_RESTORE_S32_FROM_VGPR)) ||
+ (isWWMRegSpillOpcode(Opcode) && MI.mayLoad())) {
+ Register Reg = MI.defs().begin()->getReg();
+ const MachineBasicBlock *MBB = MI.getParent();
+ MachineBasicBlock::const_instr_iterator I(MI), E = MBB->instr_end();
+ while (++I != E) {
+ if (I->readsRegister(Reg, &RI) && isBasicBlockPrologue(*I))
+ return true;
+ }
+ }
+ return false;
+}
+
bool SIInstrInfo::isBasicBlockPrologue(const MachineInstr &MI,
Register Reg) const {
// We need to handle instructions which may be inserted during register
@@ -8917,8 +8933,7 @@ bool SIInstrInfo::isBasicBlockPrologue(const MachineInstr &MI,
uint16_t Opcode = MI.getOpcode();
return IsNullOrVectorRegister &&
- (isSGPRSpill(Opcode) || isWWMRegSpillOpcode(Opcode) ||
- Opcode == AMDGPU::IMPLICIT_DEF ||
+ (isPrologueOperandReload(MI) || Opcode == AMDGPU::IMPLICIT_DEF ||
(!MI.isTerminator() && Opcode != AMDGPU::COPY &&
MI.modifiesRegister(AMDGPU::EXEC, &RI)));
}
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.h b/llvm/lib/Target/AMDGPU/SIInstrInfo.h
index 7041b59964645a..04b0414f9050ff 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.h
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.h
@@ -1342,6 +1342,8 @@ class SIInstrInfo final : public AMDGPUGenInstrInfo {
bool isBasicBlockPrologue(const MachineInstr &MI,
Register Reg = Register()) const override;
+ bool isPrologueOperandReload(const MachineInstr &MI) const;
+
MachineInstr *createPHIDestinationCopy(MachineBasicBlock &MBB,
MachineBasicBlock::iterator InsPt,
const DebugLoc &DL, Register Src,
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/image-waterfall-loop-O0.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/image-waterfall-loop-O0.ll
index 88fd7dcce35f68..eb864a0757c854 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/image-waterfall-loop-O0.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/image-waterfall-loop-O0.ll
@@ -68,9 +68,6 @@ define <4 x float> @waterfall_loop(<8 x i32> %vgpr_srd) {
; CHECK-NEXT: buffer_store_dword v16, off, s[0:3], s32 ; 4-byte Folded Spill
; CHECK-NEXT: s_mov_b32 exec_lo, s21
; CHECK-NEXT: .LBB0_1: ; =>This Inner Loop Header: Depth=1
-; CHECK-NEXT: s_or_saveexec_b32 s21, -1
-; CHECK-NEXT: buffer_load_dword v16, off, s[0:3], s32 ; 4-byte Folded Reload
-; CHECK-NEXT: s_mov_b32 exec_lo, s21
; CHECK-NEXT: buffer_load_dword v8, off, s[0:3], s32 offset:12 ; 4-byte Folded Reload
; CHECK-NEXT: buffer_load_dword v9, off, s[0:3], s32 offset:16 ; 4-byte Folded Reload
; CHECK-NEXT: buffer_load_dword v10, off, s[0:3], s32 offset:20 ; 4-byte Folded Reload
@@ -87,7 +84,10 @@ define <4 x float> @waterfall_loop(<8 x i32> %vgpr_srd) {
; CHECK-NEXT: buffer_load_dword v5, off, s[0:3], s32 offset:64 ; 4-byte Folded Reload
; CHECK-NEXT: buffer_load_dword v6, off, s[0:3], s32 offset:68 ; 4-byte Folded Reload
; CHECK-NEXT: buffer_load_dword v7, off, s[0:3], s32 offset:72 ; 4-byte Folded Reload
-; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: s_or_saveexec_b32 s21, -1
+; CHECK-NEXT: buffer_load_dword v16, off, s[0:3], s32 ; 4-byte Folded Reload
+; CHECK-NEXT: s_mov_b32 exec_lo, s21
+; CHECK-NEXT: s_waitcnt vmcnt(1)
; CHECK-NEXT: v_readfirstlane_b32 s12, v7
; CHECK-NEXT: v_readfirstlane_b32 s10, v6
; CHECK-NEXT: v_readfirstlane_b32 s9, v5
@@ -104,6 +104,7 @@ define <4 x float> @waterfall_loop(<8 x i32> %vgpr_srd) {
; CHECK-NEXT: s_mov_b32 s17, s6
; CHECK-NEXT: s_mov_b32 s18, s5
; CHECK-NEXT: s_mov_b32 s19, s4
+; CHECK-NEXT: s_waitcnt vmcnt(0)
; CHECK-NEXT: v_writelane_b32 v16, s12, 5
; CHECK-NEXT: v_writelane_b32 v16, s13, 6
; CHECK-NEXT: v_writelane_b32 v16, s14, 7
@@ -137,6 +138,8 @@ define <4 x float> @waterfall_loop(<8 x i32> %vgpr_srd) {
; CHECK-NEXT: buffer_store_dword v16, off, s[0:3], s32 ; 4-byte Folded Spill
; CHECK-NEXT: s_mov_b32 exec_lo, s21
; CHECK-NEXT: ; %bb.2: ; in Loop: Header=BB0_1 Depth=1
+; CHECK-NEXT: buffer_load_dword v0, off, s[0:3], s32 offset:4 ; 4-byte Folded Reload
+; CHECK-NEXT: buffer_load_dword v1, off, s[0:3], s32 offset:8 ; 4-byte Folded Reload
; CHECK-NEXT: s_or_saveexec_b32 s21, -1
; CHECK-NEXT: buffer_load_dword v16, off, s[0:3], s32 ; 4-byte Folded Reload
; CHECK-NEXT: s_mov_b32 exec_lo, s21
@@ -154,9 +157,6 @@ define <4 x float> @waterfall_loop(<8 x i32> %vgpr_srd) {
; CHECK-NEXT: v_readlane_b32 s17, v16, 1
; CHECK-NEXT: v_readlane_b32 s18, v16, 2
; CHECK-NEXT: v_readlane_b32 s19, v16, 3
-; CHECK-NEXT: buffer_load_dword v0, off, s[0:3], s32 offset:4 ; 4-byte Folded Reload
-; CHECK-NEXT: buffer_load_dword v1, off, s[0:3], s32 offset:8 ; 4-byte Folded Reload
-; CHECK-NEXT: s_waitcnt vmcnt(0)
; CHECK-NEXT: image_sample v0, v[0:1], s[8:15], s[16:19] dmask:0x1 dim:SQ_RSRC_IMG_2D
; CHECK-NEXT: s_waitcnt vmcnt(0)
; CHECK-NEXT: buffer_store_dword v0, off, s[0:3], s32 offset:76 ; 4-byte Folded Spill
diff --git a/llvm/test/CodeGen/AMDGPU/collapse-endcf.ll b/llvm/test/CodeGen/AMDGPU/collapse-endcf.ll
index fe17ff169cb14b..2e69bcd1ce152c 100644
--- a/llvm/test/CodeGen/AMDGPU/collapse-endcf.ll
+++ b/llvm/test/CodeGen/AMDGPU/collapse-endcf.ll
@@ -67,6 +67,7 @@ define amdgpu_kernel void @simple_nested_if(ptr addrspace(1) nocapture %arg) {
; GCN-O0-NEXT: s_mov_b64 exec, s[0:1]
; GCN-O0-NEXT: s_cbranch_execz .LBB0_4
; GCN-O0-NEXT: ; %bb.1: ; %bb.outer.then
+; GCN-O0-NEXT: buffer_load_dword v0, off, s[12:15], 0 offset:4 ; 4-byte Folded Reload
; GCN-O0-NEXT: s_or_saveexec_b64 s[8:9], -1
; GCN-O0-NEXT: s_waitcnt expcnt(0)
; GCN-O0-NEXT: buffer_load_dword v4, off, s[12:15], 0 ; 4-byte Folded Reload
@@ -74,14 +75,12 @@ define amdgpu_kernel void @simple_nested_if(ptr addrspace(1) nocapture %arg) {
; GCN-O0-NEXT: s_waitcnt vmcnt(0)
; GCN-O0-NEXT: v_readlane_b32 s4, v4, 0
; GCN-O0-NEXT: v_readlane_b32 s5, v4, 1
-; GCN-O0-NEXT: buffer_load_dword v0, off, s[12:15], 0 offset:4 ; 4-byte Folded Reload
; GCN-O0-NEXT: s_mov_b32 s2, 0xf000
; GCN-O0-NEXT: s_mov_b32 s0, 0
; GCN-O0-NEXT: ; kill: def $sgpr0 killed $sgpr0 def $sgpr0_sgpr1
; GCN-O0-NEXT: s_mov_b32 s1, s2
; GCN-O0-NEXT: ; kill: def $sgpr4_sgpr5 killed $sgpr4_sgpr5 def $sgpr4_sgpr5_sgpr6_sgpr7
; GCN-O0-NEXT: s_mov_b64 s[6:7], s[0:1]
-; GCN-O0-NEXT: s_waitcnt vmcnt(0)
; GCN-O0-NEXT: v_ashrrev_i32_e64 v3, 31, v0
; GCN-O0-NEXT: v_mov_b32_e32 v1, v0
; GCN-O0-NEXT: v_mov_b32_e32 v2, v3
@@ -100,6 +99,8 @@ define amdgpu_kernel void @simple_nested_if(ptr addrspace(1) nocapture %arg) {
; GCN-O0-NEXT: s_mov_b64 exec, s[0:1]
; GCN-O0-NEXT: s_cbranch_execz .LBB0_3
; GCN-O0-NEXT: ; %bb.2: ; %bb.inner.then
+; GCN-O0-NEXT: s_waitcnt expcnt(1)
+; GCN-O0-NEXT: buffer_load_dword v1, off, s[12:15], 0 offset:4 ; 4-byte Folded Reload
; GCN-O0-NEXT: s_or_saveexec_b64 s[8:9], -1
; GCN-O0-NEXT: s_waitcnt expcnt(0)
; GCN-O0-NEXT: buffer_load_dword v4, off, s[12:15], 0 ; 4-byte Folded Reload
@@ -107,9 +108,7 @@ define amdgpu_kernel void @simple_nested_if(ptr addrspace(1) nocapture %arg) {
; GCN-O0-NEXT: s_waitcnt vmcnt(0)
; GCN-O0-NEXT: v_readlane_b32 s0, v4, 0
; GCN-O0-NEXT: v_readlane_b32 s1, v4, 1
-; GCN-O0-NEXT: buffer_load_dword v1, off, s[12:15], 0 offset:4 ; 4-byte Folded Reload
; GCN-O0-NEXT: v_mov_b32_e32 v0, 1
-; GCN-O0-NEXT: s_waitcnt vmcnt(0)
; GCN-O0-NEXT: v_add_i32_e64 v1, s[2:3], v1, v0
; GCN-O0-NEXT: v_ashrrev_i32_e64 v3, 31, v1
; GCN-O0-NEXT: ; kill: def $vgpr1 killed $vgpr1 def $vgpr1_vgpr2 killed $exec
@@ -236,6 +235,7 @@ define amdgpu_kernel void @uncollapsable_nested_if(ptr addrspace(1) nocapture %a
; GCN-O0-NEXT: s_mov_b64 exec, s[0:1]
; GCN-O0-NEXT: s_cbranch_execz .LBB1_3
; GCN-O0-NEXT: ; %bb.1: ; %bb.outer.then
+; GCN-O0-NEXT: buffer_load_dword v0, off, s[12:15], 0 offset:4 ; 4-byte Folded Reload
; GCN-O0-NEXT: s_or_saveexec_b64 s[8:9], -1
; GCN-O0-NEXT: s_waitcnt expcnt(0)
; GCN-O0-NEXT: buffer_load_dword v4, off, s[12:15], 0 ; 4-byte Folded Reload
@@ -243,14 +243,12 @@ define amdgpu_kernel void @uncollapsable_nested_if(ptr addrspace(1) nocapture %a
; GCN-O0-NEXT: s_waitcnt vmcnt(0)
; GCN-O0-NEXT: v_readlane_b32 s4, v4, 0
; GCN-O0-NEXT: v_readlane_b32 s5, v4, 1
-; GCN-O0-NEXT: buffer_load_dword v0, off, s[12:15], 0 offset:4 ; 4-byte Folded Reload
; GCN-O0-NEXT: s_mov_b32 s2, 0xf000
; GCN-O0-NEXT: s_mov_b32 s0, 0
; GCN-O0-NEXT: ; kill: def $sgpr0 killed $sgpr0 def $sgpr0_sgpr1
; GCN-O0-NEXT: s_mov_b32 s1, s2
; GCN-O0-NEXT: ; kill: def $sgpr4_sgpr5 killed $sgpr4_sgpr5 def $sgpr4_sgpr5_sgpr6_sgpr7
; GCN-O0-NEXT: s_mov_b64 s[6:7], s[0:1]
-; GCN-O0-NEXT: s_waitcnt vmcnt(0)
; GCN-O0-NEXT: v_ashrrev_i32_e64 v3, 31, v0
; GCN-O0-NEXT: v_mov_b32_e32 v1, v0
; GCN-O0-NEXT: v_mov_b32_e32 v2, v3
@@ -269,6 +267,8 @@ define amdgpu_kernel void @uncollapsable_nested_if(ptr addrspace(1) nocapture %a
; GCN-O0-NEXT: s_mov_b64 exec, s[0:1]
; GCN-O0-NEXT: s_cbranch_execz .LBB1_4
; GCN-O0-NEXT: ; %bb.2: ; %bb.inner.then
+; GCN-O0-NEXT: s_waitcnt expcnt(1)
+; GCN-O0-NEXT: buffer_load_dword v1, off, s[12:15], 0 offset:4 ; 4-byte Folded Reload
; GCN-O0-NEXT: s_or_saveexec_b64 s[8:9], -1
; GCN-O0-NEXT: s_waitcnt expcnt(0)
; GCN-O0-NEXT: buffer_load_dword v4, off, s[12:15], 0 ; 4-byte Folded Reload
@@ -276,9 +276,7 @@ define amdgpu_kernel void @uncollapsable_nested_if(ptr addrspace(1) nocapture %a
; GCN-O0-NEXT: s_waitcnt vmcnt(0)
; GCN-O0-NEXT: v_readlane_b32 s0, v4, 0
; GCN-O0-NEXT: v_readlane_b32 s1, v4, 1
-; GCN-O0-NEXT: buffer_load_dword v1, off, s[12:15], 0 offset:4 ; 4-byte Folded Reload
; GCN-O0-NEXT: v_mov_b32_e32 v0, 1
-; GCN-O0-NEXT: s_waitcnt vmcnt(0)
; GCN-O0-NEXT: v_add_i32_e64 v1, s[2:3], v1, v0
; GCN-O0-NEXT: v_ashrrev_i32_e64 v3, 31, v1
; GCN-O0-NEXT: ; kill: def $vgpr1 killed $vgpr1 def $vgpr1_vgpr2 killed $exec
@@ -312,9 +310,9 @@ define amdgpu_kernel void @uncollapsable_nested_if(ptr addrspace(1) nocapture %a
; GCN-O0-NEXT: v_readlane_b32 s2, v4, 4
; GCN-O0-NEXT: v_readlane_b32 s3, v4, 5
; GCN-O0-NEXT: s_or_b64 exec, exec, s[2:3]
+; GCN-O0-NEXT: buffer_load_dword v1, off, s[12:15], 0 offset:4 ; 4-byte Folded Reload
; GCN-O0-NEXT: v_readlane_b32 s0, v4, 0
; GCN-O0-NEXT: v_readlane_b32 s1, v4, 1
-; GCN-O0-NEXT: buffer_load_dword v1, off, s[12:15], 0 offset:4 ; 4-byte Folded Reload
; GCN-O0-NEXT: v_mov_b32_e32 v0, 2
; GCN-O0-NEXT: s_waitcnt vmcnt(0)
; GCN-O0-NEXT: v_add_i32_e64 v1, s[2:3], v1, v0
@@ -456,17 +454,18 @@ define amdgpu_kernel void @nested_if_if_else(ptr addrspace(1) nocapture %arg) {
; GCN-O0-NEXT: s_mov_b64 exec, s[0:1]
; GCN-O0-NEXT: s_cbranch_execz .LBB2_6
; GCN-O0-NEXT: ; %bb.1: ; %bb.outer.then
+; GCN-O0-NEXT: buffer_load_dword v0, off, s[12:15], 0 offset:4 ; 4-byte Folded Reload
; GCN-O0-NEXT: s_or_saveexec_b64 s[6:7], -1
; GCN-O0-NEXT: s_waitcnt expcnt(0)
; GCN-O0-NEXT: buffer_load_dword v4, off, s[12:15], 0 ; 4-byte Folded Reload
; GCN-O0-NEXT: s_mov_b64 exec, s[6:7]
-; GCN-O0-NEXT: buffer_load_dword v0, off, s[12:15], 0 offset:4 ; 4-byte Folded Reload
; GCN-O0-NEXT: s_mov_b32 s0, 2
-; GCN-O0-NEXT: s_waitcnt vmcnt(0)
+; GCN-O0-NEXT: s_waitcnt vmcnt(1)
; GCN-O0-NEXT: v_cmp_ne_u32_e64 s[0:1], v0, s0
; GCN-O0-NEXT: s_mov_b64 s[2:3], exec
; GCN-O0-NEXT: s_and_b64 s[0:1], s[2:3], s[0:1]
; GCN-O0-NEXT: s_xor_b64 s[2:3], s[0:1], s[2:3]
+; GCN-O0-NEXT: s_waitcnt vmcnt(0)
; GCN-O0-NEXT: v_writelane_b32 v4, s2, 4
; GCN-O0-NEXT: v_writelane_b32 v4, s3, 5
; GCN-O0-NEXT: s_or_saveexec_b64 s[6:7], -1
@@ -493,6 +492,7 @@ define amdgpu_kernel void @nested_if_if_else(ptr addrspace(1) nocapture %arg) {
; GCN-O0-NEXT: s_xor_b64 exec, exec, s[0:1]
; GCN-O0-NEXT: s_cbranch_execz .LBB2_5
; GCN-O0-NEXT: ; %bb.3: ; %bb.then
+; GCN-O0-NEXT: buffer_load_dword v1, off, s[12:15], 0 offset:4 ; 4-byte Folded Reload
; GCN-O0-NEXT: s_or_saveexec_b64 s[6:7], -1
; GCN-O0-NEXT: s_waitcnt expcnt(0)
; GCN-O0-NEXT: buffer_load_dword v4, off, s[12:15], 0 ; 4-byte Folded Reload
@@ -500,9 +500,7 @@ define amdgpu_kernel void @nested_if_if_else(ptr addrspace(1) nocapture %arg) {
; GCN-O0-NEXT: s_waitcnt vmcnt(0)
; GCN-O0-NEXT: v_readlane_b32 s0, v4, 0
; GCN-O0-NEXT: v_readlane_b32 s1, v4, 1
-; GCN-O0-NEXT: buffer_load_dword v1, off, s[12:15], 0 offset:4 ; 4-byte Folded Reload
; GCN-O0-NEXT: v_mov_b32_e32 v0, 1
-; GCN-O0-NEXT: s_waitcnt vmcnt(0)
; GCN-O0-NEXT: v_add_i32_e64 v1, s[2:3], v1, v0
; GCN-O0-NEXT: v_ashrrev_i32_e64 v3, 31, v1
; GCN-O0-NEXT: ; kill: def $vgpr1 killed $vgpr1 def $vgpr1_vgpr2 killed $exec
@@ -518,6 +516,7 @@ define amdgpu_kernel void @nested_if_if_else(ptr addrspace(1) nocapture %arg) {
; GCN-O0-NEXT: buffer_store_dword v0, v[1:2], s[0:3], 0 addr64
; GCN-O0-NEXT: s_branch .LBB2_5
; GCN-O0-NEXT: .LBB2_4: ; %bb.else
+; GCN-O0-NEXT: buffer_load_dword v1, off, s[12:15], 0 offset:4 ; 4-byte Folded Reload
; GCN-O0-NEXT: s_or_saveexec_b64 s[6:7], -1
; GCN-O0-NEXT: s_waitcnt expcnt(0)
; GCN-O0-NEXT: buffer_load_dword v4, off, s[12:15], 0 ; 4-byte Folded Reload
@@ -525,9 +524,7 @@ define amdgpu_kernel void @nested_if_if_else(ptr addrspace(1) nocapture %arg) {
; GCN-O0-NEXT: s_waitcnt vmcnt(0)
; GCN-O0-NEXT: v_readlane_b32 s0, v4, 0
; GCN-O0-NEXT: v_readlane_b32 s1, v4, 1
-; GCN-O0-NEXT: buffer_load_dword v1, off, s[12:15], 0 offset:4 ; 4-byte Folded Reload
; GCN-O0-NEXT: v_mov_b32_e32 v0, 2
-; GCN-O0-NEXT: s_waitcnt vmcnt(0)
; GCN-O0-NEXT: v_add_i32_e64 v1, s[2:3], v1, v0
; GCN-O0-NEXT: v_ashrrev_i32_e64 v3, 31, v1
; GCN-O0-NEXT: ; kill: def $vgpr1 killed $vgpr1 def $vgpr1_vgpr2 killed $exec
@@ -724,13 +721,13 @@ define amdgpu_kernel void @nested_if_else_if(ptr addrspace(1) nocapture %arg) {
; GCN-O0-NEXT: s_xor_b64 exec, exec, s[0:1]
; GCN-O0-NEXT: s_cbranch_execz .LBB3_8
; GCN-O0-NEXT: ; %bb.2: ; %bb.outer.then
+; GCN-O0-NEXT: buffer_load_dword v0, off, s[12:15], 0 offset:12 ; 4-byte Folded Reload
+; GCN-O0-NEXT: buffer_load_dword v2, off, s[12:15], 0 offset:4 ; 4-byte Folded Reload
+; GCN-O0-NEXT: buffer_load_dword v3, off, s[12:15], 0 offset:8 ; 4-byte Folded Reload
; GCN-O0-NEXT: s_or_saveexec_b64 s[8:9], -1
; GCN-O0-NEXT: s_waitcnt expcnt(0)
; GCN-O0-NEXT: buffer_load_dword v6, off, s[12:15], 0 ; 4-byte Folded Reload
; GCN-O0-NEXT: s_mov_b64 exec, s[8:9]
-; GCN-O0-NEXT: buffer_load_dword v0, off, s[12:15], 0 offset:12 ; 4-byte Folded Reload
-; GCN-O0-NEXT: buffer_load_dword v2, off, s[12:15], 0 offset:4 ; 4-byte Folded Reload
-; GCN-O0-NEXT: buffer_load_dword v3, off, s[12:15], 0 offset:8 ; 4-byte Folded Reload
; GCN-O0-NEXT: s_mov_b32 s0, 0xf000
; GCN-O0-NEXT: s_mov_b32 s2, 0
; GCN-O0-NEXT: s_mov_b32 s4, s2
@@ -740,11 +737,12 @@ define amdgpu_kernel void @nested_if_else_if(ptr addrspace(1) nocapture %arg) {
; GCN-O0-NEXT: ; kill: def $sgpr0_sgpr1 killed $sgpr0_sgpr1 def $sgpr0_sgpr1_sgpr2_sgpr3
; GCN-O0-NEXT: s_mov_b64 s[2:3], s[4:5]
; GCN-O0-NEXT: v_mov_b32_e32 v1, 1
-; GCN-O0-NEXT: s_waitcnt vmcnt(0)
+; GCN-O0-NEXT: s_waitcnt vmcnt(1)
; GCN-O0-NEXT: buffer_store_dword v1, v[2:3], s[0:3], 0 addr64 offset:4
; GCN-O0-NEXT: s_mov_b32 s0, 2
; GCN-O0-NEXT: v_cmp_eq_u32_e64 s[2:3], v0, s0
; GCN-O0-NEXT: s_mov_b64 s[0:1], exec
+; GCN-O0-NEXT: s_waitcnt vmcnt(1)
; GCN-O0-NEXT: v_writelane_b32 v6, s0, 4
; GCN-O0-NEXT: v_writelane_b32 v6, s1, 5
; GCN-O0-NEXT: s_or_saveexec_b64 s[8:9], -1
@@ -770,13 +768,13 @@ define amdgpu_kernel void @nested_if_else_if(ptr addrspace(1) nocapture %arg) {
; GCN-O0-NEXT: buffer_store_dword v0, v[1:2], s[0:3], 0 addr64 offset:8
; GCN-O0-NEXT: s_branch .LBB3_7
; GCN-O0-NEXT: .LBB3_4: ; %bb.outer.else
+; GCN-O0-NEXT: buffer_load_dword v0, off, s[12:15], 0 offset:12 ; 4-byte Folded Reload
+; GCN-O0-NEXT: buffer_load_dword v2, off, s[12:15], 0 offset:4 ; 4-byte Folded Reload
+; GCN-O0-NEXT: buffer_load_dword v3, off, s[12:15], 0 offset:8 ; 4-byte Folded Reload
; GCN-O0-NEXT: s_or_saveexec_b64 s[8:9], -1
; GCN-O0-NEXT: s_waitcnt expcnt(0)
; GCN-O0-NEXT: buffer_load_dword v6, off, s[12:15], 0 ; 4-byte Folded Reload
; GCN-O0-NEXT: s_mov_b64 exec, s[8:9]
-; GCN-O0-NEXT: buffer_load_dword v0, off, s[12:15], 0 offset:12 ; 4-byte Folded Reload
-; GCN-O0-NEXT: buffer_load_dword v2, off, s[12:15], 0 offset:4 ; 4-byte Folded Reload
-; GCN-O0-NEXT: buffer_load_dword v3, off, s[12:15], 0 offset:8 ; 4-byte Folded Reload
; GCN-O0-NEXT: s_mov_b32 s1, 0xf000
; GCN-O0-NEXT: s_mov_b32 s0, 0
; GCN-O0-NEXT: s_mov_b32 s2, s0
@@ -786,10 +784,11 @@ define amdgpu_kernel void @nested_if_else_if(ptr addrspace(1) nocapture %arg) {
; GCN-O0-NEXT: ; kill: def $sgpr4_sgpr5 killed $sgpr4_sgpr5 def $sgpr4_sgpr5_sgpr6_sgpr7
; GCN-O0-NEXT: s_mov_b64 s[6:7], s[2:3]
; GCN-O0-NEXT: v_mov_b32_e32 v1, 3
-; GCN-O0-NEXT: s_waitcnt vmcnt(0)
+; GCN-O0-NEXT: s_waitcnt vmcnt(1)
; GCN-O0-NEXT: buffer_store_dword v1, v[2:3], s[4:7], 0 addr64 offset:12
; GCN-O0-NEXT: v_cmp_eq_u32_e64 s[2:3], v0, s0
; GCN-O0-NEXT: s_mov_b64 s[0:1], exec
+; GCN-O0-NEXT: s_waitcnt vmcnt(1)
; GCN-O0-NEXT: v_writelane_b32 v6, s0, 6
; GCN-O0-NEXT: v_writelane_b32 v6, s1, 7
; GCN-O0-NEXT: s_or_saveexec_b64 s[8:9], -1
@@ -927,6 +926,7 @@ define amdgpu_kernel void @s_endpgm_unsafe_barrier(ptr addrspace(1) nocapture %a
; GCN-O0-NEXT: s_mov_b64 exec, s[0:1]
; GCN-O0-NEXT: s_cbranch_execz .LBB4_2
; GCN-O0-NEXT: ; %bb.1: ; %bb.then
+; GCN-O0-NEXT: buffer_load_dword v0, off, s[12:15], 0 offset:4 ; 4-byte Folded Reload
; GCN-O0-NEXT: s_or_saveexec_b64 s[6:7], -1
; GCN-O0-NEXT: s_waitcnt expcnt(0)
; GCN-O0-NEXT: buffer_load_dword v3, off, s[12:15], 0 ; 4-byte Folded Reload
@@ -934,14 +934,12 @@ define amdgpu_kernel void @s_endpgm_unsafe_barrier(ptr addrspace(1) nocapture %a
; GCN-O0-NEXT: s_waitcnt vmcnt(0)
; GCN-O0-NEXT: v_readlane_b32 s0, v3, 0
; GCN-O0-NEXT: v_readlane_b32 s1, v3, 1
-; GCN-O0-NEXT: buffer_load_dword v0, off, s[12:15], 0 offset:4 ; 4-byte Folded Reload
; GCN-O0-NEXT: s_mov_b32 s2, 0xf000
; GCN-O0-NEXT: s_mov_b32 s4, 0
; GCN-O0-NEXT: ; kill: def $sgpr4 killed $sgpr4 def $sgpr4_sgpr5
; GCN-O0-NEXT: s_mov_b32 s5, s2
; GCN-O0-NEXT: ; kill: def $sgpr0_sgpr1 killed $sgpr0_sgpr1 def $sgpr0_sgpr1_sgpr2_sgpr3
; GCN-O0-NEXT: s_mov_b64 s[2:3], s[4:5]
-; GCN-O0-NEXT: s_waitcnt vmcnt(0)
; GCN-O0-NEXT: v_ashrrev_i32_e64 v2, 31, v0
; GCN-O0-NEXT: ; kill: def $vgpr0 killed $vgpr0 def $vgpr0_vgpr1 killed $exec
; GCN-O0-NEXT: v_mov_b32_e32 v1, v2
@@ -1066,6 +1064,8 @@ define void @scc_liveness(i32 %arg) local_unnamed_addr #0 {
; GCN-O0-NEXT: s_mov_b64 exec, s[14:15]
; GCN-O0-NEXT: .LBB5_1: ; %bb1
; GCN-O0-NEXT: ; =>This Inner Loop Header: Depth=1
+; GCN-O0-NEXT: s_waitcnt expcnt(1)
+; GCN-O0-NEXT: buffer_load_dword v0, off, s[0:3], s32 offset:4 ; 4-byte Folded Reload
; GCN-O0-NEXT: s_or_saveexec_b64 s[14:15], -1
; GCN-O0-NEXT: s_waitcnt expcnt(0)
; GCN-O0-NEXT: buffer_load_dword v6, off, s[0:3], s32 ; 4-byte Folded Reload
@@ -1077,9 +1077,7 @@ define void @scc_liveness(i32 %arg) local_unnamed_addr #0 {
; GCN-O0-NEXT: v_readlane_b32 s7, v6, 1
; GCN-O0-NEXT: v_writelane_b32 v6, s6, 4
; GCN-O0-NEXT: v_writelane_b32 v6, s7, 5
-; GCN-O0-NEXT: buffer_load_dword v0, off, s[0:3], s32 offset:4 ; 4-byte Folded Reload
; GCN-O0-NEXT: s_mov_b32 s4, 0x207
-; GCN-O0-NEXT: s_waitcnt vmcnt(0)
; GCN-O0-NEXT: v_cmp_lt_i32_e64 s[4:5], v0, s4
; GCN-O0-NEXT: s...
[truncated]
|
Ping. is anybody interested in taking a look? |
You can test this locally with the following command:git-clang-format --diff a2ba438f3e5635e368333213914c7452a6a6a2da 640a436c2c39613e32c76226eef1b618a4d2400f --extensions cpp,h -- llvm/lib/Target/AMDGPU/SIInstrInfo.cpp llvm/lib/Target/AMDGPU/SIInstrInfo.h View the diff from clang-format here.diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
index 2da8356328..fae6e8e4f6 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
@@ -8938,7 +8938,7 @@ bool SIInstrInfo::isBasicBlockPrologue(const MachineInstr &MI,
(Opcode == AMDGPU::IMPLICIT_DEF ||
(!MI.isTerminator() && Opcode != AMDGPU::COPY &&
MI.modifiesRegister(AMDGPU::EXEC, &RI)) ||
- isPrologueOperandReload(MI));
+ isPrologueOperandReload(MI));
}
MachineInstrBuilder
|
Thanks for doing this! But frankly speaking, this is making the implementation more confusing. I think most of the weirdness in the implementation is caused by the way the target hook is defined. Querying whether each instruction is block prologue seems not the right way to go. The basic idea behind the block prologue is that we have a special instruction that setup the exec for the block, all the instructions before this specific instruction are prologue instructions. So the prologue searching process should work like forward-iterating the instructions until we see the first instruction that modifies exec, which would be the last prologue instruction. There are ways to speed up the searching process to avoid visiting all the instructions in the worst case. For example, we can check the instruction types that could possibly in the prologue. If we found an instruction that is NOT in the list of possible prologue instructions before we see an exec setup instruction, then we should return and report that there is no prologue instruction at all. Following the idea, maybe we should define the hook as |
The current implementation aims to check if the instruction in question produces operands for another prologue instruction, which causes the recursion. The possible user of the register defined by the instruction might be not only the last instruction in the prologue (i.e. EXEC modification) but, for example, WWM-reload using SGPR as offset. Also, please note, that the latest version of the algorithm returns false immediately as the next instruction is not a "prologue". I agree that iterating through the whole prologue on each query is inconvenient. To change that we'd have to change all the call sites in LLVM core code. Most are looking for the insertion point for the concrete basic block walking from the block beginning. |
I think targetInstrInfo is not kind of stable ABI that needs high bar to change, right? We are the only target that implement the isBasicBlockPrologue() in tree. For out-of-tree target, it is fairly easy to switch to the new hook.
I would say the "prologue" is not a static property for certain types of instructions. An instruction is part of prologue because it appears before the exec modification instruction. |
I would not like to rely on the instructions' lexical order. An instruction belongs to the prologue if it is part of the def-use chain that ends up on the exec mask writing instruction. Just to check if I understand your idea:
Is that what you wanted to convey? |
Given that I understand your point correctly, we introduce a hook returning an iterator pointing to the last prologue or first non-prologue instruction. Below is a brief list of the call sites and related issues we'd have to solve:
2 and 3 are just a question of interface, what hooks we'd like to have. The 1st is real trouble as it depends on the iterative nature of the isBasicBlockPrologue usage. |
Is this correct? like in the case below, the second sgpr-reload-from-vgpr is not in the prologue?
I don't think the def-use relationship is important when checking whether an instruction belong to the prologue. The prologue concept is mainly used to find right insertion point for vector instructions (anything else?). For scalar instructions that produce the input to the exec mask setup, yes, we have to put them inside the prologue for correctness. For other scalar instructions, we can either put them in the prologue or after. Right?
Yes
Thank you for taking further look into this. This is indeed an issue we need to solve properly. I am not sure whether we have certain order for these instructions, seems that we cannot assume too much (but we can still assume PHIs always come first)? So, to get correct implementation of But its unclear to me what's the expected behavior like in corner cases. Like in skipPHIAndLables(), we would want to find the insertion point before debug instructions but after prologue instruction. so what if the IR have prologue instructions after debug instructions? What would the IR for such case look like? |
Exactly. Only readlane producing s0 belongs to the prologue but another one, producing s1 does not. |
Adding odd instructions in the prologue is undesirable because any instruction defining a register starts the segment in a live range. |
If the second v_readlane is not the prologue, then
That make sense to me. But we still want sort of clean and robust prologue checking. |
The isBasicBlockPrologue hook was something introduced by AMDGPU and have been in real use only for our backend. I share the same concerns discussed between Alex and Ruiling. I also think we should consolidate |
I am unsure if the pattern you used as an example could exist. `
` |
The only question to decide (and for what I need advice) is: |
Ping. I tried to address your concerns. Any other suggestions? |
Currently, we consider all SGPR spill and WWM spill instructions to belong Basic Block prologue.
We only need those producing operands for the EXEC mask manipulation at the Block beginning or those producing operands for the formers.