Skip to content

Conversation

alex-t
Copy link
Contributor

@alex-t alex-t commented Oct 22, 2024

Currently, we consider all SGPR spill and WWM spill instructions to belong Basic Block prologue.
We only need those producing operands for the EXEC mask manipulation at the Block beginning or those producing operands for the formers.

@llvmbot
Copy link
Member

llvmbot commented Oct 22, 2024

@llvm/pr-subscribers-llvm-globalisel

@llvm/pr-subscribers-backend-amdgpu

Author: None (alex-t)

Changes

Currently, we consider all SGPR spill and WWM spill instructions to belong Basic Block prologue.
We only need those producing operands for the EXEC mask manipulation at the Block beginning or those producing operands for the formers.


Patch is 200.18 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/113303.diff

18 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/SIInstrInfo.cpp (+17-2)
  • (modified) llvm/lib/Target/AMDGPU/SIInstrInfo.h (+2)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/image-waterfall-loop-O0.ll (+7-7)
  • (modified) llvm/test/CodeGen/AMDGPU/collapse-endcf.ll (+29-31)
  • (modified) llvm/test/CodeGen/AMDGPU/cross-block-use-is-not-abi-copy.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/div_i128.ll (+73-85)
  • (modified) llvm/test/CodeGen/AMDGPU/identical-subrange-spill-infloop.ll (+142-202)
  • (modified) llvm/test/CodeGen/AMDGPU/indirect-addressing-si.ll (+53-76)
  • (modified) llvm/test/CodeGen/AMDGPU/mubuf-legalize-operands-non-ptr-intrinsics.ll (+51-51)
  • (modified) llvm/test/CodeGen/AMDGPU/mubuf-legalize-operands.ll (+55-56)
  • (modified) llvm/test/CodeGen/AMDGPU/partial-sgpr-to-vgpr-spills.ll (+14-19)
  • (modified) llvm/test/CodeGen/AMDGPU/postra-sink-update-dependency.mir (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/rem_i128.ll (+36-42)
  • (modified) llvm/test/CodeGen/AMDGPU/sgpr-spill-no-vgprs.ll (+3-4)
  • (modified) llvm/test/CodeGen/AMDGPU/spill-scavenge-offset.ll (+241-241)
  • (modified) llvm/test/CodeGen/AMDGPU/trap-abis.ll (+2-3)
  • (modified) llvm/test/CodeGen/AMDGPU/wwm-reserved-spill.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/wwm-reserved.ll (+4-4)
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
index 89a2eb4f18946b..f0ad090ec393e7 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
@@ -8901,6 +8901,22 @@ unsigned SIInstrInfo::getLiveRangeSplitOpcode(Register SrcReg,
   return AMDGPU::COPY;
 }
 
+bool SIInstrInfo::isPrologueOperandReload(const MachineInstr &MI) const {
+  unsigned Opcode = MI.getOpcode();
+  if ((isSGPRSpill(MI) &&
+       (MI.mayLoad() || Opcode == AMDGPU::SI_RESTORE_S32_FROM_VGPR)) ||
+      (isWWMRegSpillOpcode(Opcode) && MI.mayLoad())) {
+    Register Reg = MI.defs().begin()->getReg();
+    const MachineBasicBlock *MBB = MI.getParent();
+    MachineBasicBlock::const_instr_iterator I(MI), E = MBB->instr_end();
+    while (++I != E) {
+      if (I->readsRegister(Reg, &RI) && isBasicBlockPrologue(*I))
+        return true;
+    }
+  }
+  return false;
+}
+
 bool SIInstrInfo::isBasicBlockPrologue(const MachineInstr &MI,
                                        Register Reg) const {
   // We need to handle instructions which may be inserted during register
@@ -8917,8 +8933,7 @@ bool SIInstrInfo::isBasicBlockPrologue(const MachineInstr &MI,
 
   uint16_t Opcode = MI.getOpcode();
   return IsNullOrVectorRegister &&
-         (isSGPRSpill(Opcode) || isWWMRegSpillOpcode(Opcode) ||
-          Opcode == AMDGPU::IMPLICIT_DEF ||
+         (isPrologueOperandReload(MI) || Opcode == AMDGPU::IMPLICIT_DEF ||
           (!MI.isTerminator() && Opcode != AMDGPU::COPY &&
            MI.modifiesRegister(AMDGPU::EXEC, &RI)));
 }
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.h b/llvm/lib/Target/AMDGPU/SIInstrInfo.h
index 7041b59964645a..04b0414f9050ff 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.h
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.h
@@ -1342,6 +1342,8 @@ class SIInstrInfo final : public AMDGPUGenInstrInfo {
   bool isBasicBlockPrologue(const MachineInstr &MI,
                             Register Reg = Register()) const override;
 
+  bool isPrologueOperandReload(const MachineInstr &MI) const;
+
   MachineInstr *createPHIDestinationCopy(MachineBasicBlock &MBB,
                                          MachineBasicBlock::iterator InsPt,
                                          const DebugLoc &DL, Register Src,
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/image-waterfall-loop-O0.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/image-waterfall-loop-O0.ll
index 88fd7dcce35f68..eb864a0757c854 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/image-waterfall-loop-O0.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/image-waterfall-loop-O0.ll
@@ -68,9 +68,6 @@ define <4 x float> @waterfall_loop(<8 x i32> %vgpr_srd) {
 ; CHECK-NEXT:    buffer_store_dword v16, off, s[0:3], s32 ; 4-byte Folded Spill
 ; CHECK-NEXT:    s_mov_b32 exec_lo, s21
 ; CHECK-NEXT:  .LBB0_1: ; =>This Inner Loop Header: Depth=1
-; CHECK-NEXT:    s_or_saveexec_b32 s21, -1
-; CHECK-NEXT:    buffer_load_dword v16, off, s[0:3], s32 ; 4-byte Folded Reload
-; CHECK-NEXT:    s_mov_b32 exec_lo, s21
 ; CHECK-NEXT:    buffer_load_dword v8, off, s[0:3], s32 offset:12 ; 4-byte Folded Reload
 ; CHECK-NEXT:    buffer_load_dword v9, off, s[0:3], s32 offset:16 ; 4-byte Folded Reload
 ; CHECK-NEXT:    buffer_load_dword v10, off, s[0:3], s32 offset:20 ; 4-byte Folded Reload
@@ -87,7 +84,10 @@ define <4 x float> @waterfall_loop(<8 x i32> %vgpr_srd) {
 ; CHECK-NEXT:    buffer_load_dword v5, off, s[0:3], s32 offset:64 ; 4-byte Folded Reload
 ; CHECK-NEXT:    buffer_load_dword v6, off, s[0:3], s32 offset:68 ; 4-byte Folded Reload
 ; CHECK-NEXT:    buffer_load_dword v7, off, s[0:3], s32 offset:72 ; 4-byte Folded Reload
-; CHECK-NEXT:    s_waitcnt vmcnt(0)
+; CHECK-NEXT:    s_or_saveexec_b32 s21, -1
+; CHECK-NEXT:    buffer_load_dword v16, off, s[0:3], s32 ; 4-byte Folded Reload
+; CHECK-NEXT:    s_mov_b32 exec_lo, s21
+; CHECK-NEXT:    s_waitcnt vmcnt(1)
 ; CHECK-NEXT:    v_readfirstlane_b32 s12, v7
 ; CHECK-NEXT:    v_readfirstlane_b32 s10, v6
 ; CHECK-NEXT:    v_readfirstlane_b32 s9, v5
@@ -104,6 +104,7 @@ define <4 x float> @waterfall_loop(<8 x i32> %vgpr_srd) {
 ; CHECK-NEXT:    s_mov_b32 s17, s6
 ; CHECK-NEXT:    s_mov_b32 s18, s5
 ; CHECK-NEXT:    s_mov_b32 s19, s4
+; CHECK-NEXT:    s_waitcnt vmcnt(0)
 ; CHECK-NEXT:    v_writelane_b32 v16, s12, 5
 ; CHECK-NEXT:    v_writelane_b32 v16, s13, 6
 ; CHECK-NEXT:    v_writelane_b32 v16, s14, 7
@@ -137,6 +138,8 @@ define <4 x float> @waterfall_loop(<8 x i32> %vgpr_srd) {
 ; CHECK-NEXT:    buffer_store_dword v16, off, s[0:3], s32 ; 4-byte Folded Spill
 ; CHECK-NEXT:    s_mov_b32 exec_lo, s21
 ; CHECK-NEXT:  ; %bb.2: ; in Loop: Header=BB0_1 Depth=1
+; CHECK-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:4 ; 4-byte Folded Reload
+; CHECK-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:8 ; 4-byte Folded Reload
 ; CHECK-NEXT:    s_or_saveexec_b32 s21, -1
 ; CHECK-NEXT:    buffer_load_dword v16, off, s[0:3], s32 ; 4-byte Folded Reload
 ; CHECK-NEXT:    s_mov_b32 exec_lo, s21
@@ -154,9 +157,6 @@ define <4 x float> @waterfall_loop(<8 x i32> %vgpr_srd) {
 ; CHECK-NEXT:    v_readlane_b32 s17, v16, 1
 ; CHECK-NEXT:    v_readlane_b32 s18, v16, 2
 ; CHECK-NEXT:    v_readlane_b32 s19, v16, 3
-; CHECK-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:4 ; 4-byte Folded Reload
-; CHECK-NEXT:    buffer_load_dword v1, off, s[0:3], s32 offset:8 ; 4-byte Folded Reload
-; CHECK-NEXT:    s_waitcnt vmcnt(0)
 ; CHECK-NEXT:    image_sample v0, v[0:1], s[8:15], s[16:19] dmask:0x1 dim:SQ_RSRC_IMG_2D
 ; CHECK-NEXT:    s_waitcnt vmcnt(0)
 ; CHECK-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:76 ; 4-byte Folded Spill
diff --git a/llvm/test/CodeGen/AMDGPU/collapse-endcf.ll b/llvm/test/CodeGen/AMDGPU/collapse-endcf.ll
index fe17ff169cb14b..2e69bcd1ce152c 100644
--- a/llvm/test/CodeGen/AMDGPU/collapse-endcf.ll
+++ b/llvm/test/CodeGen/AMDGPU/collapse-endcf.ll
@@ -67,6 +67,7 @@ define amdgpu_kernel void @simple_nested_if(ptr addrspace(1) nocapture %arg) {
 ; GCN-O0-NEXT:    s_mov_b64 exec, s[0:1]
 ; GCN-O0-NEXT:    s_cbranch_execz .LBB0_4
 ; GCN-O0-NEXT:  ; %bb.1: ; %bb.outer.then
+; GCN-O0-NEXT:    buffer_load_dword v0, off, s[12:15], 0 offset:4 ; 4-byte Folded Reload
 ; GCN-O0-NEXT:    s_or_saveexec_b64 s[8:9], -1
 ; GCN-O0-NEXT:    s_waitcnt expcnt(0)
 ; GCN-O0-NEXT:    buffer_load_dword v4, off, s[12:15], 0 ; 4-byte Folded Reload
@@ -74,14 +75,12 @@ define amdgpu_kernel void @simple_nested_if(ptr addrspace(1) nocapture %arg) {
 ; GCN-O0-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-O0-NEXT:    v_readlane_b32 s4, v4, 0
 ; GCN-O0-NEXT:    v_readlane_b32 s5, v4, 1
-; GCN-O0-NEXT:    buffer_load_dword v0, off, s[12:15], 0 offset:4 ; 4-byte Folded Reload
 ; GCN-O0-NEXT:    s_mov_b32 s2, 0xf000
 ; GCN-O0-NEXT:    s_mov_b32 s0, 0
 ; GCN-O0-NEXT:    ; kill: def $sgpr0 killed $sgpr0 def $sgpr0_sgpr1
 ; GCN-O0-NEXT:    s_mov_b32 s1, s2
 ; GCN-O0-NEXT:    ; kill: def $sgpr4_sgpr5 killed $sgpr4_sgpr5 def $sgpr4_sgpr5_sgpr6_sgpr7
 ; GCN-O0-NEXT:    s_mov_b64 s[6:7], s[0:1]
-; GCN-O0-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-O0-NEXT:    v_ashrrev_i32_e64 v3, 31, v0
 ; GCN-O0-NEXT:    v_mov_b32_e32 v1, v0
 ; GCN-O0-NEXT:    v_mov_b32_e32 v2, v3
@@ -100,6 +99,8 @@ define amdgpu_kernel void @simple_nested_if(ptr addrspace(1) nocapture %arg) {
 ; GCN-O0-NEXT:    s_mov_b64 exec, s[0:1]
 ; GCN-O0-NEXT:    s_cbranch_execz .LBB0_3
 ; GCN-O0-NEXT:  ; %bb.2: ; %bb.inner.then
+; GCN-O0-NEXT:    s_waitcnt expcnt(1)
+; GCN-O0-NEXT:    buffer_load_dword v1, off, s[12:15], 0 offset:4 ; 4-byte Folded Reload
 ; GCN-O0-NEXT:    s_or_saveexec_b64 s[8:9], -1
 ; GCN-O0-NEXT:    s_waitcnt expcnt(0)
 ; GCN-O0-NEXT:    buffer_load_dword v4, off, s[12:15], 0 ; 4-byte Folded Reload
@@ -107,9 +108,7 @@ define amdgpu_kernel void @simple_nested_if(ptr addrspace(1) nocapture %arg) {
 ; GCN-O0-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-O0-NEXT:    v_readlane_b32 s0, v4, 0
 ; GCN-O0-NEXT:    v_readlane_b32 s1, v4, 1
-; GCN-O0-NEXT:    buffer_load_dword v1, off, s[12:15], 0 offset:4 ; 4-byte Folded Reload
 ; GCN-O0-NEXT:    v_mov_b32_e32 v0, 1
-; GCN-O0-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-O0-NEXT:    v_add_i32_e64 v1, s[2:3], v1, v0
 ; GCN-O0-NEXT:    v_ashrrev_i32_e64 v3, 31, v1
 ; GCN-O0-NEXT:    ; kill: def $vgpr1 killed $vgpr1 def $vgpr1_vgpr2 killed $exec
@@ -236,6 +235,7 @@ define amdgpu_kernel void @uncollapsable_nested_if(ptr addrspace(1) nocapture %a
 ; GCN-O0-NEXT:    s_mov_b64 exec, s[0:1]
 ; GCN-O0-NEXT:    s_cbranch_execz .LBB1_3
 ; GCN-O0-NEXT:  ; %bb.1: ; %bb.outer.then
+; GCN-O0-NEXT:    buffer_load_dword v0, off, s[12:15], 0 offset:4 ; 4-byte Folded Reload
 ; GCN-O0-NEXT:    s_or_saveexec_b64 s[8:9], -1
 ; GCN-O0-NEXT:    s_waitcnt expcnt(0)
 ; GCN-O0-NEXT:    buffer_load_dword v4, off, s[12:15], 0 ; 4-byte Folded Reload
@@ -243,14 +243,12 @@ define amdgpu_kernel void @uncollapsable_nested_if(ptr addrspace(1) nocapture %a
 ; GCN-O0-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-O0-NEXT:    v_readlane_b32 s4, v4, 0
 ; GCN-O0-NEXT:    v_readlane_b32 s5, v4, 1
-; GCN-O0-NEXT:    buffer_load_dword v0, off, s[12:15], 0 offset:4 ; 4-byte Folded Reload
 ; GCN-O0-NEXT:    s_mov_b32 s2, 0xf000
 ; GCN-O0-NEXT:    s_mov_b32 s0, 0
 ; GCN-O0-NEXT:    ; kill: def $sgpr0 killed $sgpr0 def $sgpr0_sgpr1
 ; GCN-O0-NEXT:    s_mov_b32 s1, s2
 ; GCN-O0-NEXT:    ; kill: def $sgpr4_sgpr5 killed $sgpr4_sgpr5 def $sgpr4_sgpr5_sgpr6_sgpr7
 ; GCN-O0-NEXT:    s_mov_b64 s[6:7], s[0:1]
-; GCN-O0-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-O0-NEXT:    v_ashrrev_i32_e64 v3, 31, v0
 ; GCN-O0-NEXT:    v_mov_b32_e32 v1, v0
 ; GCN-O0-NEXT:    v_mov_b32_e32 v2, v3
@@ -269,6 +267,8 @@ define amdgpu_kernel void @uncollapsable_nested_if(ptr addrspace(1) nocapture %a
 ; GCN-O0-NEXT:    s_mov_b64 exec, s[0:1]
 ; GCN-O0-NEXT:    s_cbranch_execz .LBB1_4
 ; GCN-O0-NEXT:  ; %bb.2: ; %bb.inner.then
+; GCN-O0-NEXT:    s_waitcnt expcnt(1)
+; GCN-O0-NEXT:    buffer_load_dword v1, off, s[12:15], 0 offset:4 ; 4-byte Folded Reload
 ; GCN-O0-NEXT:    s_or_saveexec_b64 s[8:9], -1
 ; GCN-O0-NEXT:    s_waitcnt expcnt(0)
 ; GCN-O0-NEXT:    buffer_load_dword v4, off, s[12:15], 0 ; 4-byte Folded Reload
@@ -276,9 +276,7 @@ define amdgpu_kernel void @uncollapsable_nested_if(ptr addrspace(1) nocapture %a
 ; GCN-O0-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-O0-NEXT:    v_readlane_b32 s0, v4, 0
 ; GCN-O0-NEXT:    v_readlane_b32 s1, v4, 1
-; GCN-O0-NEXT:    buffer_load_dword v1, off, s[12:15], 0 offset:4 ; 4-byte Folded Reload
 ; GCN-O0-NEXT:    v_mov_b32_e32 v0, 1
-; GCN-O0-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-O0-NEXT:    v_add_i32_e64 v1, s[2:3], v1, v0
 ; GCN-O0-NEXT:    v_ashrrev_i32_e64 v3, 31, v1
 ; GCN-O0-NEXT:    ; kill: def $vgpr1 killed $vgpr1 def $vgpr1_vgpr2 killed $exec
@@ -312,9 +310,9 @@ define amdgpu_kernel void @uncollapsable_nested_if(ptr addrspace(1) nocapture %a
 ; GCN-O0-NEXT:    v_readlane_b32 s2, v4, 4
 ; GCN-O0-NEXT:    v_readlane_b32 s3, v4, 5
 ; GCN-O0-NEXT:    s_or_b64 exec, exec, s[2:3]
+; GCN-O0-NEXT:    buffer_load_dword v1, off, s[12:15], 0 offset:4 ; 4-byte Folded Reload
 ; GCN-O0-NEXT:    v_readlane_b32 s0, v4, 0
 ; GCN-O0-NEXT:    v_readlane_b32 s1, v4, 1
-; GCN-O0-NEXT:    buffer_load_dword v1, off, s[12:15], 0 offset:4 ; 4-byte Folded Reload
 ; GCN-O0-NEXT:    v_mov_b32_e32 v0, 2
 ; GCN-O0-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-O0-NEXT:    v_add_i32_e64 v1, s[2:3], v1, v0
@@ -456,17 +454,18 @@ define amdgpu_kernel void @nested_if_if_else(ptr addrspace(1) nocapture %arg) {
 ; GCN-O0-NEXT:    s_mov_b64 exec, s[0:1]
 ; GCN-O0-NEXT:    s_cbranch_execz .LBB2_6
 ; GCN-O0-NEXT:  ; %bb.1: ; %bb.outer.then
+; GCN-O0-NEXT:    buffer_load_dword v0, off, s[12:15], 0 offset:4 ; 4-byte Folded Reload
 ; GCN-O0-NEXT:    s_or_saveexec_b64 s[6:7], -1
 ; GCN-O0-NEXT:    s_waitcnt expcnt(0)
 ; GCN-O0-NEXT:    buffer_load_dword v4, off, s[12:15], 0 ; 4-byte Folded Reload
 ; GCN-O0-NEXT:    s_mov_b64 exec, s[6:7]
-; GCN-O0-NEXT:    buffer_load_dword v0, off, s[12:15], 0 offset:4 ; 4-byte Folded Reload
 ; GCN-O0-NEXT:    s_mov_b32 s0, 2
-; GCN-O0-NEXT:    s_waitcnt vmcnt(0)
+; GCN-O0-NEXT:    s_waitcnt vmcnt(1)
 ; GCN-O0-NEXT:    v_cmp_ne_u32_e64 s[0:1], v0, s0
 ; GCN-O0-NEXT:    s_mov_b64 s[2:3], exec
 ; GCN-O0-NEXT:    s_and_b64 s[0:1], s[2:3], s[0:1]
 ; GCN-O0-NEXT:    s_xor_b64 s[2:3], s[0:1], s[2:3]
+; GCN-O0-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-O0-NEXT:    v_writelane_b32 v4, s2, 4
 ; GCN-O0-NEXT:    v_writelane_b32 v4, s3, 5
 ; GCN-O0-NEXT:    s_or_saveexec_b64 s[6:7], -1
@@ -493,6 +492,7 @@ define amdgpu_kernel void @nested_if_if_else(ptr addrspace(1) nocapture %arg) {
 ; GCN-O0-NEXT:    s_xor_b64 exec, exec, s[0:1]
 ; GCN-O0-NEXT:    s_cbranch_execz .LBB2_5
 ; GCN-O0-NEXT:  ; %bb.3: ; %bb.then
+; GCN-O0-NEXT:    buffer_load_dword v1, off, s[12:15], 0 offset:4 ; 4-byte Folded Reload
 ; GCN-O0-NEXT:    s_or_saveexec_b64 s[6:7], -1
 ; GCN-O0-NEXT:    s_waitcnt expcnt(0)
 ; GCN-O0-NEXT:    buffer_load_dword v4, off, s[12:15], 0 ; 4-byte Folded Reload
@@ -500,9 +500,7 @@ define amdgpu_kernel void @nested_if_if_else(ptr addrspace(1) nocapture %arg) {
 ; GCN-O0-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-O0-NEXT:    v_readlane_b32 s0, v4, 0
 ; GCN-O0-NEXT:    v_readlane_b32 s1, v4, 1
-; GCN-O0-NEXT:    buffer_load_dword v1, off, s[12:15], 0 offset:4 ; 4-byte Folded Reload
 ; GCN-O0-NEXT:    v_mov_b32_e32 v0, 1
-; GCN-O0-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-O0-NEXT:    v_add_i32_e64 v1, s[2:3], v1, v0
 ; GCN-O0-NEXT:    v_ashrrev_i32_e64 v3, 31, v1
 ; GCN-O0-NEXT:    ; kill: def $vgpr1 killed $vgpr1 def $vgpr1_vgpr2 killed $exec
@@ -518,6 +516,7 @@ define amdgpu_kernel void @nested_if_if_else(ptr addrspace(1) nocapture %arg) {
 ; GCN-O0-NEXT:    buffer_store_dword v0, v[1:2], s[0:3], 0 addr64
 ; GCN-O0-NEXT:    s_branch .LBB2_5
 ; GCN-O0-NEXT:  .LBB2_4: ; %bb.else
+; GCN-O0-NEXT:    buffer_load_dword v1, off, s[12:15], 0 offset:4 ; 4-byte Folded Reload
 ; GCN-O0-NEXT:    s_or_saveexec_b64 s[6:7], -1
 ; GCN-O0-NEXT:    s_waitcnt expcnt(0)
 ; GCN-O0-NEXT:    buffer_load_dword v4, off, s[12:15], 0 ; 4-byte Folded Reload
@@ -525,9 +524,7 @@ define amdgpu_kernel void @nested_if_if_else(ptr addrspace(1) nocapture %arg) {
 ; GCN-O0-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-O0-NEXT:    v_readlane_b32 s0, v4, 0
 ; GCN-O0-NEXT:    v_readlane_b32 s1, v4, 1
-; GCN-O0-NEXT:    buffer_load_dword v1, off, s[12:15], 0 offset:4 ; 4-byte Folded Reload
 ; GCN-O0-NEXT:    v_mov_b32_e32 v0, 2
-; GCN-O0-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-O0-NEXT:    v_add_i32_e64 v1, s[2:3], v1, v0
 ; GCN-O0-NEXT:    v_ashrrev_i32_e64 v3, 31, v1
 ; GCN-O0-NEXT:    ; kill: def $vgpr1 killed $vgpr1 def $vgpr1_vgpr2 killed $exec
@@ -724,13 +721,13 @@ define amdgpu_kernel void @nested_if_else_if(ptr addrspace(1) nocapture %arg) {
 ; GCN-O0-NEXT:    s_xor_b64 exec, exec, s[0:1]
 ; GCN-O0-NEXT:    s_cbranch_execz .LBB3_8
 ; GCN-O0-NEXT:  ; %bb.2: ; %bb.outer.then
+; GCN-O0-NEXT:    buffer_load_dword v0, off, s[12:15], 0 offset:12 ; 4-byte Folded Reload
+; GCN-O0-NEXT:    buffer_load_dword v2, off, s[12:15], 0 offset:4 ; 4-byte Folded Reload
+; GCN-O0-NEXT:    buffer_load_dword v3, off, s[12:15], 0 offset:8 ; 4-byte Folded Reload
 ; GCN-O0-NEXT:    s_or_saveexec_b64 s[8:9], -1
 ; GCN-O0-NEXT:    s_waitcnt expcnt(0)
 ; GCN-O0-NEXT:    buffer_load_dword v6, off, s[12:15], 0 ; 4-byte Folded Reload
 ; GCN-O0-NEXT:    s_mov_b64 exec, s[8:9]
-; GCN-O0-NEXT:    buffer_load_dword v0, off, s[12:15], 0 offset:12 ; 4-byte Folded Reload
-; GCN-O0-NEXT:    buffer_load_dword v2, off, s[12:15], 0 offset:4 ; 4-byte Folded Reload
-; GCN-O0-NEXT:    buffer_load_dword v3, off, s[12:15], 0 offset:8 ; 4-byte Folded Reload
 ; GCN-O0-NEXT:    s_mov_b32 s0, 0xf000
 ; GCN-O0-NEXT:    s_mov_b32 s2, 0
 ; GCN-O0-NEXT:    s_mov_b32 s4, s2
@@ -740,11 +737,12 @@ define amdgpu_kernel void @nested_if_else_if(ptr addrspace(1) nocapture %arg) {
 ; GCN-O0-NEXT:    ; kill: def $sgpr0_sgpr1 killed $sgpr0_sgpr1 def $sgpr0_sgpr1_sgpr2_sgpr3
 ; GCN-O0-NEXT:    s_mov_b64 s[2:3], s[4:5]
 ; GCN-O0-NEXT:    v_mov_b32_e32 v1, 1
-; GCN-O0-NEXT:    s_waitcnt vmcnt(0)
+; GCN-O0-NEXT:    s_waitcnt vmcnt(1)
 ; GCN-O0-NEXT:    buffer_store_dword v1, v[2:3], s[0:3], 0 addr64 offset:4
 ; GCN-O0-NEXT:    s_mov_b32 s0, 2
 ; GCN-O0-NEXT:    v_cmp_eq_u32_e64 s[2:3], v0, s0
 ; GCN-O0-NEXT:    s_mov_b64 s[0:1], exec
+; GCN-O0-NEXT:    s_waitcnt vmcnt(1)
 ; GCN-O0-NEXT:    v_writelane_b32 v6, s0, 4
 ; GCN-O0-NEXT:    v_writelane_b32 v6, s1, 5
 ; GCN-O0-NEXT:    s_or_saveexec_b64 s[8:9], -1
@@ -770,13 +768,13 @@ define amdgpu_kernel void @nested_if_else_if(ptr addrspace(1) nocapture %arg) {
 ; GCN-O0-NEXT:    buffer_store_dword v0, v[1:2], s[0:3], 0 addr64 offset:8
 ; GCN-O0-NEXT:    s_branch .LBB3_7
 ; GCN-O0-NEXT:  .LBB3_4: ; %bb.outer.else
+; GCN-O0-NEXT:    buffer_load_dword v0, off, s[12:15], 0 offset:12 ; 4-byte Folded Reload
+; GCN-O0-NEXT:    buffer_load_dword v2, off, s[12:15], 0 offset:4 ; 4-byte Folded Reload
+; GCN-O0-NEXT:    buffer_load_dword v3, off, s[12:15], 0 offset:8 ; 4-byte Folded Reload
 ; GCN-O0-NEXT:    s_or_saveexec_b64 s[8:9], -1
 ; GCN-O0-NEXT:    s_waitcnt expcnt(0)
 ; GCN-O0-NEXT:    buffer_load_dword v6, off, s[12:15], 0 ; 4-byte Folded Reload
 ; GCN-O0-NEXT:    s_mov_b64 exec, s[8:9]
-; GCN-O0-NEXT:    buffer_load_dword v0, off, s[12:15], 0 offset:12 ; 4-byte Folded Reload
-; GCN-O0-NEXT:    buffer_load_dword v2, off, s[12:15], 0 offset:4 ; 4-byte Folded Reload
-; GCN-O0-NEXT:    buffer_load_dword v3, off, s[12:15], 0 offset:8 ; 4-byte Folded Reload
 ; GCN-O0-NEXT:    s_mov_b32 s1, 0xf000
 ; GCN-O0-NEXT:    s_mov_b32 s0, 0
 ; GCN-O0-NEXT:    s_mov_b32 s2, s0
@@ -786,10 +784,11 @@ define amdgpu_kernel void @nested_if_else_if(ptr addrspace(1) nocapture %arg) {
 ; GCN-O0-NEXT:    ; kill: def $sgpr4_sgpr5 killed $sgpr4_sgpr5 def $sgpr4_sgpr5_sgpr6_sgpr7
 ; GCN-O0-NEXT:    s_mov_b64 s[6:7], s[2:3]
 ; GCN-O0-NEXT:    v_mov_b32_e32 v1, 3
-; GCN-O0-NEXT:    s_waitcnt vmcnt(0)
+; GCN-O0-NEXT:    s_waitcnt vmcnt(1)
 ; GCN-O0-NEXT:    buffer_store_dword v1, v[2:3], s[4:7], 0 addr64 offset:12
 ; GCN-O0-NEXT:    v_cmp_eq_u32_e64 s[2:3], v0, s0
 ; GCN-O0-NEXT:    s_mov_b64 s[0:1], exec
+; GCN-O0-NEXT:    s_waitcnt vmcnt(1)
 ; GCN-O0-NEXT:    v_writelane_b32 v6, s0, 6
 ; GCN-O0-NEXT:    v_writelane_b32 v6, s1, 7
 ; GCN-O0-NEXT:    s_or_saveexec_b64 s[8:9], -1
@@ -927,6 +926,7 @@ define amdgpu_kernel void @s_endpgm_unsafe_barrier(ptr addrspace(1) nocapture %a
 ; GCN-O0-NEXT:    s_mov_b64 exec, s[0:1]
 ; GCN-O0-NEXT:    s_cbranch_execz .LBB4_2
 ; GCN-O0-NEXT:  ; %bb.1: ; %bb.then
+; GCN-O0-NEXT:    buffer_load_dword v0, off, s[12:15], 0 offset:4 ; 4-byte Folded Reload
 ; GCN-O0-NEXT:    s_or_saveexec_b64 s[6:7], -1
 ; GCN-O0-NEXT:    s_waitcnt expcnt(0)
 ; GCN-O0-NEXT:    buffer_load_dword v3, off, s[12:15], 0 ; 4-byte Folded Reload
@@ -934,14 +934,12 @@ define amdgpu_kernel void @s_endpgm_unsafe_barrier(ptr addrspace(1) nocapture %a
 ; GCN-O0-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-O0-NEXT:    v_readlane_b32 s0, v3, 0
 ; GCN-O0-NEXT:    v_readlane_b32 s1, v3, 1
-; GCN-O0-NEXT:    buffer_load_dword v0, off, s[12:15], 0 offset:4 ; 4-byte Folded Reload
 ; GCN-O0-NEXT:    s_mov_b32 s2, 0xf000
 ; GCN-O0-NEXT:    s_mov_b32 s4, 0
 ; GCN-O0-NEXT:    ; kill: def $sgpr4 killed $sgpr4 def $sgpr4_sgpr5
 ; GCN-O0-NEXT:    s_mov_b32 s5, s2
 ; GCN-O0-NEXT:    ; kill: def $sgpr0_sgpr1 killed $sgpr0_sgpr1 def $sgpr0_sgpr1_sgpr2_sgpr3
 ; GCN-O0-NEXT:    s_mov_b64 s[2:3], s[4:5]
-; GCN-O0-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-O0-NEXT:    v_ashrrev_i32_e64 v2, 31, v0
 ; GCN-O0-NEXT:    ; kill: def $vgpr0 killed $vgpr0 def $vgpr0_vgpr1 killed $exec
 ; GCN-O0-NEXT:    v_mov_b32_e32 v1, v2
@@ -1066,6 +1064,8 @@ define void @scc_liveness(i32 %arg) local_unnamed_addr #0 {
 ; GCN-O0-NEXT:    s_mov_b64 exec, s[14:15]
 ; GCN-O0-NEXT:  .LBB5_1: ; %bb1
 ; GCN-O0-NEXT:    ; =>This Inner Loop Header: Depth=1
+; GCN-O0-NEXT:    s_waitcnt expcnt(1)
+; GCN-O0-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:4 ; 4-byte Folded Reload
 ; GCN-O0-NEXT:    s_or_saveexec_b64 s[14:15], -1
 ; GCN-O0-NEXT:    s_waitcnt expcnt(0)
 ; GCN-O0-NEXT:    buffer_load_dword v6, off, s[0:3], s32 ; 4-byte Folded Reload
@@ -1077,9 +1077,7 @@ define void @scc_liveness(i32 %arg) local_unnamed_addr #0 {
 ; GCN-O0-NEXT:    v_readlane_b32 s7, v6, 1
 ; GCN-O0-NEXT:    v_writelane_b32 v6, s6, 4
 ; GCN-O0-NEXT:    v_writelane_b32 v6, s7, 5
-; GCN-O0-NEXT:    buffer_load_dword v0, off, s[0:3], s32 offset:4 ; 4-byte Folded Reload
 ; GCN-O0-NEXT:    s_mov_b32 s4, 0x207
-; GCN-O0-NEXT:    s_waitcnt vmcnt(0)
 ; GCN-O0-NEXT:    v_cmp_lt_i32_e64 s[4:5], v0, s4
 ; GCN-O0-NEXT:    s...
[truncated]

@alex-t alex-t requested review from jayfoad, ruiling and arsenm October 22, 2024 12:10
@alex-t
Copy link
Contributor Author

alex-t commented Oct 25, 2024

Ping. is anybody interested in taking a look?

Copy link

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:
git-clang-format --diff a2ba438f3e5635e368333213914c7452a6a6a2da 640a436c2c39613e32c76226eef1b618a4d2400f --extensions cpp,h -- llvm/lib/Target/AMDGPU/SIInstrInfo.cpp llvm/lib/Target/AMDGPU/SIInstrInfo.h
View the diff from clang-format here.
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
index 2da8356328..fae6e8e4f6 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
@@ -8938,7 +8938,7 @@ bool SIInstrInfo::isBasicBlockPrologue(const MachineInstr &MI,
          (Opcode == AMDGPU::IMPLICIT_DEF ||
           (!MI.isTerminator() && Opcode != AMDGPU::COPY &&
            MI.modifiesRegister(AMDGPU::EXEC, &RI)) ||
-              isPrologueOperandReload(MI));
+          isPrologueOperandReload(MI));
 }
 
 MachineInstrBuilder

@ruiling
Copy link
Contributor

ruiling commented Oct 30, 2024

Thanks for doing this! But frankly speaking, this is making the implementation more confusing. I think most of the weirdness in the implementation is caused by the way the target hook is defined. Querying whether each instruction is block prologue seems not the right way to go.

The basic idea behind the block prologue is that we have a special instruction that setup the exec for the block, all the instructions before this specific instruction are prologue instructions. So the prologue searching process should work like forward-iterating the instructions until we see the first instruction that modifies exec, which would be the last prologue instruction. There are ways to speed up the searching process to avoid visiting all the instructions in the worst case. For example, we can check the instruction types that could possibly in the prologue. If we found an instruction that is NOT in the list of possible prologue instructions before we see an exec setup instruction, then we should return and report that there is no prologue instruction at all.

Following the idea, maybe we should define the hook as skipBlockPrologue(MachineBasicBlock, beginIterator) which would checking for prologue instruction from the beginIterator, and return the iterator right after the last prologue instruction (if there is no prologue instruction, just return the beginIterator). Then the implementation would be a simple forward iteration over the instructions with no recursion. With the returned iterator, I think the caller should be able to do whatever they want on the prologue instructions. Sounds reasonable?

@alex-t
Copy link
Contributor Author

alex-t commented Oct 30, 2024

Thanks for doing this! But frankly speaking, this is making the implementation more confusing. I think most of the weirdness in the implementation is caused by the way the target hook is defined. Querying whether each instruction is block prologue seems not the right way to go.

The basic idea behind the block prologue is that we have a special instruction that setup the exec for the block, all the instructions before this specific instruction are prologue instructions. So the prologue searching process should work like forward-iterating the instructions until we see the first instruction that modifies exec, which would be the last prologue instruction. There are ways to speed up the searching process to avoid visiting all the instructions in the worst case. For example, we can check the instruction types that could possibly in the prologue. If we found an instruction that is NOT in the list of possible prologue instructions before we see an exec setup instruction, then we should return and report that there is no prologue instruction at all.

Following the idea, maybe we should define the hook as skipBlockPrologue(MachineBasicBlock, beginIterator) which would checking for prologue instruction from the beginIterator, and return the iterator right after the last prologue instruction (if there is no prologue instruction, just return the beginIterator). Then the implementation would be a simple forward iteration over the instructions with no recursion. With the returned iterator, I think the caller should be able to do whatever they want on the prologue instructions. Sounds reasonable?

The current implementation aims to check if the instruction in question produces operands for another prologue instruction, which causes the recursion. The possible user of the register defined by the instruction might be not only the last instruction in the prologue (i.e. EXEC modification) but, for example, WWM-reload using SGPR as offset.

Also, please note, that the latest version of the algorithm returns false immediately as the next instruction is not a "prologue".
So, we never walk along all the instructions in the block. In the worst case, we walk along the prologue sequence, which usually is not very long.

I agree that iterating through the whole prologue on each query is inconvenient. To change that we'd have to change all the call sites in LLVM core code. Most are looking for the insertion point for the concrete basic block walking from the block beginning.
Such a change would affect all targets but we could create a hook that does the same walk by default calling isBasicBlockPrologue but does something smart for AMDGPU.
The problem is that the TargetInstrInfo is a stateless class so we cannot store information in it. The best approach would be to lazily compute a kind of a MachineInstr property (a flag?) and cache it. If we could, the query for the "prologue" property is immediate.
I was thinking of adding a bit in tsFlags as I know there exist few free positions.

@ruiling
Copy link
Contributor

ruiling commented Oct 31, 2024

Such a change would affect all targets but we could create a hook that does the same walk by default calling isBasicBlockPrologue but does something smart for AMDGPU. The problem is that the TargetInstrInfo is a stateless class so we cannot store information in it.

I think targetInstrInfo is not kind of stable ABI that needs high bar to change, right? We are the only target that implement the isBasicBlockPrologue() in tree. For out-of-tree target, it is fairly easy to switch to the new hook.

The best approach would be to lazily compute a kind of a MachineInstr property (a flag?) and cache it. If we could, the query for the "prologue" property is immediate. I was thinking of adding a bit in tsFlags as I know there exist few free positions.

I would say the "prologue" is not a static property for certain types of instructions. An instruction is part of prologue because it appears before the exec modification instruction.

@alex-t
Copy link
Contributor Author

alex-t commented Oct 31, 2024

An instruction is part of prologue because it appears before the exec modification instruction.

I would not like to rely on the instructions' lexical order. An instruction belongs to the prologue if it is part of the def-use chain that ends up on the exec mask writing instruction.

Just to check if I understand your idea:

  1. We consider that the lexical order of the instruction in the prologue is implied by the def-use relations. That allows us to avoid checking def-use relations in the prologue sequence.
  2. The prologue sequence includes all instructions from the basic block beginning to the first instruction modifying EXEC.

Is that what you wanted to convey?

@alex-t
Copy link
Contributor Author

alex-t commented Oct 31, 2024

I think targetInstrInfo is not kind of stable ABI that needs high bar to change, right? We are the only target that implement the isBasicBlockPrologue() in tree. For out-of-tree target, it is fairly easy to switch to the new hook.

Given that I understand your point correctly, we introduce a hook returning an iterator pointing to the last prologue or first non-prologue instruction.

Below is a brief list of the call sites and related issues we'd have to solve:

  1. MachineBasicBlock: SkipPHIsAndLabels and SkipPHIsLabelsAndDebug use isBasicBlockPrologue iterating over the instruction list. To change those call sites we'd need to rely on the order of instructions: PHIs first, then Prologue Sequence, Debug, Labels, PseudoOps, and whatever else. Nevertheless, this order is not guaranteed and might differ like PHIs, Lables, PseudoOps, Debug, Prologue, or any permutation of the latter. This complicates things even more.

  2. MachineSink: blockPrologueInterferes function really needs a list of all target-defined prologue instructions. We might want to define a helper that returns a range (firstNonPHI(), lastPrologue())

  3. RegAllocFastImpl::getMBBBeginInsertionPoint is one of the decision points defining that exact "correct" lexical order on which we are going to rely!
    ` // However if a prolog instruction reads a register that needs to be
    // reloaded, the reload should be inserted before the prolog.

    for (MachineOperand &MO : I->operands()) {
    if (MO.isReg())
    PrologLiveIns.insert(MO.getReg());
    }
    `
    So, we again need a list of the all prologue instructions here to iterate through and collect Prolog Live-Ins.

2 and 3 are just a question of interface, what hooks we'd like to have. The 1st is real trouble as it depends on the iterative nature of the isBasicBlockPrologue usage.

@ruiling
Copy link
Contributor

ruiling commented Nov 4, 2024

An instruction belongs to the prologue if it is part of the def-use chain that ends up on the exec mask writing instruction.

Is this correct? like in the case below, the second sgpr-reload-from-vgpr is not in the prologue?
v_readlane s0, v0, 1
v_readlane s1, v0, 2
s_or_b32 exec, exec, s0

1. We consider that the lexical order of the instruction in the prologue is implied by the def-use relations. That allows us to avoid checking def-use relations in the prologue sequence.

I don't think the def-use relationship is important when checking whether an instruction belong to the prologue. The prologue concept is mainly used to find right insertion point for vector instructions (anything else?). For scalar instructions that produce the input to the exec mask setup, yes, we have to put them inside the prologue for correctness. For other scalar instructions, we can either put them in the prologue or after. Right?

2. The prologue sequence includes all instructions from the basic block beginning to the first instruction modifying EXEC.

Yes

1. MachineBasicBlock:   SkipPHIsAndLabels and SkipPHIsLabelsAndDebug use isBasicBlockPrologue iterating over the instruction list. To change those call sites we'd need to rely on the order of instructions: PHIs first, then Prologue Sequence, Debug, Labels, PseudoOps, and whatever else. Nevertheless, this order is not guaranteed and might differ like PHIs, Lables, PseudoOps, Debug, Prologue, or any permutation of the latter. This complicates things even more.

Thank you for taking further look into this. This is indeed an issue we need to solve properly. I am not sure whether we have certain order for these instructions, seems that we cannot assume too much (but we can still assume PHIs always come first)? So, to get correct implementation of skipBlockPrologue, it feels reasonable to pass additional boolean arguments like skipLabel, skipDebug to the skipBlockPrologue to tell it skipping these instructions as well.

But its unclear to me what's the expected behavior like in corner cases. Like in skipPHIAndLables(), we would want to find the insertion point before debug instructions but after prologue instruction. so what if the IR have prologue instructions after debug instructions? What would the IR for such case look like?

@alex-t
Copy link
Contributor Author

alex-t commented Nov 4, 2024

An instruction belongs to the prologue if it is part of the def-use chain that ends up on the exec mask writing instruction.

Is this correct? like in the case below, the second sgpr-reload-from-vgpr is not in the prologue? v_readlane s0, v0, 1 v_readlane s1, v0, 2 s_or_b32 exec, exec, s0

Exactly. Only readlane producing s0 belongs to the prologue but another one, producing s1 does not.

@alex-t
Copy link
Contributor Author

alex-t commented Nov 4, 2024

I don't think the def-use relationship is important when checking whether an instruction belong to the prologue. The prologue concept is mainly used to find right insertion point for vector instructions (anything else?). For scalar instructions that produce the input to the exec mask setup, yes, we have to put them inside the prologue for correctness. For other scalar instructions, we can either put them in the prologue or after. Right?

Adding odd instructions in the prologue is undesirable because any instruction defining a register starts the segment in a live range.
While LI splitting we insert the copies right after the prologue. I want to minimize the number of instructions creating new segments in the prologue to avoid possible interference.
As for the def-use information, I kept in mind the scenario where the SGPR reload loads an operand for WWM buffer load (offset operand?) and that WWM buffer load, in its order, loads the VGPR from which we further reload SGPR that is an operand for the EXEC mask update.
I don't know if the scenario above ever exists, but I wanted the algorithm to be as general as possible.

@ruiling
Copy link
Contributor

ruiling commented Nov 5, 2024

An instruction belongs to the prologue if it is part of the def-use chain that ends up on the exec mask writing instruction.

Is this correct? like in the case below, the second sgpr-reload-from-vgpr is not in the prologue? v_readlane s0, v0, 1 v_readlane s1, v0, 2 s_or_b32 exec, exec, s0

Exactly. Only readlane producing s0 belongs to the prologue but another one, producing s1 does not.

If the second v_readlane is not the prologue, then SkipPHIsAndLabels would just return the insertion point between the two v_readlanes. This is not the right insertion point for vector instructions, right?

I want to minimize the number of instructions creating new segments in the prologue to avoid possible interference.

That make sense to me. But we still want sort of clean and robust prologue checking.

@cdevadas
Copy link
Collaborator

cdevadas commented Nov 5, 2024

The isBasicBlockPrologue hook was something introduced by AMDGPU and have been in real use only for our backend. I share the same concerns discussed between Alex and Ruiling.
The flag based (or label based) Prolog partition was discussed by Matt a few times earlier when we tried to improve this hook.
But he gave up that idea by stating that the demarcation based on a flag would be highly unstable as many codegen passes might spoil it by incorrect insertion or incorrect scheduling so that we loose this flag or its purpose.
I agree that we should rename this hook to skipBlockPrologue and use this to skip past the prolog instructions in a single go.

I also think we should consolidate SkipPHIsAndLabels & SkipPHIsLabelsAndDebug into the new hook skipBlockPrologue (use an additional argument to include DebugInstr or not).
Because the former two functions, IIUC, are trying to skip past the generic prolog instructions which are PHIs, Labels, some pseudos, debugInstr, etc. There must be a default implementation of skipBlockPrologue that skips the generic instructions and then allow targets to define their own prolog instructions (something that we're trying to fix now).

@alex-t
Copy link
Contributor Author

alex-t commented Nov 5, 2024

An instruction belongs to the prologue if it is part of the def-use chain that ends up on the exec mask writing instruction.

Is this correct? like in the case below, the second sgpr-reload-from-vgpr is not in the prologue? v_readlane s0, v0, 1 v_readlane s1, v0, 2 s_or_b32 exec, exec, s0

Exactly. Only readlane producing s0 belongs to the prologue but another one, producing s1 does not.

If the second v_readlane is not the prologue, then SkipPHIsAndLabels would just return the insertion point between the two v_readlanes. This is not the right insertion point for vector instructions, right?

I am unsure if the pattern you used as an example could exist.
If you look in the RegAllocFastImpl::getMBBBeginInsertionPoint you can see that all the reloads are inserted after the prologue except those producing operands for other prologue instructions.

`

// Most reloads should be inserted after prolog instructions.
if (!TII->isBasicBlockPrologue(*I))
  break;

// However if a prolog instruction reads a register that needs to be
// reloaded, the reload should be inserted before the prolog.
for (MachineOperand &MO : I->operands()) {
  if (MO.isReg())
    PrologLiveIns.insert(MO.getReg());
}

`
So, I think that the "v_readlane s1, v0, 2" would be inserted after the "s_or_b32 exec, exec, s0"

@alex-t
Copy link
Contributor Author

alex-t commented Nov 5, 2024

The isBasicBlockPrologue hook was something introduced by AMDGPU and have been in real use only for our backend. I share the same concerns discussed between Alex and Ruiling. The flag based (or label based) Prolog partition was discussed by Matt a few times earlier when we tried to improve this hook. But he gave up that idea by stating that the demarcation based on a flag would be highly unstable as many codegen passes might spoil it by incorrect insertion or incorrect scheduling so that we loose this flag or its purpose. I agree that we should rename this hook to skipBlockPrologue and use this to skip past the prolog instructions in a single go.

I also think we should consolidate SkipPHIsAndLabels & SkipPHIsLabelsAndDebug into the new hook skipBlockPrologue (use an additional argument to include DebugInstr or not). Because the former two functions, IIUC, are trying to skip past the generic prolog instructions which are PHIs, Labels, some pseudos, debugInstr, etc. There must be a default implementation of skipBlockPrologue that skips the generic instructions and then allow targets to define their own prolog instructions (something that we're trying to fix now).

The only question to decide (and for what I need advice) is:
Do we rely on the prologue instructions in lexical order? In other words, whether we sure that the prologue sequence is always correct at the point we call skipBlockPrologue?
As I pointed out above, the reloads are themselves inserted at the point defined by calling the isBasicBlockPrologue.
The code in RegAllocFastImpl::getMBBBeginInsertionPoint checks if the given reload produces an operand for other prologue instructions and collects those that do in the special vector for further inserting.
We have more reloads insertion except for RegallocFast, I guess. So, we need to be sure that we don't have to track the def-use dependency but can just iterate until the very first EXEC update.

@alex-t
Copy link
Contributor Author

alex-t commented Nov 23, 2024

Ping. I tried to address your concerns. Any other suggestions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants