-
Notifications
You must be signed in to change notification settings - Fork 10.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AMDGPU] Don't fix the scavenge slot at offset 0 #79136
Conversation
When the scavenge slot is no longer a fixed stack item, frame indices start at 0, so we should be able to handle them.
At the moment, the emergency spill slot is a fixed object for entry functions and chain functions, and a regular stack object otherwise. This patch adopts the latter behaviour for entry/chain functions too. It seems this was always the intention [1] and it will also save us a bit of stack space in cases where the first stack object has a large alignment. [1] llvm@34c8b83
@llvm/pr-subscribers-backend-amdgpu @llvm/pr-subscribers-debuginfo Author: Diana Picus (rovka) ChangesAt the moment, the emergency spill slot is a fixed object for entry [1] 34c8b83 Patch is 728.02 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/79136.diff 67 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp b/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
index e8142244b7db69..3c362d337b6d38 100644
--- a/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
@@ -520,14 +520,10 @@ int SIMachineFunctionInfo::getScavengeFI(MachineFrameInfo &MFI,
const SIRegisterInfo &TRI) {
if (ScavengeFI)
return *ScavengeFI;
- if (isBottomOfStack()) {
- ScavengeFI = MFI.CreateFixedObject(
- TRI.getSpillSize(AMDGPU::SGPR_32RegClass), 0, false);
- } else {
- ScavengeFI = MFI.CreateStackObject(
- TRI.getSpillSize(AMDGPU::SGPR_32RegClass),
- TRI.getSpillAlign(AMDGPU::SGPR_32RegClass), false);
- }
+
+ ScavengeFI =
+ MFI.CreateStackObject(TRI.getSpillSize(AMDGPU::SGPR_32RegClass),
+ TRI.getSpillAlign(AMDGPU::SGPR_32RegClass), false);
return *ScavengeFI;
}
diff --git a/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp b/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
index a2cacb5cbaa393..0928f1e6c6ce47 100644
--- a/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
@@ -2273,9 +2273,6 @@ bool SIRegisterInfo::eliminateFrameIndex(MachineBasicBlock::iterator MI,
if (FrameReg)
FIOp.ChangeToRegister(FrameReg, false);
- if (!Offset)
- return false;
-
MachineOperand *OffsetOp =
TII->getNamedOperand(*MI, AMDGPU::OpName::offset);
int64_t NewOffset = Offset + OffsetOp->getImm();
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/call-outgoing-stack-args.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/call-outgoing-stack-args.ll
index e597ce6f114a6b..24652982c6584f 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/call-outgoing-stack-args.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/call-outgoing-stack-args.ll
@@ -65,6 +65,8 @@ define amdgpu_kernel void @kernel_caller_byval() {
; MUBUF-NEXT: s_add_u32 s0, s0, s7
; MUBUF-NEXT: s_addc_u32 s1, s1, 0
; MUBUF-NEXT: v_mov_b32_e32 v0, 0
+; MUBUF-NEXT: buffer_store_dword v0, off, s[0:3], 0
+; MUBUF-NEXT: buffer_store_dword v0, off, s[0:3], 0 offset:4
; MUBUF-NEXT: buffer_store_dword v0, off, s[0:3], 0 offset:8
; MUBUF-NEXT: buffer_store_dword v0, off, s[0:3], 0 offset:12
; MUBUF-NEXT: buffer_store_dword v0, off, s[0:3], 0 offset:16
@@ -95,25 +97,23 @@ define amdgpu_kernel void @kernel_caller_byval() {
; MUBUF-NEXT: buffer_store_dword v0, off, s[0:3], 0 offset:116
; MUBUF-NEXT: buffer_store_dword v0, off, s[0:3], 0 offset:120
; MUBUF-NEXT: buffer_store_dword v0, off, s[0:3], 0 offset:124
-; MUBUF-NEXT: buffer_store_dword v0, off, s[0:3], 0 offset:128
-; MUBUF-NEXT: buffer_store_dword v0, off, s[0:3], 0 offset:132
-; MUBUF-NEXT: buffer_load_dword v0, off, s[0:3], 0 offset:8
+; MUBUF-NEXT: buffer_load_dword v0, off, s[0:3], 0
; MUBUF-NEXT: s_nop 0
-; MUBUF-NEXT: buffer_load_dword v1, off, s[0:3], 0 offset:12
-; MUBUF-NEXT: buffer_load_dword v2, off, s[0:3], 0 offset:16
-; MUBUF-NEXT: buffer_load_dword v3, off, s[0:3], 0 offset:20
-; MUBUF-NEXT: buffer_load_dword v4, off, s[0:3], 0 offset:24
-; MUBUF-NEXT: buffer_load_dword v5, off, s[0:3], 0 offset:28
-; MUBUF-NEXT: buffer_load_dword v6, off, s[0:3], 0 offset:32
-; MUBUF-NEXT: buffer_load_dword v7, off, s[0:3], 0 offset:36
-; MUBUF-NEXT: buffer_load_dword v8, off, s[0:3], 0 offset:40
-; MUBUF-NEXT: buffer_load_dword v9, off, s[0:3], 0 offset:44
-; MUBUF-NEXT: buffer_load_dword v10, off, s[0:3], 0 offset:48
-; MUBUF-NEXT: buffer_load_dword v11, off, s[0:3], 0 offset:52
-; MUBUF-NEXT: buffer_load_dword v12, off, s[0:3], 0 offset:56
-; MUBUF-NEXT: buffer_load_dword v13, off, s[0:3], 0 offset:60
-; MUBUF-NEXT: buffer_load_dword v14, off, s[0:3], 0 offset:64
-; MUBUF-NEXT: buffer_load_dword v15, off, s[0:3], 0 offset:68
+; MUBUF-NEXT: buffer_load_dword v1, off, s[0:3], 0 offset:4
+; MUBUF-NEXT: buffer_load_dword v2, off, s[0:3], 0 offset:8
+; MUBUF-NEXT: buffer_load_dword v3, off, s[0:3], 0 offset:12
+; MUBUF-NEXT: buffer_load_dword v4, off, s[0:3], 0 offset:16
+; MUBUF-NEXT: buffer_load_dword v5, off, s[0:3], 0 offset:20
+; MUBUF-NEXT: buffer_load_dword v6, off, s[0:3], 0 offset:24
+; MUBUF-NEXT: buffer_load_dword v7, off, s[0:3], 0 offset:28
+; MUBUF-NEXT: buffer_load_dword v8, off, s[0:3], 0 offset:32
+; MUBUF-NEXT: buffer_load_dword v9, off, s[0:3], 0 offset:36
+; MUBUF-NEXT: buffer_load_dword v10, off, s[0:3], 0 offset:40
+; MUBUF-NEXT: buffer_load_dword v11, off, s[0:3], 0 offset:44
+; MUBUF-NEXT: buffer_load_dword v12, off, s[0:3], 0 offset:48
+; MUBUF-NEXT: buffer_load_dword v13, off, s[0:3], 0 offset:52
+; MUBUF-NEXT: buffer_load_dword v14, off, s[0:3], 0 offset:56
+; MUBUF-NEXT: buffer_load_dword v15, off, s[0:3], 0 offset:60
; MUBUF-NEXT: s_movk_i32 s32, 0x1400
; MUBUF-NEXT: s_getpc_b64 s[4:5]
; MUBUF-NEXT: s_add_u32 s4, s4, external_void_func_byval@rel32@lo+4
@@ -160,6 +160,7 @@ define amdgpu_kernel void @kernel_caller_byval() {
; FLATSCR-NEXT: s_addc_u32 flat_scratch_hi, s1, 0
; FLATSCR-NEXT: v_mov_b32_e32 v1, 0
; FLATSCR-NEXT: s_mov_b32 s0, 0
+; FLATSCR-NEXT: scratch_store_dwordx2 off, v[0:1], s0
; FLATSCR-NEXT: scratch_store_dwordx2 off, v[0:1], s0 offset:8
; FLATSCR-NEXT: scratch_store_dwordx2 off, v[0:1], s0 offset:16
; FLATSCR-NEXT: scratch_store_dwordx2 off, v[0:1], s0 offset:24
@@ -175,16 +176,15 @@ define amdgpu_kernel void @kernel_caller_byval() {
; FLATSCR-NEXT: scratch_store_dwordx2 off, v[0:1], s0 offset:104
; FLATSCR-NEXT: scratch_store_dwordx2 off, v[0:1], s0 offset:112
; FLATSCR-NEXT: scratch_store_dwordx2 off, v[0:1], s0 offset:120
-; FLATSCR-NEXT: scratch_store_dwordx2 off, v[0:1], s0 offset:128
-; FLATSCR-NEXT: scratch_load_dwordx2 v[0:1], off, s0 offset:8
+; FLATSCR-NEXT: scratch_load_dwordx2 v[0:1], off, s0
; FLATSCR-NEXT: s_nop 0
-; FLATSCR-NEXT: scratch_load_dwordx2 v[2:3], off, s0 offset:16
-; FLATSCR-NEXT: scratch_load_dwordx2 v[4:5], off, s0 offset:24
-; FLATSCR-NEXT: scratch_load_dwordx2 v[6:7], off, s0 offset:32
-; FLATSCR-NEXT: scratch_load_dwordx2 v[8:9], off, s0 offset:40
-; FLATSCR-NEXT: scratch_load_dwordx2 v[10:11], off, s0 offset:48
-; FLATSCR-NEXT: scratch_load_dwordx2 v[12:13], off, s0 offset:56
-; FLATSCR-NEXT: scratch_load_dwordx2 v[14:15], off, s0 offset:64
+; FLATSCR-NEXT: scratch_load_dwordx2 v[2:3], off, s0 offset:8
+; FLATSCR-NEXT: scratch_load_dwordx2 v[4:5], off, s0 offset:16
+; FLATSCR-NEXT: scratch_load_dwordx2 v[6:7], off, s0 offset:24
+; FLATSCR-NEXT: scratch_load_dwordx2 v[8:9], off, s0 offset:32
+; FLATSCR-NEXT: scratch_load_dwordx2 v[10:11], off, s0 offset:40
+; FLATSCR-NEXT: scratch_load_dwordx2 v[12:13], off, s0 offset:48
+; FLATSCR-NEXT: scratch_load_dwordx2 v[14:15], off, s0 offset:56
; FLATSCR-NEXT: s_movk_i32 s32, 0x50
; FLATSCR-NEXT: s_getpc_b64 s[0:1]
; FLATSCR-NEXT: s_add_u32 s0, s0, external_void_func_byval@rel32@lo+4
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/crash-stack-address-O0.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/crash-stack-address-O0.ll
index 9580326d7b78fa..0d793654f7ea5f 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/crash-stack-address-O0.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/crash-stack-address-O0.ll
@@ -12,10 +12,10 @@ define amdgpu_kernel void @stack_write_fi() {
; CHECK-NEXT: s_mov_b32 s5, 0
; CHECK-NEXT: s_mov_b32 s4, 0
; CHECK-NEXT: v_mov_b32_e32 v0, s5
-; CHECK-NEXT: buffer_store_dword v0, off, s[0:3], 0 offset:4
+; CHECK-NEXT: buffer_store_dword v0, off, s[0:3], 0
; CHECK-NEXT: s_waitcnt vmcnt(0)
; CHECK-NEXT: v_mov_b32_e32 v0, s4
-; CHECK-NEXT: buffer_store_dword v0, off, s[0:3], 0 offset:8
+; CHECK-NEXT: buffer_store_dword v0, off, s[0:3], 0 offset:4
; CHECK-NEXT: s_waitcnt vmcnt(0)
; CHECK-NEXT: s_endpgm
entry:
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/flat-scratch-init.gfx.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/flat-scratch-init.gfx.ll
index dcad707acaf200..b4b95fdab4ab25 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/flat-scratch-init.gfx.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/flat-scratch-init.gfx.ll
@@ -12,7 +12,7 @@ define amdgpu_ps void @amdgpu_ps() {
; MESA-NEXT: s_add_u32 flat_scratch_lo, s2, s4
; MESA-NEXT: s_mov_b64 s[0:1], src_private_base
; MESA-NEXT: s_addc_u32 flat_scratch_hi, s3, 0
-; MESA-NEXT: v_mov_b32_e32 v0, 4
+; MESA-NEXT: v_mov_b32_e32 v0, 0
; MESA-NEXT: v_mov_b32_e32 v1, s1
; MESA-NEXT: v_mov_b32_e32 v2, 0
; MESA-NEXT: flat_store_dword v[0:1], v2
@@ -24,7 +24,7 @@ define amdgpu_ps void @amdgpu_ps() {
; PAL-NEXT: s_getpc_b64 s[2:3]
; PAL-NEXT: s_mov_b32 s2, s0
; PAL-NEXT: s_load_dwordx2 s[2:3], s[2:3], 0x0
-; PAL-NEXT: v_mov_b32_e32 v0, 4
+; PAL-NEXT: v_mov_b32_e32 v0, 0
; PAL-NEXT: v_mov_b32_e32 v2, 0
; PAL-NEXT: s_waitcnt lgkmcnt(0)
; PAL-NEXT: s_and_b32 s3, s3, 0xffff
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/flat-scratch.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/flat-scratch.ll
index 75065f677b652e..921bdb5015c79a 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/flat-scratch.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/flat-scratch.ll
@@ -15,11 +15,11 @@ define amdgpu_kernel void @store_load_sindex_kernel(i32 %idx) {
; GFX9-NEXT: s_waitcnt lgkmcnt(0)
; GFX9-NEXT: s_lshl_b32 s1, s0, 2
; GFX9-NEXT: s_and_b32 s0, s0, 15
-; GFX9-NEXT: s_add_i32 s1, s1, 4
+; GFX9-NEXT: s_add_i32 s1, s1, 0
; GFX9-NEXT: s_lshl_b32 s0, s0, 2
; GFX9-NEXT: scratch_store_dword off, v0, s1
; GFX9-NEXT: s_waitcnt vmcnt(0)
-; GFX9-NEXT: s_add_i32 s0, s0, 4
+; GFX9-NEXT: s_add_i32 s0, s0, 0
; GFX9-NEXT: scratch_load_dword v0, off, s0 glc
; GFX9-NEXT: s_waitcnt vmcnt(0)
; GFX9-NEXT: s_endpgm
@@ -36,8 +36,8 @@ define amdgpu_kernel void @store_load_sindex_kernel(i32 %idx) {
; GFX10-NEXT: s_and_b32 s1, s0, 15
; GFX10-NEXT: s_lshl_b32 s0, s0, 2
; GFX10-NEXT: s_lshl_b32 s1, s1, 2
-; GFX10-NEXT: s_add_i32 s0, s0, 4
-; GFX10-NEXT: s_add_i32 s1, s1, 4
+; GFX10-NEXT: s_add_i32 s0, s0, 0
+; GFX10-NEXT: s_add_i32 s1, s1, 0
; GFX10-NEXT: scratch_store_dword off, v0, s0
; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
; GFX10-NEXT: scratch_load_dword v0, off, s1 glc dlc
@@ -51,12 +51,12 @@ define amdgpu_kernel void @store_load_sindex_kernel(i32 %idx) {
; GFX940-NEXT: s_waitcnt lgkmcnt(0)
; GFX940-NEXT: s_lshl_b32 s1, s0, 2
; GFX940-NEXT: s_and_b32 s0, s0, 15
-; GFX940-NEXT: s_add_i32 s1, s1, 4
+; GFX940-NEXT: s_add_i32 s1, s1, 0
; GFX940-NEXT: s_lshl_b32 s0, s0, 2
; GFX940-NEXT: scratch_store_dword off, v0, s1 sc0 sc1
; GFX940-NEXT: s_waitcnt vmcnt(0)
; GFX940-NEXT: v_mov_b32_e32 v0, s0
-; GFX940-NEXT: scratch_load_dword v0, v0, off offset:4 sc0 sc1
+; GFX940-NEXT: scratch_load_dword v0, v0, off sc0 sc1
; GFX940-NEXT: s_waitcnt vmcnt(0)
; GFX940-NEXT: s_endpgm
;
@@ -69,10 +69,10 @@ define amdgpu_kernel void @store_load_sindex_kernel(i32 %idx) {
; GFX11-NEXT: s_lshl_b32 s1, s1, 2
; GFX11-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX11-NEXT: v_dual_mov_b32 v0, 15 :: v_dual_mov_b32 v1, s1
-; GFX11-NEXT: s_add_i32 s0, s0, 4
+; GFX11-NEXT: s_add_i32 s0, s0, 0
; GFX11-NEXT: scratch_store_b32 off, v0, s0 dlc
; GFX11-NEXT: s_waitcnt_vscnt null, 0x0
-; GFX11-NEXT: scratch_load_b32 v0, v1, off offset:4 glc dlc
+; GFX11-NEXT: scratch_load_b32 v0, v1, off glc dlc
; GFX11-NEXT: s_waitcnt vmcnt(0)
; GFX11-NEXT: s_endpgm
;
@@ -87,9 +87,9 @@ define amdgpu_kernel void @store_load_sindex_kernel(i32 %idx) {
; GFX12-NEXT: s_lshl_b32 s0, s0, 2
; GFX12-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX12-NEXT: v_mov_b32_e32 v2, s0
-; GFX12-NEXT: scratch_store_b32 v0, v1, off offset:4 scope:SCOPE_SYS
+; GFX12-NEXT: scratch_store_b32 v0, v1, off scope:SCOPE_SYS
; GFX12-NEXT: s_wait_storecnt 0x0
-; GFX12-NEXT: scratch_load_b32 v0, v2, off offset:4 scope:SCOPE_SYS
+; GFX12-NEXT: scratch_load_b32 v0, v2, off scope:SCOPE_SYS
; GFX12-NEXT: s_wait_loadcnt 0x0
; GFX12-NEXT: s_endpgm
bb:
@@ -109,12 +109,12 @@ define amdgpu_kernel void @store_load_vindex_kernel() {
; GFX9-NEXT: v_lshlrev_b32_e32 v1, 2, v0
; GFX9-NEXT: v_sub_u32_e32 v0, 0, v0
; GFX9-NEXT: s_addc_u32 flat_scratch_hi, s1, 0
-; GFX9-NEXT: v_add_u32_e32 v1, 4, v1
+; GFX9-NEXT: v_add_u32_e32 v1, 0, v1
; GFX9-NEXT: v_mov_b32_e32 v2, 15
; GFX9-NEXT: v_lshlrev_b32_e32 v0, 2, v0
; GFX9-NEXT: scratch_store_dword v1, v2, off
; GFX9-NEXT: s_waitcnt vmcnt(0)
-; GFX9-NEXT: v_add_u32_e32 v0, 4, v0
+; GFX9-NEXT: v_add_u32_e32 v0, 0, v0
; GFX9-NEXT: scratch_load_dword v0, v0, off offset:124 glc
; GFX9-NEXT: s_waitcnt vmcnt(0)
; GFX9-NEXT: s_endpgm
@@ -129,8 +129,8 @@ define amdgpu_kernel void @store_load_vindex_kernel() {
; GFX10-NEXT: v_lshlrev_b32_e32 v0, 2, v0
; GFX10-NEXT: v_mov_b32_e32 v2, 15
; GFX10-NEXT: v_lshlrev_b32_e32 v1, 2, v1
-; GFX10-NEXT: v_add_nc_u32_e32 v0, 4, v0
-; GFX10-NEXT: v_add_nc_u32_e32 v1, 4, v1
+; GFX10-NEXT: v_add_nc_u32_e32 v0, 0, v0
+; GFX10-NEXT: v_add_nc_u32_e32 v1, 0, v1
; GFX10-NEXT: scratch_store_dword v0, v2, off
; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
; GFX10-NEXT: scratch_load_dword v0, v1, off offset:124 glc dlc
@@ -143,9 +143,9 @@ define amdgpu_kernel void @store_load_vindex_kernel() {
; GFX940-NEXT: v_sub_u32_e32 v0, 0, v0
; GFX940-NEXT: v_mov_b32_e32 v2, 15
; GFX940-NEXT: v_lshlrev_b32_e32 v0, 2, v0
-; GFX940-NEXT: scratch_store_dword v1, v2, off offset:4 sc0 sc1
+; GFX940-NEXT: scratch_store_dword v1, v2, off sc0 sc1
; GFX940-NEXT: s_waitcnt vmcnt(0)
-; GFX940-NEXT: v_add_u32_e32 v0, 4, v0
+; GFX940-NEXT: v_add_u32_e32 v0, 0, v0
; GFX940-NEXT: scratch_load_dword v0, v0, off offset:124 sc0 sc1
; GFX940-NEXT: s_waitcnt vmcnt(0)
; GFX940-NEXT: s_endpgm
@@ -156,9 +156,9 @@ define amdgpu_kernel void @store_load_vindex_kernel() {
; GFX11-NEXT: v_lshlrev_b32_e32 v0, 2, v0
; GFX11-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX11-NEXT: v_dual_mov_b32 v2, 15 :: v_dual_lshlrev_b32 v1, 2, v1
-; GFX11-NEXT: scratch_store_b32 v0, v2, off offset:4 dlc
+; GFX11-NEXT: scratch_store_b32 v0, v2, off dlc
; GFX11-NEXT: s_waitcnt_vscnt null, 0x0
-; GFX11-NEXT: v_add_nc_u32_e32 v1, 4, v1
+; GFX11-NEXT: v_add_nc_u32_e32 v1, 0, v1
; GFX11-NEXT: scratch_load_b32 v0, v1, off offset:124 glc dlc
; GFX11-NEXT: s_waitcnt vmcnt(0)
; GFX11-NEXT: s_endpgm
@@ -169,9 +169,9 @@ define amdgpu_kernel void @store_load_vindex_kernel() {
; GFX12-NEXT: v_lshlrev_b32_e32 v0, 2, v0
; GFX12-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX12-NEXT: v_dual_mov_b32 v2, 15 :: v_dual_lshlrev_b32 v1, 2, v1
-; GFX12-NEXT: scratch_store_b32 v0, v2, off offset:4 scope:SCOPE_SYS
+; GFX12-NEXT: scratch_store_b32 v0, v2, off scope:SCOPE_SYS
; GFX12-NEXT: s_wait_storecnt 0x0
-; GFX12-NEXT: scratch_load_b32 v0, v1, off offset:128 scope:SCOPE_SYS
+; GFX12-NEXT: scratch_load_b32 v0, v1, off offset:124 scope:SCOPE_SYS
; GFX12-NEXT: s_wait_loadcnt 0x0
; GFX12-NEXT: s_endpgm
bb:
@@ -324,16 +324,16 @@ define amdgpu_kernel void @store_load_sindex_small_offset_kernel(i32 %idx) {
; GFX9-NEXT: s_add_u32 flat_scratch_lo, s2, s5
; GFX9-NEXT: s_addc_u32 flat_scratch_hi, s3, 0
; GFX9-NEXT: s_mov_b32 s1, 0
-; GFX9-NEXT: scratch_load_dword v0, off, s1 offset:4 glc
+; GFX9-NEXT: scratch_load_dword v0, off, s1 glc
; GFX9-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX9-NEXT: s_lshl_b32 s1, s0, 2
; GFX9-NEXT: s_and_b32 s0, s0, 15
; GFX9-NEXT: v_mov_b32_e32 v0, 15
-; GFX9-NEXT: s_addk_i32 s1, 0x104
+; GFX9-NEXT: s_addk_i32 s1, 0x100
; GFX9-NEXT: s_lshl_b32 s0, s0, 2
; GFX9-NEXT: scratch_store_dword off, v0, s1
; GFX9-NEXT: s_waitcnt vmcnt(0)
-; GFX9-NEXT: s_addk_i32 s0, 0x104
+; GFX9-NEXT: s_addk_i32 s0, 0x100
; GFX9-NEXT: scratch_load_dword v0, off, s0 glc
; GFX9-NEXT: s_waitcnt vmcnt(0)
; GFX9-NEXT: s_endpgm
@@ -345,15 +345,15 @@ define amdgpu_kernel void @store_load_sindex_small_offset_kernel(i32 %idx) {
; GFX10-NEXT: s_setreg_b32 hwreg(HW_REG_FLAT_SCR_LO), s2
; GFX10-NEXT: s_setreg_b32 hwreg(HW_REG_FLAT_SCR_HI), s3
; GFX10-NEXT: s_load_dword s0, s[0:1], 0x24
-; GFX10-NEXT: scratch_load_dword v0, off, off offset:4 glc dlc
+; GFX10-NEXT: scratch_load_dword v0, off, off glc dlc
; GFX10-NEXT: s_waitcnt vmcnt(0)
; GFX10-NEXT: v_mov_b32_e32 v0, 15
; GFX10-NEXT: s_waitcnt lgkmcnt(0)
; GFX10-NEXT: s_and_b32 s1, s0, 15
; GFX10-NEXT: s_lshl_b32 s0, s0, 2
; GFX10-NEXT: s_lshl_b32 s1, s1, 2
-; GFX10-NEXT: s_addk_i32 s0, 0x104
-; GFX10-NEXT: s_addk_i32 s1, 0x104
+; GFX10-NEXT: s_addk_i32 s0, 0x100
+; GFX10-NEXT: s_addk_i32 s1, 0x100
; GFX10-NEXT: scratch_store_dword off, v0, s0
; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
; GFX10-NEXT: scratch_load_dword v0, off, s1 glc dlc
@@ -363,42 +363,42 @@ define amdgpu_kernel void @store_load_sindex_small_offset_kernel(i32 %idx) {
; GFX940-LABEL: store_load_sindex_small_offset_kernel:
; GFX940: ; %bb.0: ; %bb
; GFX940-NEXT: s_load_dword s0, s[0:1], 0x24
-; GFX940-NEXT: scratch_load_dword v0, off, off offset:4 sc0 sc1
+; GFX940-NEXT: scratch_load_dword v0, off, off sc0 sc1
; GFX940-NEXT: s_waitcnt vmcnt(0)
; GFX940-NEXT: v_mov_b32_e32 v0, 15
; GFX940-NEXT: s_waitcnt lgkmcnt(0)
; GFX940-NEXT: s_lshl_b32 s1, s0, 2
; GFX940-NEXT: s_and_b32 s0, s0, 15
-; GFX940-NEXT: s_addk_i32 s1, 0x104
+; GFX940-NEXT: s_addk_i32 s1, 0x100
; GFX940-NEXT: s_lshl_b32 s0, s0, 2
; GFX940-NEXT: scratch_store_dword off, v0, s1 sc0 sc1
; GFX940-NEXT: s_waitcnt vmcnt(0)
; GFX940-NEXT: v_mov_b32_e32 v0, s0
-; GFX940-NEXT: scratch_load_dword v0, v0, off offset:260 sc0 sc1
+; GFX940-NEXT: scratch_load_dword v0, v0, off offset:256 sc0 sc1
; GFX940-NEXT: s_waitcnt vmcnt(0)
; GFX940-NEXT: s_endpgm
;
; GFX11-LABEL: store_load_sindex_small_offset_kernel:
; GFX11: ; %bb.0: ; %bb
; GFX11-NEXT: s_load_b32 s0, s[0:1], 0x24
-; GFX11-NEXT: scratch_load_b32 v2, off, off offset:4 glc dlc
+; GFX11-NEXT: scratch_load_b32 v2, off, off glc dlc
; GFX11-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-NEXT: s_and_b32 s1, s0, 15
; GFX11-NEXT: s_lshl_b32 s0, s0, 2
; GFX11-NEXT: s_lshl_b32 s1, s1, 2
; GFX11-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX11-NEXT: v_dual_mov_b32 v0, 15 :: v_dual_mov_b32 v1, s1
-; GFX11-NEXT: s_addk_i32 s0, 0x104
+; GFX11-NEXT: s_addk_i32 s0, 0x100
; GFX11-NEXT: scratch_store_b32 off, v0, s0 dlc
; GFX11-NEXT: s_waitcnt_vscnt null, 0x0
-; GFX11-NEXT: scratch_load_b32 v0, v1, off offset:260 glc dlc
+; GFX11-NEXT: scratch_load_b32 v0, v1, off offset:256 glc dlc
; GFX11-NEXT: s_waitcnt vmcnt(0)
; GFX11-NEXT: s_endpgm
;
; GFX12-LABEL: store_load_sindex_small_offset_kernel:
; GFX12: ; %bb.0: ; %bb
; GFX12-NEXT: s_load_b32 s0, s[0:1], 0x24
-; GFX12-NEXT: scratch_load_b32 v3, off, off offset:4 scope:SCOPE_SYS
+; GFX12-NEXT: scratch_load_b32 v3, off, off scope:SCOPE_SYS
; GFX12-NEXT: s_wait_loadcnt 0x0
; GFX12-NEXT: v_mov_b32_e32 v1, 15
; GFX12-NEXT: s_wait_kmcnt 0x0
@@ -408,9 +408,9 @@ define amdgpu_kernel void @store_load_sindex_small_offset_kernel(i32 %idx) {
; GFX12-NEXT: s_lshl_b32 s0, s0, 2
; GFX12-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX12-NEXT: v_mov_b32_e32 v2, s0
-; GFX12-NEXT: scratch_store_b32 v0, v1, off offset:260 scope:SCOPE_SYS
+; GFX12-NEXT: scratch_store_b32 v0, v1, off offset:256 scope:SCOPE_SYS
; GFX12-NEXT: s_wait_storecnt 0x0
-; GFX12-NEXT: scratch_load_b32 v0, v2, off offset:260 scope:SCOPE_SYS
+; GFX12-NEXT: scratch_load_b32 v0, v2, off offset:256 scope:SCOPE_SYS
; GFX12-NEXT: s_wait_loadcnt 0x0
; GFX12-NEXT: s_endpgm
bb:
@@ -432,16 +432,16 @@ define amdgpu_kernel void @store_load_vindex_small_offset_kernel() {
; GFX9-NEXT: s_add_u32 flat_scratch_lo, s0, s3
; GFX9-NEXT: s_addc_u32 flat_scratch_hi, s1, 0
; GFX9-NEXT: s_mov_b32 s0, 0
-; GFX9-NEXT: scratch_load_dword v1, off, s0 offset:4 glc
+; GFX9-NEXT: s...
[truncated]
|
@@ -166,47 +166,6 @@ done: | |||
ret void | |||
} | |||
|
|||
; This ends up not fitting due to the reserved 4 bytes at offset 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we don't have the reserved bytes anymore, I think we can skip this test (there's no special behavior compared to the other tests).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would keep the test anyway to show the change in behavior
; GFX6-NEXT: buffer_store_dword v4, off, s[40:43], 0 | ||
; GFX6-NEXT: s_mov_b32 s34, 0x84800 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why the scheduling changes here, but it seems pretty harmless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means that we need to fully handle access by incrementing SP and decrementing after. That was broken at some point. Do we still hit the PEI logic to keep the scavenging slot as close to the incoming SP as possible?
@@ -166,47 +166,6 @@ done: | |||
ret void | |||
} | |||
|
|||
; This ends up not fitting due to the reserved 4 bytes at offset 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would keep the test anyway to show the change in behavior
I'm not sure I follow, can you please clarify what you have in mind? Since the change only affects entry and chain functions, there's no incoming SP and we should be accessing the stack (scavenge slot included) directly with offsets, not via SP. If something were wrong with the way PEI is handling the SP, wouldn't that affect non-entry functions instead? |
The problem this was solving was if you have offsets that are larger than can be encoded in the immediate offset of the buffer instructions. Fixing it at 0 ensures this cannot happen. If the offset is larger, you either need a free register to materialize the constant in, or inc/dec a reserved frame register around the use |
Ok, thanks for clarifying, I'll get back to this in a week or 2. |
allocateScavengingFrameIndexesNearIncomingSP is overridden, so that should take care of it. I'm also not really seeing any case where we actually make use of the emergency stack slot now? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect we have some missing emergency scavenging slot test coverage, but in principle this should be OK
Ok, thanks. @mariusz-sikora-at-amd is investigating an issue that might be related to the scavenging slot, so hopefully that will help get some better test coverage. |
At the moment, the emergency spill slot is a fixed object for entry
functions and chain functions, and a regular stack object otherwise.
This patch adopts the latter behaviour for entry/chain functions too. It
seems this was always the intention [1] and it will also save us a bit
of stack space in cases where the first stack object has a large
alignment.
[1] 34c8b83