[AMDGPU] Fix handling of FP in cs.chain functions #161194

ro-i · 2025-09-29T13:21:54Z

In case there is an dynamic alloca / an alloca which is not in the entry block, cs.chain functions do not set up an FP, but are reported to need one. This results in a failed assertion in
SIFrameLowering::emitPrologue() (Assertion (!HasFP || FPSaved) && "Needed to save FP but didn't save it anywhere"' failed.) This commit changes hasFPImpl` so that the need for an SP in a cs.chain function does not directly imply the need for an FP anymore.

This LLVM defect was identified via the AMD Fuzzing project.

Re-opens #132711

In case there is an dynamic alloca / an alloca which is not in the entry block, cs.chain functions do not set up an FP, but are reported to need one. This results in a failed assertion in `SIFrameLowering::emitPrologue()` (Assertion `(!HasFP || FPSaved) && "Needed to save FP but didn't save it anywhere"' failed.) This commit changes `hasFPImpl` so that the need for an SP in a cs.chain function does not directly imply the need for an FP anymore. This LLVM defect was identified via the AMD Fuzzing project.

llvmbot · 2025-09-29T13:22:31Z

@llvm/pr-subscribers-backend-amdgpu

Author: Robert Imschweiler (ro-i)

Changes

In case there is an dynamic alloca / an alloca which is not in the entry block, cs.chain functions do not set up an FP, but are reported to need one. This results in a failed assertion in
SIFrameLowering::emitPrologue() (Assertion (!HasFP || FPSaved) && "Needed to save FP but didn't save it anywhere"' failed.) This commit changes hasFPImpl` so that the need for an SP in a cs.chain function does not directly imply the need for an FP anymore.

This LLVM defect was identified via the AMD Fuzzing project.

Re-opens #132711

Full diff: https://github.com/llvm/llvm-project/pull/161194.diff

2 Files Affected:

(modified) llvm/lib/Target/AMDGPU/SIFrameLowering.cpp (+3-1)
(added) llvm/test/CodeGen/AMDGPU/amdgpu-cs-chain-fp-nosave.ll (+360)

diff --git a/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp b/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
index 7c5d4fc2dacf6..7c2ce2737f7be 100644
--- a/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
@@ -2166,7 +2166,9 @@ bool SIFrameLowering::hasFPImpl(const MachineFunction &MF) const {
     return MFI.getStackSize() != 0;
   }
 
-  return frameTriviallyRequiresSP(MFI) || MFI.isFrameAddressTaken() ||
+  return (frameTriviallyRequiresSP(MFI) &&
+          !MF.getInfo<SIMachineFunctionInfo>()->isChainFunction()) ||
+         MFI.isFrameAddressTaken() ||
          MF.getSubtarget<GCNSubtarget>().getRegisterInfo()->hasStackRealignment(
              MF) ||
          mayReserveScratchForCWSR(MF) ||
diff --git a/llvm/test/CodeGen/AMDGPU/amdgpu-cs-chain-fp-nosave.ll b/llvm/test/CodeGen/AMDGPU/amdgpu-cs-chain-fp-nosave.ll
new file mode 100644
index 0000000000000..a2696fe160067
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/amdgpu-cs-chain-fp-nosave.ll
@@ -0,0 +1,360 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1200 -o - < %s 2>&1 | FileCheck %s
+
+; These situations are "special" in that they have an alloca not in the entry
+; block, which affects prolog/epilog generation.
+
+declare amdgpu_gfx void @foo()
+
+define amdgpu_cs_chain void @test_alloca() {
+; CHECK-LABEL: test_alloca:
+; CHECK:       ; %bb.0: ; %.entry
+; CHECK-NEXT:    s_wait_loadcnt_dscnt 0x0
+; CHECK-NEXT:    s_wait_expcnt 0x0
+; CHECK-NEXT:    s_wait_samplecnt 0x0
+; CHECK-NEXT:    s_wait_bvhcnt 0x0
+; CHECK-NEXT:    s_wait_kmcnt 0x0
+; CHECK-NEXT:    v_mov_b32_e32 v0, 0
+; CHECK-NEXT:    s_mov_b32 s32, 16
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_mov_b32 s0, s32
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_add_co_i32 s32, s0, 0x200
+; CHECK-NEXT:    scratch_store_b32 off, v0, s0
+; CHECK-NEXT:    s_endpgm
+.entry:
+  br label %SW_C
+
+SW_C:                                             ; preds = %.entry
+  %v = alloca i32, i32 1, align 4, addrspace(5)
+  store i32 0, ptr addrspace(5) %v, align 4
+  ret void
+}
+
+define amdgpu_cs_chain void @test_alloca_var_uniform(i32 inreg %count) {
+; CHECK-LABEL: test_alloca_var_uniform:
+; CHECK:       ; %bb.0: ; %.entry
+; CHECK-NEXT:    s_wait_loadcnt_dscnt 0x0
+; CHECK-NEXT:    s_wait_expcnt 0x0
+; CHECK-NEXT:    s_wait_samplecnt 0x0
+; CHECK-NEXT:    s_wait_bvhcnt 0x0
+; CHECK-NEXT:    s_wait_kmcnt 0x0
+; CHECK-NEXT:    s_lshl_b32 s0, s0, 2
+; CHECK-NEXT:    v_mov_b32_e32 v0, 0
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_add_co_i32 s0, s0, 15
+; CHECK-NEXT:    s_mov_b32 s32, 16
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_and_b32 s0, s0, -16
+; CHECK-NEXT:    s_mov_b32 s1, s32
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_lshl_b32 s0, s0, 5
+; CHECK-NEXT:    scratch_store_b32 off, v0, s1
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_add_co_i32 s32, s1, s0
+; CHECK-NEXT:    s_endpgm
+.entry:
+  br label %SW_C
+
+SW_C:                                             ; preds = %.entry
+  %v = alloca i32, i32 %count, align 4, addrspace(5)
+  store i32 0, ptr addrspace(5) %v, align 4
+  ret void
+}
+
+define amdgpu_cs_chain void @test_alloca_var(i32 %count) {
+; CHECK-LABEL: test_alloca_var:
+; CHECK:       ; %bb.0: ; %.entry
+; CHECK-NEXT:    s_wait_loadcnt_dscnt 0x0
+; CHECK-NEXT:    s_wait_expcnt 0x0
+; CHECK-NEXT:    s_wait_samplecnt 0x0
+; CHECK-NEXT:    s_wait_bvhcnt 0x0
+; CHECK-NEXT:    s_wait_kmcnt 0x0
+; CHECK-NEXT:    v_lshl_add_u32 v0, v8, 2, 15
+; CHECK-NEXT:    s_mov_b32 s1, exec_lo
+; CHECK-NEXT:    s_mov_b32 s0, 0
+; CHECK-NEXT:    s_mov_b32 s32, 16
+; CHECK-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; CHECK-NEXT:    v_dual_mov_b32 v0, 0 :: v_dual_and_b32 v1, -16, v0
+; CHECK-NEXT:  .LBB2_1: ; =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_ctz_i32_b32 s2, s1
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; CHECK-NEXT:    v_readlane_b32 s3, v1, s2
+; CHECK-NEXT:    s_bitset0_b32 s1, s2
+; CHECK-NEXT:    s_max_u32 s0, s0, s3
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_cmp_lg_u32 s1, 0
+; CHECK-NEXT:    s_cbranch_scc1 .LBB2_1
+; CHECK-NEXT:  ; %bb.2:
+; CHECK-NEXT:    s_mov_b32 s1, s32
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    v_lshl_add_u32 v1, s0, 5, s1
+; CHECK-NEXT:    scratch_store_b32 off, v0, s1
+; CHECK-NEXT:    v_readfirstlane_b32 s32, v1
+; CHECK-NEXT:    s_endpgm
+.entry:
+  br label %SW_C
+
+SW_C:                                             ; preds = %.entry
+  %v = alloca i32, i32 %count, align 4, addrspace(5)
+  store i32 0, ptr addrspace(5) %v, align 4
+  ret void
+}
+
+define amdgpu_cs_chain void @test_alloca_and_call() {
+; CHECK-LABEL: test_alloca_and_call:
+; CHECK:       ; %bb.0: ; %.entry
+; CHECK-NEXT:    s_wait_loadcnt_dscnt 0x0
+; CHECK-NEXT:    s_wait_expcnt 0x0
+; CHECK-NEXT:    s_wait_samplecnt 0x0
+; CHECK-NEXT:    s_wait_bvhcnt 0x0
+; CHECK-NEXT:    s_wait_kmcnt 0x0
+; CHECK-NEXT:    s_getpc_b64 s[0:1]
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_sext_i32_i16 s1, s1
+; CHECK-NEXT:    s_add_co_u32 s0, s0, foo@gotpcrel32@lo+12
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_add_co_ci_u32 s1, s1, foo@gotpcrel32@hi+24
+; CHECK-NEXT:    v_mov_b32_e32 v0, 0
+; CHECK-NEXT:    s_load_b64 s[0:1], s[0:1], 0x0
+; CHECK-NEXT:    s_mov_b32 s32, 16
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_mov_b32 s2, s32
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_add_co_i32 s32, s2, 0x200
+; CHECK-NEXT:    scratch_store_b32 off, v0, s2
+; CHECK-NEXT:    s_wait_kmcnt 0x0
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_swappc_b64 s[30:31], s[0:1]
+; CHECK-NEXT:    s_endpgm
+.entry:
+  br label %SW_C
+
+SW_C:                                             ; preds = %.entry
+  %v = alloca i32, i32 1, align 4, addrspace(5)
+  store i32 0, ptr addrspace(5) %v, align 4
+  call amdgpu_gfx void @foo()
+  ret void
+}
+
+define amdgpu_cs_chain void @test_alloca_and_call_var_uniform(i32 inreg %count) {
+; CHECK-LABEL: test_alloca_and_call_var_uniform:
+; CHECK:       ; %bb.0: ; %.entry
+; CHECK-NEXT:    s_wait_loadcnt_dscnt 0x0
+; CHECK-NEXT:    s_wait_expcnt 0x0
+; CHECK-NEXT:    s_wait_samplecnt 0x0
+; CHECK-NEXT:    s_wait_bvhcnt 0x0
+; CHECK-NEXT:    s_wait_kmcnt 0x0
+; CHECK-NEXT:    s_getpc_b64 s[2:3]
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_sext_i32_i16 s3, s3
+; CHECK-NEXT:    s_add_co_u32 s2, s2, foo@gotpcrel32@lo+12
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_add_co_ci_u32 s3, s3, foo@gotpcrel32@hi+24
+; CHECK-NEXT:    s_lshl_b32 s0, s0, 2
+; CHECK-NEXT:    s_load_b64 s[2:3], s[2:3], 0x0
+; CHECK-NEXT:    s_add_co_i32 s0, s0, 15
+; CHECK-NEXT:    v_mov_b32_e32 v0, 0
+; CHECK-NEXT:    s_mov_b32 s32, 16
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_and_b32 s0, s0, -16
+; CHECK-NEXT:    s_mov_b32 s1, s32
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_lshl_b32 s0, s0, 5
+; CHECK-NEXT:    scratch_store_b32 off, v0, s1
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_add_co_i32 s32, s1, s0
+; CHECK-NEXT:    s_wait_kmcnt 0x0
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_swappc_b64 s[30:31], s[2:3]
+; CHECK-NEXT:    s_endpgm
+.entry:
+  br label %SW_C
+
+SW_C:                                             ; preds = %.entry
+  %v = alloca i32, i32 %count, align 4, addrspace(5)
+  store i32 0, ptr addrspace(5) %v, align 4
+  call amdgpu_gfx void @foo()
+  ret void
+}
+
+define amdgpu_cs_chain void @test_alloca_and_call_var(i32 %count) {
+; CHECK-LABEL: test_alloca_and_call_var:
+; CHECK:       ; %bb.0: ; %.entry
+; CHECK-NEXT:    s_wait_loadcnt_dscnt 0x0
+; CHECK-NEXT:    s_wait_expcnt 0x0
+; CHECK-NEXT:    s_wait_samplecnt 0x0
+; CHECK-NEXT:    s_wait_bvhcnt 0x0
+; CHECK-NEXT:    s_wait_kmcnt 0x0
+; CHECK-NEXT:    v_lshl_add_u32 v0, v8, 2, 15
+; CHECK-NEXT:    s_mov_b32 s1, exec_lo
+; CHECK-NEXT:    s_mov_b32 s0, 0
+; CHECK-NEXT:    s_mov_b32 s32, 16
+; CHECK-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; CHECK-NEXT:    v_dual_mov_b32 v0, 0 :: v_dual_and_b32 v1, -16, v0
+; CHECK-NEXT:  .LBB5_1: ; =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_ctz_i32_b32 s2, s1
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; CHECK-NEXT:    v_readlane_b32 s3, v1, s2
+; CHECK-NEXT:    s_bitset0_b32 s1, s2
+; CHECK-NEXT:    s_max_u32 s0, s0, s3
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_cmp_lg_u32 s1, 0
+; CHECK-NEXT:    s_cbranch_scc1 .LBB5_1
+; CHECK-NEXT:  ; %bb.2:
+; CHECK-NEXT:    s_getpc_b64 s[2:3]
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_sext_i32_i16 s3, s3
+; CHECK-NEXT:    s_add_co_u32 s2, s2, foo@gotpcrel32@lo+12
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_add_co_ci_u32 s3, s3, foo@gotpcrel32@hi+24
+; CHECK-NEXT:    s_mov_b32 s1, s32
+; CHECK-NEXT:    s_load_b64 s[2:3], s[2:3], 0x0
+; CHECK-NEXT:    v_lshl_add_u32 v1, s0, 5, s1
+; CHECK-NEXT:    scratch_store_b32 off, v0, s1
+; CHECK-NEXT:    v_readfirstlane_b32 s32, v1
+; CHECK-NEXT:    s_wait_kmcnt 0x0
+; CHECK-NEXT:    s_wait_alu 0xf1ff
+; CHECK-NEXT:    s_swappc_b64 s[30:31], s[2:3]
+; CHECK-NEXT:    s_endpgm
+.entry:
+  br label %SW_C
+
+SW_C:                                             ; preds = %.entry
+  %v = alloca i32, i32 %count, align 4, addrspace(5)
+  store i32 0, ptr addrspace(5) %v, align 4
+  call amdgpu_gfx void @foo()
+  ret void
+}
+
+define amdgpu_cs_chain void @test_call_and_alloca() {
+; CHECK-LABEL: test_call_and_alloca:
+; CHECK:       ; %bb.0: ; %.entry
+; CHECK-NEXT:    s_wait_loadcnt_dscnt 0x0
+; CHECK-NEXT:    s_wait_expcnt 0x0
+; CHECK-NEXT:    s_wait_samplecnt 0x0
+; CHECK-NEXT:    s_wait_bvhcnt 0x0
+; CHECK-NEXT:    s_wait_kmcnt 0x0
+; CHECK-NEXT:    s_getpc_b64 s[0:1]
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_sext_i32_i16 s1, s1
+; CHECK-NEXT:    s_add_co_u32 s0, s0, foo@gotpcrel32@lo+12
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_add_co_ci_u32 s1, s1, foo@gotpcrel32@hi+24
+; CHECK-NEXT:    s_mov_b32 s32, 16
+; CHECK-NEXT:    s_load_b64 s[0:1], s[0:1], 0x0
+; CHECK-NEXT:    s_mov_b32 s4, s32
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_add_co_i32 s32, s4, 0x200
+; CHECK-NEXT:    s_wait_kmcnt 0x0
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_swappc_b64 s[30:31], s[0:1]
+; CHECK-NEXT:    v_mov_b32_e32 v0, 0
+; CHECK-NEXT:    scratch_store_b32 off, v0, s4
+; CHECK-NEXT:    s_endpgm
+.entry:
+  br label %SW_C
+
+SW_C:                                             ; preds = %.entry
+  %v = alloca i32, i32 1, align 4, addrspace(5)
+  call amdgpu_gfx void @foo()
+  store i32 0, ptr addrspace(5) %v, align 4
+  ret void
+}
+
+define amdgpu_cs_chain void @test_call_and_alloca_var_uniform(i32 inreg %count) {
+; CHECK-LABEL: test_call_and_alloca_var_uniform:
+; CHECK:       ; %bb.0: ; %.entry
+; CHECK-NEXT:    s_wait_loadcnt_dscnt 0x0
+; CHECK-NEXT:    s_wait_expcnt 0x0
+; CHECK-NEXT:    s_wait_samplecnt 0x0
+; CHECK-NEXT:    s_wait_bvhcnt 0x0
+; CHECK-NEXT:    s_wait_kmcnt 0x0
+; CHECK-NEXT:    s_getpc_b64 s[2:3]
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_sext_i32_i16 s3, s3
+; CHECK-NEXT:    s_add_co_u32 s2, s2, foo@gotpcrel32@lo+12
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_add_co_ci_u32 s3, s3, foo@gotpcrel32@hi+24
+; CHECK-NEXT:    s_lshl_b32 s0, s0, 2
+; CHECK-NEXT:    s_load_b64 s[2:3], s[2:3], 0x0
+; CHECK-NEXT:    s_add_co_i32 s0, s0, 15
+; CHECK-NEXT:    s_mov_b32 s32, 16
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_and_b32 s0, s0, -16
+; CHECK-NEXT:    s_mov_b32 s4, s32
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_lshl_b32 s0, s0, 5
+; CHECK-NEXT:    v_mov_b32_e32 v40, 0
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_add_co_i32 s32, s4, s0
+; CHECK-NEXT:    s_wait_kmcnt 0x0
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_swappc_b64 s[30:31], s[2:3]
+; CHECK-NEXT:    scratch_store_b32 off, v40, s4
+; CHECK-NEXT:    s_endpgm
+.entry:
+  br label %SW_C
+
+SW_C:                                             ; preds = %.entry
+  %v = alloca i32, i32 %count, align 4, addrspace(5)
+  call amdgpu_gfx void @foo()
+  store i32 0, ptr addrspace(5) %v, align 4
+  ret void
+}
+
+define amdgpu_cs_chain void @test_call_and_alloca_var(i32 %count) {
+; CHECK-LABEL: test_call_and_alloca_var:
+; CHECK:       ; %bb.0: ; %.entry
+; CHECK-NEXT:    s_wait_loadcnt_dscnt 0x0
+; CHECK-NEXT:    s_wait_expcnt 0x0
+; CHECK-NEXT:    s_wait_samplecnt 0x0
+; CHECK-NEXT:    s_wait_bvhcnt 0x0
+; CHECK-NEXT:    s_wait_kmcnt 0x0
+; CHECK-NEXT:    v_lshl_add_u32 v0, v8, 2, 15
+; CHECK-NEXT:    v_mov_b32_e32 v40, 0
+; CHECK-NEXT:    s_mov_b32 s1, exec_lo
+; CHECK-NEXT:    s_mov_b32 s0, 0
+; CHECK-NEXT:    s_mov_b32 s32, 16
+; CHECK-NEXT:    v_and_b32_e32 v0, -16, v0
+; CHECK-NEXT:  .LBB8_1: ; =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_ctz_i32_b32 s2, s1
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; CHECK-NEXT:    v_readlane_b32 s3, v0, s2
+; CHECK-NEXT:    s_bitset0_b32 s1, s2
+; CHECK-NEXT:    s_max_u32 s0, s0, s3
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_cmp_lg_u32 s1, 0
+; CHECK-NEXT:    s_cbranch_scc1 .LBB8_1
+; CHECK-NEXT:  ; %bb.2:
+; CHECK-NEXT:    s_getpc_b64 s[2:3]
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_sext_i32_i16 s3, s3
+; CHECK-NEXT:    s_add_co_u32 s2, s2, foo@gotpcrel32@lo+12
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_add_co_ci_u32 s3, s3, foo@gotpcrel32@hi+24
+; CHECK-NEXT:    s_mov_b32 s4, s32
+; CHECK-NEXT:    s_load_b64 s[2:3], s[2:3], 0x0
+; CHECK-NEXT:    v_lshl_add_u32 v0, s0, 5, s4
+; CHECK-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; CHECK-NEXT:    v_readfirstlane_b32 s32, v0
+; CHECK-NEXT:    s_wait_kmcnt 0x0
+; CHECK-NEXT:    s_wait_alu 0xf1ff
+; CHECK-NEXT:    s_swappc_b64 s[30:31], s[2:3]
+; CHECK-NEXT:    scratch_store_b32 off, v40, s4
+; CHECK-NEXT:    s_endpgm
+.entry:
+  br label %SW_C
+
+SW_C:                                             ; preds = %.entry
+  %v = alloca i32, i32 %count, align 4, addrspace(5)
+  call amdgpu_gfx void @foo()
+  store i32 0, ptr addrspace(5) %v, align 4
+  ret void
+}

llvm/test/CodeGen/AMDGPU/amdgpu-cs-chain-fp-nosave.ll

arsenm · 2025-10-10T08:42:33Z

llvm/test/CodeGen/AMDGPU/amdgpu-cs-chain-fp-nosave.ll

+  br label %SW_C
+
+SW_C:                                             ; preds = %.entry
+  %v = alloca i32, i32 %count, align 4, addrspace(5)


Does the out-of-blockness actually matter? Will any dynamic alloca do? The static sized allocas out of block are treated the same

Does the out-of-blockness actually matter?

yes. Have a look at the definition of AllocaInst::isStaticAlloca():

llvm-project/llvm/lib/IR/Instructions.cpp

Lines 1300 to 1310 in 7314565

/// isStaticAlloca - Return true if this alloca is in the entry block of the

/// function and is a constant size. If so, the code generator will fold it

/// into the prolog/epilog code, so it is basically free.

bool AllocaInst::isStaticAlloca() const {

// Must be constant size.

if (!isa<ConstantInt>(getArraySize())) return false;

// Must be in the entry block.

const BasicBlock *Parent = getParent();

return Parent->isEntryBlock() && !isUsedWithInAlloca();

}

tldr: If it's not in the entry block, it's not static.
That makes MFI.hasVarSizedObjects true, which makes frameTriviallyRequiresSP (in SIFrameLowering.cpp) true, which makes SIFrameLowering::hasFPImpl true, which makes hasFP in SIFrameLowering::emitPrologue true.
And then, this becomes an issue:

llvm-project/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp

Lines 1327 to 1330 in 1389980

bool FPSaved = FuncInfo->hasPrologEpilogSGPRSpillEntry(FramePtrReg);

(void)FPSaved;

assert((!HasFP || FPSaved) &&

"Needed to save FP but didn't save it anywhere");

Afaiu, cs_chain functions are not supposed to return, btw. So, in general, there is no point in doing FP/SP saving stuff

I think the point is that you can keep the variable-sized allocas in the entry block.

Ah, I misunderstood the comment, thanks, done

Co-authored-by: Matt Arsenault <Matthew.Arsenault@amd.com>

rovka

LGTM, but wait a couple days in case @arsenm has anything to add.

llvm/test/CodeGen/AMDGPU/amdgpu-cs-chain-fp-nosave.ll

ro-i requested review from arsenm, easyonaadit, jofrn and rovka September 29, 2025 13:21

llvmbot added the backend:AMDGPU label Sep 29, 2025

add gfx942 as test target

97a5437

arsenm reviewed Oct 10, 2025

View reviewed changes

ro-i and others added 2 commits October 13, 2025 09:36

Update llvm/test/CodeGen/AMDGPU/amdgpu-cs-chain-fp-nosave.ll

7ff930c

Co-authored-by: Matt Arsenault <Matthew.Arsenault@amd.com>

remove unnecessary blocks for variable-sized allocas

71b32c4

rovka approved these changes Oct 15, 2025

View reviewed changes

llvm/test/CodeGen/AMDGPU/amdgpu-cs-chain-fp-nosave.ll Outdated Show resolved Hide resolved

adapt test comment

875b4fb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AMDGPU] Fix handling of FP in cs.chain functions #161194

[AMDGPU] Fix handling of FP in cs.chain functions #161194

Uh oh!

ro-i commented Sep 29, 2025

Uh oh!

llvmbot commented Sep 29, 2025

Uh oh!

Uh oh!

arsenm Oct 10, 2025

Uh oh!

ro-i Oct 13, 2025 •

edited

Loading

Uh oh!

ro-i Oct 13, 2025

Uh oh!

easyonaadit Oct 13, 2025

Uh oh!

ro-i Oct 13, 2025

Uh oh!

rovka left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	/// isStaticAlloca - Return true if this alloca is in the entry block of the
	/// function and is a constant size. If so, the code generator will fold it
	/// into the prolog/epilog code, so it is basically free.
	bool AllocaInst::isStaticAlloca() const {
	// Must be constant size.
	if (!isa<ConstantInt>(getArraySize())) return false;

	// Must be in the entry block.
	const BasicBlock *Parent = getParent();
	return Parent->isEntryBlock() && !isUsedWithInAlloca();
	}

	bool FPSaved = FuncInfo->hasPrologEpilogSGPRSpillEntry(FramePtrReg);
	(void)FPSaved;
	assert((!HasFP \|\| FPSaved) &&
	"Needed to save FP but didn't save it anywhere");

[AMDGPU] Fix handling of FP in cs.chain functions #161194

Are you sure you want to change the base?

[AMDGPU] Fix handling of FP in cs.chain functions #161194

Uh oh!

Conversation

ro-i commented Sep 29, 2025

Uh oh!

llvmbot commented Sep 29, 2025

Uh oh!

Uh oh!

arsenm Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

ro-i Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ro-i Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

easyonaadit Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

ro-i Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

rovka left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ro-i Oct 13, 2025 •

edited

Loading