Skip to content

Conversation

ro-i
Copy link
Contributor

@ro-i ro-i commented Sep 29, 2025

In case there is an dynamic alloca / an alloca which is not in the entry block, cs.chain functions do not set up an FP, but are reported to need one. This results in a failed assertion in
SIFrameLowering::emitPrologue() (Assertion (!HasFP || FPSaved) && "Needed to save FP but didn't save it anywhere"' failed.) This commit changes hasFPImpl` so that the need for an SP in a cs.chain function does not directly imply the need for an FP anymore.

This LLVM defect was identified via the AMD Fuzzing project.


Re-opens #132711

In case there is an dynamic alloca / an alloca which is not in the entry
block, cs.chain functions do not set up an FP, but are reported to need
one. This results in a failed assertion in
`SIFrameLowering::emitPrologue()` (Assertion `(!HasFP || FPSaved) &&
"Needed to save FP but didn't save it anywhere"' failed.) This commit
changes `hasFPImpl` so that the need for an SP in a cs.chain function
does not directly imply the need for an FP anymore.

This LLVM defect was identified via the AMD Fuzzing project.
@llvmbot
Copy link
Member

llvmbot commented Sep 29, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Robert Imschweiler (ro-i)

Changes

In case there is an dynamic alloca / an alloca which is not in the entry block, cs.chain functions do not set up an FP, but are reported to need one. This results in a failed assertion in
SIFrameLowering::emitPrologue() (Assertion (!HasFP || FPSaved) && "Needed to save FP but didn't save it anywhere"' failed.) This commit changes hasFPImpl` so that the need for an SP in a cs.chain function does not directly imply the need for an FP anymore.

This LLVM defect was identified via the AMD Fuzzing project.


Re-opens #132711


Full diff: https://github.com/llvm/llvm-project/pull/161194.diff

2 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/SIFrameLowering.cpp (+3-1)
  • (added) llvm/test/CodeGen/AMDGPU/amdgpu-cs-chain-fp-nosave.ll (+360)
diff --git a/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp b/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
index 7c5d4fc2dacf6..7c2ce2737f7be 100644
--- a/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
@@ -2166,7 +2166,9 @@ bool SIFrameLowering::hasFPImpl(const MachineFunction &MF) const {
     return MFI.getStackSize() != 0;
   }
 
-  return frameTriviallyRequiresSP(MFI) || MFI.isFrameAddressTaken() ||
+  return (frameTriviallyRequiresSP(MFI) &&
+          !MF.getInfo<SIMachineFunctionInfo>()->isChainFunction()) ||
+         MFI.isFrameAddressTaken() ||
          MF.getSubtarget<GCNSubtarget>().getRegisterInfo()->hasStackRealignment(
              MF) ||
          mayReserveScratchForCWSR(MF) ||
diff --git a/llvm/test/CodeGen/AMDGPU/amdgpu-cs-chain-fp-nosave.ll b/llvm/test/CodeGen/AMDGPU/amdgpu-cs-chain-fp-nosave.ll
new file mode 100644
index 0000000000000..a2696fe160067
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/amdgpu-cs-chain-fp-nosave.ll
@@ -0,0 +1,360 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1200 -o - < %s 2>&1 | FileCheck %s
+
+; These situations are "special" in that they have an alloca not in the entry
+; block, which affects prolog/epilog generation.
+
+declare amdgpu_gfx void @foo()
+
+define amdgpu_cs_chain void @test_alloca() {
+; CHECK-LABEL: test_alloca:
+; CHECK:       ; %bb.0: ; %.entry
+; CHECK-NEXT:    s_wait_loadcnt_dscnt 0x0
+; CHECK-NEXT:    s_wait_expcnt 0x0
+; CHECK-NEXT:    s_wait_samplecnt 0x0
+; CHECK-NEXT:    s_wait_bvhcnt 0x0
+; CHECK-NEXT:    s_wait_kmcnt 0x0
+; CHECK-NEXT:    v_mov_b32_e32 v0, 0
+; CHECK-NEXT:    s_mov_b32 s32, 16
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_mov_b32 s0, s32
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_add_co_i32 s32, s0, 0x200
+; CHECK-NEXT:    scratch_store_b32 off, v0, s0
+; CHECK-NEXT:    s_endpgm
+.entry:
+  br label %SW_C
+
+SW_C:                                             ; preds = %.entry
+  %v = alloca i32, i32 1, align 4, addrspace(5)
+  store i32 0, ptr addrspace(5) %v, align 4
+  ret void
+}
+
+define amdgpu_cs_chain void @test_alloca_var_uniform(i32 inreg %count) {
+; CHECK-LABEL: test_alloca_var_uniform:
+; CHECK:       ; %bb.0: ; %.entry
+; CHECK-NEXT:    s_wait_loadcnt_dscnt 0x0
+; CHECK-NEXT:    s_wait_expcnt 0x0
+; CHECK-NEXT:    s_wait_samplecnt 0x0
+; CHECK-NEXT:    s_wait_bvhcnt 0x0
+; CHECK-NEXT:    s_wait_kmcnt 0x0
+; CHECK-NEXT:    s_lshl_b32 s0, s0, 2
+; CHECK-NEXT:    v_mov_b32_e32 v0, 0
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_add_co_i32 s0, s0, 15
+; CHECK-NEXT:    s_mov_b32 s32, 16
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_and_b32 s0, s0, -16
+; CHECK-NEXT:    s_mov_b32 s1, s32
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_lshl_b32 s0, s0, 5
+; CHECK-NEXT:    scratch_store_b32 off, v0, s1
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_add_co_i32 s32, s1, s0
+; CHECK-NEXT:    s_endpgm
+.entry:
+  br label %SW_C
+
+SW_C:                                             ; preds = %.entry
+  %v = alloca i32, i32 %count, align 4, addrspace(5)
+  store i32 0, ptr addrspace(5) %v, align 4
+  ret void
+}
+
+define amdgpu_cs_chain void @test_alloca_var(i32 %count) {
+; CHECK-LABEL: test_alloca_var:
+; CHECK:       ; %bb.0: ; %.entry
+; CHECK-NEXT:    s_wait_loadcnt_dscnt 0x0
+; CHECK-NEXT:    s_wait_expcnt 0x0
+; CHECK-NEXT:    s_wait_samplecnt 0x0
+; CHECK-NEXT:    s_wait_bvhcnt 0x0
+; CHECK-NEXT:    s_wait_kmcnt 0x0
+; CHECK-NEXT:    v_lshl_add_u32 v0, v8, 2, 15
+; CHECK-NEXT:    s_mov_b32 s1, exec_lo
+; CHECK-NEXT:    s_mov_b32 s0, 0
+; CHECK-NEXT:    s_mov_b32 s32, 16
+; CHECK-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; CHECK-NEXT:    v_dual_mov_b32 v0, 0 :: v_dual_and_b32 v1, -16, v0
+; CHECK-NEXT:  .LBB2_1: ; =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_ctz_i32_b32 s2, s1
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; CHECK-NEXT:    v_readlane_b32 s3, v1, s2
+; CHECK-NEXT:    s_bitset0_b32 s1, s2
+; CHECK-NEXT:    s_max_u32 s0, s0, s3
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_cmp_lg_u32 s1, 0
+; CHECK-NEXT:    s_cbranch_scc1 .LBB2_1
+; CHECK-NEXT:  ; %bb.2:
+; CHECK-NEXT:    s_mov_b32 s1, s32
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    v_lshl_add_u32 v1, s0, 5, s1
+; CHECK-NEXT:    scratch_store_b32 off, v0, s1
+; CHECK-NEXT:    v_readfirstlane_b32 s32, v1
+; CHECK-NEXT:    s_endpgm
+.entry:
+  br label %SW_C
+
+SW_C:                                             ; preds = %.entry
+  %v = alloca i32, i32 %count, align 4, addrspace(5)
+  store i32 0, ptr addrspace(5) %v, align 4
+  ret void
+}
+
+define amdgpu_cs_chain void @test_alloca_and_call() {
+; CHECK-LABEL: test_alloca_and_call:
+; CHECK:       ; %bb.0: ; %.entry
+; CHECK-NEXT:    s_wait_loadcnt_dscnt 0x0
+; CHECK-NEXT:    s_wait_expcnt 0x0
+; CHECK-NEXT:    s_wait_samplecnt 0x0
+; CHECK-NEXT:    s_wait_bvhcnt 0x0
+; CHECK-NEXT:    s_wait_kmcnt 0x0
+; CHECK-NEXT:    s_getpc_b64 s[0:1]
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_sext_i32_i16 s1, s1
+; CHECK-NEXT:    s_add_co_u32 s0, s0, foo@gotpcrel32@lo+12
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_add_co_ci_u32 s1, s1, foo@gotpcrel32@hi+24
+; CHECK-NEXT:    v_mov_b32_e32 v0, 0
+; CHECK-NEXT:    s_load_b64 s[0:1], s[0:1], 0x0
+; CHECK-NEXT:    s_mov_b32 s32, 16
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_mov_b32 s2, s32
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_add_co_i32 s32, s2, 0x200
+; CHECK-NEXT:    scratch_store_b32 off, v0, s2
+; CHECK-NEXT:    s_wait_kmcnt 0x0
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_swappc_b64 s[30:31], s[0:1]
+; CHECK-NEXT:    s_endpgm
+.entry:
+  br label %SW_C
+
+SW_C:                                             ; preds = %.entry
+  %v = alloca i32, i32 1, align 4, addrspace(5)
+  store i32 0, ptr addrspace(5) %v, align 4
+  call amdgpu_gfx void @foo()
+  ret void
+}
+
+define amdgpu_cs_chain void @test_alloca_and_call_var_uniform(i32 inreg %count) {
+; CHECK-LABEL: test_alloca_and_call_var_uniform:
+; CHECK:       ; %bb.0: ; %.entry
+; CHECK-NEXT:    s_wait_loadcnt_dscnt 0x0
+; CHECK-NEXT:    s_wait_expcnt 0x0
+; CHECK-NEXT:    s_wait_samplecnt 0x0
+; CHECK-NEXT:    s_wait_bvhcnt 0x0
+; CHECK-NEXT:    s_wait_kmcnt 0x0
+; CHECK-NEXT:    s_getpc_b64 s[2:3]
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_sext_i32_i16 s3, s3
+; CHECK-NEXT:    s_add_co_u32 s2, s2, foo@gotpcrel32@lo+12
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_add_co_ci_u32 s3, s3, foo@gotpcrel32@hi+24
+; CHECK-NEXT:    s_lshl_b32 s0, s0, 2
+; CHECK-NEXT:    s_load_b64 s[2:3], s[2:3], 0x0
+; CHECK-NEXT:    s_add_co_i32 s0, s0, 15
+; CHECK-NEXT:    v_mov_b32_e32 v0, 0
+; CHECK-NEXT:    s_mov_b32 s32, 16
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_and_b32 s0, s0, -16
+; CHECK-NEXT:    s_mov_b32 s1, s32
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_lshl_b32 s0, s0, 5
+; CHECK-NEXT:    scratch_store_b32 off, v0, s1
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_add_co_i32 s32, s1, s0
+; CHECK-NEXT:    s_wait_kmcnt 0x0
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_swappc_b64 s[30:31], s[2:3]
+; CHECK-NEXT:    s_endpgm
+.entry:
+  br label %SW_C
+
+SW_C:                                             ; preds = %.entry
+  %v = alloca i32, i32 %count, align 4, addrspace(5)
+  store i32 0, ptr addrspace(5) %v, align 4
+  call amdgpu_gfx void @foo()
+  ret void
+}
+
+define amdgpu_cs_chain void @test_alloca_and_call_var(i32 %count) {
+; CHECK-LABEL: test_alloca_and_call_var:
+; CHECK:       ; %bb.0: ; %.entry
+; CHECK-NEXT:    s_wait_loadcnt_dscnt 0x0
+; CHECK-NEXT:    s_wait_expcnt 0x0
+; CHECK-NEXT:    s_wait_samplecnt 0x0
+; CHECK-NEXT:    s_wait_bvhcnt 0x0
+; CHECK-NEXT:    s_wait_kmcnt 0x0
+; CHECK-NEXT:    v_lshl_add_u32 v0, v8, 2, 15
+; CHECK-NEXT:    s_mov_b32 s1, exec_lo
+; CHECK-NEXT:    s_mov_b32 s0, 0
+; CHECK-NEXT:    s_mov_b32 s32, 16
+; CHECK-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; CHECK-NEXT:    v_dual_mov_b32 v0, 0 :: v_dual_and_b32 v1, -16, v0
+; CHECK-NEXT:  .LBB5_1: ; =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_ctz_i32_b32 s2, s1
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; CHECK-NEXT:    v_readlane_b32 s3, v1, s2
+; CHECK-NEXT:    s_bitset0_b32 s1, s2
+; CHECK-NEXT:    s_max_u32 s0, s0, s3
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_cmp_lg_u32 s1, 0
+; CHECK-NEXT:    s_cbranch_scc1 .LBB5_1
+; CHECK-NEXT:  ; %bb.2:
+; CHECK-NEXT:    s_getpc_b64 s[2:3]
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_sext_i32_i16 s3, s3
+; CHECK-NEXT:    s_add_co_u32 s2, s2, foo@gotpcrel32@lo+12
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_add_co_ci_u32 s3, s3, foo@gotpcrel32@hi+24
+; CHECK-NEXT:    s_mov_b32 s1, s32
+; CHECK-NEXT:    s_load_b64 s[2:3], s[2:3], 0x0
+; CHECK-NEXT:    v_lshl_add_u32 v1, s0, 5, s1
+; CHECK-NEXT:    scratch_store_b32 off, v0, s1
+; CHECK-NEXT:    v_readfirstlane_b32 s32, v1
+; CHECK-NEXT:    s_wait_kmcnt 0x0
+; CHECK-NEXT:    s_wait_alu 0xf1ff
+; CHECK-NEXT:    s_swappc_b64 s[30:31], s[2:3]
+; CHECK-NEXT:    s_endpgm
+.entry:
+  br label %SW_C
+
+SW_C:                                             ; preds = %.entry
+  %v = alloca i32, i32 %count, align 4, addrspace(5)
+  store i32 0, ptr addrspace(5) %v, align 4
+  call amdgpu_gfx void @foo()
+  ret void
+}
+
+define amdgpu_cs_chain void @test_call_and_alloca() {
+; CHECK-LABEL: test_call_and_alloca:
+; CHECK:       ; %bb.0: ; %.entry
+; CHECK-NEXT:    s_wait_loadcnt_dscnt 0x0
+; CHECK-NEXT:    s_wait_expcnt 0x0
+; CHECK-NEXT:    s_wait_samplecnt 0x0
+; CHECK-NEXT:    s_wait_bvhcnt 0x0
+; CHECK-NEXT:    s_wait_kmcnt 0x0
+; CHECK-NEXT:    s_getpc_b64 s[0:1]
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_sext_i32_i16 s1, s1
+; CHECK-NEXT:    s_add_co_u32 s0, s0, foo@gotpcrel32@lo+12
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_add_co_ci_u32 s1, s1, foo@gotpcrel32@hi+24
+; CHECK-NEXT:    s_mov_b32 s32, 16
+; CHECK-NEXT:    s_load_b64 s[0:1], s[0:1], 0x0
+; CHECK-NEXT:    s_mov_b32 s4, s32
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_add_co_i32 s32, s4, 0x200
+; CHECK-NEXT:    s_wait_kmcnt 0x0
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_swappc_b64 s[30:31], s[0:1]
+; CHECK-NEXT:    v_mov_b32_e32 v0, 0
+; CHECK-NEXT:    scratch_store_b32 off, v0, s4
+; CHECK-NEXT:    s_endpgm
+.entry:
+  br label %SW_C
+
+SW_C:                                             ; preds = %.entry
+  %v = alloca i32, i32 1, align 4, addrspace(5)
+  call amdgpu_gfx void @foo()
+  store i32 0, ptr addrspace(5) %v, align 4
+  ret void
+}
+
+define amdgpu_cs_chain void @test_call_and_alloca_var_uniform(i32 inreg %count) {
+; CHECK-LABEL: test_call_and_alloca_var_uniform:
+; CHECK:       ; %bb.0: ; %.entry
+; CHECK-NEXT:    s_wait_loadcnt_dscnt 0x0
+; CHECK-NEXT:    s_wait_expcnt 0x0
+; CHECK-NEXT:    s_wait_samplecnt 0x0
+; CHECK-NEXT:    s_wait_bvhcnt 0x0
+; CHECK-NEXT:    s_wait_kmcnt 0x0
+; CHECK-NEXT:    s_getpc_b64 s[2:3]
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_sext_i32_i16 s3, s3
+; CHECK-NEXT:    s_add_co_u32 s2, s2, foo@gotpcrel32@lo+12
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_add_co_ci_u32 s3, s3, foo@gotpcrel32@hi+24
+; CHECK-NEXT:    s_lshl_b32 s0, s0, 2
+; CHECK-NEXT:    s_load_b64 s[2:3], s[2:3], 0x0
+; CHECK-NEXT:    s_add_co_i32 s0, s0, 15
+; CHECK-NEXT:    s_mov_b32 s32, 16
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_and_b32 s0, s0, -16
+; CHECK-NEXT:    s_mov_b32 s4, s32
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_lshl_b32 s0, s0, 5
+; CHECK-NEXT:    v_mov_b32_e32 v40, 0
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_add_co_i32 s32, s4, s0
+; CHECK-NEXT:    s_wait_kmcnt 0x0
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_swappc_b64 s[30:31], s[2:3]
+; CHECK-NEXT:    scratch_store_b32 off, v40, s4
+; CHECK-NEXT:    s_endpgm
+.entry:
+  br label %SW_C
+
+SW_C:                                             ; preds = %.entry
+  %v = alloca i32, i32 %count, align 4, addrspace(5)
+  call amdgpu_gfx void @foo()
+  store i32 0, ptr addrspace(5) %v, align 4
+  ret void
+}
+
+define amdgpu_cs_chain void @test_call_and_alloca_var(i32 %count) {
+; CHECK-LABEL: test_call_and_alloca_var:
+; CHECK:       ; %bb.0: ; %.entry
+; CHECK-NEXT:    s_wait_loadcnt_dscnt 0x0
+; CHECK-NEXT:    s_wait_expcnt 0x0
+; CHECK-NEXT:    s_wait_samplecnt 0x0
+; CHECK-NEXT:    s_wait_bvhcnt 0x0
+; CHECK-NEXT:    s_wait_kmcnt 0x0
+; CHECK-NEXT:    v_lshl_add_u32 v0, v8, 2, 15
+; CHECK-NEXT:    v_mov_b32_e32 v40, 0
+; CHECK-NEXT:    s_mov_b32 s1, exec_lo
+; CHECK-NEXT:    s_mov_b32 s0, 0
+; CHECK-NEXT:    s_mov_b32 s32, 16
+; CHECK-NEXT:    v_and_b32_e32 v0, -16, v0
+; CHECK-NEXT:  .LBB8_1: ; =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_ctz_i32_b32 s2, s1
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; CHECK-NEXT:    v_readlane_b32 s3, v0, s2
+; CHECK-NEXT:    s_bitset0_b32 s1, s2
+; CHECK-NEXT:    s_max_u32 s0, s0, s3
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_cmp_lg_u32 s1, 0
+; CHECK-NEXT:    s_cbranch_scc1 .LBB8_1
+; CHECK-NEXT:  ; %bb.2:
+; CHECK-NEXT:    s_getpc_b64 s[2:3]
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_sext_i32_i16 s3, s3
+; CHECK-NEXT:    s_add_co_u32 s2, s2, foo@gotpcrel32@lo+12
+; CHECK-NEXT:    s_wait_alu 0xfffe
+; CHECK-NEXT:    s_add_co_ci_u32 s3, s3, foo@gotpcrel32@hi+24
+; CHECK-NEXT:    s_mov_b32 s4, s32
+; CHECK-NEXT:    s_load_b64 s[2:3], s[2:3], 0x0
+; CHECK-NEXT:    v_lshl_add_u32 v0, s0, 5, s4
+; CHECK-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; CHECK-NEXT:    v_readfirstlane_b32 s32, v0
+; CHECK-NEXT:    s_wait_kmcnt 0x0
+; CHECK-NEXT:    s_wait_alu 0xf1ff
+; CHECK-NEXT:    s_swappc_b64 s[30:31], s[2:3]
+; CHECK-NEXT:    scratch_store_b32 off, v40, s4
+; CHECK-NEXT:    s_endpgm
+.entry:
+  br label %SW_C
+
+SW_C:                                             ; preds = %.entry
+  %v = alloca i32, i32 %count, align 4, addrspace(5)
+  call amdgpu_gfx void @foo()
+  store i32 0, ptr addrspace(5) %v, align 4
+  ret void
+}

br label %SW_C

SW_C: ; preds = %.entry
%v = alloca i32, i32 %count, align 4, addrspace(5)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the out-of-blockness actually matter? Will any dynamic alloca do? The static sized allocas out of block are treated the same

Copy link
Contributor Author

@ro-i ro-i Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the out-of-blockness actually matter?

yes. Have a look at the definition of AllocaInst::isStaticAlloca():

/// isStaticAlloca - Return true if this alloca is in the entry block of the
/// function and is a constant size. If so, the code generator will fold it
/// into the prolog/epilog code, so it is basically free.
bool AllocaInst::isStaticAlloca() const {
// Must be constant size.
if (!isa<ConstantInt>(getArraySize())) return false;
// Must be in the entry block.
const BasicBlock *Parent = getParent();
return Parent->isEntryBlock() && !isUsedWithInAlloca();
}

tldr: If it's not in the entry block, it's not static.
That makes MFI.hasVarSizedObjects true, which makes frameTriviallyRequiresSP (in SIFrameLowering.cpp) true, which makes SIFrameLowering::hasFPImpl true, which makes hasFP in SIFrameLowering::emitPrologue true.
And then, this becomes an issue:

bool FPSaved = FuncInfo->hasPrologEpilogSGPRSpillEntry(FramePtrReg);
(void)FPSaved;
assert((!HasFP || FPSaved) &&
"Needed to save FP but didn't save it anywhere");

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Afaiu, cs_chain functions are not supposed to return, btw. So, in general, there is no point in doing FP/SP saving stuff

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the point is that you can keep the variable-sized allocas in the entry block.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I misunderstood the comment, thanks, done

Copy link
Collaborator

@rovka rovka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but wait a couple days in case @arsenm has anything to add.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants