Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 10 additions & 9 deletions llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1408,11 +1408,12 @@ void SIFrameLowering::emitEpilogue(MachineFunction &MF,
LiveUnits.addReg(FramePtrRegScratchCopy);
}

emitCSRSpillRestores(MF, MBB, MBBI, DL, LiveUnits, FramePtrReg,
emitCSRSpillRestores(MF, MBB, MBBI, DL, LiveUnits,
FuncInfo->isChainFunction() ? Register() : FramePtrReg,
FramePtrRegScratchCopy);
}

if (FPSaved) {
if (FPSaved && !FuncInfo->isChainFunction()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this for? If you're in a chain function where the FP needs to be spilled, then you'll call emitCSRSpillRestores once in the branch above, and then again from the else on this branch.

To be fair this whole code looks unnecessarily complicated, can we just merge the 2 ifs? They originally had the same condition and it's probably good to keep it that way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they can be merged. Thank you!

// Insert the copy to restore FP.
Register SrcReg = SGPRForFPSaveRestoreCopy ? SGPRForFPSaveRestoreCopy
: FramePtrRegScratchCopy;
Expand Down Expand Up @@ -2170,13 +2171,13 @@ bool SIFrameLowering::hasFPImpl(const MachineFunction &MF) const {
return MFI.getStackSize() != 0;
}

return (frameTriviallyRequiresSP(MFI) &&
!MF.getInfo<SIMachineFunctionInfo>()->isChainFunction()) ||
MFI.isFrameAddressTaken() ||
MF.getSubtarget<GCNSubtarget>().getRegisterInfo()->hasStackRealignment(
MF) ||
mayReserveScratchForCWSR(MF) ||
MF.getTarget().Options.DisableFramePointerElim(MF);
return !MF.getInfo<SIMachineFunctionInfo>()->isChainFunction() &&
(frameTriviallyRequiresSP(MFI) || MFI.isFrameAddressTaken() ||
MF.getSubtarget<GCNSubtarget>()
.getRegisterInfo()
->hasStackRealignment(MF) ||
mayReserveScratchForCWSR(MF) ||
MF.getTarget().Options.DisableFramePointerElim(MF));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think chain functions should honour DisableFramePointerElim.

As far as I know, we don't have a test with the frameaddress intrinsic, we should probably add one (if it does the wrong thing, add a FIXME and I can have a look at it if you don't have the time).

I have also just realized that CWSR support is probably broken for chain functions. That has a bit of a bigger scope than your current patch, so I can look into that one too (unless you're very interested in learning about it).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, will add a test for frameaddress. I will take CWSR into consideration...

You may work on it if you'd like as well.

}

bool SIFrameLowering::mayReserveScratchForCWSR(
Expand Down
25 changes: 25 additions & 0 deletions llvm/test/CodeGen/AMDGPU/amdgpu-cs-chain-fp-nosave.ll
Original file line number Diff line number Diff line change
Expand Up @@ -517,3 +517,28 @@ define amdgpu_cs_chain void @test_call_and_alloca_var(i32 %count) {
store i32 0, ptr addrspace(5) %v, align 4
ret void
}

define amdgpu_cs_chain void @test_fp_all() #0 {
; GFX12-LABEL: test_fp_all:
; GFX12: ; %bb.0:
; GFX12-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-NEXT: s_wait_expcnt 0x0
; GFX12-NEXT: s_wait_samplecnt 0x0
; GFX12-NEXT: s_wait_bvhcnt 0x0
; GFX12-NEXT: s_wait_kmcnt 0x0
; GFX12-NEXT: v_mov_b32_e32 v0, 0
; GFX12-NEXT: scratch_store_b32 off, v0, off
; GFX12-NEXT: s_endpgm
;
; GFX942-LABEL: test_fp_all:
; GFX942: ; %bb.0:
; GFX942-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX942-NEXT: v_mov_b32_e32 v0, 0
; GFX942-NEXT: scratch_store_dword off, v0, off
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it the frame pointer was just ignored? If we require a frame pointer it still needs to be set up

; GFX942-NEXT: s_endpgm
%v = alloca i32, align 4, addrspace(5)
store i32 0, ptr addrspace(5) %v, align 4
ret void
}

attributes #0 = { "frame-pointer"="all" }
69 changes: 69 additions & 0 deletions llvm/test/CodeGen/AMDGPU/amdgpu-cs-chain-frame-pointer.ll
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
; RUN: llc -global-isel=1 -mtriple=amdgcn -mcpu=gfx942 -O0 -verify-machineinstrs < %s | FileCheck %s

define amdgpu_cs_chain void @recurse(ptr %callee) {
; CHECK-LABEL: recurse:
; CHECK: ; %bb.0:
; CHECK-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; CHECK-NEXT: s_mov_b32 s32, 16
; CHECK-NEXT: s_xor_saveexec_b64 s[0:1], -1
; CHECK-NEXT: scratch_store_dword off, v40, off ; 4-byte Folded Spill
; CHECK-NEXT: scratch_store_dword off, v41, off offset:4 ; 4-byte Folded Spill
; CHECK-NEXT: s_mov_b64 exec, s[0:1]
; CHECK-NEXT: ; implicit-def: $sgpr6_sgpr7
; CHECK-NEXT: ; implicit-def: $sgpr10_sgpr11
; CHECK-NEXT: ; implicit-def: $sgpr0
; CHECK-NEXT: s_mov_b32 s1, 0
; CHECK-NEXT: ; implicit-def: $vgpr40 : SGPR spill to VGPR lane
; CHECK-NEXT: v_writelane_b32 v40, s1, 0
; CHECK-NEXT: v_mov_b32_e32 v0, s1
; CHECK-NEXT: v_mov_b32_e32 v1, s1
; CHECK-NEXT: s_mov_b64 s[4:5], s[6:7]
; CHECK-NEXT: s_mov_b64 s[8:9], 36
; CHECK-NEXT: s_mov_b32 s12, s0
; CHECK-NEXT: s_mov_b32 s13, s0
; CHECK-NEXT: s_mov_b32 s14, s0
; CHECK-NEXT: s_mov_b32 s15, s0
; CHECK-NEXT: v_mov_b32_e32 v31, s0
; CHECK-NEXT: s_getpc_b64 s[0:1]
; CHECK-NEXT: s_add_u32 s0, s0, recurse@gotpcrel32@lo+4
; CHECK-NEXT: s_addc_u32 s1, s1, recurse@gotpcrel32@hi+12
; CHECK-NEXT: s_load_dwordx2 s[0:1], s[0:1], 0x0
; CHECK-NEXT: s_waitcnt lgkmcnt(0)
; CHECK-NEXT: s_swappc_b64 s[30:31], s[0:1]
; CHECK-NEXT: v_readlane_b32 s3, v40, 0
; CHECK-NEXT: s_nop 1
; CHECK-NEXT: v_mov_b32_e32 v0, s3
; CHECK-NEXT: s_nop 0
; CHECK-NEXT: v_readfirstlane_b32 s0, v0
; CHECK-NEXT: v_mov_b32_e32 v0, s3
; CHECK-NEXT: s_nop 0
; CHECK-NEXT: v_readfirstlane_b32 s1, v0
; CHECK-NEXT: v_mov_b32_e32 v0, s3
; CHECK-NEXT: s_nop 0
; CHECK-NEXT: v_readfirstlane_b32 s2, v0
; CHECK-NEXT: v_mov_b32_e32 v8, s3
; CHECK-NEXT: s_mov_b32 s4, 0
; CHECK-NEXT: v_mov_b32_e32 v9, s4
; CHECK-NEXT: v_mov_b32_e32 v10, s3
; CHECK-NEXT: v_mov_b32_e32 v11, s3
; CHECK-NEXT: s_mov_b64 s[4:5], 0
; CHECK-NEXT: v_readlane_b32 s3, v41, 0
; CHECK-NEXT: s_xor_saveexec_b64 s[6:7], -1
; CHECK-NEXT: scratch_load_dword v40, off, off ; 4-byte Folded Reload
; CHECK-NEXT: scratch_load_dword v41, off, off offset:4 ; 4-byte Folded Reload
; CHECK-NEXT: s_mov_b64 exec, s[6:7]
; CHECK-NEXT: s_waitcnt vmcnt(0)
; CHECK-NEXT: v_readlane_b32 s3, v41, 0
; CHECK-NEXT: s_xor_saveexec_b64 s[8:9], -1
; CHECK-NEXT: scratch_load_dword v40, off, off ; 4-byte Folded Reload
; CHECK-NEXT: scratch_load_dword v41, off, off offset:4 ; 4-byte Folded Reload
; CHECK-NEXT: s_mov_b64 exec, s[8:9]
; CHECK-NEXT: s_mov_b64 exec, 0
; CHECK-NEXT: s_setpc_b64 s[4:5]
call void @recurse(ptr null)
call void (ptr, i64, <3 x i32>, { i32, ptr addrspace(5), i32, i32 }, i32, ...) @llvm.amdgcn.cs.chain.p0.i64.v3i32.sl_i32p5i32i32s(ptr null, i64 0, <3 x i32> inreg zeroinitializer, { i32, ptr addrspace(5), i32, i32 } zeroinitializer, i32 0)
unreachable
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some other straightforward examples that need an FP, like dynamic alloca and/or "frame-pointer"="all"? Do those fail without -O0?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have tests for some of these cases in https://github.com/llvm/llvm-project/blob/main/llvm/test/CodeGen/AMDGPU/amdgpu-cs-chain-fp-nosave.ll

It would be nice to consolidate all FP-related tests in a single file, so they're easy to find.

Copy link
Contributor Author

@jofrn jofrn Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The straightforward fp all example failed with -O0, so I added it to amdgpu-cs-chain-fp-nosave.ll. The recursive example is still in its original file, and the chain intrinsic call prevents compilation of it to gfx1200, so because fp-nosave.ll has gfx1200 within one of the RUN lines, then we may want to keep it in a different file for now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should definitely work on GFX12 and -O0! If you feel like the issues are not related to your patch, let me know and I can look into fixing them. Thanks :)

Copy link
Contributor Author

@jofrn jofrn Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a typo. I meant to say the recurse test only fails with -O0 and -global-isel=1. The fp all test can be moved because it can fail without -O0; it can go back to the new file too since it fails either way. Sorry about that.

You may try as well. Thank you for the help so far.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dynamic alloca would be easier to see

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are dynamic allocas in the other test file.

declare void @llvm.amdgcn.cs.chain.p0.i64.v3i32.sl_i32p5i32i32s(ptr, i64, <3 x i32>, { i32, ptr addrspace(5), i32, i32 }, i32 immarg, ...)