-
Notifications
You must be signed in to change notification settings - Fork 15.3k
[AMDGPU] Initialize FrameOffsetReg for amdgpu_cs_chain functions #165518
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
4c9ae7c
d4eecd0
820b9c3
506b411
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1408,11 +1408,12 @@ void SIFrameLowering::emitEpilogue(MachineFunction &MF, | |
| LiveUnits.addReg(FramePtrRegScratchCopy); | ||
| } | ||
|
|
||
| emitCSRSpillRestores(MF, MBB, MBBI, DL, LiveUnits, FramePtrReg, | ||
| emitCSRSpillRestores(MF, MBB, MBBI, DL, LiveUnits, | ||
| FuncInfo->isChainFunction() ? Register() : FramePtrReg, | ||
| FramePtrRegScratchCopy); | ||
| } | ||
|
|
||
| if (FPSaved) { | ||
| if (FPSaved && !FuncInfo->isChainFunction()) { | ||
| // Insert the copy to restore FP. | ||
| Register SrcReg = SGPRForFPSaveRestoreCopy ? SGPRForFPSaveRestoreCopy | ||
| : FramePtrRegScratchCopy; | ||
|
|
@@ -2170,13 +2171,13 @@ bool SIFrameLowering::hasFPImpl(const MachineFunction &MF) const { | |
| return MFI.getStackSize() != 0; | ||
| } | ||
|
|
||
| return (frameTriviallyRequiresSP(MFI) && | ||
| !MF.getInfo<SIMachineFunctionInfo>()->isChainFunction()) || | ||
| MFI.isFrameAddressTaken() || | ||
| MF.getSubtarget<GCNSubtarget>().getRegisterInfo()->hasStackRealignment( | ||
| MF) || | ||
| mayReserveScratchForCWSR(MF) || | ||
| MF.getTarget().Options.DisableFramePointerElim(MF); | ||
| return !MF.getInfo<SIMachineFunctionInfo>()->isChainFunction() && | ||
| (frameTriviallyRequiresSP(MFI) || MFI.isFrameAddressTaken() || | ||
| MF.getSubtarget<GCNSubtarget>() | ||
| .getRegisterInfo() | ||
| ->hasStackRealignment(MF) || | ||
| mayReserveScratchForCWSR(MF) || | ||
| MF.getTarget().Options.DisableFramePointerElim(MF)); | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think chain functions should honour DisableFramePointerElim. As far as I know, we don't have a test with the frameaddress intrinsic, we should probably add one (if it does the wrong thing, add a FIXME and I can have a look at it if you don't have the time). I have also just realized that CWSR support is probably broken for chain functions. That has a bit of a bigger scope than your current patch, so I can look into that one too (unless you're very interested in learning about it).
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Alright, will add a test for frameaddress. I will take CWSR into consideration... You may work on it if you'd like as well. |
||
| } | ||
|
|
||
| bool SIFrameLowering::mayReserveScratchForCWSR( | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -517,3 +517,28 @@ define amdgpu_cs_chain void @test_call_and_alloca_var(i32 %count) { | |
| store i32 0, ptr addrspace(5) %v, align 4 | ||
| ret void | ||
| } | ||
|
|
||
| define amdgpu_cs_chain void @test_fp_all() #0 { | ||
| ; GFX12-LABEL: test_fp_all: | ||
| ; GFX12: ; %bb.0: | ||
| ; GFX12-NEXT: s_wait_loadcnt_dscnt 0x0 | ||
| ; GFX12-NEXT: s_wait_expcnt 0x0 | ||
| ; GFX12-NEXT: s_wait_samplecnt 0x0 | ||
| ; GFX12-NEXT: s_wait_bvhcnt 0x0 | ||
| ; GFX12-NEXT: s_wait_kmcnt 0x0 | ||
| ; GFX12-NEXT: v_mov_b32_e32 v0, 0 | ||
| ; GFX12-NEXT: scratch_store_b32 off, v0, off | ||
| ; GFX12-NEXT: s_endpgm | ||
| ; | ||
| ; GFX942-LABEL: test_fp_all: | ||
| ; GFX942: ; %bb.0: | ||
| ; GFX942-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0) | ||
| ; GFX942-NEXT: v_mov_b32_e32 v0, 0 | ||
| ; GFX942-NEXT: scratch_store_dword off, v0, off | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This looks like it the frame pointer was just ignored? If we require a frame pointer it still needs to be set up |
||
| ; GFX942-NEXT: s_endpgm | ||
| %v = alloca i32, align 4, addrspace(5) | ||
| store i32 0, ptr addrspace(5) %v, align 4 | ||
| ret void | ||
| } | ||
|
|
||
| attributes #0 = { "frame-pointer"="all" } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,69 @@ | ||
| ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5 | ||
| ; RUN: llc -global-isel=1 -mtriple=amdgcn -mcpu=gfx942 -O0 -verify-machineinstrs < %s | FileCheck %s | ||
|
|
||
| define amdgpu_cs_chain void @recurse(ptr %callee) { | ||
| ; CHECK-LABEL: recurse: | ||
| ; CHECK: ; %bb.0: | ||
| ; CHECK-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0) | ||
| ; CHECK-NEXT: s_mov_b32 s32, 16 | ||
| ; CHECK-NEXT: s_xor_saveexec_b64 s[0:1], -1 | ||
| ; CHECK-NEXT: scratch_store_dword off, v40, off ; 4-byte Folded Spill | ||
| ; CHECK-NEXT: scratch_store_dword off, v41, off offset:4 ; 4-byte Folded Spill | ||
| ; CHECK-NEXT: s_mov_b64 exec, s[0:1] | ||
| ; CHECK-NEXT: ; implicit-def: $sgpr6_sgpr7 | ||
| ; CHECK-NEXT: ; implicit-def: $sgpr10_sgpr11 | ||
| ; CHECK-NEXT: ; implicit-def: $sgpr0 | ||
| ; CHECK-NEXT: s_mov_b32 s1, 0 | ||
| ; CHECK-NEXT: ; implicit-def: $vgpr40 : SGPR spill to VGPR lane | ||
| ; CHECK-NEXT: v_writelane_b32 v40, s1, 0 | ||
| ; CHECK-NEXT: v_mov_b32_e32 v0, s1 | ||
| ; CHECK-NEXT: v_mov_b32_e32 v1, s1 | ||
| ; CHECK-NEXT: s_mov_b64 s[4:5], s[6:7] | ||
| ; CHECK-NEXT: s_mov_b64 s[8:9], 36 | ||
| ; CHECK-NEXT: s_mov_b32 s12, s0 | ||
| ; CHECK-NEXT: s_mov_b32 s13, s0 | ||
| ; CHECK-NEXT: s_mov_b32 s14, s0 | ||
| ; CHECK-NEXT: s_mov_b32 s15, s0 | ||
| ; CHECK-NEXT: v_mov_b32_e32 v31, s0 | ||
| ; CHECK-NEXT: s_getpc_b64 s[0:1] | ||
| ; CHECK-NEXT: s_add_u32 s0, s0, recurse@gotpcrel32@lo+4 | ||
| ; CHECK-NEXT: s_addc_u32 s1, s1, recurse@gotpcrel32@hi+12 | ||
| ; CHECK-NEXT: s_load_dwordx2 s[0:1], s[0:1], 0x0 | ||
| ; CHECK-NEXT: s_waitcnt lgkmcnt(0) | ||
| ; CHECK-NEXT: s_swappc_b64 s[30:31], s[0:1] | ||
| ; CHECK-NEXT: v_readlane_b32 s3, v40, 0 | ||
| ; CHECK-NEXT: s_nop 1 | ||
| ; CHECK-NEXT: v_mov_b32_e32 v0, s3 | ||
| ; CHECK-NEXT: s_nop 0 | ||
| ; CHECK-NEXT: v_readfirstlane_b32 s0, v0 | ||
| ; CHECK-NEXT: v_mov_b32_e32 v0, s3 | ||
| ; CHECK-NEXT: s_nop 0 | ||
| ; CHECK-NEXT: v_readfirstlane_b32 s1, v0 | ||
| ; CHECK-NEXT: v_mov_b32_e32 v0, s3 | ||
| ; CHECK-NEXT: s_nop 0 | ||
| ; CHECK-NEXT: v_readfirstlane_b32 s2, v0 | ||
| ; CHECK-NEXT: v_mov_b32_e32 v8, s3 | ||
| ; CHECK-NEXT: s_mov_b32 s4, 0 | ||
| ; CHECK-NEXT: v_mov_b32_e32 v9, s4 | ||
| ; CHECK-NEXT: v_mov_b32_e32 v10, s3 | ||
| ; CHECK-NEXT: v_mov_b32_e32 v11, s3 | ||
| ; CHECK-NEXT: s_mov_b64 s[4:5], 0 | ||
| ; CHECK-NEXT: v_readlane_b32 s3, v41, 0 | ||
| ; CHECK-NEXT: s_xor_saveexec_b64 s[6:7], -1 | ||
| ; CHECK-NEXT: scratch_load_dword v40, off, off ; 4-byte Folded Reload | ||
| ; CHECK-NEXT: scratch_load_dword v41, off, off offset:4 ; 4-byte Folded Reload | ||
| ; CHECK-NEXT: s_mov_b64 exec, s[6:7] | ||
| ; CHECK-NEXT: s_waitcnt vmcnt(0) | ||
| ; CHECK-NEXT: v_readlane_b32 s3, v41, 0 | ||
| ; CHECK-NEXT: s_xor_saveexec_b64 s[8:9], -1 | ||
| ; CHECK-NEXT: scratch_load_dword v40, off, off ; 4-byte Folded Reload | ||
| ; CHECK-NEXT: scratch_load_dword v41, off, off offset:4 ; 4-byte Folded Reload | ||
| ; CHECK-NEXT: s_mov_b64 exec, s[8:9] | ||
| ; CHECK-NEXT: s_mov_b64 exec, 0 | ||
| ; CHECK-NEXT: s_setpc_b64 s[4:5] | ||
| call void @recurse(ptr null) | ||
| call void (ptr, i64, <3 x i32>, { i32, ptr addrspace(5), i32, i32 }, i32, ...) @llvm.amdgcn.cs.chain.p0.i64.v3i32.sl_i32p5i32i32s(ptr null, i64 0, <3 x i32> inreg zeroinitializer, { i32, ptr addrspace(5), i32, i32 } zeroinitializer, i32 0) | ||
| unreachable | ||
| } | ||
|
|
||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you add some other straightforward examples that need an FP, like dynamic alloca and/or "frame-pointer"="all"? Do those fail without -O0?
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We already have tests for some of these cases in https://github.com/llvm/llvm-project/blob/main/llvm/test/CodeGen/AMDGPU/amdgpu-cs-chain-fp-nosave.ll It would be nice to consolidate all FP-related tests in a single file, so they're easy to find.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The straightforward fp all example failed with -O0, so I added it to amdgpu-cs-chain-fp-nosave.ll. The recursive example is still in its original file, and the chain intrinsic call prevents compilation of it to gfx1200, so because fp-nosave.ll has gfx1200 within one of the RUN lines, then we may want to keep it in a different file for now.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This should definitely work on GFX12 and -O0! If you feel like the issues are not related to your patch, let me know and I can look into fixing them. Thanks :)
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I had a typo. I meant to say the recurse test only fails with -O0 and -global-isel=1. The fp all test can be moved because it can fail without -O0; it can go back to the new file too since it fails either way. Sorry about that. You may try as well. Thank you for the help so far.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. dynamic alloca would be easier to see
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There are dynamic allocas in the other test file. |
||
| declare void @llvm.amdgcn.cs.chain.p0.i64.v3i32.sl_i32p5i32i32s(ptr, i64, <3 x i32>, { i32, ptr addrspace(5), i32, i32 }, i32 immarg, ...) | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this for? If you're in a chain function where the FP needs to be spilled, then you'll call emitCSRSpillRestores once in the branch above, and then again from the else on this branch.
To be fair this whole code looks unnecessarily complicated, can we just merge the 2 ifs? They originally had the same condition and it's probably good to keep it that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, they can be merged. Thank you!