-
Notifications
You must be signed in to change notification settings - Fork 15.2k
[AMDGPU] Initialize FrameOffsetReg for amdgpu_cs_chain functions #165518
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@llvm/pr-subscribers-backend-amdgpu Author: None (jofrn) ChangesFunctions with the amdgpu_cs_chain calling convention were not initializing FrameOffsetReg, leaving it as FP_REG. This caused machine code verification failures Full diff: https://github.com/llvm/llvm-project/pull/165518.diff 2 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp b/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
index b398db4f7caff..a9f1e9a2ad31d 100644
--- a/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
@@ -98,6 +98,7 @@ SIMachineFunctionInfo::SIMachineFunctionInfo(const Function &F,
// set one up. For now, we can use s32 to match what amdgpu_gfx functions
// would use if called, but this can be revisited.
// FIXME: Only reserve this if we actually need it.
+ FrameOffsetReg = AMDGPU::SGPR33;
StackPtrOffsetReg = AMDGPU::SGPR32;
ScratchRSrcReg = AMDGPU::SGPR48_SGPR49_SGPR50_SGPR51;
diff --git a/llvm/test/CodeGen/AMDGPU/amdgpu-cs-chain-frame-pointer.ll b/llvm/test/CodeGen/AMDGPU/amdgpu-cs-chain-frame-pointer.ll
new file mode 100644
index 0000000000000..3e2ecad67fbec
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/amdgpu-cs-chain-frame-pointer.ll
@@ -0,0 +1,64 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -global-isel=1 -mtriple=amdgcn -mcpu=gfx942 -O0 -verify-machineinstrs < %s | FileCheck %s
+
+define amdgpu_cs_chain void @indirect(ptr %callee) {
+; CHECK-LABEL: indirect:
+; CHECK: ; %bb.0:
+; CHECK-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; CHECK-NEXT: s_mov_b32 s32, 16
+; CHECK-NEXT: s_xor_saveexec_b64 s[0:1], -1
+; CHECK-NEXT: scratch_store_dword off, v40, off ; 4-byte Folded Spill
+; CHECK-NEXT: scratch_store_dword off, v41, off offset:4 ; 4-byte Folded Spill
+; CHECK-NEXT: s_mov_b64 exec, s[0:1]
+; CHECK-NEXT: ; implicit-def: $sgpr6_sgpr7
+; CHECK-NEXT: ; implicit-def: $sgpr10_sgpr11
+; CHECK-NEXT: ; implicit-def: $sgpr0
+; CHECK-NEXT: s_mov_b32 s1, 0
+; CHECK-NEXT: ; implicit-def: $vgpr40 : SGPR spill to VGPR lane
+; CHECK-NEXT: v_writelane_b32 v40, s1, 0
+; CHECK-NEXT: v_mov_b32_e32 v0, s1
+; CHECK-NEXT: v_mov_b32_e32 v1, s1
+; CHECK-NEXT: s_mov_b64 s[4:5], s[6:7]
+; CHECK-NEXT: s_mov_b64 s[8:9], 36
+; CHECK-NEXT: s_mov_b32 s12, s0
+; CHECK-NEXT: s_mov_b32 s13, s0
+; CHECK-NEXT: s_mov_b32 s14, s0
+; CHECK-NEXT: s_mov_b32 s15, s0
+; CHECK-NEXT: v_mov_b32_e32 v31, s0
+; CHECK-NEXT: s_getpc_b64 s[0:1]
+; CHECK-NEXT: s_add_u32 s0, s0, indirect@gotpcrel32@lo+4
+; CHECK-NEXT: s_addc_u32 s1, s1, indirect@gotpcrel32@hi+12
+; CHECK-NEXT: s_load_dwordx2 s[0:1], s[0:1], 0x0
+; CHECK-NEXT: s_waitcnt lgkmcnt(0)
+; CHECK-NEXT: s_swappc_b64 s[30:31], s[0:1]
+; CHECK-NEXT: v_readlane_b32 s3, v40, 0
+; CHECK-NEXT: s_nop 1
+; CHECK-NEXT: v_mov_b32_e32 v0, s3
+; CHECK-NEXT: s_nop 0
+; CHECK-NEXT: v_readfirstlane_b32 s0, v0
+; CHECK-NEXT: v_mov_b32_e32 v0, s3
+; CHECK-NEXT: s_nop 0
+; CHECK-NEXT: v_readfirstlane_b32 s1, v0
+; CHECK-NEXT: v_mov_b32_e32 v0, s3
+; CHECK-NEXT: s_nop 0
+; CHECK-NEXT: v_readfirstlane_b32 s2, v0
+; CHECK-NEXT: v_mov_b32_e32 v8, s3
+; CHECK-NEXT: s_mov_b32 s4, 0
+; CHECK-NEXT: v_mov_b32_e32 v9, s4
+; CHECK-NEXT: v_mov_b32_e32 v10, s3
+; CHECK-NEXT: v_mov_b32_e32 v11, s3
+; CHECK-NEXT: s_mov_b64 s[4:5], 0
+; CHECK-NEXT: v_readlane_b32 s3, v41, 0
+; CHECK-NEXT: s_xor_saveexec_b64 s[6:7], -1
+; CHECK-NEXT: scratch_load_dword v40, off, s33 ; 4-byte Folded Reload
+; CHECK-NEXT: scratch_load_dword v41, off, s33 offset:4 ; 4-byte Folded Reload
+; CHECK-NEXT: s_mov_b64 exec, s[6:7]
+; CHECK-NEXT: s_mov_b32 s33, s3
+; CHECK-NEXT: s_mov_b64 exec, 0
+; CHECK-NEXT: s_setpc_b64 s[4:5]
+ call void @indirect(ptr null)
+ call void (ptr, i64, <3 x i32>, { i32, ptr addrspace(5), i32, i32 }, i32, ...) @llvm.amdgcn.cs.chain.p0.i64.v3i32.sl_i32p5i32i32s(ptr null, i64 0, <3 x i32> inreg zeroinitializer, { i32, ptr addrspace(5), i32, i32 } zeroinitializer, i32 0)
+ unreachable
+}
+
+declare void @llvm.amdgcn.cs.chain.p0.i64.v3i32.sl_i32p5i32i32s(ptr, i64, <3 x i32>, { i32, ptr addrspace(5), i32, i32 }, i32 immarg, ...)
|
18331e0 to
76189d6
Compare
| // set one up. For now, we can use s32 to match what amdgpu_gfx functions | ||
| // would use if called, but this can be revisited. | ||
| // FIXME: Only reserve this if we actually need it. | ||
| FrameOffsetReg = AMDGPU::SGPR33; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It still needs to compile without verifier error if it's UB. You don't have the option of rejecting a frame pointer, you can explicitly request it (and things like dynamic alloca force it too)
| // set one up. For now, we can use s32 to match what amdgpu_gfx functions | ||
| // would use if called, but this can be revisited. | ||
| // FIXME: Only reserve this if we actually need it. | ||
| FrameOffsetReg = AMDGPU::SGPR33; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It still needs to compile without verifier error if it's UB. You don't have the option of rejecting a frame pointer, you can explicitly request it (and things like dynamic alloca force it too)
| call void (ptr, i64, <3 x i32>, { i32, ptr addrspace(5), i32, i32 }, i32, ...) @llvm.amdgcn.cs.chain.p0.i64.v3i32.sl_i32p5i32i32s(ptr null, i64 0, <3 x i32> inreg zeroinitializer, { i32, ptr addrspace(5), i32, i32 } zeroinitializer, i32 0) | ||
| unreachable | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add some other straightforward examples that need an FP, like dynamic alloca and/or "frame-pointer"="all"? Do those fail without -O0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have tests for some of these cases in https://github.com/llvm/llvm-project/blob/main/llvm/test/CodeGen/AMDGPU/amdgpu-cs-chain-fp-nosave.ll
It would be nice to consolidate all FP-related tests in a single file, so they're easy to find.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The straightforward fp all example failed with -O0, so I added it to amdgpu-cs-chain-fp-nosave.ll. The recursive example is still in its original file, and the chain intrinsic call prevents compilation of it to gfx1200, so because fp-nosave.ll has gfx1200 within one of the RUN lines, then we may want to keep it in a different file for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should definitely work on GFX12 and -O0! If you feel like the issues are not related to your patch, let me know and I can look into fixing them. Thanks :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a typo. I meant to say the recurse test only fails with -O0 and -global-isel=1. The fp all test can be moved because it can fail without -O0; it can go back to the new file too since it fails either way. Sorry about that.
You may try as well. Thank you for the help so far.
| ; CHECK-NEXT: s_mov_b64 s[4:5], 0 | ||
| ; CHECK-NEXT: v_readlane_b32 s3, v41, 0 | ||
| ; CHECK-NEXT: s_xor_saveexec_b64 s[6:7], -1 | ||
| ; CHECK-NEXT: scratch_load_dword v40, off, s33 ; 4-byte Folded Reload |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't correct, since the stores don't use fp_reg, they just use plain offsets. It looks like this is just an oversight where we used the wrong register for the restores. You should check who's inserting these spills and restores and make sure they're handled the same way, rather than force the FP to s33. (I could've sworn I fixed something like this a while ago but I can't find it right now).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the hardcoding of s33 and added check to avoid restoring FP for chain functions. Thanks.
Functions with the amdgpu_cs_chain calling convention were not initializing FrameOffsetReg, leaving it as FP_REG. This caused machine code verification failures as SCRATCH_STORE_DWORD_SADDR instructions require the saddr operand to be in the SReg_32_XEXEC_HI register class. This LLVM defect was identified via the AMD Fuzzing project.
…andle chain functions in same way as non-FPSaved case, an offset. Remove prior hardcode to s33 and do not restore FP for chain functions.
6e8105c to
820b9c3
Compare
…fx1200. Move fp_all test to fp-nosave since it compiles on both gfx942 and gfx1200.
| call void (ptr, i64, <3 x i32>, { i32, ptr addrspace(5), i32, i32 }, i32, ...) @llvm.amdgcn.cs.chain.p0.i64.v3i32.sl_i32p5i32i32s(ptr null, i64 0, <3 x i32> inreg zeroinitializer, { i32, ptr addrspace(5), i32, i32 } zeroinitializer, i32 0) | ||
| unreachable | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dynamic alloca would be easier to see
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are dynamic allocas in the other test file.
| ; GFX942: ; %bb.0: | ||
| ; GFX942-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0) | ||
| ; GFX942-NEXT: v_mov_b32_e32 v0, 0 | ||
| ; GFX942-NEXT: scratch_store_dword off, v0, off |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like it the frame pointer was just ignored? If we require a frame pointer it still needs to be set up
| .getRegisterInfo() | ||
| ->hasStackRealignment(MF) || | ||
| mayReserveScratchForCWSR(MF) || | ||
| MF.getTarget().Options.DisableFramePointerElim(MF)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think chain functions should honour DisableFramePointerElim.
As far as I know, we don't have a test with the frameaddress intrinsic, we should probably add one (if it does the wrong thing, add a FIXME and I can have a look at it if you don't have the time).
I have also just realized that CWSR support is probably broken for chain functions. That has a bit of a bigger scope than your current patch, so I can look into that one too (unless you're very interested in learning about it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, will add a test for frameaddress. I will take CWSR into consideration...
You may work on it if you'd like as well.
| } | ||
|
|
||
| if (FPSaved) { | ||
| if (FPSaved && !FuncInfo->isChainFunction()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this for? If you're in a chain function where the FP needs to be spilled, then you'll call emitCSRSpillRestores once in the branch above, and then again from the else on this branch.
To be fair this whole code looks unnecessarily complicated, can we just merge the 2 ifs? They originally had the same condition and it's probably good to keep it that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, they can be merged. Thank you!
| call void (ptr, i64, <3 x i32>, { i32, ptr addrspace(5), i32, i32 }, i32, ...) @llvm.amdgcn.cs.chain.p0.i64.v3i32.sl_i32p5i32i32s(ptr null, i64 0, <3 x i32> inreg zeroinitializer, { i32, ptr addrspace(5), i32, i32 } zeroinitializer, i32 0) | ||
| unreachable | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should definitely work on GFX12 and -O0! If you feel like the issues are not related to your patch, let me know and I can look into fixing them. Thanks :)
The current issue occurs from the fp_all test case. The original patch for recurse requires only
fp_all requires perhaps the additional modifications to |
Functions with the amdgpu_cs_chain calling convention were not initializing FrameOffsetReg, leaving it as FP_REG. This caused machine code verification failures
as SCRATCH_STORE_DWORD_SADDR instructions require the saddr operand to be in the SReg_32_XEXEC_HI register class.