Skip to content

Conversation

@jofrn
Copy link
Contributor

@jofrn jofrn commented Oct 29, 2025

Functions with the amdgpu_cs_chain calling convention were not initializing FrameOffsetReg, leaving it as FP_REG. This caused machine code verification failures
as SCRATCH_STORE_DWORD_SADDR instructions require the saddr operand to be in the SReg_32_XEXEC_HI register class.

@llvmbot
Copy link
Member

llvmbot commented Oct 29, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: None (jofrn)

Changes

Functions with the amdgpu_cs_chain calling convention were not initializing FrameOffsetReg, leaving it as FP_REG. This caused machine code verification failures
as SCRATCH_STORE_DWORD_SADDR instructions require the saddr operand to be in the SReg_32_XEXEC_HI register class.


Full diff: https://github.com/llvm/llvm-project/pull/165518.diff

2 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp (+1)
  • (added) llvm/test/CodeGen/AMDGPU/amdgpu-cs-chain-frame-pointer.ll (+64)
diff --git a/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp b/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
index b398db4f7caff..a9f1e9a2ad31d 100644
--- a/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
@@ -98,6 +98,7 @@ SIMachineFunctionInfo::SIMachineFunctionInfo(const Function &F,
     // set one up. For now, we can use s32 to match what amdgpu_gfx functions
     // would use if called, but this can be revisited.
     // FIXME: Only reserve this if we actually need it.
+    FrameOffsetReg = AMDGPU::SGPR33;
     StackPtrOffsetReg = AMDGPU::SGPR32;
 
     ScratchRSrcReg = AMDGPU::SGPR48_SGPR49_SGPR50_SGPR51;
diff --git a/llvm/test/CodeGen/AMDGPU/amdgpu-cs-chain-frame-pointer.ll b/llvm/test/CodeGen/AMDGPU/amdgpu-cs-chain-frame-pointer.ll
new file mode 100644
index 0000000000000..3e2ecad67fbec
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/amdgpu-cs-chain-frame-pointer.ll
@@ -0,0 +1,64 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -global-isel=1 -mtriple=amdgcn -mcpu=gfx942 -O0 -verify-machineinstrs < %s | FileCheck %s
+
+define amdgpu_cs_chain void @indirect(ptr %callee) {
+; CHECK-LABEL: indirect:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    s_mov_b32 s32, 16
+; CHECK-NEXT:    s_xor_saveexec_b64 s[0:1], -1
+; CHECK-NEXT:    scratch_store_dword off, v40, off ; 4-byte Folded Spill
+; CHECK-NEXT:    scratch_store_dword off, v41, off offset:4 ; 4-byte Folded Spill
+; CHECK-NEXT:    s_mov_b64 exec, s[0:1]
+; CHECK-NEXT:    ; implicit-def: $sgpr6_sgpr7
+; CHECK-NEXT:    ; implicit-def: $sgpr10_sgpr11
+; CHECK-NEXT:    ; implicit-def: $sgpr0
+; CHECK-NEXT:    s_mov_b32 s1, 0
+; CHECK-NEXT:    ; implicit-def: $vgpr40 : SGPR spill to VGPR lane
+; CHECK-NEXT:    v_writelane_b32 v40, s1, 0
+; CHECK-NEXT:    v_mov_b32_e32 v0, s1
+; CHECK-NEXT:    v_mov_b32_e32 v1, s1
+; CHECK-NEXT:    s_mov_b64 s[4:5], s[6:7]
+; CHECK-NEXT:    s_mov_b64 s[8:9], 36
+; CHECK-NEXT:    s_mov_b32 s12, s0
+; CHECK-NEXT:    s_mov_b32 s13, s0
+; CHECK-NEXT:    s_mov_b32 s14, s0
+; CHECK-NEXT:    s_mov_b32 s15, s0
+; CHECK-NEXT:    v_mov_b32_e32 v31, s0
+; CHECK-NEXT:    s_getpc_b64 s[0:1]
+; CHECK-NEXT:    s_add_u32 s0, s0, indirect@gotpcrel32@lo+4
+; CHECK-NEXT:    s_addc_u32 s1, s1, indirect@gotpcrel32@hi+12
+; CHECK-NEXT:    s_load_dwordx2 s[0:1], s[0:1], 0x0
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    s_swappc_b64 s[30:31], s[0:1]
+; CHECK-NEXT:    v_readlane_b32 s3, v40, 0
+; CHECK-NEXT:    s_nop 1
+; CHECK-NEXT:    v_mov_b32_e32 v0, s3
+; CHECK-NEXT:    s_nop 0
+; CHECK-NEXT:    v_readfirstlane_b32 s0, v0
+; CHECK-NEXT:    v_mov_b32_e32 v0, s3
+; CHECK-NEXT:    s_nop 0
+; CHECK-NEXT:    v_readfirstlane_b32 s1, v0
+; CHECK-NEXT:    v_mov_b32_e32 v0, s3
+; CHECK-NEXT:    s_nop 0
+; CHECK-NEXT:    v_readfirstlane_b32 s2, v0
+; CHECK-NEXT:    v_mov_b32_e32 v8, s3
+; CHECK-NEXT:    s_mov_b32 s4, 0
+; CHECK-NEXT:    v_mov_b32_e32 v9, s4
+; CHECK-NEXT:    v_mov_b32_e32 v10, s3
+; CHECK-NEXT:    v_mov_b32_e32 v11, s3
+; CHECK-NEXT:    s_mov_b64 s[4:5], 0
+; CHECK-NEXT:    v_readlane_b32 s3, v41, 0
+; CHECK-NEXT:    s_xor_saveexec_b64 s[6:7], -1
+; CHECK-NEXT:    scratch_load_dword v40, off, s33 ; 4-byte Folded Reload
+; CHECK-NEXT:    scratch_load_dword v41, off, s33 offset:4 ; 4-byte Folded Reload
+; CHECK-NEXT:    s_mov_b64 exec, s[6:7]
+; CHECK-NEXT:    s_mov_b32 s33, s3
+; CHECK-NEXT:    s_mov_b64 exec, 0
+; CHECK-NEXT:    s_setpc_b64 s[4:5]
+  call void @indirect(ptr null)
+  call void (ptr, i64, <3 x i32>, { i32, ptr addrspace(5), i32, i32 }, i32, ...) @llvm.amdgcn.cs.chain.p0.i64.v3i32.sl_i32p5i32i32s(ptr null, i64 0, <3 x i32> inreg zeroinitializer, { i32, ptr addrspace(5), i32, i32 } zeroinitializer, i32 0)
+  unreachable
+}
+
+declare void @llvm.amdgcn.cs.chain.p0.i64.v3i32.sl_i32p5i32i32s(ptr, i64, <3 x i32>, { i32, ptr addrspace(5), i32, i32 }, i32 immarg, ...)

@jofrn jofrn force-pushed the users/jofrn/fix-amdgpu-cs-chain-frame-pointer branch from 18331e0 to 76189d6 Compare October 29, 2025 07:04
// set one up. For now, we can use s32 to match what amdgpu_gfx functions
// would use if called, but this can be revisited.
// FIXME: Only reserve this if we actually need it.
FrameOffsetReg = AMDGPU::SGPR33;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the comment, it doesn't seem like this is needed. The test case has a UB in the first place, because indirect should not be called directly. I'm not sure if we want to do this. @jayfoad @rovka

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It still needs to compile without verifier error if it's UB. You don't have the option of rejecting a frame pointer, you can explicitly request it (and things like dynamic alloca force it too)

// set one up. For now, we can use s32 to match what amdgpu_gfx functions
// would use if called, but this can be revisited.
// FIXME: Only reserve this if we actually need it.
FrameOffsetReg = AMDGPU::SGPR33;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It still needs to compile without verifier error if it's UB. You don't have the option of rejecting a frame pointer, you can explicitly request it (and things like dynamic alloca force it too)

call void (ptr, i64, <3 x i32>, { i32, ptr addrspace(5), i32, i32 }, i32, ...) @llvm.amdgcn.cs.chain.p0.i64.v3i32.sl_i32p5i32i32s(ptr null, i64 0, <3 x i32> inreg zeroinitializer, { i32, ptr addrspace(5), i32, i32 } zeroinitializer, i32 0)
unreachable
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some other straightforward examples that need an FP, like dynamic alloca and/or "frame-pointer"="all"? Do those fail without -O0?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have tests for some of these cases in https://github.com/llvm/llvm-project/blob/main/llvm/test/CodeGen/AMDGPU/amdgpu-cs-chain-fp-nosave.ll

It would be nice to consolidate all FP-related tests in a single file, so they're easy to find.

Copy link
Contributor Author

@jofrn jofrn Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The straightforward fp all example failed with -O0, so I added it to amdgpu-cs-chain-fp-nosave.ll. The recursive example is still in its original file, and the chain intrinsic call prevents compilation of it to gfx1200, so because fp-nosave.ll has gfx1200 within one of the RUN lines, then we may want to keep it in a different file for now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should definitely work on GFX12 and -O0! If you feel like the issues are not related to your patch, let me know and I can look into fixing them. Thanks :)

Copy link
Contributor Author

@jofrn jofrn Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a typo. I meant to say the recurse test only fails with -O0 and -global-isel=1. The fp all test can be moved because it can fail without -O0; it can go back to the new file too since it fails either way. Sorry about that.

You may try as well. Thank you for the help so far.

; CHECK-NEXT: s_mov_b64 s[4:5], 0
; CHECK-NEXT: v_readlane_b32 s3, v41, 0
; CHECK-NEXT: s_xor_saveexec_b64 s[6:7], -1
; CHECK-NEXT: scratch_load_dword v40, off, s33 ; 4-byte Folded Reload
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't correct, since the stores don't use fp_reg, they just use plain offsets. It looks like this is just an oversight where we used the wrong register for the restores. You should check who's inserting these spills and restores and make sure they're handled the same way, rather than force the FP to s33. (I could've sworn I fixed something like this a while ago but I can't find it right now).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the hardcoding of s33 and added check to avoid restoring FP for chain functions. Thanks.

Functions with the amdgpu_cs_chain calling convention were not
initializing FrameOffsetReg, leaving it as FP_REG.
This caused machine code verification failures
as SCRATCH_STORE_DWORD_SADDR instructions require the saddr
operand to be in the SReg_32_XEXEC_HI register class.

This LLVM defect was identified via the AMD Fuzzing project.
…andle chain functions in same way as non-FPSaved case, an offset. Remove prior hardcode to s33 and do not restore FP for chain functions.
@jofrn jofrn force-pushed the users/jofrn/fix-amdgpu-cs-chain-frame-pointer branch from 6e8105c to 820b9c3 Compare November 10, 2025 13:09
…fx1200. Move fp_all test to fp-nosave since it compiles on both gfx942 and gfx1200.
call void (ptr, i64, <3 x i32>, { i32, ptr addrspace(5), i32, i32 }, i32, ...) @llvm.amdgcn.cs.chain.p0.i64.v3i32.sl_i32p5i32i32s(ptr null, i64 0, <3 x i32> inreg zeroinitializer, { i32, ptr addrspace(5), i32, i32 } zeroinitializer, i32 0)
unreachable
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dynamic alloca would be easier to see

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are dynamic allocas in the other test file.

; GFX942: ; %bb.0:
; GFX942-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX942-NEXT: v_mov_b32_e32 v0, 0
; GFX942-NEXT: scratch_store_dword off, v0, off
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it the frame pointer was just ignored? If we require a frame pointer it still needs to be set up

.getRegisterInfo()
->hasStackRealignment(MF) ||
mayReserveScratchForCWSR(MF) ||
MF.getTarget().Options.DisableFramePointerElim(MF));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think chain functions should honour DisableFramePointerElim.

As far as I know, we don't have a test with the frameaddress intrinsic, we should probably add one (if it does the wrong thing, add a FIXME and I can have a look at it if you don't have the time).

I have also just realized that CWSR support is probably broken for chain functions. That has a bit of a bigger scope than your current patch, so I can look into that one too (unless you're very interested in learning about it).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, will add a test for frameaddress. I will take CWSR into consideration...

You may work on it if you'd like as well.

}

if (FPSaved) {
if (FPSaved && !FuncInfo->isChainFunction()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this for? If you're in a chain function where the FP needs to be spilled, then you'll call emitCSRSpillRestores once in the branch above, and then again from the else on this branch.

To be fair this whole code looks unnecessarily complicated, can we just merge the 2 ifs? They originally had the same condition and it's probably good to keep it that way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they can be merged. Thank you!

call void (ptr, i64, <3 x i32>, { i32, ptr addrspace(5), i32, i32 }, i32, ...) @llvm.amdgcn.cs.chain.p0.i64.v3i32.sl_i32p5i32i32s(ptr null, i64 0, <3 x i32> inreg zeroinitializer, { i32, ptr addrspace(5), i32, i32 } zeroinitializer, i32 0)
unreachable
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should definitely work on GFX12 and -O0! If you feel like the issues are not related to your patch, let me know and I can look into fixing them. Thanks :)

@jofrn
Copy link
Contributor Author

jofrn commented Nov 17, 2025

This should definitely work on GFX12 and -O0! If you feel like the issues are not related to your patch, let me know and I can look into fixing them. Thanks :)

recurse test can fail with or without -O0; I had a typo, clarified above.

The current issue occurs from the fp_all test case. The original patch for recurse requires only

return !MF.getInfo<SIMachineFunctionInfo>()->isChainFunction() && (frameTriviallyRequiresSP(MFI) || MFI.isFrameAddressTaken() ...

fp_all requires perhaps the additional modifications to SIFrameLowering::emitEpilogue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants