Skip to content

Conversation

@heiher
Copy link
Member

@heiher heiher commented Nov 18, 2025

Allow tail-calling functions that return via sret when the caller has an incoming sret pointer that can be forwarded.

Remove the overly strict requirement that tail-call argument values must exactly match the caller's incoming arguments. The real constraint is only that the callee uses no more argument stack space than the caller.

This fixes musttail codegen and enables significantly more tail-call optimizations.

Fixes #168152

…atching

Allow tail-calling functions that return via sret when the caller has an
incoming sret pointer that can be forwarded.

Remove the overly strict requirement that tail-call argument values must
exactly match the caller's incoming arguments. The real constraint is only
that the callee uses no more argument stack space than the caller.

This fixes musttail codegen and enables significantly more tail-call
optimizations.
@llvmbot
Copy link
Member

llvmbot commented Nov 18, 2025

@llvm/pr-subscribers-backend-loongarch

Author: hev (heiher)

Changes

Allow tail-calling functions that return via sret when the caller has an incoming sret pointer that can be forwarded.

Remove the overly strict requirement that tail-call argument values must exactly match the caller's incoming arguments. The real constraint is only that the callee uses no more argument stack space than the caller.

This fixes musttail codegen and enables significantly more tail-call optimizations.

Fixes #168152


Patch is 24.66 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/168506.diff

5 Files Affected:

  • (modified) llvm/lib/Target/LoongArch/LoongArchISelLowering.cpp (+65-10)
  • (modified) llvm/lib/Target/LoongArch/LoongArchISelLowering.h (+6)
  • (modified) llvm/lib/Target/LoongArch/LoongArchMachineFunctionInfo.h (+7)
  • (added) llvm/test/CodeGen/LoongArch/musttail.ll (+397)
  • (modified) llvm/test/CodeGen/LoongArch/tail-calls.ll (+4-9)
diff --git a/llvm/lib/Target/LoongArch/LoongArchISelLowering.cpp b/llvm/lib/Target/LoongArch/LoongArchISelLowering.cpp
index cf4ffc82f6009..2a55558e00e78 100644
--- a/llvm/lib/Target/LoongArch/LoongArchISelLowering.cpp
+++ b/llvm/lib/Target/LoongArch/LoongArchISelLowering.cpp
@@ -8069,6 +8069,7 @@ SDValue LoongArchTargetLowering::LowerFormalArguments(
     SelectionDAG &DAG, SmallVectorImpl<SDValue> &InVals) const {
 
   MachineFunction &MF = DAG.getMachineFunction();
+  auto *LoongArchFI = MF.getInfo<LoongArchMachineFunctionInfo>();
 
   switch (CallConv) {
   default:
@@ -8140,7 +8141,6 @@ SDValue LoongArchTargetLowering::LowerFormalArguments(
     const TargetRegisterClass *RC = &LoongArch::GPRRegClass;
     MachineFrameInfo &MFI = MF.getFrameInfo();
     MachineRegisterInfo &RegInfo = MF.getRegInfo();
-    auto *LoongArchFI = MF.getInfo<LoongArchMachineFunctionInfo>();
 
     // Offset of the first variable argument from stack pointer, and size of
     // the vararg save area. For now, the varargs save area is either zero or
@@ -8190,6 +8190,8 @@ SDValue LoongArchTargetLowering::LowerFormalArguments(
     LoongArchFI->setVarArgsSaveSize(VarArgsSaveSize);
   }
 
+  LoongArchFI->setArgumentStackSize(CCInfo.getStackSize());
+
   // All stores are grouped in one node to allow the matching between
   // the size of Ins and InVals. This only happens for vararg functions.
   if (!OutChains.empty()) {
@@ -8246,9 +8248,11 @@ bool LoongArchTargetLowering::isEligibleForTailCallOptimization(
   auto &Outs = CLI.Outs;
   auto &Caller = MF.getFunction();
   auto CallerCC = Caller.getCallingConv();
+  auto *LoongArchFI = MF.getInfo<LoongArchMachineFunctionInfo>();
 
-  // Do not tail call opt if the stack is used to pass parameters.
-  if (CCInfo.getStackSize() != 0)
+  // If the stack arguments for this call do not fit into our own save area then
+  // the call cannot be made tail.
+  if (CCInfo.getStackSize() > LoongArchFI->getArgumentStackSize())
     return false;
 
   // Do not tail call opt if any parameters need to be passed indirectly.
@@ -8260,7 +8264,7 @@ bool LoongArchTargetLowering::isEligibleForTailCallOptimization(
   // semantics.
   auto IsCallerStructRet = Caller.hasStructRetAttr();
   auto IsCalleeStructRet = Outs.empty() ? false : Outs[0].Flags.isSRet();
-  if (IsCallerStructRet || IsCalleeStructRet)
+  if (IsCallerStructRet != IsCalleeStructRet)
     return false;
 
   // Do not tail call opt if either the callee or caller has a byval argument.
@@ -8276,9 +8280,47 @@ bool LoongArchTargetLowering::isEligibleForTailCallOptimization(
     if (!TRI->regmaskSubsetEqual(CallerPreserved, CalleePreserved))
       return false;
   }
+
+  // If the callee takes no arguments then go on to check the results of the
+  // call.
+  const MachineRegisterInfo &MRI = MF.getRegInfo();
+  const SmallVectorImpl<SDValue> &OutVals = CLI.OutVals;
+  if (!parametersInCSRMatch(MRI, CallerPreserved, ArgLocs, OutVals))
+    return false;
+
   return true;
 }
 
+SDValue LoongArchTargetLowering::addTokenForArgument(SDValue Chain,
+                                                     SelectionDAG &DAG,
+                                                     MachineFrameInfo &MFI,
+                                                     int ClobberedFI) const {
+  SmallVector<SDValue, 8> ArgChains;
+  int64_t FirstByte = MFI.getObjectOffset(ClobberedFI);
+  int64_t LastByte = FirstByte + MFI.getObjectSize(ClobberedFI) - 1;
+
+  // Include the original chain at the beginning of the list. When this is
+  // used by target LowerCall hooks, this helps legalize find the
+  // CALLSEQ_BEGIN node.
+  ArgChains.push_back(Chain);
+
+  // Add a chain value for each stack argument corresponding
+  for (SDNode *U : DAG.getEntryNode().getNode()->users())
+    if (LoadSDNode *L = dyn_cast<LoadSDNode>(U))
+      if (FrameIndexSDNode *FI = dyn_cast<FrameIndexSDNode>(L->getBasePtr()))
+        if (FI->getIndex() < 0) {
+          int64_t InFirstByte = MFI.getObjectOffset(FI->getIndex());
+          int64_t InLastByte = InFirstByte;
+          InLastByte += MFI.getObjectSize(FI->getIndex()) - 1;
+
+          if ((InFirstByte <= FirstByte && FirstByte <= InLastByte) ||
+              (FirstByte <= InFirstByte && InFirstByte <= LastByte))
+            ArgChains.push_back(SDValue(L, 1));
+        }
+
+  // Build a tokenfactor for all the chains.
+  return DAG.getNode(ISD::TokenFactor, SDLoc(Chain), MVT::Other, ArgChains);
+}
 static Align getPrefTypeAlign(EVT VT, SelectionDAG &DAG) {
   return DAG.getDataLayout().getPrefTypeAlign(
       VT.getTypeForEVT(*DAG.getContext()));
@@ -8454,19 +8496,32 @@ LoongArchTargetLowering::LowerCall(CallLoweringInfo &CLI,
       RegsToPass.push_back(std::make_pair(VA.getLocReg(), ArgValue));
     } else {
       assert(VA.isMemLoc() && "Argument not register or memory");
-      assert(!IsTailCall && "Tail call not allowed if stack is used "
-                            "for passing parameters");
+      SDValue DstAddr;
+      MachinePointerInfo DstInfo;
+      int32_t Offset = VA.getLocMemOffset();
 
       // Work out the address of the stack slot.
       if (!StackPtr.getNode())
         StackPtr = DAG.getCopyFromReg(Chain, DL, LoongArch::R3, PtrVT);
-      SDValue Address =
-          DAG.getNode(ISD::ADD, DL, PtrVT, StackPtr,
-                      DAG.getIntPtrConstant(VA.getLocMemOffset(), DL));
+
+      if (IsTailCall) {
+        unsigned OpSize = (VA.getValVT().getSizeInBits() + 7) / 8;
+        int FI = MF.getFrameInfo().CreateFixedObject(OpSize, Offset, true);
+        DstAddr = DAG.getFrameIndex(FI, PtrVT);
+        DstInfo = MachinePointerInfo::getFixedStack(MF, FI);
+        // Make sure any stack arguments overlapping with where we're storing
+        // are loaded before this eventual operation. Otherwise they'll be
+        // clobbered.
+        Chain = addTokenForArgument(Chain, DAG, MF.getFrameInfo(), FI);
+      } else {
+        SDValue PtrOff = DAG.getIntPtrConstant(Offset, DL);
+        DstAddr = DAG.getNode(ISD::ADD, DL, PtrVT, StackPtr, PtrOff);
+        DstInfo = MachinePointerInfo::getStack(MF, Offset);
+      }
 
       // Emit the store.
       MemOpChains.push_back(
-          DAG.getStore(Chain, DL, ArgValue, Address, MachinePointerInfo()));
+          DAG.getStore(Chain, DL, ArgValue, DstAddr, DstInfo));
     }
   }
 
diff --git a/llvm/lib/Target/LoongArch/LoongArchISelLowering.h b/llvm/lib/Target/LoongArch/LoongArchISelLowering.h
index 8a4d7748467c7..e95f70f06cc7b 100644
--- a/llvm/lib/Target/LoongArch/LoongArchISelLowering.h
+++ b/llvm/lib/Target/LoongArch/LoongArchISelLowering.h
@@ -438,6 +438,12 @@ class LoongArchTargetLowering : public TargetLowering {
       CCState &CCInfo, CallLoweringInfo &CLI, MachineFunction &MF,
       const SmallVectorImpl<CCValAssign> &ArgLocs) const;
 
+  /// Finds the incoming stack arguments which overlap the given fixed stack
+  /// object and incorporates their load into the current chain. This prevents
+  /// an upcoming store from clobbering the stack argument before it's used.
+  SDValue addTokenForArgument(SDValue Chain, SelectionDAG &DAG,
+                              MachineFrameInfo &MFI, int ClobberedFI) const;
+
   bool softPromoteHalfType() const override { return true; }
 
   bool
diff --git a/llvm/lib/Target/LoongArch/LoongArchMachineFunctionInfo.h b/llvm/lib/Target/LoongArch/LoongArchMachineFunctionInfo.h
index 904985c189dba..cf0837cbf09c7 100644
--- a/llvm/lib/Target/LoongArch/LoongArchMachineFunctionInfo.h
+++ b/llvm/lib/Target/LoongArch/LoongArchMachineFunctionInfo.h
@@ -32,6 +32,10 @@ class LoongArchMachineFunctionInfo : public MachineFunctionInfo {
   /// Size of stack frame to save callee saved registers
   unsigned CalleeSavedStackSize = 0;
 
+  /// ArgumentStackSize - amount of bytes on stack consumed by the arguments
+  /// being passed on the stack
+  unsigned ArgumentStackSize = 0;
+
   /// FrameIndex of the spill slot when there is no scavenged register in
   /// insertIndirectBranch.
   int BranchRelaxationSpillFrameIndex = -1;
@@ -63,6 +67,9 @@ class LoongArchMachineFunctionInfo : public MachineFunctionInfo {
   unsigned getCalleeSavedStackSize() const { return CalleeSavedStackSize; }
   void setCalleeSavedStackSize(unsigned Size) { CalleeSavedStackSize = Size; }
 
+  unsigned getArgumentStackSize() const { return ArgumentStackSize; }
+  void setArgumentStackSize(unsigned size) { ArgumentStackSize = size; }
+
   int getBranchRelaxationSpillFrameIndex() {
     return BranchRelaxationSpillFrameIndex;
   }
diff --git a/llvm/test/CodeGen/LoongArch/musttail.ll b/llvm/test/CodeGen/LoongArch/musttail.ll
new file mode 100644
index 0000000000000..cf436e0505ad4
--- /dev/null
+++ b/llvm/test/CodeGen/LoongArch/musttail.ll
@@ -0,0 +1,397 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -mtriple=loongarch32 %s -o - | FileCheck %s --check-prefix=LA32
+; RUN: llc -mtriple=loongarch64 %s -o - | FileCheck %s --check-prefix=LA64
+
+declare i32 @many_args_callee(i32 %0, i32 %1, i32 %2, i32 %3, i32 %4, i32 %5, i32 %6, i32 %7, i32 %8, i32 %9)
+
+define i32 @many_args_tail(i32 %0, i32 %1, i32 %2, i32 %3, i32 %4, i32 %5, i32 %6, i32 %7, i32 %8, i32 %9) {
+; LA32-LABEL: many_args_tail:
+; LA32:       # %bb.0:
+; LA32-NEXT:    ori $a0, $zero, 9
+; LA32-NEXT:    st.w $a0, $sp, 4
+; LA32-NEXT:    ori $a0, $zero, 8
+; LA32-NEXT:    ori $a1, $zero, 1
+; LA32-NEXT:    ori $a2, $zero, 2
+; LA32-NEXT:    ori $a3, $zero, 3
+; LA32-NEXT:    ori $a4, $zero, 4
+; LA32-NEXT:    ori $a5, $zero, 5
+; LA32-NEXT:    ori $a6, $zero, 6
+; LA32-NEXT:    ori $a7, $zero, 7
+; LA32-NEXT:    st.w $a0, $sp, 0
+; LA32-NEXT:    move $a0, $zero
+; LA32-NEXT:    b many_args_callee
+;
+; LA64-LABEL: many_args_tail:
+; LA64:       # %bb.0:
+; LA64-NEXT:    ori $a0, $zero, 9
+; LA64-NEXT:    st.d $a0, $sp, 8
+; LA64-NEXT:    ori $a0, $zero, 8
+; LA64-NEXT:    ori $a1, $zero, 1
+; LA64-NEXT:    ori $a2, $zero, 2
+; LA64-NEXT:    ori $a3, $zero, 3
+; LA64-NEXT:    ori $a4, $zero, 4
+; LA64-NEXT:    ori $a5, $zero, 5
+; LA64-NEXT:    ori $a6, $zero, 6
+; LA64-NEXT:    ori $a7, $zero, 7
+; LA64-NEXT:    st.d $a0, $sp, 0
+; LA64-NEXT:    move $a0, $zero
+; LA64-NEXT:    pcaddu18i $t8, %call36(many_args_callee)
+; LA64-NEXT:    jr $t8
+  %ret = tail call i32 @many_args_callee(i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9)
+  ret i32 %ret
+}
+
+define i32 @many_args_musttail(i32 %0, i32 %1, i32 %2, i32 %3, i32 %4, i32 %5, i32 %6, i32 %7, i32 %8, i32 %9) {
+; LA32-LABEL: many_args_musttail:
+; LA32:       # %bb.0:
+; LA32-NEXT:    ori $a0, $zero, 9
+; LA32-NEXT:    st.w $a0, $sp, 4
+; LA32-NEXT:    ori $a0, $zero, 8
+; LA32-NEXT:    ori $a1, $zero, 1
+; LA32-NEXT:    ori $a2, $zero, 2
+; LA32-NEXT:    ori $a3, $zero, 3
+; LA32-NEXT:    ori $a4, $zero, 4
+; LA32-NEXT:    ori $a5, $zero, 5
+; LA32-NEXT:    ori $a6, $zero, 6
+; LA32-NEXT:    ori $a7, $zero, 7
+; LA32-NEXT:    st.w $a0, $sp, 0
+; LA32-NEXT:    move $a0, $zero
+; LA32-NEXT:    b many_args_callee
+;
+; LA64-LABEL: many_args_musttail:
+; LA64:       # %bb.0:
+; LA64-NEXT:    ori $a0, $zero, 9
+; LA64-NEXT:    st.d $a0, $sp, 8
+; LA64-NEXT:    ori $a0, $zero, 8
+; LA64-NEXT:    ori $a1, $zero, 1
+; LA64-NEXT:    ori $a2, $zero, 2
+; LA64-NEXT:    ori $a3, $zero, 3
+; LA64-NEXT:    ori $a4, $zero, 4
+; LA64-NEXT:    ori $a5, $zero, 5
+; LA64-NEXT:    ori $a6, $zero, 6
+; LA64-NEXT:    ori $a7, $zero, 7
+; LA64-NEXT:    st.d $a0, $sp, 0
+; LA64-NEXT:    move $a0, $zero
+; LA64-NEXT:    pcaddu18i $t8, %call36(many_args_callee)
+; LA64-NEXT:    jr $t8
+  %ret = musttail call i32 @many_args_callee(i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9)
+  ret i32 %ret
+}
+
+; This function has more arguments than it's tail-callee. This isn't valid for
+; the musttail attribute, but can still be tail-called as a non-guaranteed
+; optimisation, because the outgoing arguments to @many_args_callee fit in the
+; stack space allocated by the caller of @more_args_tail.
+define i32 @more_args_tail(i32 %0, i32 %1, i32 %2, i32 %3, i32 %4, i32 %5, i32 %6, i32 %7, i32 %8, i32 %9) {
+; LA32-LABEL: more_args_tail:
+; LA32:       # %bb.0:
+; LA32-NEXT:    ori $a0, $zero, 9
+; LA32-NEXT:    st.w $a0, $sp, 4
+; LA32-NEXT:    ori $a0, $zero, 8
+; LA32-NEXT:    ori $a1, $zero, 1
+; LA32-NEXT:    ori $a2, $zero, 2
+; LA32-NEXT:    ori $a3, $zero, 3
+; LA32-NEXT:    ori $a4, $zero, 4
+; LA32-NEXT:    ori $a5, $zero, 5
+; LA32-NEXT:    ori $a6, $zero, 6
+; LA32-NEXT:    ori $a7, $zero, 7
+; LA32-NEXT:    st.w $a0, $sp, 0
+; LA32-NEXT:    move $a0, $zero
+; LA32-NEXT:    b many_args_callee
+;
+; LA64-LABEL: more_args_tail:
+; LA64:       # %bb.0:
+; LA64-NEXT:    ori $a0, $zero, 9
+; LA64-NEXT:    st.d $a0, $sp, 8
+; LA64-NEXT:    ori $a0, $zero, 8
+; LA64-NEXT:    ori $a1, $zero, 1
+; LA64-NEXT:    ori $a2, $zero, 2
+; LA64-NEXT:    ori $a3, $zero, 3
+; LA64-NEXT:    ori $a4, $zero, 4
+; LA64-NEXT:    ori $a5, $zero, 5
+; LA64-NEXT:    ori $a6, $zero, 6
+; LA64-NEXT:    ori $a7, $zero, 7
+; LA64-NEXT:    st.d $a0, $sp, 0
+; LA64-NEXT:    move $a0, $zero
+; LA64-NEXT:    pcaddu18i $t8, %call36(many_args_callee)
+; LA64-NEXT:    jr $t8
+  %ret = tail call i32 @many_args_callee(i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9)
+  ret i32 %ret
+}
+
+; Again, this isn't valid for musttail, but can be tail-called in practice
+; because the stack size if the same.
+define i32 @different_args_tail_32bit(i64 %0, i64 %1, i64 %2, i64 %3, i64 %4) {
+; LA32-LABEL: different_args_tail_32bit:
+; LA32:       # %bb.0:
+; LA32-NEXT:    ori $a0, $zero, 9
+; LA32-NEXT:    st.w $a0, $sp, 4
+; LA32-NEXT:    ori $a0, $zero, 8
+; LA32-NEXT:    ori $a1, $zero, 1
+; LA32-NEXT:    ori $a2, $zero, 2
+; LA32-NEXT:    ori $a3, $zero, 3
+; LA32-NEXT:    ori $a4, $zero, 4
+; LA32-NEXT:    ori $a5, $zero, 5
+; LA32-NEXT:    ori $a6, $zero, 6
+; LA32-NEXT:    ori $a7, $zero, 7
+; LA32-NEXT:    st.w $a0, $sp, 0
+; LA32-NEXT:    move $a0, $zero
+; LA32-NEXT:    b many_args_callee
+;
+; LA64-LABEL: different_args_tail_32bit:
+; LA64:       # %bb.0:
+; LA64-NEXT:    addi.d $sp, $sp, -32
+; LA64-NEXT:    .cfi_def_cfa_offset 32
+; LA64-NEXT:    st.d $ra, $sp, 24 # 8-byte Folded Spill
+; LA64-NEXT:    .cfi_offset 1, -8
+; LA64-NEXT:    ori $a0, $zero, 9
+; LA64-NEXT:    st.d $a0, $sp, 8
+; LA64-NEXT:    ori $a0, $zero, 8
+; LA64-NEXT:    ori $a1, $zero, 1
+; LA64-NEXT:    ori $a2, $zero, 2
+; LA64-NEXT:    ori $a3, $zero, 3
+; LA64-NEXT:    ori $a4, $zero, 4
+; LA64-NEXT:    ori $a5, $zero, 5
+; LA64-NEXT:    ori $a6, $zero, 6
+; LA64-NEXT:    ori $a7, $zero, 7
+; LA64-NEXT:    st.d $a0, $sp, 0
+; LA64-NEXT:    move $a0, $zero
+; LA64-NEXT:    pcaddu18i $ra, %call36(many_args_callee)
+; LA64-NEXT:    jirl $ra, $ra, 0
+; LA64-NEXT:    ld.d $ra, $sp, 24 # 8-byte Folded Reload
+; LA64-NEXT:    addi.d $sp, $sp, 32
+; LA64-NEXT:    ret
+  %ret = tail call i32 @many_args_callee(i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9)
+  ret i32 %ret
+}
+
+define i32 @different_args_tail_64bit(i128 %0, i128 %1, i128 %2, i128 %3, i128 %4) {
+; LA32-LABEL: different_args_tail_64bit:
+; LA32:       # %bb.0:
+; LA32-NEXT:    addi.w $sp, $sp, -16
+; LA32-NEXT:    .cfi_def_cfa_offset 16
+; LA32-NEXT:    st.w $ra, $sp, 12 # 4-byte Folded Spill
+; LA32-NEXT:    .cfi_offset 1, -4
+; LA32-NEXT:    ori $a0, $zero, 9
+; LA32-NEXT:    st.w $a0, $sp, 4
+; LA32-NEXT:    ori $a0, $zero, 8
+; LA32-NEXT:    ori $a1, $zero, 1
+; LA32-NEXT:    ori $a2, $zero, 2
+; LA32-NEXT:    ori $a3, $zero, 3
+; LA32-NEXT:    ori $a4, $zero, 4
+; LA32-NEXT:    ori $a5, $zero, 5
+; LA32-NEXT:    ori $a6, $zero, 6
+; LA32-NEXT:    ori $a7, $zero, 7
+; LA32-NEXT:    st.w $a0, $sp, 0
+; LA32-NEXT:    move $a0, $zero
+; LA32-NEXT:    bl many_args_callee
+; LA32-NEXT:    ld.w $ra, $sp, 12 # 4-byte Folded Reload
+; LA32-NEXT:    addi.w $sp, $sp, 16
+; LA32-NEXT:    ret
+;
+; LA64-LABEL: different_args_tail_64bit:
+; LA64:       # %bb.0:
+; LA64-NEXT:    ori $a0, $zero, 9
+; LA64-NEXT:    st.d $a0, $sp, 8
+; LA64-NEXT:    ori $a0, $zero, 8
+; LA64-NEXT:    ori $a1, $zero, 1
+; LA64-NEXT:    ori $a2, $zero, 2
+; LA64-NEXT:    ori $a3, $zero, 3
+; LA64-NEXT:    ori $a4, $zero, 4
+; LA64-NEXT:    ori $a5, $zero, 5
+; LA64-NEXT:    ori $a6, $zero, 6
+; LA64-NEXT:    ori $a7, $zero, 7
+; LA64-NEXT:    st.d $a0, $sp, 0
+; LA64-NEXT:    move $a0, $zero
+; LA64-NEXT:    pcaddu18i $t8, %call36(many_args_callee)
+; LA64-NEXT:    jr $t8
+  %ret = tail call i32 @many_args_callee(i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9)
+  ret i32 %ret
+}
+
+; Here, the caller requires less stack space for it's arguments than the
+; callee, so it would not ba valid to do a tail-call.
+define i32 @fewer_args_tail(i32 %0, i32 %1, i32 %2, i32 %3, i32 %4) {
+; LA32-LABEL: fewer_args_tail:
+; LA32:       # %bb.0:
+; LA32-NEXT:    addi.w $sp, $sp, -16
+; LA32-NEXT:    .cfi_def_cfa_offset 16
+; LA32-NEXT:    st.w $ra, $sp, 12 # 4-byte Folded Spill
+; LA32-NEXT:    .cfi_offset 1, -4
+; LA32-NEXT:    ori $a0, $zero, 9
+; LA32-NEXT:    st.w $a0, $sp, 4
+; LA32-NEXT:    ori $a0, $zero, 8
+; LA32-NEXT:    ori $a1, $zero, 1
+; LA32-NEXT:    ori $a2, $zero, 2
+; LA32-NEXT:    ori $a3, $zero, 3
+; LA32-NEXT:    ori $a4, $zero, 4
+; LA32-NEXT:    ori $a5, $zero, 5
+; LA32-NEXT:    ori $a6, $zero, 6
+; LA32-NEXT:    ori $a7, $zero, 7
+; LA32-NEXT:    st.w $a0, $sp, 0
+; LA32-NEXT:    move $a0, $zero
+; LA32-NEXT:    bl many_args_callee
+; LA32-NEXT:    ld.w $ra, $sp, 12 # 4-byte Folded Reload
+; LA32-NEXT:    addi.w $sp, $sp, 16
+; LA32-NEXT:    ret
+;
+; LA64-LABEL: fewer_args_tail:
+; LA64:       # %bb.0:
+; LA64-NEXT:    addi.d $sp, $sp, -32
+; LA64-NEXT:    .cfi_def_cfa_offset 32
+; LA64-NEXT:    st.d $ra, $sp, 24 # 8-byte Folded Spill
+; LA64-NEXT:    .cfi_offset 1, -8
+; LA64-NEXT:    ori $a0, $zero, 9
+; LA64-NEXT:    st.d $a0, $sp, 8
+; LA64-NEXT:    ori $a0, $zero, 8
+; LA64-NEXT:    ori $a1, $zero, 1
+; LA64-NEXT:    ori $a2, $zero, 2
+; LA64-NEXT:    ori $a3, $zero, 3
+; LA64-NEXT:    ori $a4, $zero, 4
+; LA64-NEXT:    ori $a5, $zero, 5
+; LA64-NEXT:    ori $a6, $zero, 6
+; LA64-NEXT:    ori $a7, $zero, 7
+; LA64-NEXT:    st.d $a0, $sp, 0
+; LA64-NEXT:    move $a0, $zero
+; LA64-NEXT:    pcaddu18i $ra, %call36(many_args_callee)
+; LA64-NEXT:    jirl $ra, $ra, 0
+; LA64-NEXT:    ld.d $ra, $sp, 24 # 8-byte Folded Reload
+; LA64-NEXT:    addi.d $sp, $sp, 32
+; LA64-NEXT:    ret
+  %ret = tail call i32 @many_args_callee(i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9)
+  ret i32 %ret
+}
+
+declare void @foo(i32, i32, i32, i32, i32, i32, i32, i32, i32)
+
+define void @bar(i32 %0, i32 %1, i32 %2, i32 %3, i32 %4, i32 %5, i32 %6, i32 %7, i32 %8) nounwind {
+; LA32-LABEL: bar:
+; LA32:       # %bb.0: # %entry
+; LA32-NEXT:    addi.w $sp, $sp, -48
+; LA32-NEXT:    st.w $ra, $sp, 44 # 4-byte Folded Spill
+; LA32-NEXT:    st.w $fp, $sp, 40 # 4-byte Folded Spill
+; LA32-NEXT:    st.w $s0, $sp, 36 # 4-byte Folded Spill
+; LA32-NEXT:    st.w $s1, $sp, 32 # 4-byte Folded Spill
+; LA32-NEXT:    st.w $s2, $sp, 28 # 4-byte Folded Spill
+; LA32-NEXT:    st.w $s3, $sp, 24 # 4-byte Folded Spill
+; LA32-NEXT:    st.w $s4, $sp, 20 # 4-byte Folded Spill
+; LA32-NEXT:    st.w $s5, $sp, 16 # 4-byte Folded Spill
+; LA32-NEXT:    st.w $s6, $sp, 12 # 4-byte Folded Spill
+; LA32-NEXT:    move $fp, $a7
+; LA32-NEXT:    move $s0, $a6
+; LA32-NEXT:    move $s1, $a5
+; LA32-NEXT:    move $s2, $a4
+; LA32-NEXT:    move $s3, $a3
+; LA32-NEXT:    move $s4, $a2
+; LA32-NEXT:    move $s5, $a1
+; LA32-NEXT:    move $s6, $a0
+; LA32-NEXT:    ori $a0, $zero, 1
+; LA32-NEXT:    st.w $a0, $sp, 0
+; LA32-NEXT:    move $a0, $s6
+; LA32-NEXT:    bl foo
+; LA32-NEXT:    ori $a0, $zero, 2
+; LA32-NEXT:    st.w $a0, $sp, 48
+; LA32-NEXT:    move $a0, $s6
+; LA32-NEXT:    move $a1, $s5
+; LA32-NEXT:    move $a2, $s4
+; LA32-NEXT:    move $a3, $s3
+; LA32-NEXT:    move $a4, $s2
+; LA32-NEXT:    move $a5, $s1
+; LA32-NEXT:    move $a6, $s0
+; LA32-NEXT:    move $a7, $fp
+; LA32-NEXT:    ld.w $s6, $sp, 12 # 4-byte Folded Reload
+; LA32-NEXT:    ld.w $s5, $sp, 16 # 4-byte Folded Reload
+; LA32-NEXT:    ld.w $s4, $sp, 20 # 4-byte Folded Reload
+; LA32-NEXT:    ld.w $s3, $sp, 24 # 4-byte Folded Reload
+; LA32-NEXT:    ld.w $s2, $sp, 28 # 4-byte Folded Reload
+; LA32-NEXT:    ld.w $s1, $sp, 32 # 4-byte Folded Reload
+; LA32-NEXT:    ...
[truncated]

@github-actions
Copy link

🐧 Linux x64 Test Results

  • 186274 tests passed
  • 4849 tests skipped

@folkertdev
Copy link
Contributor

folkertdev commented Nov 20, 2025

Thanks so much for this! It looks like the riscv code is very similar to loongarch, so this approach should also work there, covering all targets that rustc would reasonably care about for an MVP.

It might be worthwhile to also test this case https://godbolt.org/z/5evjsn1zs, which x86_64 miscompiles causing a segmentation fault. (aarch64 seems to have fixed this in LLVM 20, it also miscompiled before).

@heiher
Copy link
Member Author

heiher commented Nov 21, 2025

It looks like the riscv code is very similar to loongarch, so this approach should also work there

I agree with you. It should also work there.

this case https://godbolt.org/z/5evjsn1zs

It looks like LoongArch is already generating correct code for this case. I also noticed that the byval argument test cases are still failing on LoongArch (and AArch64). Does the Rust become strictly require support for byval arguments?

define dso_local i32 @callee_byval(ptr byval(i32) %0) nounwind {
  ret i32 0
}

define dso_local i32 @caller_byval(ptr byval(i32) %0) nounwind {
  %r = musttail call i32 @callee_byval(ptr byval(i32) %0)
  ret i32 %r
}

@folkertdev
Copy link
Contributor

I also noticed that the byval argument test cases are still failing on LoongArch (and AArch64).

I'm not sure what you mean here.

Rust currently disallows a become on a function call that uses any PassMode::Indirect arguments. I want to relax that restriction to accept (what LLVM calls) byval arguments.

However it turns out that many LLVM backends also do not support byval arguments correctly: the arm backends do support it now (I suspect since #109943), and I'll try to add support for x86_64 in #168956 by basically emulating the aarch64 approach. With loongarch (this PR), riscv (where we can basically copy this PR) and maybe powerpc/s390x (who have been quite responsive in fixing issues we run into) that should cover all of the major architectures I think.


What rust requires is a subset of musttail: we currently only allow sibcalls (so, ABI between caller and callee is a perfect match), but we want to support an arbitrary number of arguments, and want any rust type (that is FFI-safe/has a stable layout) to work as an argument.

@folkertdev
Copy link
Contributor

I believe I understand your question better now. The answer is yes, rust would need e.g. the following to work correctly:

%struct.5xi32 = type { [5 x i32] }

declare dso_local i32 @FuncFlip(ptr byval(%struct.5xi32) %0, ptr byval(%struct.5xi32) %1)

define dso_local i32 @testFlip(ptr byval(%struct.5xi32) %0, ptr byval(%struct.5xi32) %1) {
  %r = musttail call i32 @FuncFlip(ptr byval(%struct.5xi32) %1, ptr byval(%struct.5xi32) %0)
  ret i32 %r
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

loongarch64: failed to perform tail call elimination on a call site marked musttail

4 participants