[GlobalOpt] Add TTI interface useFastCCForInternalCall for FASTCC #164768

phoebewang · 2025-10-23T07:58:27Z

Background: X86 APX feature adds 16 registers within the same 64-bit mode. PR #164638 is trying to extend such registers for FASTCC. However, a blocker issue is calling convention cannot be changeable with or without a feature.

The solution is to disable FASTCC if APX is not ready. This is an NFC change to the final code generation, becasue X86 doesn't define an alternative ABI for FASTCC in 64-bit mode. We can solve the potential compatibility issue of #164638 with this patch.

Background: X86 APX feature adds 16 registers within the same 64-bit mode. PR llvm#164638 is trying to extend such registers for FASTCC. However, a blocker issue is calling convention cannot be changeable with or without a feature. The solution is to disable FASTCC if APX is not ready. This is an NFC change to the final code generation, becasue X86 doesn't define an alternative ABI for FASTCC in 64-bit mode. We can solve the potential compatibility issue of llvm#164638 with this patch.

llvmbot · 2025-10-23T07:59:10Z

@llvm/pr-subscribers-clang
@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-backend-x86

Author: Phoebe Wang (phoebewang)

Changes

Background: X86 APX feature adds 16 registers within the same 64-bit mode. PR #164638 is trying to extend such registers for FASTCC. However, a blocker issue is calling convention cannot be changeable with or without a feature.

The solution is to disable FASTCC if APX is not ready. This is an NFC change to the final code generation, becasue X86 doesn't define an alternative ABI for FASTCC in 64-bit mode. We can solve the potential compatibility issue of #164638 with this patch.

Full diff: https://github.com/llvm/llvm-project/pull/164768.diff

9 Files Affected:

(modified) llvm/include/llvm/Analysis/TargetTransformInfo.h (+4)
(modified) llvm/include/llvm/Analysis/TargetTransformInfoImpl.h (+2)
(modified) llvm/lib/Analysis/TargetTransformInfo.cpp (+4)
(modified) llvm/lib/Target/X86/X86TargetTransformInfo.h (+4)
(modified) llvm/lib/Transforms/IPO/GlobalOpt.cpp (+9-6)
(modified) llvm/test/Transforms/GlobalOpt/null-check-is-use-pr35760.ll (+1-1)
(modified) llvm/test/Transforms/GlobalOpt/null-check-not-use-pr35760.ll (+1-1)
(modified) llvm/test/tools/gold/X86/merge-functions.ll (+2-2)
(modified) llvm/test/tools/gold/X86/unified-lto.ll (+2-2)

diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index 5d3b233ed6b6a..f52fb448fc584 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -943,6 +943,10 @@ class TargetTransformInfo {
   ///  should use coldcc calling convention.
   LLVM_ABI bool useColdCCForColdCall(Function &F) const;
 
+  /// Return true if the input function is internal, should use fastcc calling
+  /// convention.
+  LLVM_ABI bool useFastCCForInternalCall(Function &F) const;
+
   LLVM_ABI bool isTargetIntrinsicTriviallyScalarizable(Intrinsic::ID ID) const;
 
   /// Identifies if the vector form of the intrinsic has a scalar operand.
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
index 4cd607c0d0c8d..064e28c504af4 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -431,6 +431,8 @@ class TargetTransformInfoImplBase {
 
   virtual bool useColdCCForColdCall(Function &F) const { return false; }
 
+  virtual bool useFastCCForInternalCall(Function &F) const { return true; }
+
   virtual bool isTargetIntrinsicTriviallyScalarizable(Intrinsic::ID ID) const {
     return false;
   }
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index bf62623099a97..dd65d8375828c 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -609,6 +609,10 @@ bool TargetTransformInfo::useColdCCForColdCall(Function &F) const {
   return TTIImpl->useColdCCForColdCall(F);
 }
 
+bool TargetTransformInfo::useFastCCForInternalCall(Function &F) const {
+  return TTIImpl->useFastCCForInternalCall(F);
+}
+
 bool TargetTransformInfo::isTargetIntrinsicTriviallyScalarizable(
     Intrinsic::ID ID) const {
   return TTIImpl->isTargetIntrinsicTriviallyScalarizable(ID);
diff --git a/llvm/lib/Target/X86/X86TargetTransformInfo.h b/llvm/lib/Target/X86/X86TargetTransformInfo.h
index 133b3668a46c8..609861a53a0a0 100644
--- a/llvm/lib/Target/X86/X86TargetTransformInfo.h
+++ b/llvm/lib/Target/X86/X86TargetTransformInfo.h
@@ -319,6 +319,10 @@ class X86TTIImpl final : public BasicTTIImplBase<X86TTIImpl> {
   unsigned getStoreMinimumVF(unsigned VF, Type *ScalarMemTy,
                              Type *ScalarValTy) const override;
 
+  bool useFastCCForInternalCall(Function &F) const override {
+    return !ST->is64Bit() || ST->hasEGPR();
+  }
+
 private:
   bool supportsGather() const;
   InstructionCost getGSVectorCost(unsigned Opcode, TTI::TargetCostKind CostKind,
diff --git a/llvm/lib/Transforms/IPO/GlobalOpt.cpp b/llvm/lib/Transforms/IPO/GlobalOpt.cpp
index 99c4982c58b47..1516a5bb7a6c2 100644
--- a/llvm/lib/Transforms/IPO/GlobalOpt.cpp
+++ b/llvm/lib/Transforms/IPO/GlobalOpt.cpp
@@ -2018,12 +2018,15 @@ OptimizeFunctions(Module &M,
 
     if (hasChangeableCC(&F, ChangeableCCCache)) {
       // If this function has a calling convention worth changing, is not a
-      // varargs function, and is only called directly, promote it to use the
-      // Fast calling convention.
-      F.setCallingConv(CallingConv::Fast);
-      ChangeCalleesToFastCall(&F);
-      ++NumFastCallFns;
-      Changed = true;
+      // varargs function, is only called directly, and is supported by the
+      // target, promote it to use the Fast calling convention.
+      TargetTransformInfo &TTI = GetTTI(F);
+      if (TTI.useFastCCForInternalCall(F)) {
+        F.setCallingConv(CallingConv::Fast);
+        ChangeCalleesToFastCall(&F);
+        ++NumFastCallFns;
+        Changed = true;
+      }
     }
 
     if (F.getAttributes().hasAttrSomewhere(Attribute::Nest) &&
diff --git a/llvm/test/Transforms/GlobalOpt/null-check-is-use-pr35760.ll b/llvm/test/Transforms/GlobalOpt/null-check-is-use-pr35760.ll
index 70923c547940c..4a0c93f09c7df 100644
--- a/llvm/test/Transforms/GlobalOpt/null-check-is-use-pr35760.ll
+++ b/llvm/test/Transforms/GlobalOpt/null-check-is-use-pr35760.ll
@@ -12,7 +12,7 @@ define dso_local i32 @main() {
 ; CHECK-LABEL: define {{[^@]+}}@main() local_unnamed_addr {
 ; CHECK-NEXT:  bb:
 ; CHECK-NEXT:    store ptr null, ptr @_ZL3g_i, align 8
-; CHECK-NEXT:    call fastcc void @_ZL13PutsSomethingv()
+; CHECK-NEXT:    call void @_ZL13PutsSomethingv()
 ; CHECK-NEXT:    ret i32 0
 ;
 bb:
diff --git a/llvm/test/Transforms/GlobalOpt/null-check-not-use-pr35760.ll b/llvm/test/Transforms/GlobalOpt/null-check-not-use-pr35760.ll
index a499fe1e4ad92..2b92d856d1848 100644
--- a/llvm/test/Transforms/GlobalOpt/null-check-not-use-pr35760.ll
+++ b/llvm/test/Transforms/GlobalOpt/null-check-not-use-pr35760.ll
@@ -15,7 +15,7 @@ define dso_local i32 @main() {
 ; CHECK-LABEL: define {{[^@]+}}@main() local_unnamed_addr {
 ; CHECK-NEXT:  bb:
 ; CHECK-NEXT:    store ptr null, ptr @_ZL3g_i, align 8
-; CHECK-NEXT:    call fastcc void @_ZL13PutsSomethingv()
+; CHECK-NEXT:    call void @_ZL13PutsSomethingv()
 ; CHECK-NEXT:    ret i32 0
 ;
 bb:
diff --git a/llvm/test/tools/gold/X86/merge-functions.ll b/llvm/test/tools/gold/X86/merge-functions.ll
index d4a49b1c40b47..296e7aa3f76f7 100644
--- a/llvm/test/tools/gold/X86/merge-functions.ll
+++ b/llvm/test/tools/gold/X86/merge-functions.ll
@@ -11,8 +11,8 @@
 
 ; Check that we've merged foo and bar
 ; CHECK:      define dso_local noundef i32 @main()
-; CHECK-NEXT:   tail call fastcc void @bar()
-; CHECK-NEXT:   tail call fastcc void @bar()
+; CHECK-NEXT:   tail call void @bar()
+; CHECK-NEXT:   tail call void @bar()
 
 target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128"
 target triple = "x86_64-pc-linux-gnu"
diff --git a/llvm/test/tools/gold/X86/unified-lto.ll b/llvm/test/tools/gold/X86/unified-lto.ll
index e5030e863a64a..24eb94a08de39 100644
--- a/llvm/test/tools/gold/X86/unified-lto.ll
+++ b/llvm/test/tools/gold/X86/unified-lto.ll
@@ -25,10 +25,10 @@
 ; Constant propagation is not supported by thin LTO.
 ; With full LTO we fold argument into constant 43
 ; CHECK:       define dso_local noundef i32 @main()
-; CHECK-NEXT:    tail call fastcc void @foo()
+; CHECK-NEXT:    tail call void @foo()
 ; CHECK-NEXT:    ret i32 43
 
-; CHECK:       define internal fastcc void @foo()
+; CHECK:       define internal void @foo()
 ; CHECK-NEXT:    store i32 43, ptr @_g, align 4
 
 ; ThinLTO doesn't import foo, because the latter has noinline attribute

nikic · 2025-10-23T08:05:48Z

llvm/lib/Target/X86/X86TargetTransformInfo.h

@@ -319,6 +319,10 @@ class X86TTIImpl final : public BasicTTIImplBase<X86TTIImpl> {
  unsigned getStoreMinimumVF(unsigned VF, Type *ScalarMemTy,
                             Type *ScalarValTy) const override;

+  bool useFastCCForInternalCall(Function &F) const override {
+    return !ST->is64Bit() || ST->hasEGPR();
+  }


Don't you need to check that both caller and callee have EGPR?

I think single direction is enough. We can call a function without EGPR from EGPR enabled function, but not the opposite direction.

Hm, I don't really follow.

If we have no-EGPR -> EGPR, then the EGPR function may expect arguments to be passed in EGPR registers, while the no-EGPR function will push them to the stack.

If we have EGPR -> no-EGPR, then the EGPR function may pass arguments in EGPR registers, while the no-EGPR function will expect them to be on the stack.

Am I missing something here?

Oh, I get the point. I was confused with the inlining logic.

nikic · 2025-10-23T08:10:15Z

Hm, I'm not sure I like this. Would it make sense to potentially have multiple fastcc calling conventions (e.g. fastcc_egpr) and have the hook select the best one? For this specific case (where fastcc doesn't do anything on the baseline target) this wouldn't make a difference, but this would also accommodate other cases in the future.

phoebewang · 2025-10-23T08:16:31Z

Hm, I'm not sure I like this. Would it make sense to potentially have multiple fastcc calling conventions (e.g. fastcc_egpr) and have the hook select the best one? For this specific case (where fastcc doesn't do anything on the baseline target) this wouldn't make a difference, but this would also accommodate other cases in the future.

I don't think multiple fastcc calling conventions is useful, based on the fact that we cannot bind calling convention with a single feature. Assume we have multiple features in the future, the solution is still to disable fastcc instead of allowing multiple calling conventions coexist.

efriedma-quic · 2025-10-23T19:11:43Z

"fastcc" was originally invented in the context of targets with a single calling convention, but where that calling convention had significant limitations. Like, on 32-bit x86, the default "C" calling convention passes everything on the stack. So it's basically supposed to be a variation on the C calling convention: assuming the same underlying machine constraints, but ignoring bad decisions made when the ABI documents were originally written.

The x86-64 ABI developers learned those lessons, so fastcc is just the C calling convention; we didn't need to modify it.

If we have new architectural features, and we want a different calling convention for machines that have the feature, versus machines which don't have the feature, I think we need more specific names for those calling conventions. Having the C calling convention vary based on per-function target features is a constant source of problems for interprocedural optimizations. We don't want to extend those same problems to fastcc.

I think, if you want to pass arguments using APX registers, you should introduce a new "APX" calling convention. Probably from there, instead of a boolean useFastCCForInternalCall, you want a function CallingConv getOptimalCCForInternalCall(CallInst &CI), or something like that, to pick whatever calling convention the target prefers based on the available caller/callee target features, and maybe other heuristics.

phoebewang · 2025-10-24T02:57:48Z

I don't fully agree with the points. If fastcc is just (and always) the C calling convention, it means its existence is meaningless. I agree with its original intention and a well designed ABI doesn't need it at all. But it is the exactly reason why we retargeting it for APX. We are solving (almost) the same problem: a single calling convention has its design limitations. The difference is just, for 32-bit, it's bad design at the beginning; for APX, the design cannot fit the new feature. So, IMO, using fastcc for APX is the best choice.

The way to use getOptimalCCForInternalCall looks like over engineering to me. I can't say for other targets, but I believe we don't have the need to introduce a second fastcc in 5 years on x86. Noting, we actually has a x86_fastcallcc that used for C's fastcall. We neither use it for fasscc as you suggested, nor invent a second fastcc for so many years.

efriedma-quic · 2025-10-24T19:02:32Z

I sort of understand the logic of "we need a calling convention, fastcc isn't useful on x86-64, let's repurpose it". But there's a significant benefit to giving a calling convention with unusual semantics (like, it only works on targets that have APX enabled) its own name; it's less confusing, and it cleanly allows future extensions for all targets. And introducing a new calling convention really isn't that much more work.

It isn't likely the set of integer registers on x86 will change again, but floating-point/vector registers have been getting constant updates.

phoebewang · 2025-10-25T13:53:15Z

I sort of understand the logic of "we need a calling convention, fastcc isn't useful on x86-64, let's repurpose it". But there's a significant benefit to giving a calling convention with unusual semantics (like, it only works on targets that have APX enabled) its own name; it's less confusing, and it cleanly allows future extensions for all targets. And introducing a new calling convention really isn't that much more work.

I'm afraid I disagree. Notice the minor difference between X86_FastCall and Fast in X86CallingConv.td. Out of cautious, I think mixing up a internal only fastcc with an external named calling convention is not a good idea. A internal only calling convention has more freedom, since it's limited to the single CU. It dones't have backward and cross compilation compatibility constraints the traditional calling convention have.

It isn't likely the set of integer registers on x86 will change again, but floating-point/vector registers have been getting constant updates.

The current floating-point/vector calling convention uses 8 registers for argument passing, and they are less common than GPRs (since we use them to pass PTR and/or large type like i128 etc.). So I don't think it's a problem is short time. AVX512 has extended total register number from 16 to 32 for more than a decade. I don't see a request to change fastcc for it :)

Nevertheless, I have moved the caller check logic to target code, and kept 64-bit generating fastcc as is. Besides, it keeps the ability to be forward compatible to newer calling conventions and a target independent fastcc in the middle end. Please take another look. Thanks!

phoebewang requested review from MatzeB, RKSimon, eleviant, nikic and topperc October 23, 2025 07:58

llvmbot added backend:X86 llvm:analysis Includes value tracking, cost tables and constant folding llvm:transforms labels Oct 23, 2025

nikic requested a review from efriedma-quic October 23, 2025 08:04

nikic reviewed Oct 23, 2025

View reviewed changes

phoebewang added 3 commits October 23, 2025 19:20

Check all call sites' containing function too

03f0ed9

Merge remote-tracking branch 'origin/main' into fastcc

12869de

Fix a lit failure

175315f

llvmbot added the clang Clang issues not falling into any other category label Oct 23, 2025

Move caller check logic to target code

c95843e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[GlobalOpt] Add TTI interface useFastCCForInternalCall for FASTCC #164768

[GlobalOpt] Add TTI interface useFastCCForInternalCall for FASTCC #164768

Uh oh!

phoebewang commented Oct 23, 2025

Uh oh!

llvmbot commented Oct 23, 2025 •

edited

Loading

Uh oh!

nikic Oct 23, 2025

Uh oh!

phoebewang Oct 23, 2025

Uh oh!

nikic Oct 23, 2025

Uh oh!

phoebewang Oct 23, 2025

Uh oh!

nikic commented Oct 23, 2025

Uh oh!

phoebewang commented Oct 23, 2025

Uh oh!

efriedma-quic commented Oct 23, 2025

Uh oh!

phoebewang commented Oct 24, 2025

Uh oh!

efriedma-quic commented Oct 24, 2025

Uh oh!

phoebewang commented Oct 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

[GlobalOpt] Add TTI interface useFastCCForInternalCall for FASTCC #164768

Are you sure you want to change the base?

[GlobalOpt] Add TTI interface useFastCCForInternalCall for FASTCC #164768

Uh oh!

Conversation

phoebewang commented Oct 23, 2025

Uh oh!

llvmbot commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nikic Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

phoebewang Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

nikic Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

phoebewang Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

nikic commented Oct 23, 2025

Uh oh!

phoebewang commented Oct 23, 2025

Uh oh!

efriedma-quic commented Oct 23, 2025

Uh oh!

phoebewang commented Oct 24, 2025

Uh oh!

efriedma-quic commented Oct 24, 2025

Uh oh!

phoebewang commented Oct 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

llvmbot commented Oct 23, 2025 •

edited

Loading