Add FeatureFuseLiterals as SubTargetFeature for Grace and Olympus #160257

sushgokh · 2025-09-23T09:11:55Z

With this, we are gaining significantly with povray benchmark from SPEC17(around 12% with -flto -Ofast). This is attributable to transformation from this feature and subsequent shrink wrapping.

We also see some improvement(around 2%) with xalanc benchmark from SPEC17.

There are some improvements on some internal benchmarks as well.

Note: All the performance changes are observed on Grace. Things should be same for Olympus as well

With this, we are gaining significantly with povray benchmark from SPEC17(around 12% with -flto -Ofast). This is attributable to transformation from this feature and subsequent shrink wrapping. We also see some improvement(around 2%) with xalanc benchmark from SPEC17. There are some improvements on some internal benchmarks as well.

llvmbot · 2025-09-23T09:12:31Z

@llvm/pr-subscribers-backend-aarch64

Author: Sushant Gokhale (sushgokh)

Changes

With this, we are gaining significantly with povray benchmark from SPEC17(around 12% with -flto -Ofast). This is attributable to transformation from this feature and subsequent shrink wrapping.

We also see some improvement(around 2%) with xalanc benchmark from SPEC17.

There are some improvements on some internal benchmarks as well.

Note: All the performance changes are observed on Grace. Things should be same for Olympus as well

Full diff: https://github.com/llvm/llvm-project/pull/160257.diff

3 Files Affected:

(modified) llvm/lib/Target/AArch64/AArch64Processors.td (+4-2)
(modified) llvm/test/CodeGen/AArch64/misched-fusion-addadrp.ll (+8-1)
(modified) llvm/test/CodeGen/AArch64/selectopt-const.ll (+2-1)

diff --git a/llvm/lib/Target/AArch64/AArch64Processors.td b/llvm/lib/Target/AArch64/AArch64Processors.td
index 81f5d075729d9..1d07e82acae77 100644
--- a/llvm/lib/Target/AArch64/AArch64Processors.td
+++ b/llvm/lib/Target/AArch64/AArch64Processors.td
@@ -328,7 +328,8 @@ def TuneOlympus : SubtargetFeature<"olympus", "ARMProcFamily", "Olympus",
                                    FeatureFuseAdrpAdd,
                                    FeaturePostRAScheduler,
                                    FeaturePredictableSelectIsExpensive,
-                                   FeatureUseFixedOverScalableIfEqualCost]>;
+                                   FeatureUseFixedOverScalableIfEqualCost,
+                                   FeatureFuseLiterals]>;
 
 // Note that cyclone does not fuse AES instructions, but newer apple chips do
 // perform the fusion and cyclone is used by default when targeting apple OSes.
@@ -641,7 +642,8 @@ def TuneNeoverseV2 : SubtargetFeature<"neoversev2", "ARMProcFamily", "NeoverseV2
                                       FeatureUseFixedOverScalableIfEqualCost,
                                       FeatureAvoidLDAPUR,
                                       FeaturePredictableSelectIsExpensive,
-                                      FeatureDisableLatencySchedHeuristic]>;
+                                      FeatureDisableLatencySchedHeuristic,
+                                      FeatureFuseLiterals]>;
 
 def TuneNeoverseV3 : SubtargetFeature<"neoversev3", "ARMProcFamily", "NeoverseV3",
                                       "Neoverse V3 ARM processors", [
diff --git a/llvm/test/CodeGen/AArch64/misched-fusion-addadrp.ll b/llvm/test/CodeGen/AArch64/misched-fusion-addadrp.ll
index 70b6b91d3cf66..4b77e9eb71faf 100644
--- a/llvm/test/CodeGen/AArch64/misched-fusion-addadrp.ll
+++ b/llvm/test/CodeGen/AArch64/misched-fusion-addadrp.ll
@@ -12,7 +12,8 @@
 ; RUN: llc %s -o - -mtriple=aarch64-unknown -mcpu=neoverse-n1     | FileCheck %s
 ; RUN: llc %s -o - -mtriple=aarch64-unknown -mcpu=neoverse-v1     | FileCheck %s
 ; RUN: llc %s -o - -mtriple=aarch64-unknown -mcpu=neoverse-n2     | FileCheck %s
-; RUN: llc %s -o - -mtriple=aarch64-unknown -mcpu=neoverse-v2     | FileCheck %s
+; RUN: llc %s -o - -mtriple=aarch64-unknown -mcpu=neoverse-v2     | FileCheck %s --check-prefix FUSE-LITERALS
+; RUN: llc %s -o - -mtriple=aarch64-unknown -mcpu=olympus         | FileCheck %s --check-prefix FUSE-LITERALS
 ; RUN: llc %s -o - -mtriple=aarch64-unknown -mcpu=apple-a16 -mattr=-fuse-literals | FileCheck %s
 ; RUN: llc %s -o - -mtriple=aarch64-unknown -mcpu=apple-a17 -mattr=-fuse-literals | FileCheck %s
 ; RUN: llc %s -o - -mtriple=aarch64-unknown -mcpu=ampere1  -mattr=-fuse-literals | FileCheck %s
@@ -38,6 +39,12 @@ define double @litf() {
 ; CHECK-LABEL: litf:
 ; CHECK:      adrp [[ADDR:x[0-9]+]], [[CSTLABEL:.LCP.*]]
 ; CHECK-NEXT: ldr  {{d[0-9]+}}, {{[[]}}[[ADDR]], :lo12:[[CSTLABEL]]{{[]]}}
+;
+; FUSE-LITERALS: mov     [[R:x[0-9]+]], #11544
+; FUSE-LITERALS: movk    [[R]], #21572, lsl #16
+; FUSE-LITERALS: movk    [[R]], #8699, lsl #32
+; FUSE-LITERALS: movk    [[R]], #16393, lsl #48
+; FUSE-LITERALS: fmov    {{d[0-9]+}}, [[R]]
 entry:
   ret double 0x400921FB54442D18
 }
diff --git a/llvm/test/CodeGen/AArch64/selectopt-const.ll b/llvm/test/CodeGen/AArch64/selectopt-const.ll
index fe48dbaf1ab76..62ac297153962 100644
--- a/llvm/test/CodeGen/AArch64/selectopt-const.ll
+++ b/llvm/test/CodeGen/AArch64/selectopt-const.ll
@@ -1,5 +1,6 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
 ; RUN: llc -mtriple=aarch64-linux-gnu -mcpu=neoverse-v2 -O3 < %s | FileCheck %s
+; RUN: llc -mtriple=aarch64-linux-gnu -mcpu=olympus -O3 < %s | FileCheck %s
 
 define i32 @test_const(ptr %in1, ptr %in2, ptr %out, i32 %n, ptr %tbl) {
 ; CHECK-LABEL: test_const:
@@ -8,10 +9,10 @@ define i32 @test_const(ptr %in1, ptr %in2, ptr %out, i32 %n, ptr %tbl) {
 ; CHECK-NEXT:    b.lt .LBB0_3
 ; CHECK-NEXT:  // %bb.1: // %for.body.preheader
 ; CHECK-NEXT:    mov w9, #1267 // =0x4f3
+; CHECK-NEXT:    movk w9, #16309, lsl #16
 ; CHECK-NEXT:    fmov s1, #1.00000000
 ; CHECK-NEXT:    fmov d2, #5.00000000
 ; CHECK-NEXT:    mov w8, w3
-; CHECK-NEXT:    movk w9, #16309, lsl #16
 ; CHECK-NEXT:    fmov s0, w9
 ; CHECK-NEXT:    mov w9, #16 // =0x10
 ; CHECK-NEXT:    .p2align 5, , 16

davemgreen · 2025-09-23T09:42:20Z

The fused instructions are listed in the software optimization guide, and do not include movk AFAIU. I'm guessing that the real transform that you want isn't whether movk is fused for scheduling but whether a load or movk+fmov is used for materializing a fp constant? fmov has a limited bandwidth so should be avoided, but in load-heavy code (or where a load blocks other transforms, as it sounds like is happening here) the movks might be quicker.

Can we fix that directly? Either by changing how constants are materialized or making is so there isn't the load in the entry block of whatever function is causing the problem?

sushgokh · 2025-09-23T10:13:51Z

but whether a load or movk+fmov is used for materializing a fp constant? fmov has a limited bandwidth so should be avoided, but in load-heavy code (or where a load blocks other transforms, as it sounds like is happening here) the movks might be quicker.

You are right about adrp+load sequence being replaced by sequence of 'movk+fmov'. For the code concerned, movk instructions are lot cheaper

Can we fix that directly? Either by changing how constants are materialized or making is so there isn't the load in the entry block of whatever function is causing the problem?

Either by changing how constants are materialized - this is the feature that enables this. I dont know any other way to do this.
there isn't the load in the entry block of whatever function is causing the problem - We have a load at start of BB followed by cmp/branch. Let me check if we can hoist that load.

sushgokh · 2025-09-23T10:46:09Z

Can we fix that directly? Either by changing how constants are materialized or making is so there isn't the load in the entry block of whatever function is causing the problem?

Now I see what you mean. Let me check the way constants are materialized and see if this can be made core specific

Rather than adding the new feature, this changes the way how constants are materialized for Grace and Olympus.

davemgreen

Can you provide an example of the function this help with? You said it was something in povray? I'm wondering if it is something we can fix more directly. Why are we hoisting a constant into the entry block off a function that can otherwise be shrink-wrapped? Why does a load from a constant pool block shrink wrapping?

davemgreen · 2025-09-27T19:37:14Z

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

+  // If the constant to be materialized is scalar, it maybe efficient to use
+  // sequence of 'mov + fmov' rather than 'adrp + ldr' on specified CPU's.
+  // However, when materializing vector of constants, there are two things to
+  // note:
+  // 1. Throughput of fmov instruction is very low.
+  // 2. ldr instruction can load multiple constants in one go. Also, it's
+  // throughput is higher as compared to fmov.


Does this say "fmovs limit throughput, loads are great", but then goes on to use the fmov version for these cpus?

"fmovs limit throughput, loads are great".

We want to be cautious when we are materializing vector of constants. So, I have used "maybe more efficient" to describe that we are pessimistic here.

davemgreen · 2025-09-27T19:38:04Z

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

+  if (!VT.isVector() && (Subtarget->getCPU() == "neoverse-v2" ||
+                         Subtarget->getCPU() == "olympus"))


We don't like to add checks like this on the cpu name. It is better to add a subtarget feature for it.

davemgreen · 2025-09-27T19:47:35Z

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

+  // throughput is higher as compared to fmov.
+  if (!VT.isVector() && (Subtarget->getCPU() == "neoverse-v2" ||
+                         Subtarget->getCPU() == "olympus"))
+    return true;


We would probably want to handle minsize/optsize like below. It would be the Subtarget->hasFuseLiterals that should probably change, being replaced with a new subtarget feature.

sushgokh · 2025-09-30T10:41:09Z

Can you provide an example of the function this help with? You said it was something in povray? I'm wondering if it is something we can fix more directly. Why are we hoisting a constant into the entry block off a function that can otherwise be shrink-wrapped? Why does a load from a constant pool block shrink wrapping?

It is not constant load being shrink wrapped. There are 2 things happening here:

adrp+ldr sequence being converted to mov's+fmov
shrink wrapping. This is not of the above ldr in (1). Its enabling shrink wrapping of Callee saved registers and other stuff which was not happening otherwise.

(1) and (2) are unrelated in this sense.

sushgokh · 2025-10-06T12:13:33Z

@davemgreen I was on leave last week and couldnt work on this.

I had a look at assembly sequence again(before and after) and there are host of things going on in 'after' (not necessarily in same sequence)

conversion of same ldr, present in 2 immediate successors of start block, to mov+fmov
hoisting this sequence into the common block
Block merging(i.e. merge one block previously containing ldr into start block)
control flow optimizer pass marking one of the blocks as early exit since shrink wrapping has already happened.

I need to investigate why (2) and shrink wrapping isnt happening previously. This will take some time.

wrapping Shrink wrapping treats a load from constant pool as a stack access. This is not correct. Constants are basically stored in read only section AFAIU. This prevents shrink wrapping from kicking in. (Related to PR llvm#160257. PR llvm#160257 will be closed.)

sushgokh requested review from madhur13490, davemgreen, sjoerdmeijer and david-arm September 23, 2025 09:11

llvmbot added the backend:AArch64 label Sep 23, 2025

Address comments

f7781b6

Rather than adding the new feature, this changes the way how constants are materialized for Grace and Olympus.

davemgreen reviewed Sep 29, 2025

View reviewed changes

sushgokh mentioned this pull request Oct 8, 2025

[ShrinkWrap][NFC] Test with load from constant pool preventing shrink #162476

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add FeatureFuseLiterals as SubTargetFeature for Grace and Olympus #160257

Add FeatureFuseLiterals as SubTargetFeature for Grace and Olympus #160257

sushgokh commented Sep 23, 2025

Uh oh!

llvmbot commented Sep 23, 2025

Uh oh!

davemgreen commented Sep 23, 2025

Uh oh!

sushgokh commented Sep 23, 2025

Uh oh!

sushgokh commented Sep 23, 2025

Uh oh!

davemgreen left a comment

Uh oh!

davemgreen Sep 27, 2025

Uh oh!

sushgokh Sep 30, 2025

Uh oh!

davemgreen Sep 27, 2025

Uh oh!

sushgokh Sep 30, 2025

Uh oh!

davemgreen Sep 27, 2025

Uh oh!

sushgokh Sep 30, 2025

Uh oh!

sushgokh commented Sep 30, 2025

Uh oh!

sushgokh commented Oct 6, 2025

Uh oh!

Uh oh!

		if (!VT.isVector() && (Subtarget->getCPU() == "neoverse-v2" \|\|
		Subtarget->getCPU() == "olympus"))

Add FeatureFuseLiterals as SubTargetFeature for Grace and Olympus #160257

Are you sure you want to change the base?

Add FeatureFuseLiterals as SubTargetFeature for Grace and Olympus #160257

Conversation

sushgokh commented Sep 23, 2025

Uh oh!

llvmbot commented Sep 23, 2025

Uh oh!

davemgreen commented Sep 23, 2025

Uh oh!

sushgokh commented Sep 23, 2025

Uh oh!

sushgokh commented Sep 23, 2025

Uh oh!

davemgreen left a comment

Choose a reason for hiding this comment

Uh oh!

davemgreen Sep 27, 2025

Choose a reason for hiding this comment

Uh oh!

sushgokh Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

davemgreen Sep 27, 2025

Choose a reason for hiding this comment

Uh oh!

sushgokh Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

davemgreen Sep 27, 2025

Choose a reason for hiding this comment

Uh oh!

sushgokh Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

sushgokh commented Sep 30, 2025

Uh oh!

sushgokh commented Oct 6, 2025

Uh oh!

Uh oh!