Skip to content

Conversation

sushgokh
Copy link
Contributor

With this, we are gaining significantly with povray benchmark from SPEC17(around 12% with -flto -Ofast). This is attributable to transformation from this feature and subsequent shrink wrapping.

We also see some improvement(around 2%) with xalanc benchmark from SPEC17.

There are some improvements on some internal benchmarks as well.

Note: All the performance changes are observed on Grace. Things should be same for Olympus as well

With this, we are gaining significantly with povray benchmark from
SPEC17(around 12% with -flto -Ofast). This is attributable to
transformation from this feature and subsequent shrink wrapping.

We also see some improvement(around 2%) with xalanc benchmark from
SPEC17.

There are some improvements on some internal benchmarks as well.
@llvmbot
Copy link
Member

llvmbot commented Sep 23, 2025

@llvm/pr-subscribers-backend-aarch64

Author: Sushant Gokhale (sushgokh)

Changes

With this, we are gaining significantly with povray benchmark from SPEC17(around 12% with -flto -Ofast). This is attributable to transformation from this feature and subsequent shrink wrapping.

We also see some improvement(around 2%) with xalanc benchmark from SPEC17.

There are some improvements on some internal benchmarks as well.

Note: All the performance changes are observed on Grace. Things should be same for Olympus as well


Full diff: https://github.com/llvm/llvm-project/pull/160257.diff

3 Files Affected:

  • (modified) llvm/lib/Target/AArch64/AArch64Processors.td (+4-2)
  • (modified) llvm/test/CodeGen/AArch64/misched-fusion-addadrp.ll (+8-1)
  • (modified) llvm/test/CodeGen/AArch64/selectopt-const.ll (+2-1)
diff --git a/llvm/lib/Target/AArch64/AArch64Processors.td b/llvm/lib/Target/AArch64/AArch64Processors.td
index 81f5d075729d9..1d07e82acae77 100644
--- a/llvm/lib/Target/AArch64/AArch64Processors.td
+++ b/llvm/lib/Target/AArch64/AArch64Processors.td
@@ -328,7 +328,8 @@ def TuneOlympus : SubtargetFeature<"olympus", "ARMProcFamily", "Olympus",
                                    FeatureFuseAdrpAdd,
                                    FeaturePostRAScheduler,
                                    FeaturePredictableSelectIsExpensive,
-                                   FeatureUseFixedOverScalableIfEqualCost]>;
+                                   FeatureUseFixedOverScalableIfEqualCost,
+                                   FeatureFuseLiterals]>;
 
 // Note that cyclone does not fuse AES instructions, but newer apple chips do
 // perform the fusion and cyclone is used by default when targeting apple OSes.
@@ -641,7 +642,8 @@ def TuneNeoverseV2 : SubtargetFeature<"neoversev2", "ARMProcFamily", "NeoverseV2
                                       FeatureUseFixedOverScalableIfEqualCost,
                                       FeatureAvoidLDAPUR,
                                       FeaturePredictableSelectIsExpensive,
-                                      FeatureDisableLatencySchedHeuristic]>;
+                                      FeatureDisableLatencySchedHeuristic,
+                                      FeatureFuseLiterals]>;
 
 def TuneNeoverseV3 : SubtargetFeature<"neoversev3", "ARMProcFamily", "NeoverseV3",
                                       "Neoverse V3 ARM processors", [
diff --git a/llvm/test/CodeGen/AArch64/misched-fusion-addadrp.ll b/llvm/test/CodeGen/AArch64/misched-fusion-addadrp.ll
index 70b6b91d3cf66..4b77e9eb71faf 100644
--- a/llvm/test/CodeGen/AArch64/misched-fusion-addadrp.ll
+++ b/llvm/test/CodeGen/AArch64/misched-fusion-addadrp.ll
@@ -12,7 +12,8 @@
 ; RUN: llc %s -o - -mtriple=aarch64-unknown -mcpu=neoverse-n1     | FileCheck %s
 ; RUN: llc %s -o - -mtriple=aarch64-unknown -mcpu=neoverse-v1     | FileCheck %s
 ; RUN: llc %s -o - -mtriple=aarch64-unknown -mcpu=neoverse-n2     | FileCheck %s
-; RUN: llc %s -o - -mtriple=aarch64-unknown -mcpu=neoverse-v2     | FileCheck %s
+; RUN: llc %s -o - -mtriple=aarch64-unknown -mcpu=neoverse-v2     | FileCheck %s --check-prefix FUSE-LITERALS
+; RUN: llc %s -o - -mtriple=aarch64-unknown -mcpu=olympus         | FileCheck %s --check-prefix FUSE-LITERALS
 ; RUN: llc %s -o - -mtriple=aarch64-unknown -mcpu=apple-a16 -mattr=-fuse-literals | FileCheck %s
 ; RUN: llc %s -o - -mtriple=aarch64-unknown -mcpu=apple-a17 -mattr=-fuse-literals | FileCheck %s
 ; RUN: llc %s -o - -mtriple=aarch64-unknown -mcpu=ampere1  -mattr=-fuse-literals | FileCheck %s
@@ -38,6 +39,12 @@ define double @litf() {
 ; CHECK-LABEL: litf:
 ; CHECK:      adrp [[ADDR:x[0-9]+]], [[CSTLABEL:.LCP.*]]
 ; CHECK-NEXT: ldr  {{d[0-9]+}}, {{[[]}}[[ADDR]], :lo12:[[CSTLABEL]]{{[]]}}
+;
+; FUSE-LITERALS: mov     [[R:x[0-9]+]], #11544
+; FUSE-LITERALS: movk    [[R]], #21572, lsl #16
+; FUSE-LITERALS: movk    [[R]], #8699, lsl #32
+; FUSE-LITERALS: movk    [[R]], #16393, lsl #48
+; FUSE-LITERALS: fmov    {{d[0-9]+}}, [[R]]
 entry:
   ret double 0x400921FB54442D18
 }
diff --git a/llvm/test/CodeGen/AArch64/selectopt-const.ll b/llvm/test/CodeGen/AArch64/selectopt-const.ll
index fe48dbaf1ab76..62ac297153962 100644
--- a/llvm/test/CodeGen/AArch64/selectopt-const.ll
+++ b/llvm/test/CodeGen/AArch64/selectopt-const.ll
@@ -1,5 +1,6 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
 ; RUN: llc -mtriple=aarch64-linux-gnu -mcpu=neoverse-v2 -O3 < %s | FileCheck %s
+; RUN: llc -mtriple=aarch64-linux-gnu -mcpu=olympus -O3 < %s | FileCheck %s
 
 define i32 @test_const(ptr %in1, ptr %in2, ptr %out, i32 %n, ptr %tbl) {
 ; CHECK-LABEL: test_const:
@@ -8,10 +9,10 @@ define i32 @test_const(ptr %in1, ptr %in2, ptr %out, i32 %n, ptr %tbl) {
 ; CHECK-NEXT:    b.lt .LBB0_3
 ; CHECK-NEXT:  // %bb.1: // %for.body.preheader
 ; CHECK-NEXT:    mov w9, #1267 // =0x4f3
+; CHECK-NEXT:    movk w9, #16309, lsl #16
 ; CHECK-NEXT:    fmov s1, #1.00000000
 ; CHECK-NEXT:    fmov d2, #5.00000000
 ; CHECK-NEXT:    mov w8, w3
-; CHECK-NEXT:    movk w9, #16309, lsl #16
 ; CHECK-NEXT:    fmov s0, w9
 ; CHECK-NEXT:    mov w9, #16 // =0x10
 ; CHECK-NEXT:    .p2align 5, , 16

@davemgreen
Copy link
Collaborator

The fused instructions are listed in the software optimization guide, and do not include movk AFAIU. I'm guessing that the real transform that you want isn't whether movk is fused for scheduling but whether a load or movk+fmov is used for materializing a fp constant? fmov has a limited bandwidth so should be avoided, but in load-heavy code (or where a load blocks other transforms, as it sounds like is happening here) the movks might be quicker.

Can we fix that directly? Either by changing how constants are materialized or making is so there isn't the load in the entry block of whatever function is causing the problem?

@sushgokh
Copy link
Contributor Author

but whether a load or movk+fmov is used for materializing a fp constant? fmov has a limited bandwidth so should be avoided, but in load-heavy code (or where a load blocks other transforms, as it sounds like is happening here) the movks might be quicker.

You are right about adrp+load sequence being replaced by sequence of 'movk+fmov'. For the code concerned, movk instructions are lot cheaper

Can we fix that directly? Either by changing how constants are materialized or making is so there isn't the load in the entry block of whatever function is causing the problem?

  1. Either by changing how constants are materialized - this is the feature that enables this. I dont know any other way to do this.
  2. there isn't the load in the entry block of whatever function is causing the problem - We have a load at start of BB followed by cmp/branch. Let me check if we can hoist that load.

@sushgokh
Copy link
Contributor Author

Can we fix that directly? Either by changing how constants are materialized or making is so there isn't the load in the entry block of whatever function is causing the problem?

Now I see what you mean. Let me check the way constants are materialized and see if this can be made core specific

Rather than adding the new feature, this changes the way how constants
are materialized for Grace and Olympus.
Copy link
Collaborator

@davemgreen davemgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide an example of the function this help with? You said it was something in povray? I'm wondering if it is something we can fix more directly. Why are we hoisting a constant into the entry block off a function that can otherwise be shrink-wrapped? Why does a load from a constant pool block shrink wrapping?

Comment on lines +12567 to +12573
// If the constant to be materialized is scalar, it maybe efficient to use
// sequence of 'mov + fmov' rather than 'adrp + ldr' on specified CPU's.
// However, when materializing vector of constants, there are two things to
// note:
// 1. Throughput of fmov instruction is very low.
// 2. ldr instruction can load multiple constants in one go. Also, it's
// throughput is higher as compared to fmov.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this say "fmovs limit throughput, loads are great", but then goes on to use the fmov version for these cpus?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"fmovs limit throughput, loads are great".

We want to be cautious when we are materializing vector of constants. So, I have used "maybe more efficient" to describe that we are pessimistic here.

Comment on lines +12574 to +12575
if (!VT.isVector() && (Subtarget->getCPU() == "neoverse-v2" ||
Subtarget->getCPU() == "olympus"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't like to add checks like this on the cpu name. It is better to add a subtarget feature for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok sure

// throughput is higher as compared to fmov.
if (!VT.isVector() && (Subtarget->getCPU() == "neoverse-v2" ||
Subtarget->getCPU() == "olympus"))
return true;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would probably want to handle minsize/optsize like below. It would be the Subtarget->hasFuseLiterals that should probably change, being replaced with a new subtarget feature.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok sure

@sushgokh
Copy link
Contributor Author

Can you provide an example of the function this help with? You said it was something in povray? I'm wondering if it is something we can fix more directly. Why are we hoisting a constant into the entry block off a function that can otherwise be shrink-wrapped? Why does a load from a constant pool block shrink wrapping?

It is not constant load being shrink wrapped. There are 2 things happening here:

  1. adrp+ldr sequence being converted to mov's+fmov
  2. shrink wrapping. This is not of the above ldr in (1). Its enabling shrink wrapping of Callee saved registers and other stuff which was not happening otherwise.

(1) and (2) are unrelated in this sense.

@sushgokh
Copy link
Contributor Author

sushgokh commented Oct 6, 2025

@davemgreen I was on leave last week and couldnt work on this.

I had a look at assembly sequence again(before and after) and there are host of things going on in 'after' (not necessarily in same sequence)

  1. conversion of same ldr, present in 2 immediate successors of start block, to mov+fmov
  2. hoisting this sequence into the common block
  3. Block merging(i.e. merge one block previously containing ldr into start block)
  4. control flow optimizer pass marking one of the blocks as early exit since shrink wrapping has already happened.

I need to investigate why (2) and shrink wrapping isnt happening previously. This will take some time.

sushgokh added a commit to sushgokh/llvm-project that referenced this pull request Oct 8, 2025
wrapping

Shrink wrapping treats a load from constant pool as a stack access. This
is not correct. Constants are basically stored in read only section
AFAIU. This prevents shrink wrapping from kicking in.

(Related to PR llvm#160257. PR llvm#160257 will be closed.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants