-
Notifications
You must be signed in to change notification settings - Fork 15.2k
Add FeatureFuseLiterals as SubTargetFeature for Grace and Olympus #160257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add FeatureFuseLiterals as SubTargetFeature for Grace and Olympus #160257
Conversation
With this, we are gaining significantly with povray benchmark from SPEC17(around 12% with -flto -Ofast). This is attributable to transformation from this feature and subsequent shrink wrapping. We also see some improvement(around 2%) with xalanc benchmark from SPEC17. There are some improvements on some internal benchmarks as well.
@llvm/pr-subscribers-backend-aarch64 Author: Sushant Gokhale (sushgokh) ChangesWith this, we are gaining significantly with povray benchmark from SPEC17(around 12% with -flto -Ofast). This is attributable to transformation from this feature and subsequent shrink wrapping. We also see some improvement(around 2%) with xalanc benchmark from SPEC17. There are some improvements on some internal benchmarks as well. Note: All the performance changes are observed on Grace. Things should be same for Olympus as well Full diff: https://github.com/llvm/llvm-project/pull/160257.diff 3 Files Affected:
diff --git a/llvm/lib/Target/AArch64/AArch64Processors.td b/llvm/lib/Target/AArch64/AArch64Processors.td
index 81f5d075729d9..1d07e82acae77 100644
--- a/llvm/lib/Target/AArch64/AArch64Processors.td
+++ b/llvm/lib/Target/AArch64/AArch64Processors.td
@@ -328,7 +328,8 @@ def TuneOlympus : SubtargetFeature<"olympus", "ARMProcFamily", "Olympus",
FeatureFuseAdrpAdd,
FeaturePostRAScheduler,
FeaturePredictableSelectIsExpensive,
- FeatureUseFixedOverScalableIfEqualCost]>;
+ FeatureUseFixedOverScalableIfEqualCost,
+ FeatureFuseLiterals]>;
// Note that cyclone does not fuse AES instructions, but newer apple chips do
// perform the fusion and cyclone is used by default when targeting apple OSes.
@@ -641,7 +642,8 @@ def TuneNeoverseV2 : SubtargetFeature<"neoversev2", "ARMProcFamily", "NeoverseV2
FeatureUseFixedOverScalableIfEqualCost,
FeatureAvoidLDAPUR,
FeaturePredictableSelectIsExpensive,
- FeatureDisableLatencySchedHeuristic]>;
+ FeatureDisableLatencySchedHeuristic,
+ FeatureFuseLiterals]>;
def TuneNeoverseV3 : SubtargetFeature<"neoversev3", "ARMProcFamily", "NeoverseV3",
"Neoverse V3 ARM processors", [
diff --git a/llvm/test/CodeGen/AArch64/misched-fusion-addadrp.ll b/llvm/test/CodeGen/AArch64/misched-fusion-addadrp.ll
index 70b6b91d3cf66..4b77e9eb71faf 100644
--- a/llvm/test/CodeGen/AArch64/misched-fusion-addadrp.ll
+++ b/llvm/test/CodeGen/AArch64/misched-fusion-addadrp.ll
@@ -12,7 +12,8 @@
; RUN: llc %s -o - -mtriple=aarch64-unknown -mcpu=neoverse-n1 | FileCheck %s
; RUN: llc %s -o - -mtriple=aarch64-unknown -mcpu=neoverse-v1 | FileCheck %s
; RUN: llc %s -o - -mtriple=aarch64-unknown -mcpu=neoverse-n2 | FileCheck %s
-; RUN: llc %s -o - -mtriple=aarch64-unknown -mcpu=neoverse-v2 | FileCheck %s
+; RUN: llc %s -o - -mtriple=aarch64-unknown -mcpu=neoverse-v2 | FileCheck %s --check-prefix FUSE-LITERALS
+; RUN: llc %s -o - -mtriple=aarch64-unknown -mcpu=olympus | FileCheck %s --check-prefix FUSE-LITERALS
; RUN: llc %s -o - -mtriple=aarch64-unknown -mcpu=apple-a16 -mattr=-fuse-literals | FileCheck %s
; RUN: llc %s -o - -mtriple=aarch64-unknown -mcpu=apple-a17 -mattr=-fuse-literals | FileCheck %s
; RUN: llc %s -o - -mtriple=aarch64-unknown -mcpu=ampere1 -mattr=-fuse-literals | FileCheck %s
@@ -38,6 +39,12 @@ define double @litf() {
; CHECK-LABEL: litf:
; CHECK: adrp [[ADDR:x[0-9]+]], [[CSTLABEL:.LCP.*]]
; CHECK-NEXT: ldr {{d[0-9]+}}, {{[[]}}[[ADDR]], :lo12:[[CSTLABEL]]{{[]]}}
+;
+; FUSE-LITERALS: mov [[R:x[0-9]+]], #11544
+; FUSE-LITERALS: movk [[R]], #21572, lsl #16
+; FUSE-LITERALS: movk [[R]], #8699, lsl #32
+; FUSE-LITERALS: movk [[R]], #16393, lsl #48
+; FUSE-LITERALS: fmov {{d[0-9]+}}, [[R]]
entry:
ret double 0x400921FB54442D18
}
diff --git a/llvm/test/CodeGen/AArch64/selectopt-const.ll b/llvm/test/CodeGen/AArch64/selectopt-const.ll
index fe48dbaf1ab76..62ac297153962 100644
--- a/llvm/test/CodeGen/AArch64/selectopt-const.ll
+++ b/llvm/test/CodeGen/AArch64/selectopt-const.ll
@@ -1,5 +1,6 @@
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
; RUN: llc -mtriple=aarch64-linux-gnu -mcpu=neoverse-v2 -O3 < %s | FileCheck %s
+; RUN: llc -mtriple=aarch64-linux-gnu -mcpu=olympus -O3 < %s | FileCheck %s
define i32 @test_const(ptr %in1, ptr %in2, ptr %out, i32 %n, ptr %tbl) {
; CHECK-LABEL: test_const:
@@ -8,10 +9,10 @@ define i32 @test_const(ptr %in1, ptr %in2, ptr %out, i32 %n, ptr %tbl) {
; CHECK-NEXT: b.lt .LBB0_3
; CHECK-NEXT: // %bb.1: // %for.body.preheader
; CHECK-NEXT: mov w9, #1267 // =0x4f3
+; CHECK-NEXT: movk w9, #16309, lsl #16
; CHECK-NEXT: fmov s1, #1.00000000
; CHECK-NEXT: fmov d2, #5.00000000
; CHECK-NEXT: mov w8, w3
-; CHECK-NEXT: movk w9, #16309, lsl #16
; CHECK-NEXT: fmov s0, w9
; CHECK-NEXT: mov w9, #16 // =0x10
; CHECK-NEXT: .p2align 5, , 16
|
The fused instructions are listed in the software optimization guide, and do not include movk AFAIU. I'm guessing that the real transform that you want isn't whether movk is fused for scheduling but whether a load or movk+fmov is used for materializing a fp constant? fmov has a limited bandwidth so should be avoided, but in load-heavy code (or where a load blocks other transforms, as it sounds like is happening here) the movks might be quicker. Can we fix that directly? Either by changing how constants are materialized or making is so there isn't the load in the entry block of whatever function is causing the problem? |
You are right about
|
Now I see what you mean. Let me check the way constants are materialized and see if this can be made core specific |
Rather than adding the new feature, this changes the way how constants are materialized for Grace and Olympus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you provide an example of the function this help with? You said it was something in povray? I'm wondering if it is something we can fix more directly. Why are we hoisting a constant into the entry block off a function that can otherwise be shrink-wrapped? Why does a load from a constant pool block shrink wrapping?
// If the constant to be materialized is scalar, it maybe efficient to use | ||
// sequence of 'mov + fmov' rather than 'adrp + ldr' on specified CPU's. | ||
// However, when materializing vector of constants, there are two things to | ||
// note: | ||
// 1. Throughput of fmov instruction is very low. | ||
// 2. ldr instruction can load multiple constants in one go. Also, it's | ||
// throughput is higher as compared to fmov. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this say "fmovs limit throughput, loads are great", but then goes on to use the fmov version for these cpus?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"fmovs limit throughput, loads are great".
We want to be cautious when we are materializing vector of constants. So, I have used "maybe more efficient" to describe that we are pessimistic here.
if (!VT.isVector() && (Subtarget->getCPU() == "neoverse-v2" || | ||
Subtarget->getCPU() == "olympus")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't like to add checks like this on the cpu name. It is better to add a subtarget feature for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok sure
// throughput is higher as compared to fmov. | ||
if (!VT.isVector() && (Subtarget->getCPU() == "neoverse-v2" || | ||
Subtarget->getCPU() == "olympus")) | ||
return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would probably want to handle minsize/optsize like below. It would be the Subtarget->hasFuseLiterals that should probably change, being replaced with a new subtarget feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok sure
It is not constant load being shrink wrapped. There are 2 things happening here:
(1) and (2) are unrelated in this sense. |
@davemgreen I was on leave last week and couldnt work on this. I had a look at assembly sequence again(before and after) and there are host of things going on in 'after' (not necessarily in same sequence)
I need to investigate why (2) and shrink wrapping isnt happening previously. This will take some time. |
wrapping Shrink wrapping treats a load from constant pool as a stack access. This is not correct. Constants are basically stored in read only section AFAIU. This prevents shrink wrapping from kicking in. (Related to PR llvm#160257. PR llvm#160257 will be closed.)
With this, we are gaining significantly with povray benchmark from SPEC17(around 12% with -flto -Ofast). This is attributable to transformation from this feature and subsequent shrink wrapping.
We also see some improvement(around 2%) with xalanc benchmark from SPEC17.
There are some improvements on some internal benchmarks as well.
Note: All the performance changes are observed on Grace. Things should be same for Olympus as well