-
Notifications
You must be signed in to change notification settings - Fork 10.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AArch64] Enable CmpBcc fusion for Neoverse-v2 #90608
Conversation
This adds compare and branch instructions fusion for Neoverse V2. According to Software Optimization Guide: Specific Aarch64 instruction pairs that can be fused are as follows: CMP/CMN (immediate) + B.cond CMP/CMN (register) + B.cond Performance for SPEC2017 is neutral, but another benchmark improves significantly. Results for SPEC2017 on a Neoverse V2: 500.perlbench 0% 502.gcc_r 0% 505.mcf_r -0.15% 523.xalancbmk_r -0.43% 525.x264_r 0% 531.deepsjeng_r 0% 541.leela_r -0.16% 557.xz_r -0.47%
@llvm/pr-subscribers-backend-aarch64 Author: Elvina Yakubova (ElvinaYakubova) ChangesThis adds compare and branch instructions fusion for Neoverse V2. According to the Software Optimization Guide: Performance for SPEC2017 is neutral, but another benchmark improves significantly. Full diff: https://github.com/llvm/llvm-project/pull/90608.diff 2 Files Affected:
diff --git a/llvm/lib/Target/AArch64/AArch64Processors.td b/llvm/lib/Target/AArch64/AArch64Processors.td
index 8772e51bf0ab42..af0bd664029f49 100644
--- a/llvm/lib/Target/AArch64/AArch64Processors.td
+++ b/llvm/lib/Target/AArch64/AArch64Processors.td
@@ -479,6 +479,7 @@ def TuneNeoverseV1 : SubtargetFeature<"neoversev1", "ARMProcFamily", "NeoverseV1
def TuneNeoverseV2 : SubtargetFeature<"neoversev2", "ARMProcFamily", "NeoverseV2",
"Neoverse V2 ARM processors", [
FeatureFuseAES,
+ FeatureCmpBccFusion,
FeatureFuseAdrpAdd,
FeatureALULSLFast,
FeaturePostRAScheduler,
diff --git a/llvm/test/CodeGen/AArch64/misched-fusion-cmp-bcc.ll b/llvm/test/CodeGen/AArch64/misched-fusion-cmp-bcc.ll
new file mode 100644
index 00000000000000..1034e072922a09
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/misched-fusion-cmp-bcc.ll
@@ -0,0 +1,35 @@
+; RUN: llc %s -o - -O0 -mtriple=aarch64-unknown -mattr=cmp-bcc-fusion | FileCheck %s
+; RUN: llc %s -o - -O0 -mtriple=aarch64-unknown -mcpu=cortex-a77 | FileCheck %s
+; RUN: llc %s -o - -O0 -mtriple=aarch64-unknown -mcpu=cortex-a78 | FileCheck %s
+; RUN: llc %s -o - -O0 -mtriple=aarch64-unknown -mcpu=cortex-a78ae | FileCheck %s
+; RUN: llc %s -o - -O0 -mtriple=aarch64-unknown -mcpu=cortex-a78c | FileCheck %s
+; RUN: llc %s -o - -O0 -mtriple=aarch64-unknown -mcpu=cortex-a710 | FileCheck %s
+; RUN: llc %s -o - -O0 -mtriple=aarch64-unknown -mcpu=cortex-x715 | FileCheck %s
+; RUN: llc %s -o - -O0 -mtriple=aarch64-unknown -mcpu=cortex-x720 | FileCheck %s
+; RUN: llc %s -o - -O0 -mtriple=aarch64-unknown -mcpu=cortex-x720ae | FileCheck %s
+; RUN: llc %s -o - -O0 -mtriple=aarch64-unknown -mcpu=cortex-x1 | FileCheck %s
+; RUN: llc %s -o - -O0 -mtriple=aarch64-unknown -mcpu=cortex-x2 | FileCheck %s
+; RUN: llc %s -o - -O0 -mtriple=aarch64-unknown -mcpu=neoverse-v2 | FileCheck %s
+; RUN: llc %s -o - -O0 -mtriple=aarch64-unknown -mcpu=ampere1 | FileCheck %s
+; RUN: llc %s -o - -O0 -mtriple=aarch64-unknown -mcpu=ampere1a | FileCheck %s
+; RUN: llc %s -o - -O0 -mtriple=aarch64-unknown -mcpu=ampere1b | FileCheck %s
+
+
+define void @test_cmp_bcc_fusion(i32 %x, i32 %y, i32* %arr) {
+entry:
+ %cmp = icmp eq i32 %x, %y
+ store i32 %x, i32* %arr, align 4
+ br i1 %cmp, label %if_true, label %if_false
+
+if_true:
+ ret void
+
+if_false:
+ ret void
+}
+
+; CHECK-LABEL: test_cmp_bcc_fusion:
+; CHECK: str {{w[0-9]}}, [{{x[0-9]}}]
+; CHECK-NEXT: subs {{w[0-9]}}, {{w[0-9]}}, {{w[0-9]}}
+; CHECK-NEXT: b.ne .LBB0_2
+; CHECK-NEXT: b .LBB0_1
|
I actually suggested that we remove this from some cores recently, as when I've tried it in the past (a long time ago now) the performance hasn't looked great. Perhaps things have changed since then. It's certainly true that the cores can do this fusion, my suspicion was that the implementation was a little too aggressive in llvm and led to worse codegen in places from knock-on effects. Can you give more details on the "benchmark that improves significantly"? Do you happen to know why it improves? |
The benchmark is Eigen, and we've seen 15% improvement. There are plenty of
code sequences there |
I agree with Dave that we need to look into this a little bit more and double check that this codegen change is actually responsible for the performance uplift (and not e.g. because of some secondary effects). |
I believe that changes like this would be perfectly fine, and generally be an improvement:
It is the cases where we end up spilling extra registers or just end up with extra instructions that worry me:
I'm not against this change if you have ran plenty of benchmarks and the performance overall is improving. It is certainly a sensible thing to do on paper. Maybe the example above isn't great (it's just a mov) but hopefully you get the idea about it potentially causing extra register pressure and extra spills. When I compile the llvm-test-suite + 3 versions of spec, notably the instruction count overall increases by 0.1%. Everything changes so there might be some noise going on, but ideally scheduling changes wouldn't increase the instruction count. (The codesize is also slightly increasing). I was wondering if there was a slightly less aggressive version of macro-fusing that we could come up with that got the benefits without the increases in register pressure. It looks like it already tries to treat it that way in the scheduler at the moment though, with this not being the top priority inside GenericScheduler::tryCandidate. Fusing to the ExitSU might be causing difficulties. So I think this change is OK so long as you have ran plenty of benchmarks. The results I have are not better, but not a lot worse and it could be chalked up be within the noise margin given how many things change. It might be worth investigating if there is a weaker form we could invent that becomes more aggressive again in post-ra. |
Everything you wrote is spot and are things we have seen too, e.g. the instruction count and a bit of noise, register pressure. We have one very good motivating example (Eigen), but the rest is neutral. We need to double check that this Eigen improvement is real because there are some minor code changes elsewhere too. E.g., we have also seen some extra MOVs being introduced (probably the extra instruction count), and eliminating that might be a good way to improve things, but I will let Elvina confirm and comment on that. |
@davemgreen You’re absolutely right about tryCandidate in the scheduler pass. I tried to find a pattern to ignore such cases, but no luck with that. I’ve also done more benchmarking (e.g. different Eigen’s and TSVC's kernels) and generally there are more beneficial cases than regression. Should we merge the changes in this case? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah sure. Let's do it. LGTM
(I may at some point try to invent a weaker form of clustering that is a little further down the priority list, but I will trust your results as the ones I've seen may just be getting unlucky with noise, and like you say a lot of things are going up and down).
Thank you! I'll try to think about this further too as a separate task. |
This adds compare and branch instructions fusion for Neoverse V2. According to the Software Optimization Guide:
Specific Aarch64 instruction pairs that can be fused are as follows:
CMP/CMN (immediate) + B.cond
CMP/CMN (register) + B.cond
Performance for SPEC2017 is neutral, but another benchmark improves significantly.
Results for SPEC2017 on a Neoverse V2:
500.perlbench 0%
502.gcc_r 0%
505.mcf_r -0.15%
523.xalancbmk_r -0.43%
525.x264_r 0%
531.deepsjeng_r 0%
541.leela_r -0.16%
557.xz_r -0.47%