[AArch64] Enable CmpBcc fusion for Neoverse-v2 #90608

ElvinaYakubova · 2024-04-30T13:36:08Z

This adds compare and branch instructions fusion for Neoverse V2. According to the Software Optimization Guide:
Specific Aarch64 instruction pairs that can be fused are as follows:
CMP/CMN (immediate) + B.cond
CMP/CMN (register) + B.cond

Performance for SPEC2017 is neutral, but another benchmark improves significantly.
Results for SPEC2017 on a Neoverse V2:
500.perlbench 0%
502.gcc_r 0%
505.mcf_r -0.15%
523.xalancbmk_r -0.43%
525.x264_r 0%
531.deepsjeng_r 0%
541.leela_r -0.16%
557.xz_r -0.47%

This adds compare and branch instructions fusion for Neoverse V2. According to Software Optimization Guide: Specific Aarch64 instruction pairs that can be fused are as follows: CMP/CMN (immediate) + B.cond CMP/CMN (register) + B.cond Performance for SPEC2017 is neutral, but another benchmark improves significantly. Results for SPEC2017 on a Neoverse V2: 500.perlbench 0% 502.gcc_r 0% 505.mcf_r -0.15% 523.xalancbmk_r -0.43% 525.x264_r 0% 531.deepsjeng_r 0% 541.leela_r -0.16% 557.xz_r -0.47%

llvmbot · 2024-04-30T13:36:39Z

@llvm/pr-subscribers-backend-aarch64

Author: Elvina Yakubova (ElvinaYakubova)

Changes

This adds compare and branch instructions fusion for Neoverse V2. According to the Software Optimization Guide:
Specific Aarch64 instruction pairs that can be fused are as follows:
CMP/CMN (immediate) + B.cond
CMP/CMN (register) + B.cond

Performance for SPEC2017 is neutral, but another benchmark improves significantly.
Results for SPEC2017 on a Neoverse V2:
500.perlbench 0%
502.gcc_r 0%
505.mcf_r -0.15%
523.xalancbmk_r -0.43%
525.x264_r 0%
531.deepsjeng_r 0%
541.leela_r -0.16%
557.xz_r -0.47%

Full diff: https://github.com/llvm/llvm-project/pull/90608.diff

2 Files Affected:

(modified) llvm/lib/Target/AArch64/AArch64Processors.td (+1)
(added) llvm/test/CodeGen/AArch64/misched-fusion-cmp-bcc.ll (+35)

diff --git a/llvm/lib/Target/AArch64/AArch64Processors.td b/llvm/lib/Target/AArch64/AArch64Processors.td
index 8772e51bf0ab42..af0bd664029f49 100644
--- a/llvm/lib/Target/AArch64/AArch64Processors.td
+++ b/llvm/lib/Target/AArch64/AArch64Processors.td
@@ -479,6 +479,7 @@ def TuneNeoverseV1 : SubtargetFeature<"neoversev1", "ARMProcFamily", "NeoverseV1
 def TuneNeoverseV2 : SubtargetFeature<"neoversev2", "ARMProcFamily", "NeoverseV2",
                                       "Neoverse V2 ARM processors", [
                                       FeatureFuseAES,
+                                      FeatureCmpBccFusion,
                                       FeatureFuseAdrpAdd,
                                       FeatureALULSLFast,
                                       FeaturePostRAScheduler,
diff --git a/llvm/test/CodeGen/AArch64/misched-fusion-cmp-bcc.ll b/llvm/test/CodeGen/AArch64/misched-fusion-cmp-bcc.ll
new file mode 100644
index 00000000000000..1034e072922a09
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/misched-fusion-cmp-bcc.ll
@@ -0,0 +1,35 @@
+; RUN: llc %s -o - -O0 -mtriple=aarch64-unknown -mattr=cmp-bcc-fusion | FileCheck %s
+; RUN: llc %s -o - -O0 -mtriple=aarch64-unknown -mcpu=cortex-a77    | FileCheck %s
+; RUN: llc %s -o - -O0 -mtriple=aarch64-unknown -mcpu=cortex-a78    | FileCheck %s
+; RUN: llc %s -o - -O0 -mtriple=aarch64-unknown -mcpu=cortex-a78ae  | FileCheck %s
+; RUN: llc %s -o - -O0 -mtriple=aarch64-unknown -mcpu=cortex-a78c   | FileCheck %s
+; RUN: llc %s -o - -O0 -mtriple=aarch64-unknown -mcpu=cortex-a710   | FileCheck %s
+; RUN: llc %s -o - -O0 -mtriple=aarch64-unknown -mcpu=cortex-x715   | FileCheck %s
+; RUN: llc %s -o - -O0 -mtriple=aarch64-unknown -mcpu=cortex-x720   | FileCheck %s
+; RUN: llc %s -o - -O0 -mtriple=aarch64-unknown -mcpu=cortex-x720ae | FileCheck %s
+; RUN: llc %s -o - -O0 -mtriple=aarch64-unknown -mcpu=cortex-x1     | FileCheck %s
+; RUN: llc %s -o - -O0 -mtriple=aarch64-unknown -mcpu=cortex-x2     | FileCheck %s
+; RUN: llc %s -o - -O0 -mtriple=aarch64-unknown -mcpu=neoverse-v2   | FileCheck %s
+; RUN: llc %s -o - -O0 -mtriple=aarch64-unknown -mcpu=ampere1       | FileCheck %s
+; RUN: llc %s -o - -O0 -mtriple=aarch64-unknown -mcpu=ampere1a      | FileCheck %s
+; RUN: llc %s -o - -O0 -mtriple=aarch64-unknown -mcpu=ampere1b      | FileCheck %s
+
+
+define void @test_cmp_bcc_fusion(i32 %x, i32 %y, i32* %arr) {
+entry:
+  %cmp = icmp eq i32 %x, %y
+  store i32 %x, i32* %arr, align 4
+  br i1 %cmp, label %if_true, label %if_false
+
+if_true:
+  ret void
+
+if_false:
+  ret void
+}
+
+; CHECK-LABEL: test_cmp_bcc_fusion:
+; CHECK: str {{w[0-9]}}, [{{x[0-9]}}]
+; CHECK-NEXT: subs {{w[0-9]}}, {{w[0-9]}}, {{w[0-9]}}
+; CHECK-NEXT: b.ne .LBB0_2
+; CHECK-NEXT: b .LBB0_1

davemgreen · 2024-04-30T13:49:50Z

I actually suggested that we remove this from some cores recently, as when I've tried it in the past (a long time ago now) the performance hasn't looked great. Perhaps things have changed since then. It's certainly true that the cores can do this fusion, my suspicion was that the implementation was a little too aggressive in llvm and led to worse codegen in places from knock-on effects.

Can you give more details on the "benchmark that improves significantly"? Do you happen to know why it improves?

ElvinaYakubova · 2024-04-30T14:02:44Z

I actually suggested that we remove this from some cores recently, as when I've tried it in the past (a long time ago now) the performance hasn't looked great. Perhaps things have changed since then. It's certainly true that the cores can do this fusion, my suspicion was that the implementation was a little too aggressive in llvm and led to worse codegen in places from knock-on effects.

Can you give more details on the "benchmark that improves significantly"? Do you happen to know why it improves?

The benchmark is Eigen, and we've seen 15% improvement. There are plenty of

cmp	
str
b.cc/b.eq/b.lt

code sequences there

sjoerdmeijer · 2024-05-01T10:38:42Z

I actually suggested that we remove this from some cores recently, as when I've tried it in the past (a long time ago now) the performance hasn't looked great. Perhaps things have changed since then. It's certainly true that the cores can do this fusion, my suspicion was that the implementation was a little too aggressive in llvm and led to worse codegen in places from knock-on effects.
Can you give more details on the "benchmark that improves significantly"? Do you happen to know why it improves?

The benchmark is Eigen, and we've seen 15% improvement. There are plenty of
cmp	
str
b.cc/b.eq/b.lt
code sequences there

I agree with Dave that we need to look into this a little bit more and double check that this codegen change is actually responsible for the performance uplift (and not e.g. because of some secondary effects).

davemgreen · 2024-05-03T07:25:15Z

I believe that changes like this would be perfectly fine, and generally be an improvement:

--- a/llvm/test/CodeGen/AArch64/complex-deinterleaving-reductions-scalable.ll                               
+++ b/llvm/test/CodeGen/AArch64/complex-deinterleaving-reductions-scalable.ll                               
@@ -340,10 +340,10 @@ define dso_local %"class.std::complex" @reduction_mix(ptr %a, ptr %b, ptr noalia      
 ; CHECK-NEXT:    add x0, x0, x11                                                                           
 ; CHECK-NEXT:    ld1w { z5.d }, p0/z, [x3, x8, lsl #2]                                                     
 ; CHECK-NEXT:    add x8, x8, x9                                                                            
-; CHECK-NEXT:    cmp x10, x8                                                                               
 ; CHECK-NEXT:    fadd z0.d, z4.d, z0.d                                                                     
 ; CHECK-NEXT:    fadd z1.d, z3.d, z1.d                                                                     
 ; CHECK-NEXT:    add z2.d, z5.d, z2.d                                                                      
+; CHECK-NEXT:    cmp x10, x8                                                                               
 ; CHECK-NEXT:    b.ne .LBB3_1                                                                              
 ; CHECK-NEXT:  // %bb.2: // %middle.block                                                                  
 ; CHECK-NEXT:    uaddv d2, p0, z2.d

It is the cases where we end up spilling extra registers or just end up with extra instructions that worry me:

--- a/llvm/test/CodeGen/AArch64/setcc_knownbits.ll                                                    
+++ b/llvm/test/CodeGen/AArch64/setcc_knownbits.ll                                                    
@@ -27,9 +27,10 @@ define noundef i1 @logger(i32 noundef %logLevel, ptr %ea, ptr %pll) {              
 ;                                                                                                    
 ; CHECK-GI-LABEL: logger:                                                                            
 ; CHECK-GI:       // %bb.0: // %entry                                                                
-; CHECK-GI-NEXT:    ldr w8, [x2]                                                                     
-; CHECK-GI-NEXT:    cmp w8, w0                                                                       
+; CHECK-GI-NEXT:    ldr w9, [x2]                                                                     
+; CHECK-GI-NEXT:    mov w8, w0                                                                       
 ; CHECK-GI-NEXT:    mov w0, wzr                                                                      
+; CHECK-GI-NEXT:    cmp w9, w8                                                                       
 ; CHECK-GI-NEXT:    b.hi .LBB1_2                                                                     
 ; CHECK-GI-NEXT:  // %bb.1: // %land.rhs                                                             
 ; CHECK-GI-NEXT:    ldr x8, [x1]

I'm not against this change if you have ran plenty of benchmarks and the performance overall is improving. It is certainly a sensible thing to do on paper. Maybe the example above isn't great (it's just a mov) but hopefully you get the idea about it potentially causing extra register pressure and extra spills. When I compile the llvm-test-suite + 3 versions of spec, notably the instruction count overall increases by 0.1%. Everything changes so there might be some noise going on, but ideally scheduling changes wouldn't increase the instruction count. (The codesize is also slightly increasing).

I was wondering if there was a slightly less aggressive version of macro-fusing that we could come up with that got the benefits without the increases in register pressure. It looks like it already tries to treat it that way in the scheduler at the moment though, with this not being the top priority inside GenericScheduler::tryCandidate. Fusing to the ExitSU might be causing difficulties.

So I think this change is OK so long as you have ran plenty of benchmarks. The results I have are not better, but not a lot worse and it could be chalked up be within the noise margin given how many things change. It might be worth investigating if there is a weaker form we could invent that becomes more aggressive again in post-ra.

sjoerdmeijer · 2024-05-03T07:53:36Z

Everything you wrote is spot and are things we have seen too, e.g. the instruction count and a bit of noise, register pressure. We have one very good motivating example (Eigen), but the rest is neutral. We need to double check that this Eigen improvement is real because there are some minor code changes elsewhere too. E.g., we have also seen some extra MOVs being introduced (probably the extra instruction count), and eliminating that might be a good way to improve things, but I will let Elvina confirm and comment on that.

ElvinaYakubova · 2024-06-04T07:03:33Z

@davemgreen You’re absolutely right about tryCandidate in the scheduler pass. I tried to find a pattern to ignore such cases, but no luck with that. I’ve also done more benchmarking (e.g. different Eigen’s and TSVC's kernels) and generally there are more beneficial cases than regression. Should we merge the changes in this case?

davemgreen

Yeah sure. Let's do it. LGTM

(I may at some point try to invent a weaker form of clustering that is a little further down the priority list, but I will trust your results as the ones I've seen may just be getting unlucky with noise, and like you say a lot of things are going up and down).

ElvinaYakubova · 2024-06-04T14:08:35Z

Thank you! I'll try to think about this further too as a separate task.

ElvinaYakubova requested review from madhur13490, sjoerdmeijer and tmatheson-arm April 30, 2024 13:36

llvmbot added the backend:AArch64 label Apr 30, 2024

sjoerdmeijer requested a review from davemgreen April 30, 2024 13:37

dc03-work mentioned this pull request Jun 4, 2024

[AArch64] Enable cmp + csel fusion for Neoverse V2 #94309

Open

davemgreen approved these changes Jun 4, 2024

View reviewed changes

ElvinaYakubova merged commit fadd1ec into llvm:main Jun 4, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AArch64] Enable CmpBcc fusion for Neoverse-v2 #90608

[AArch64] Enable CmpBcc fusion for Neoverse-v2 #90608

ElvinaYakubova commented Apr 30, 2024

llvmbot commented Apr 30, 2024

davemgreen commented Apr 30, 2024

ElvinaYakubova commented Apr 30, 2024

sjoerdmeijer commented May 1, 2024

davemgreen commented May 3, 2024

sjoerdmeijer commented May 3, 2024 •

edited

ElvinaYakubova commented Jun 4, 2024

davemgreen left a comment

ElvinaYakubova commented Jun 4, 2024

[AArch64] Enable CmpBcc fusion for Neoverse-v2 #90608

[AArch64] Enable CmpBcc fusion for Neoverse-v2 #90608

Conversation

ElvinaYakubova commented Apr 30, 2024

llvmbot commented Apr 30, 2024

davemgreen commented Apr 30, 2024

ElvinaYakubova commented Apr 30, 2024

sjoerdmeijer commented May 1, 2024

davemgreen commented May 3, 2024

sjoerdmeijer commented May 3, 2024 • edited

ElvinaYakubova commented Jun 4, 2024

davemgreen left a comment

Choose a reason for hiding this comment

ElvinaYakubova commented Jun 4, 2024

sjoerdmeijer commented May 3, 2024 •

edited