Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AArch64] Enable CmpBcc fusion for Neoverse-v2 #90608

Merged
merged 1 commit into from
Jun 4, 2024

Conversation

ElvinaYakubova
Copy link
Contributor

This adds compare and branch instructions fusion for Neoverse V2. According to the Software Optimization Guide:
Specific Aarch64 instruction pairs that can be fused are as follows:
CMP/CMN (immediate) + B.cond
CMP/CMN (register) + B.cond

Performance for SPEC2017 is neutral, but another benchmark improves significantly.
Results for SPEC2017 on a Neoverse V2:
500.perlbench 0%
502.gcc_r 0%
505.mcf_r -0.15%
523.xalancbmk_r -0.43%
525.x264_r 0%
531.deepsjeng_r 0%
541.leela_r -0.16%
557.xz_r -0.47%

This adds compare and branch instructions fusion for Neoverse V2.
According to Software Optimization Guide:
Specific Aarch64 instruction pairs that can be fused are as follows:
CMP/CMN (immediate) + B.cond
CMP/CMN (register) + B.cond

Performance for SPEC2017 is neutral, but another benchmark improves
significantly.
Results for SPEC2017 on a Neoverse V2:
500.perlbench 0%
502.gcc_r 0%
505.mcf_r -0.15%
523.xalancbmk_r -0.43%
525.x264_r 0%
531.deepsjeng_r 0%
541.leela_r -0.16%
557.xz_r -0.47%
@llvmbot
Copy link
Collaborator

llvmbot commented Apr 30, 2024

@llvm/pr-subscribers-backend-aarch64

Author: Elvina Yakubova (ElvinaYakubova)

Changes

This adds compare and branch instructions fusion for Neoverse V2. According to the Software Optimization Guide:
Specific Aarch64 instruction pairs that can be fused are as follows:
CMP/CMN (immediate) + B.cond
CMP/CMN (register) + B.cond

Performance for SPEC2017 is neutral, but another benchmark improves significantly.
Results for SPEC2017 on a Neoverse V2:
500.perlbench 0%
502.gcc_r 0%
505.mcf_r -0.15%
523.xalancbmk_r -0.43%
525.x264_r 0%
531.deepsjeng_r 0%
541.leela_r -0.16%
557.xz_r -0.47%


Full diff: https://github.com/llvm/llvm-project/pull/90608.diff

2 Files Affected:

  • (modified) llvm/lib/Target/AArch64/AArch64Processors.td (+1)
  • (added) llvm/test/CodeGen/AArch64/misched-fusion-cmp-bcc.ll (+35)
diff --git a/llvm/lib/Target/AArch64/AArch64Processors.td b/llvm/lib/Target/AArch64/AArch64Processors.td
index 8772e51bf0ab42..af0bd664029f49 100644
--- a/llvm/lib/Target/AArch64/AArch64Processors.td
+++ b/llvm/lib/Target/AArch64/AArch64Processors.td
@@ -479,6 +479,7 @@ def TuneNeoverseV1 : SubtargetFeature<"neoversev1", "ARMProcFamily", "NeoverseV1
 def TuneNeoverseV2 : SubtargetFeature<"neoversev2", "ARMProcFamily", "NeoverseV2",
                                       "Neoverse V2 ARM processors", [
                                       FeatureFuseAES,
+                                      FeatureCmpBccFusion,
                                       FeatureFuseAdrpAdd,
                                       FeatureALULSLFast,
                                       FeaturePostRAScheduler,
diff --git a/llvm/test/CodeGen/AArch64/misched-fusion-cmp-bcc.ll b/llvm/test/CodeGen/AArch64/misched-fusion-cmp-bcc.ll
new file mode 100644
index 00000000000000..1034e072922a09
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/misched-fusion-cmp-bcc.ll
@@ -0,0 +1,35 @@
+; RUN: llc %s -o - -O0 -mtriple=aarch64-unknown -mattr=cmp-bcc-fusion | FileCheck %s
+; RUN: llc %s -o - -O0 -mtriple=aarch64-unknown -mcpu=cortex-a77    | FileCheck %s
+; RUN: llc %s -o - -O0 -mtriple=aarch64-unknown -mcpu=cortex-a78    | FileCheck %s
+; RUN: llc %s -o - -O0 -mtriple=aarch64-unknown -mcpu=cortex-a78ae  | FileCheck %s
+; RUN: llc %s -o - -O0 -mtriple=aarch64-unknown -mcpu=cortex-a78c   | FileCheck %s
+; RUN: llc %s -o - -O0 -mtriple=aarch64-unknown -mcpu=cortex-a710   | FileCheck %s
+; RUN: llc %s -o - -O0 -mtriple=aarch64-unknown -mcpu=cortex-x715   | FileCheck %s
+; RUN: llc %s -o - -O0 -mtriple=aarch64-unknown -mcpu=cortex-x720   | FileCheck %s
+; RUN: llc %s -o - -O0 -mtriple=aarch64-unknown -mcpu=cortex-x720ae | FileCheck %s
+; RUN: llc %s -o - -O0 -mtriple=aarch64-unknown -mcpu=cortex-x1     | FileCheck %s
+; RUN: llc %s -o - -O0 -mtriple=aarch64-unknown -mcpu=cortex-x2     | FileCheck %s
+; RUN: llc %s -o - -O0 -mtriple=aarch64-unknown -mcpu=neoverse-v2   | FileCheck %s
+; RUN: llc %s -o - -O0 -mtriple=aarch64-unknown -mcpu=ampere1       | FileCheck %s
+; RUN: llc %s -o - -O0 -mtriple=aarch64-unknown -mcpu=ampere1a      | FileCheck %s
+; RUN: llc %s -o - -O0 -mtriple=aarch64-unknown -mcpu=ampere1b      | FileCheck %s
+
+
+define void @test_cmp_bcc_fusion(i32 %x, i32 %y, i32* %arr) {
+entry:
+  %cmp = icmp eq i32 %x, %y
+  store i32 %x, i32* %arr, align 4
+  br i1 %cmp, label %if_true, label %if_false
+
+if_true:
+  ret void
+
+if_false:
+  ret void
+}
+
+; CHECK-LABEL: test_cmp_bcc_fusion:
+; CHECK: str {{w[0-9]}}, [{{x[0-9]}}]
+; CHECK-NEXT: subs {{w[0-9]}}, {{w[0-9]}}, {{w[0-9]}}
+; CHECK-NEXT: b.ne .LBB0_2
+; CHECK-NEXT: b .LBB0_1

@davemgreen
Copy link
Collaborator

I actually suggested that we remove this from some cores recently, as when I've tried it in the past (a long time ago now) the performance hasn't looked great. Perhaps things have changed since then. It's certainly true that the cores can do this fusion, my suspicion was that the implementation was a little too aggressive in llvm and led to worse codegen in places from knock-on effects.

Can you give more details on the "benchmark that improves significantly"? Do you happen to know why it improves?

@ElvinaYakubova
Copy link
Contributor Author

I actually suggested that we remove this from some cores recently, as when I've tried it in the past (a long time ago now) the performance hasn't looked great. Perhaps things have changed since then. It's certainly true that the cores can do this fusion, my suspicion was that the implementation was a little too aggressive in llvm and led to worse codegen in places from knock-on effects.

Can you give more details on the "benchmark that improves significantly"? Do you happen to know why it improves?

The benchmark is Eigen, and we've seen 15% improvement. There are plenty of

cmp	
str
b.cc/b.eq/b.lt

code sequences there

@sjoerdmeijer
Copy link
Collaborator

I actually suggested that we remove this from some cores recently, as when I've tried it in the past (a long time ago now) the performance hasn't looked great. Perhaps things have changed since then. It's certainly true that the cores can do this fusion, my suspicion was that the implementation was a little too aggressive in llvm and led to worse codegen in places from knock-on effects.
Can you give more details on the "benchmark that improves significantly"? Do you happen to know why it improves?

The benchmark is Eigen, and we've seen 15% improvement. There are plenty of

cmp	
str
b.cc/b.eq/b.lt

code sequences there

I agree with Dave that we need to look into this a little bit more and double check that this codegen change is actually responsible for the performance uplift (and not e.g. because of some secondary effects).

@davemgreen
Copy link
Collaborator

I believe that changes like this would be perfectly fine, and generally be an improvement:

--- a/llvm/test/CodeGen/AArch64/complex-deinterleaving-reductions-scalable.ll                               
+++ b/llvm/test/CodeGen/AArch64/complex-deinterleaving-reductions-scalable.ll                               
@@ -340,10 +340,10 @@ define dso_local %"class.std::complex" @reduction_mix(ptr %a, ptr %b, ptr noalia      
 ; CHECK-NEXT:    add x0, x0, x11                                                                           
 ; CHECK-NEXT:    ld1w { z5.d }, p0/z, [x3, x8, lsl #2]                                                     
 ; CHECK-NEXT:    add x8, x8, x9                                                                            
-; CHECK-NEXT:    cmp x10, x8                                                                               
 ; CHECK-NEXT:    fadd z0.d, z4.d, z0.d                                                                     
 ; CHECK-NEXT:    fadd z1.d, z3.d, z1.d                                                                     
 ; CHECK-NEXT:    add z2.d, z5.d, z2.d                                                                      
+; CHECK-NEXT:    cmp x10, x8                                                                               
 ; CHECK-NEXT:    b.ne .LBB3_1                                                                              
 ; CHECK-NEXT:  // %bb.2: // %middle.block                                                                  
 ; CHECK-NEXT:    uaddv d2, p0, z2.d                                                                        

It is the cases where we end up spilling extra registers or just end up with extra instructions that worry me:

--- a/llvm/test/CodeGen/AArch64/setcc_knownbits.ll                                                    
+++ b/llvm/test/CodeGen/AArch64/setcc_knownbits.ll                                                    
@@ -27,9 +27,10 @@ define noundef i1 @logger(i32 noundef %logLevel, ptr %ea, ptr %pll) {              
 ;                                                                                                    
 ; CHECK-GI-LABEL: logger:                                                                            
 ; CHECK-GI:       // %bb.0: // %entry                                                                
-; CHECK-GI-NEXT:    ldr w8, [x2]                                                                     
-; CHECK-GI-NEXT:    cmp w8, w0                                                                       
+; CHECK-GI-NEXT:    ldr w9, [x2]                                                                     
+; CHECK-GI-NEXT:    mov w8, w0                                                                       
 ; CHECK-GI-NEXT:    mov w0, wzr                                                                      
+; CHECK-GI-NEXT:    cmp w9, w8                                                                       
 ; CHECK-GI-NEXT:    b.hi .LBB1_2                                                                     
 ; CHECK-GI-NEXT:  // %bb.1: // %land.rhs                                                             
 ; CHECK-GI-NEXT:    ldr x8, [x1]                                                                     

I'm not against this change if you have ran plenty of benchmarks and the performance overall is improving. It is certainly a sensible thing to do on paper. Maybe the example above isn't great (it's just a mov) but hopefully you get the idea about it potentially causing extra register pressure and extra spills. When I compile the llvm-test-suite + 3 versions of spec, notably the instruction count overall increases by 0.1%. Everything changes so there might be some noise going on, but ideally scheduling changes wouldn't increase the instruction count. (The codesize is also slightly increasing).

I was wondering if there was a slightly less aggressive version of macro-fusing that we could come up with that got the benefits without the increases in register pressure. It looks like it already tries to treat it that way in the scheduler at the moment though, with this not being the top priority inside GenericScheduler::tryCandidate. Fusing to the ExitSU might be causing difficulties.

So I think this change is OK so long as you have ran plenty of benchmarks. The results I have are not better, but not a lot worse and it could be chalked up be within the noise margin given how many things change. It might be worth investigating if there is a weaker form we could invent that becomes more aggressive again in post-ra.

@sjoerdmeijer
Copy link
Collaborator

sjoerdmeijer commented May 3, 2024

Everything you wrote is spot and are things we have seen too, e.g. the instruction count and a bit of noise, register pressure. We have one very good motivating example (Eigen), but the rest is neutral. We need to double check that this Eigen improvement is real because there are some minor code changes elsewhere too. E.g., we have also seen some extra MOVs being introduced (probably the extra instruction count), and eliminating that might be a good way to improve things, but I will let Elvina confirm and comment on that.

@ElvinaYakubova
Copy link
Contributor Author

@davemgreen You’re absolutely right about tryCandidate in the scheduler pass. I tried to find a pattern to ignore such cases, but no luck with that. I’ve also done more benchmarking (e.g. different Eigen’s and TSVC's kernels) and generally there are more beneficial cases than regression. Should we merge the changes in this case?

Copy link
Collaborator

@davemgreen davemgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sure. Let's do it. LGTM

(I may at some point try to invent a weaker form of clustering that is a little further down the priority list, but I will trust your results as the ones I've seen may just be getting unlucky with noise, and like you say a lot of things are going up and down).

@ElvinaYakubova
Copy link
Contributor Author

Thank you! I'll try to think about this further too as a separate task.

@ElvinaYakubova ElvinaYakubova merged commit fadd1ec into llvm:main Jun 4, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants