Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CodeGen][AArch64] Set min jump table entries to 13 for AArch64 targets #71166

Merged
merged 5 commits into from
Nov 14, 2023

Conversation

david-arm
Copy link
Contributor

There are some workloads that are negatively impacted by using jump tables when the number of entries is small. The SPEC2017 perlbench benchmark is one example of this, where increasing the threshold to around 13 gives a ~1.5% improvement on neoverse-v1. I chose the minimum threshold based on empirical evidence rather than science, and just manually increased the threshold until I got the best performance without impacting other workloads. For neoverse-v1 I saw around ~0.2% improvement in the SPEC2017 integer geomean, and no overall change for neoverse-n1. If we find issues with this threshold later on we can always revisit this.

The most significant SPEC2017 score changes on neoverse-v1 were:

500.perlbench_r: +1.6%
520.omnetpp_r: +0.6%

and the rest saw changes < 0.5%.

I updated CodeGen/AArch64/min-jump-table.ll to reflect the new threshold. For most of the affected tests I manually set the min number of entries back to 4 on the RUN line because the tests seem to rely upon this behaviour.

There are some workloads that are negatively impacted by using jump
tables when the number of entries is small. The SPEC2017 perlbench
benchmark is one example of this, where increasing the threshold to
around 13 gives a ~1.5% improvement on neoverse-v1. I chose the
minimum threshold based on empirical evidence rather than science,
and just manually increased the threshold until I got the best
performance without impacting other workloads. For neoverse-v1 I
saw around ~0.2% improvement in the SPEC2017 integer geomean, and
no overall change for neoverse-n1. If we find issues with this
threshold later on we can always revisit this.

The most significant SPEC2017 score changes on neoverse-v1 were:

500.perlbench_r: +1.6%
520.omnetpp_r: +0.6%

and the rest saw changes < 0.5%.

I updated CodeGen/AArch64/min-jump-table.ll to reflect the new
threshold. For most of the affected tests I manually set the min
number of entries back to 4 on the RUN line because the tests seem
to rely upon this behaviour.
@llvmbot
Copy link
Collaborator

llvmbot commented Nov 3, 2023

@llvm/pr-subscribers-backend-aarch64

@llvm/pr-subscribers-llvm-globalisel

Author: David Sherwood (david-arm)

Changes

There are some workloads that are negatively impacted by using jump tables when the number of entries is small. The SPEC2017 perlbench benchmark is one example of this, where increasing the threshold to around 13 gives a ~1.5% improvement on neoverse-v1. I chose the minimum threshold based on empirical evidence rather than science, and just manually increased the threshold until I got the best performance without impacting other workloads. For neoverse-v1 I saw around ~0.2% improvement in the SPEC2017 integer geomean, and no overall change for neoverse-n1. If we find issues with this threshold later on we can always revisit this.

The most significant SPEC2017 score changes on neoverse-v1 were:

500.perlbench_r: +1.6%
520.omnetpp_r: +0.6%

and the rest saw changes < 0.5%.

I updated CodeGen/AArch64/min-jump-table.ll to reflect the new threshold. For most of the affected tests I manually set the min number of entries back to 4 on the RUN line because the tests seem to rely upon this behaviour.


Patch is 23.65 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/71166.diff

17 Files Affected:

  • (modified) llvm/lib/Target/AArch64/AArch64ISelLowering.cpp (+6)
  • (modified) llvm/test/CodeGen/AArch64/GlobalISel/arm64-irtranslator-switch.ll (+1-1)
  • (modified) llvm/test/CodeGen/AArch64/arm64-jumptable.ll (+2-2)
  • (modified) llvm/test/CodeGen/AArch64/bti-branch-relaxation.ll (+1-1)
  • (modified) llvm/test/CodeGen/AArch64/implicit-def-subreg-to-reg-regression.ll (+1-1)
  • (modified) llvm/test/CodeGen/AArch64/jump-table-32.ll (+1-1)
  • (modified) llvm/test/CodeGen/AArch64/jump-table-exynos.ll (+2-2)
  • (modified) llvm/test/CodeGen/AArch64/jump-table.ll (+6-6)
  • (modified) llvm/test/CodeGen/AArch64/max-jump-table.ll (+5-5)
  • (modified) llvm/test/CodeGen/AArch64/min-jump-table.ll (+92-4)
  • (modified) llvm/test/CodeGen/AArch64/patchable-function-entry-bti.ll (+1-1)
  • (modified) llvm/test/CodeGen/AArch64/redundant-mov-from-zero-extend.ll (+23-18)
  • (modified) llvm/test/CodeGen/AArch64/switch-unreachable-default.ll (+1-1)
  • (modified) llvm/test/CodeGen/AArch64/win64-jumptable.ll (+2-2)
  • (modified) llvm/test/CodeGen/AArch64/wineh-bti.ll (+1-1)
  • (modified) llvm/test/CodeGen/Generic/machine-function-splitter.ll (+4-4)
  • (modified) llvm/test/DebugInfo/COFF/jump-table.ll (+1-1)
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index 291f0c8c5d991c6..002b95c68272f5c 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -144,6 +144,10 @@ static cl::opt<bool> EnableExtToTBL("aarch64-enable-ext-to-tbl", cl::Hidden,
 static cl::opt<unsigned> MaxXors("aarch64-max-xors", cl::init(16), cl::Hidden,
                                  cl::desc("Maximum of xors"));
 
+static cl::opt<unsigned> AArch64MinimumJumpTableEntries(
+    "aarch64-min-jump-table-entries", cl::init(13), cl::Hidden,
+    cl::desc("Set minimum number of entries to use a jump table on AArch64"));
+
 /// Value type used for condition codes.
 static const MVT MVT_CC = MVT::i32;
 
@@ -1653,6 +1657,8 @@ AArch64TargetLowering::AArch64TargetLowering(const TargetMachine &TM,
   PredictableSelectIsExpensive = Subtarget->predictableSelectIsExpensive();
 
   IsStrictFPEnabled = true;
+
+  setMinimumJumpTableEntries(AArch64MinimumJumpTableEntries);
 }
 
 void AArch64TargetLowering::addTypeForNEON(MVT VT) {
diff --git a/llvm/test/CodeGen/AArch64/GlobalISel/arm64-irtranslator-switch.ll b/llvm/test/CodeGen/AArch64/GlobalISel/arm64-irtranslator-switch.ll
index 4e4297b7a5e22f8..9371edd439fe49c 100644
--- a/llvm/test/CodeGen/AArch64/GlobalISel/arm64-irtranslator-switch.ll
+++ b/llvm/test/CodeGen/AArch64/GlobalISel/arm64-irtranslator-switch.ll
@@ -1,5 +1,5 @@
 ; NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py
-; RUN: llc -global-isel -mtriple aarch64 -O0 -aarch64-enable-atomic-cfg-tidy=0 -stop-after=irtranslator -simplify-mir -verify-machineinstrs %s -o - 2>&1 | FileCheck %s
+; RUN: llc -global-isel -mtriple aarch64 -aarch64-min-jump-table-entries=4 -O0 -aarch64-enable-atomic-cfg-tidy=0 -stop-after=irtranslator -simplify-mir -verify-machineinstrs %s -o - 2>&1 | FileCheck %s
 
 define i32 @switch(i32 %argc) {
   ; CHECK-LABEL: name: switch
diff --git a/llvm/test/CodeGen/AArch64/arm64-jumptable.ll b/llvm/test/CodeGen/AArch64/arm64-jumptable.ll
index d4ac9e72b28ff62..7d9adf92a6a95c5 100644
--- a/llvm/test/CodeGen/AArch64/arm64-jumptable.ll
+++ b/llvm/test/CodeGen/AArch64/arm64-jumptable.ll
@@ -1,5 +1,5 @@
-; RUN: llc -mtriple=arm64-apple-ios < %s | FileCheck %s
-; RUN: llc -mtriple=arm64-linux-gnu < %s | FileCheck %s --check-prefix=CHECK-LINUX
+; RUN: llc -mtriple=arm64-apple-ios -aarch64-min-jump-table-entries=4 < %s | FileCheck %s
+; RUN: llc -mtriple=arm64-linux-gnu -aarch64-min-jump-table-entries=4 < %s | FileCheck %s --check-prefix=CHECK-LINUX
 ; <rdar://11417675>
 
 define void @sum(i32 %a, ptr %to, i32 %c) {
diff --git a/llvm/test/CodeGen/AArch64/bti-branch-relaxation.ll b/llvm/test/CodeGen/AArch64/bti-branch-relaxation.ll
index d9ed54fb86bf151..e3e7f42526f785c 100644
--- a/llvm/test/CodeGen/AArch64/bti-branch-relaxation.ll
+++ b/llvm/test/CodeGen/AArch64/bti-branch-relaxation.ll
@@ -1,4 +1,4 @@
-; RUN: llc %s -o - | FileCheck %s
+; RUN: llc %s -aarch64-min-jump-table-entries=4 -o - | FileCheck %s
 target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
 target triple = "aarch64"
 
diff --git a/llvm/test/CodeGen/AArch64/implicit-def-subreg-to-reg-regression.ll b/llvm/test/CodeGen/AArch64/implicit-def-subreg-to-reg-regression.ll
index 73b180dc4ae765f..0f208f8ed905241 100644
--- a/llvm/test/CodeGen/AArch64/implicit-def-subreg-to-reg-regression.ll
+++ b/llvm/test/CodeGen/AArch64/implicit-def-subreg-to-reg-regression.ll
@@ -1,5 +1,5 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 3
-; RUN: llc -mtriple=arm64-apple-ios < %s | FileCheck %s
+; RUN: llc -aarch64-min-jump-table-entries=4 -mtriple=arm64-apple-ios < %s | FileCheck %s
 
 ; Check there's no assert in spilling from implicit-def operands on an
 ; IMPLICIT_DEF.
diff --git a/llvm/test/CodeGen/AArch64/jump-table-32.ll b/llvm/test/CodeGen/AArch64/jump-table-32.ll
index d8572e901af29ec..e65813de8943e3b 100644
--- a/llvm/test/CodeGen/AArch64/jump-table-32.ll
+++ b/llvm/test/CodeGen/AArch64/jump-table-32.ll
@@ -1,4 +1,4 @@
-; RUN: llc -verify-machineinstrs -o - %s -mtriple=arm64_32-apple-ios7.0 -aarch64-enable-atomic-cfg-tidy=0 | FileCheck %s
+; RUN: llc -verify-machineinstrs -o - %s -aarch64-min-jump-table-entries=4 -mtriple=arm64_32-apple-ios7.0 -aarch64-enable-atomic-cfg-tidy=0 | FileCheck %s
 
 define i32 @test_jumptable(i32 %in) {
 ; CHECK: test_jumptable
diff --git a/llvm/test/CodeGen/AArch64/jump-table-exynos.ll b/llvm/test/CodeGen/AArch64/jump-table-exynos.ll
index b5b400ecfbffcac..61b0df5de2af3a4 100644
--- a/llvm/test/CodeGen/AArch64/jump-table-exynos.ll
+++ b/llvm/test/CodeGen/AArch64/jump-table-exynos.ll
@@ -1,5 +1,5 @@
-; RUN: llc -o - %s -mtriple=aarch64-none-linux-gnu -mattr=+force-32bit-jump-tables -aarch64-enable-atomic-cfg-tidy=0 | FileCheck %s
-; RUN: llc -o - %s -mtriple=aarch64-none-linux-gnu -mcpu=exynos-m3 -aarch64-enable-atomic-cfg-tidy=0 | FileCheck %s
+; RUN: llc -o - %s -mtriple=aarch64-none-linux-gnu -mattr=+force-32bit-jump-tables -aarch64-min-jump-table-entries=4 -aarch64-enable-atomic-cfg-tidy=0 | FileCheck %s
+; RUN: llc -o - %s -mtriple=aarch64-none-linux-gnu -mcpu=exynos-m3 -aarch64-min-jump-table-entries=4 -aarch64-enable-atomic-cfg-tidy=0 | FileCheck %s
 
 ; Exynos doesn't want jump tables to be compressed for now.
 
diff --git a/llvm/test/CodeGen/AArch64/jump-table.ll b/llvm/test/CodeGen/AArch64/jump-table.ll
index 6c32444657542db..bb6c74b3b9df6c7 100644
--- a/llvm/test/CodeGen/AArch64/jump-table.ll
+++ b/llvm/test/CodeGen/AArch64/jump-table.ll
@@ -1,9 +1,9 @@
-; RUN: llc -no-integrated-as -verify-machineinstrs -o - %s -mtriple=aarch64-none-linux-gnu -aarch64-enable-atomic-cfg-tidy=0 | FileCheck %s
-; RUN: llc -no-integrated-as -code-model=large -verify-machineinstrs -o - %s -mtriple=aarch64-none-linux-gnu -aarch64-enable-atomic-cfg-tidy=0 | FileCheck --check-prefix=CHECK-LARGE %s
-; RUN: llc -no-integrated-as -code-model=large -relocation-model=pic -o - %s -mtriple=aarch64-none-linux-gnu -aarch64-enable-atomic-cfg-tidy=0 | FileCheck --check-prefix=CHECK-PIC %s
-; RUN: llc -no-integrated-as -mtriple=aarch64-none-linux-gnu -verify-machineinstrs -relocation-model=pic -aarch64-enable-atomic-cfg-tidy=0 -o - %s | FileCheck --check-prefix=CHECK-PIC %s
-; RUN: llc -no-integrated-as -verify-machineinstrs -o - %s -mtriple=arm64-apple-ios -aarch64-enable-atomic-cfg-tidy=0 | FileCheck --check-prefix=CHECK-IOS %s
-; RUN: llc -no-integrated-as -code-model=tiny -verify-machineinstrs -o - %s -mtriple=aarch64-none-linux-gnu -aarch64-enable-atomic-cfg-tidy=0 | FileCheck --check-prefix=CHECK-TINY %s
+; RUN: llc -no-integrated-as -verify-machineinstrs -o - %s -aarch64-min-jump-table-entries=4 -mtriple=aarch64-none-linux-gnu -aarch64-enable-atomic-cfg-tidy=0 | FileCheck %s
+; RUN: llc -no-integrated-as -code-model=large -verify-machineinstrs -o - %s -aarch64-min-jump-table-entries=4 -mtriple=aarch64-none-linux-gnu -aarch64-enable-atomic-cfg-tidy=0 | FileCheck --check-prefix=CHECK-LARGE %s
+; RUN: llc -no-integrated-as -code-model=large -relocation-model=pic -o - %s -aarch64-min-jump-table-entries=4 -mtriple=aarch64-none-linux-gnu -aarch64-enable-atomic-cfg-tidy=0 | FileCheck --check-prefix=CHECK-PIC %s
+; RUN: llc -no-integrated-as -mtriple=aarch64-none-linux-gnu -verify-machineinstrs -relocation-model=pic -aarch64-min-jump-table-entries=4 -aarch64-enable-atomic-cfg-tidy=0 -o - %s | FileCheck --check-prefix=CHECK-PIC %s
+; RUN: llc -no-integrated-as -verify-machineinstrs -o - %s -mtriple=arm64-apple-ios -aarch64-min-jump-table-entries=4 -aarch64-enable-atomic-cfg-tidy=0 | FileCheck --check-prefix=CHECK-IOS %s
+; RUN: llc -no-integrated-as -code-model=tiny -verify-machineinstrs -o - %s -aarch64-min-jump-table-entries=4 -mtriple=aarch64-none-linux-gnu -aarch64-enable-atomic-cfg-tidy=0 | FileCheck --check-prefix=CHECK-TINY %s
 
 define i32 @test_jumptable(i32 %in) {
 ; CHECK: test_jumptable
diff --git a/llvm/test/CodeGen/AArch64/max-jump-table.ll b/llvm/test/CodeGen/AArch64/max-jump-table.ll
index d01924a9a54272e..1268f5e6167ad8d 100644
--- a/llvm/test/CodeGen/AArch64/max-jump-table.ll
+++ b/llvm/test/CodeGen/AArch64/max-jump-table.ll
@@ -1,8 +1,8 @@
-; RUN: llc %s -O2 -print-after-isel -mtriple=aarch64-linux-gnu -jump-table-density=40                         -o /dev/null 2> %t; FileCheck %s --check-prefixes=CHECK,CHECK0  < %t
-; RUN: llc %s -O2 -print-after-isel -mtriple=aarch64-linux-gnu -jump-table-density=40 -max-jump-table-size=4  -o /dev/null 2> %t; FileCheck %s --check-prefixes=CHECK,CHECK4  < %t
-; RUN: llc %s -O2 -print-after-isel -mtriple=aarch64-linux-gnu -jump-table-density=40 -max-jump-table-size=8  -o /dev/null 2> %t; FileCheck %s --check-prefixes=CHECK,CHECK8  < %t
-; RUN: llc %s -O2 -print-after-isel -mtriple=aarch64-linux-gnu -jump-table-density=40 -max-jump-table-size=16 -o /dev/null 2> %t; FileCheck %s --check-prefixes=CHECK,CHECK16 < %t
-; RUN: llc %s -O2 -print-after-isel -mtriple=aarch64-linux-gnu -jump-table-density=40 -mcpu=exynos-m3         -o /dev/null 2> %t; FileCheck %s --check-prefixes=CHECK,CHECKM3 < %t
+; RUN: llc %s -O2 -print-after-isel -mtriple=aarch64-linux-gnu -aarch64-min-jump-table-entries=4 -jump-table-density=40                         -o /dev/null 2> %t; FileCheck %s --check-prefixes=CHECK,CHECK0  < %t
+; RUN: llc %s -O2 -print-after-isel -mtriple=aarch64-linux-gnu -aarch64-min-jump-table-entries=4 -jump-table-density=40 -max-jump-table-size=4  -o /dev/null 2> %t; FileCheck %s --check-prefixes=CHECK,CHECK4  < %t
+; RUN: llc %s -O2 -print-after-isel -mtriple=aarch64-linux-gnu -aarch64-min-jump-table-entries=4 -jump-table-density=40 -max-jump-table-size=8  -o /dev/null 2> %t; FileCheck %s --check-prefixes=CHECK,CHECK8  < %t
+; RUN: llc %s -O2 -print-after-isel -mtriple=aarch64-linux-gnu -aarch64-min-jump-table-entries=4 -jump-table-density=40 -max-jump-table-size=16 -o /dev/null 2> %t; FileCheck %s --check-prefixes=CHECK,CHECK16 < %t
+; RUN: llc %s -O2 -print-after-isel -mtriple=aarch64-linux-gnu -aarch64-min-jump-table-entries=4 -jump-table-density=40 -mcpu=exynos-m3         -o /dev/null 2> %t; FileCheck %s --check-prefixes=CHECK,CHECKM3 < %t
 
 declare void @ext(i32, i32)
 
diff --git a/llvm/test/CodeGen/AArch64/min-jump-table.ll b/llvm/test/CodeGen/AArch64/min-jump-table.ll
index bf6bddf7b9c4f73..3b3f79746de0ab7 100644
--- a/llvm/test/CodeGen/AArch64/min-jump-table.ll
+++ b/llvm/test/CodeGen/AArch64/min-jump-table.ll
@@ -1,7 +1,9 @@
-; RUN: llc %s -O2 -print-after-isel -mtriple=aarch64-linux-gnu -jump-table-density=40 -min-jump-table-entries=0 -o /dev/null 2> %t; FileCheck %s --check-prefixes=CHECK,CHECK0  < %t
-; RUN: llc %s -O2 -print-after-isel -mtriple=aarch64-linux-gnu -jump-table-density=40 -min-jump-table-entries=2 -o /dev/null 2> %t; FileCheck %s --check-prefixes=CHECK,CHECK2  < %t
-; RUN: llc %s -O2 -print-after-isel -mtriple=aarch64-linux-gnu -jump-table-density=40 -min-jump-table-entries=4 -o /dev/null 2> %t; FileCheck %s --check-prefixes=CHECK,CHECK4  < %t
-; RUN: llc %s -O2 -print-after-isel -mtriple=aarch64-linux-gnu -jump-table-density=40 -min-jump-table-entries=8 -o /dev/null 2> %t; FileCheck %s --check-prefixes=CHECK,CHECK8  < %t
+; RUN: llc %s -O2 -print-after-isel -mtriple=aarch64-linux-gnu -jump-table-density=40 -aarch64-min-jump-table-entries=0 -o /dev/null 2> %t; FileCheck %s --check-prefixes=CHECK,CHECK0  < %t
+; RUN: llc %s -O2 -print-after-isel -mtriple=aarch64-linux-gnu -jump-table-density=40 -aarch64-min-jump-table-entries=2 -o /dev/null 2> %t; FileCheck %s --check-prefixes=CHECK,CHECK2  < %t
+; RUN: llc %s -O2 -print-after-isel -mtriple=aarch64-linux-gnu -jump-table-density=40 -aarch64-min-jump-table-entries=4 -o /dev/null 2> %t; FileCheck %s --check-prefixes=CHECK,CHECK4  < %t
+; RUN: llc %s -O2 -print-after-isel -mtriple=aarch64-linux-gnu -jump-table-density=40 -aarch64-min-jump-table-entries=8 -o /dev/null 2> %t; FileCheck %s --check-prefixes=CHECK,CHECK8  < %t
+; RUN: llc %s -O2 -print-after-isel -mtriple=aarch64-linux-gnu -jump-table-density=40 -aarch64-min-jump-table-entries=12 -o /dev/null 2> %t; FileCheck %s --check-prefixes=CHECK,CHECK12  < %t
+; RUN: llc %s -O2 -print-after-isel -mtriple=aarch64-linux-gnu -jump-table-density=40 -o /dev/null 2> %t; FileCheck %s --check-prefixes=CHECK,CHECK-DEFAULT  < %t
 
 declare void @ext(i32, i32)
 
@@ -16,6 +18,8 @@ entry:
 ; CHECK2-NEXT: Jump Tables:
 ; CHECK4-NOT: {{^}}Jump Tables:
 ; CHECK8-NOT: {{^}}Jump Tables:
+; CHECK12-NOT: {{^}}Jump Tables:
+; CHECK-DEFAULT-NOT: {{^}}Jump Tables:
 
 bb1: tail call void @ext(i32 1, i32 0) br label %return
 bb2: tail call void @ext(i32 2, i32 2) br label %return
@@ -36,6 +40,8 @@ entry:
 ; CHECK2-NEXT: Jump Tables:
 ; CHECK4-NEXT: Jump Tables:
 ; CHECK8-NOT: {{^}}Jump Tables:
+; CHECK12-NOT: {{^}}Jump Tables:
+; CHECK-DEFAULT-NOT: {{^}}Jump Tables:
 
 bb1: tail call void @ext(i32 1, i32 0) br label %return
 bb2: tail call void @ext(i32 3, i32 2) br label %return
@@ -58,6 +64,83 @@ entry:
     i32 9, label %bb8
   ]
 ; CHECK-LABEL: function jt8:
+; CHECK0-NEXT: Jump Tables:
+; CHECK2-NEXT: Jump Tables:
+; CHECK4-NEXT: Jump Tables:
+; CHECK8-NEXT: Jump Tables:
+; CHECK12-NOT: Jump Tables:
+; CHECK-DEFAULT-NOT: {{^}}Jump Tables:
+
+bb1: tail call void @ext(i32 1, i32 0) br label %return
+bb2: tail call void @ext(i32 2, i32 2) br label %return
+bb3: tail call void @ext(i32 3, i32 4) br label %return
+bb4: tail call void @ext(i32 4, i32 6) br label %return
+bb5: tail call void @ext(i32 5, i32 8) br label %return
+bb6: tail call void @ext(i32 6, i32 10) br label %return
+bb7: tail call void @ext(i32 7, i32 12) br label %return
+bb8: tail call void @ext(i32 8, i32 14) br label %return
+
+return: ret i32 %b
+}
+
+define i32 @jt12(i32 %a, i32 %b) {
+entry:
+  switch i32 %a, label %return [
+    i32 1, label %bb1
+    i32 2, label %bb2
+    i32 3, label %bb3
+    i32 4, label %bb4
+    i32 5, label %bb5
+    i32 6, label %bb6
+    i32 7, label %bb7
+    i32 8, label %bb8
+    i32 9, label %bb9
+    i32 10, label %bb10
+    i32 11, label %bb11
+    i32 12, label %bb12
+  ]
+; CHECK-LABEL: function jt12:
+; CHECK0-NEXT: Jump Tables:
+; CHECK2-NEXT: Jump Tables:
+; CHECK4-NEXT: Jump Tables:
+; CHECK8-NEXT: Jump Tables:
+; CHECK12-NEXT: Jump Tables:
+; CHECK-DEFAULT-NOT: {{^}}Jump Tables:
+
+bb1: tail call void @ext(i32 1, i32 0) br label %return
+bb2: tail call void @ext(i32 2, i32 2) br label %return
+bb3: tail call void @ext(i32 3, i32 4) br label %return
+bb4: tail call void @ext(i32 4, i32 6) br label %return
+bb5: tail call void @ext(i32 5, i32 8) br label %return
+bb6: tail call void @ext(i32 6, i32 10) br label %return
+bb7: tail call void @ext(i32 7, i32 12) br label %return
+bb8: tail call void @ext(i32 8, i32 14) br label %return
+bb9: tail call void @ext(i32 9, i32 16) br label %return
+bb10: tail call void @ext(i32 10, i32 18) br label %return
+bb11: tail call void @ext(i32 11, i32 20) br label %return
+bb12: tail call void @ext(i32 12, i32 22) br label %return
+
+return: ret i32 %b
+}
+
+define i32 @jt13(i32 %a, i32 %b) {
+entry:
+  switch i32 %a, label %return [
+    i32 1, label %bb1
+    i32 2, label %bb2
+    i32 3, label %bb3
+    i32 4, label %bb4
+    i32 5, label %bb5
+    i32 6, label %bb6
+    i32 7, label %bb7
+    i32 8, label %bb8
+    i32 9, label %bb9
+    i32 10, label %bb10
+    i32 11, label %bb11
+    i32 12, label %bb12
+    i32 13, label %bb13
+  ]
+; CHECK-LABEL: function jt13:
 ; CHECK-NEXT: Jump Tables:
 
 bb1: tail call void @ext(i32 1, i32 0) br label %return
@@ -68,6 +151,11 @@ bb5: tail call void @ext(i32 5, i32 8) br label %return
 bb6: tail call void @ext(i32 6, i32 10) br label %return
 bb7: tail call void @ext(i32 7, i32 12) br label %return
 bb8: tail call void @ext(i32 8, i32 14) br label %return
+bb9: tail call void @ext(i32 9, i32 16) br label %return
+bb10: tail call void @ext(i32 10, i32 18) br label %return
+bb11: tail call void @ext(i32 11, i32 20) br label %return
+bb12: tail call void @ext(i32 12, i32 22) br label %return
+bb13: tail call void @ext(i32 13, i32 24) br label %return
 
 return: ret i32 %b
 }
diff --git a/llvm/test/CodeGen/AArch64/patchable-function-entry-bti.ll b/llvm/test/CodeGen/AArch64/patchable-function-entry-bti.ll
index 15657730c2cdcb6..106f6bb856b6b8a 100644
--- a/llvm/test/CodeGen/AArch64/patchable-function-entry-bti.ll
+++ b/llvm/test/CodeGen/AArch64/patchable-function-entry-bti.ll
@@ -1,4 +1,4 @@
-; RUN: llc -mtriple=aarch64 %s -o - | FileCheck %s
+; RUN: llc -mtriple=aarch64 -aarch64-min-jump-table-entries=4 %s -o - | FileCheck %s
 
 define void @f0() "patchable-function-entry"="0" "branch-target-enforcement"="true" {
 ; CHECK-LABEL: f0:
diff --git a/llvm/test/CodeGen/AArch64/redundant-mov-from-zero-extend.ll b/llvm/test/CodeGen/AArch64/redundant-mov-from-zero-extend.ll
index c150cb889313ac9..caa28b31a6f6539 100644
--- a/llvm/test/CodeGen/AArch64/redundant-mov-from-zero-extend.ll
+++ b/llvm/test/CodeGen/AArch64/redundant-mov-from-zero-extend.ll
@@ -11,33 +11,37 @@ define i32 @test(i32 %input, i32 %n, i32 %a) {
 ; CHECK-NEXT:  .LBB0_2: // %bb.0
 ; CHECK-NEXT:    add w8, w0, w1
 ; CHECK-NEXT:    mov w0, #100 // =0x64
-; CHECK-NEXT:    cmp w8, #4
-; CHECK-NEXT:    b.hi .LBB0_5
+; CHECK-NEXT:    cmp w8, #1
+; CHECK-NEXT:    b.le .LBB0_7
 ; CHECK-NEXT:  // %bb.3: // %bb.0
-; CHECK-NEXT:    adrp x9, .LJTI0_0
-; CHECK-NEXT:    add x9, x9, :lo12:.LJTI0_0
-; CHECK-NEXT:    adr x10, .LBB0_4
-; CHECK-NEXT:    ldrb w11, [x9, x8]
-; CHECK-NEXT:    add x10, x10, x11, lsl #2
-; CHECK-NEXT:    br x10
-; CHECK-NEXT:  .LBB0_4: // %sw.bb
-; CHECK-NEXT:    add w0, w2, #1
-; CHECK-NEXT:    ret
-; CHECK-NEXT:  .LBB0_5: // %bb.0
+; CHECK-NEXT:    cmp w8, #2
+; CHECK-NEXT:    b.eq .LBB0_10
+; CHECK-NEXT:  // %bb.4: // %bb.0
+; CHECK-NEXT:    cmp w8, #4
+; CHECK-NEXT:    b.eq .LBB0_11
+; CHECK-NEXT:  // %bb.5: // %bb.0
 ; CHECK-NEXT:    cmp w8, #200
-; CHECK-NEXT:    b.ne .LBB0_9
+; CHECK-NEXT:    b.ne .LBB0_12
 ; CHECK-NEXT:  // %bb.6: // %sw.bb7
 ; CHECK-NEXT:    add w0, w2, #7
 ; CHECK-NEXT:    ret
-; CHECK-NEXT:  .LBB0_7: // %sw.bb3
+; CHECK-NEXT:  .LBB0_7: // %bb.0
+; CHECK-NEXT:    cbz w8, .LBB0_13
+; CHECK-NEXT:  // %bb.8: // %bb.0
+; CHECK-NEXT:    cmp w8, #1
+; CHECK-NEXT:    b.ne .LBB0_12
+; CHECK-NEXT:  // %bb.9: // %sw.bb1
+; CHECK-NEXT:    add w0, w2, #3
+; CHECK-NEXT:    ret
+; CHECK-NEXT:  .LBB0_10: // %sw.bb3
 ; CHECK-NEXT:    add w0, w2, #4
 ; CHECK-NEXT:    ret
-; CHECK-NEXT:  .LBB0_8: // %sw.bb5
+; CHECK-NEXT:  .LBB0_11: // %sw.bb5
 ; CHECK-NEXT:    add w0, w2, #5
-; CHECK-NEXT:  .LBB0_9: // %return
+; CHECK-NEXT:  .LBB0_12: // %return
 ; CHECK-NEXT:    ret
-; CHECK-NEXT:  .LBB0_10: // %sw.bb1
-; CHECK-NEXT:    add w0, w2, #3
+; CHECK-NEXT:  .LBB0_13: // %sw.bb
+; CHECK-NEXT:    add w0, w2, #1
 ; CHECK-NEXT:    ret
 entry:
   %b = add nsw i32 %input, %n
@@ -77,3 +81,4 @@ return:
   %retval.0 = phi i32 [ %add8, %sw.bb7 ], [ %add6, %sw.bb5 ], [ %add4, %sw.bb3 ], [ %add2, %sw.bb1 ], [ %add, %sw.bb ], [ 100, %bb.0 ], [ 0, %entry ]
   ret i32 %retval.0
 }
+
diff --git a/llvm/test/CodeGen/AArch64/switch-unreachable-default.ll b/llvm/test/CodeGen/AArch64/switch-unreachable-default.ll
index 9acc5a150b630f2..949485bf52f2e8d 100644
--- a/llvm/test/CodeGen/AArch64/switch-unreachable-default.ll
+++ b/llvm/test/CodeGen/AArch64/switch-unreachable-default.ll
@@ -1,4 +1,4 @@
-; RUN: llc -O3 -o - %s | FileCheck %s
+; RUN: llc -O3 -aarch64-min-jump-table-entries=4 -o - %s | FileCheck %s
 
 ; Test that the output in the presence of an unreachable default does not have
 ; a compare and branch at the top of the switch to handle the default case.
diff --git a/llvm/test/CodeGen/AArch64/win64-jumptable.ll b/llvm/test/CodeGen/AArch64/win64-jumptable.ll
index 0b9b7deceae1138..f9f2b0bf0ca5cf5 100644
--- a/llvm/test/CodeGen/AArch64/win64-jumptable.ll
+++ b/llvm/test/CodeGen/AArch64/win64-jumptable.ll
@@ -1,5 +1,5 @@
-; RUN: llc -o - %s -mtriple=aarch64-windows -aarch64-enable-compress-jump-tables=0 | FileCheck %s
-; RUN: llc -o - %s -mtriple=aarch64-windows -aarch64-enable-compress-jump-tables=0 -filetype=obj | llvm-readobj --unwind - | FileCheck %s -check-prefix=UNWIND
+; RUN: llc -o - %s -mtriple=aarch64-windows -aarch64-min-jump-table-entries=4 -aarch64-enable-compress-jump-tables=0 | FileCheck %s
+; RUN: llc -o - %s -mtriple=aarch64-windows -aarch64-min-jump-table-entries=4 -aarch64-enable-compress-jump-tables=0 -filetype=obj | llvm-readobj --...
[truncated]

@davemgreen
Copy link
Collaborator

As with last time this came up I have no strong opinion on whether one value is better than another, so this sounds OK.

What happens with minsize? Should that have a lower limit?

I've changed getMinimumJumpTableEntries to now take a Function
parameter so that we can query the minsize attribute and only
increase the threshold if minsize is absent.
Copy link
Collaborator

@paulwalker-arm paulwalker-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that you're making MinimumJumpTableEntries a per function property perhaps it’s better stored in AArch64Subtarget and then you don’t need to change the existing getMinimumJumpTableEntries interface. This will be more in keeping with how other per function settings are handled.

Instead of querying the function attributes every time, we can
let the subtarget decide once at construction.
@david-arm
Copy link
Contributor Author

Now that you're making MinimumJumpTableEntries a per function property perhaps it’s better stored in AArch64Subtarget and then you don’t need to change the existing getMinimumJumpTableEntries interface. This will be more in keeping with how other per function settings are handled.

Done!

Copy link

github-actions bot commented Nov 13, 2023

✅ With the latest revision this PR passed the C/C++ code formatter.

@davemgreen
Copy link
Collaborator

Thanks. It looks like some of the formatting could be redone? The failing test at least looks unrelated.

I see the Arm backend does the same thing with adding MinSize to the subtarget. This sounds OK to me if the formatting is happy.

@david-arm
Copy link
Contributor Author

I ran make check-all using HEAD of LLVM and tests pass.

@david-arm david-arm merged commit bdc0afc into llvm:main Nov 14, 2023
2 of 3 checks passed
@nikic
Copy link
Contributor

nikic commented Nov 14, 2023

How does this affect performance in cases where the last switch case is usually taken?

@david-arm
Copy link
Contributor Author

How does this affect performance in cases where the last switch case is usually taken?

Honestly speaking I don't know! As mentioned in the commit message, I just changed this number until I got the best performance results, so it's quite possible that in some benchmarks that I've not tested this does make things worse. If we discover issues later on we can revisit this and perhaps refine the algorithm.

@davemgreen
Copy link
Collaborator

How does this affect performance in cases where the last switch case is usually taken?

If I understand what you are saying, it should generate a tree of branches so the positions of the cases shouldn't matter a huge amount. It can depend on the differences between the cases and the branch targets though, they might not match.

Let us know if you have reason to think this would be worse in some way, it's not a very easy thing to measure accurately. A load from a constant pool into an indirect jump can be worse than a series of branches that can each be independently branch predicted.

@nikic
Copy link
Contributor

nikic commented Nov 14, 2023

How does this affect performance in cases where the last switch case is usually taken?

If I understand what you are saying, it should generate a tree of branches so the positions of the cases shouldn't matter a huge amount. It can depend on the differences between the cases and the branch targets though, they might not match.

Let us know if you have reason to think this would be worse in some way, it's not a very easy thing to measure accurately. A load from a constant pool into an indirect jump can be worse than a series of branches that can each be independently branch predicted.

I was mainly just wondering about what the worst case impact here would be for an unlucky switch. If I understand correctly, previously the switch (without jump table) would need at most two branches and now it needs at most four. Does that sound correct? (I was thinking in terms of a linear chain previously -- with a binary tree this doesn't sound so bad.)

@sjoerdmeijer
Copy link
Collaborator

FWIW: this gives < 0.5% on the Neoverse V2 for Perlbench. Might even be noise because I haven't done a lot of runs.

@david-arm
Copy link
Contributor Author

FWIW: this gives < 0.5% on the Neoverse V2 for Perlbench. Might even be noise because I haven't done a lot of runs.

That's interesting - and an improvement, even if minor, is still good news! It is also possible to tweak the flag per CPU if a higher or lower number would help.

@sjoerdmeijer
Copy link
Collaborator

I think it's actually noise, don't think I see improvements on the V2.
Perhaps that's still okay because it is shows improvements on the V1. But now I am actually wondering if the V1 improvement is an actual improvement, or do we get lucky with secondary effects such as different function/loop/block alignments?

@david-arm
Copy link
Contributor Author

I think it's actually noise, don't think I see improvements on the V2. Perhaps that's still okay because it is shows improvements on the V1. But now I am actually wondering if the V1 improvement is an actual improvement, or do we get lucky with secondary effects such as different function/loop/block alignments?

I ran this quite a few times on neoverse-v1 hardware and consistently saw an improvement, particularly with #72132 applied as well. Although I appreciate there is a lot of noise in perlbench. From observation it seems that GCC also takes a similar approach when deciding to create jump tables for perlbench.

@david-arm
Copy link
Contributor Author

I think it's actually noise, don't think I see improvements on the V2. Perhaps that's still okay because it is shows improvements on the V1. But now I am actually wondering if the V1 improvement is an actual improvement, or do we get lucky with secondary effects such as different function/loop/block alignments?

What I also saw was a gradual improvement in performance as you increased the threshold for creating jump tables. For example, I'd get consistently better results with a minimum threshold of 6, then even better with 10, then the best around 13. When I saw that trend of improving performance with higher threshold it made me think this wasn't just luck or coincidence, and that there is a genuine correlation.

@sjoerdmeijer
Copy link
Collaborator

I think it's actually noise, don't think I see improvements on the V2. Perhaps that's still okay because it is shows improvements on the V1. But now I am actually wondering if the V1 improvement is an actual improvement, or do we get lucky with secondary effects such as different function/loop/block alignments?

What I also saw was a gradual improvement in performance as you increased the threshold for creating jump tables. For example, I'd get consistently better results with a minimum threshold of 6, then even better with 10, then the best around 13. When I saw that trend of improving performance with higher threshold it made me think this wasn't just luck or coincidence, and that there is a genuine correlation.

Ok, that sounds good.

@Wilco1
Copy link

Wilco1 commented Nov 20, 2023

I think it's actually noise, don't think I see improvements on the V2. Perhaps that's still okay because it is shows improvements on the V1. But now I am actually wondering if the V1 improvement is an actual improvement, or do we get lucky with secondary effects such as different function/loop/block alignments?

Check the instruction count, cache and branch misses. Much of the gain comes from a huge reduction in instruction count, the rest from avoiding cache misses from loading the table entries and indirect branch misses. As indirect branch prediction improves, a lower value may be slightly better in future cores, however I don't believe it ever makes sense to use a table with less than 8 cases.

zahiraam pushed a commit to zahiraam/llvm-project that referenced this pull request Nov 20, 2023
…ts (llvm#71166)

There are some workloads that are negatively impacted by using jump
tables when the number of entries is small. The SPEC2017 perlbench
benchmark is one example of this, where increasing the threshold to
around 13 gives a ~1.5% improvement on neoverse-v1. I chose the minimum
threshold based on empirical evidence rather than science, and just
manually increased the threshold until I got the best performance
without impacting other workloads. For neoverse-v1 I saw around ~0.2%
improvement in the SPEC2017 integer geomean, and no overall change for
neoverse-n1. If we find issues with this threshold later on we can
always revisit this.

The most significant SPEC2017 score changes on neoverse-v1 were:

500.perlbench_r: +1.6%
520.omnetpp_r: +0.6%

and the rest saw changes < 0.5%.

I updated CodeGen/AArch64/min-jump-table.ll to reflect the new
threshold. For most of the affected tests I manually set the min number
of entries back to 4 on the RUN line because the tests seem to rely upon
this behaviour.
wangpc-pp added a commit to wangpc-pp/llvm-project that referenced this pull request Nov 23, 2023
This is like what AArch64 has done in llvm#71166 except that we don't
handle `HasMinSize` case now.
wangpc-pp added a commit that referenced this pull request Nov 23, 2023
This is like what AArch64 has done in #71166 except that we don't
handle `HasMinSize` case now.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants