-
Notifications
You must be signed in to change notification settings - Fork 15.1k
[SPARC] Mark branches as being expensive in early Niagara CPUs #166489
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Early Niagara processors (T1-T3) lacks any branch predictor, yet they also have a pipeline long enough that the delay slot cannot cover for all of the branch latency. This means that branch instructions will stall the processor for a couple cycles, which makes them an expensive operation. Additionally, the high cost of branching means that it's still profitable to prefer conditional moves even when the conditional is predictable, so let LLVM know about both things.
|
@llvm/pr-subscribers-backend-sparc Author: Koakuma (koachan) ChangesEarly Niagara processors (T1-T3) lacks any branch predictor, yet they also have a pipeline long enough that the delay slot cannot cover for all of the branch latency. Full diff: https://github.com/llvm/llvm-project/pull/166489.diff 3 Files Affected:
diff --git a/llvm/lib/Target/Sparc/Sparc.td b/llvm/lib/Target/Sparc/Sparc.td
index 7137e5fbff4ff..38b0508885069 100644
--- a/llvm/lib/Target/Sparc/Sparc.td
+++ b/llvm/lib/Target/Sparc/Sparc.td
@@ -95,6 +95,9 @@ def FeatureSoftFloat : SubtargetFeature<"soft-float", "UseSoftFloat", "true",
def TuneSlowRDPC : SubtargetFeature<"slow-rdpc", "HasSlowRDPC", "true",
"rd %pc, %XX is slow", [FeatureV9]>;
+def TuneNoPredictor : SubtargetFeature<"no-predictor", "HasNoPredictor", "true",
+ "Processor has no branch predictor, branches stall execution", []>;
+
//==== Features added predmoninantly for LEON subtarget support
include "LeonFeatures.td"
@@ -174,12 +177,15 @@ def : Proc<"ultrasparc3", [FeatureV9, FeatureV8Deprecated, FeatureVIS,
FeatureVIS2],
[TuneSlowRDPC]>;
def : Proc<"niagara", [FeatureV9, FeatureV8Deprecated, FeatureVIS,
- FeatureVIS2, FeatureUA2005]>;
+ FeatureVIS2, FeatureUA2005],
+ [TuneNoPredictor]>;
def : Proc<"niagara2", [FeatureV9, FeatureV8Deprecated, UsePopc,
- FeatureVIS, FeatureVIS2, FeatureUA2005]>;
+ FeatureVIS, FeatureVIS2, FeatureUA2005],
+ [TuneNoPredictor]>;
def : Proc<"niagara3", [FeatureV9, FeatureV8Deprecated, UsePopc,
FeatureVIS, FeatureVIS2, FeatureVIS3,
- FeatureUA2005, FeatureUA2007]>;
+ FeatureUA2005, FeatureUA2007],
+ [TuneNoPredictor]>;
def : Proc<"niagara4", [FeatureV9, FeatureV8Deprecated, UsePopc,
FeatureVIS, FeatureVIS2, FeatureVIS3,
FeatureUA2005, FeatureUA2007, FeatureOSA2011,
diff --git a/llvm/lib/Target/Sparc/SparcISelLowering.cpp b/llvm/lib/Target/Sparc/SparcISelLowering.cpp
index cbb7db68f7e7c..ae3c32687c207 100644
--- a/llvm/lib/Target/Sparc/SparcISelLowering.cpp
+++ b/llvm/lib/Target/Sparc/SparcISelLowering.cpp
@@ -2000,6 +2000,14 @@ SparcTargetLowering::SparcTargetLowering(const TargetMachine &TM,
setOperationAction(ISD::INTRINSIC_WO_CHAIN, MVT::Other, Custom);
+ // Some processors have no branch predictor and have pipelines longer than
+ // what can be covered by the delay slot. This results in a stall, so mark
+ // branches to be expensive on those processors.
+ setJumpIsExpensive(Subtarget->hasNoPredictor());
+ // The high cost of branching means that using conditional moves will
+ // still be profitable even if the condition is predictable.
+ PredictableSelectIsExpensive = !isJumpExpensive();
+
setMinFunctionAlignment(Align(4));
computeRegisterProperties(Subtarget->getRegisterInfo());
diff --git a/llvm/test/CodeGen/SPARC/select-earlyniagara.ll b/llvm/test/CodeGen/SPARC/select-earlyniagara.ll
new file mode 100644
index 0000000000000..2cec10455d205
--- /dev/null
+++ b/llvm/test/CodeGen/SPARC/select-earlyniagara.ll
@@ -0,0 +1,43 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: llc -O3 < %s -relocation-model=pic -mtriple=sparc -mcpu=v9 | FileCheck --check-prefix=SPARC %s
+; RUN: llc -O3 < %s -relocation-model=pic -mtriple=sparcv9 -mcpu=v9 | FileCheck --check-prefix=SPARC64 %s
+
+;; Early Niagara processors should prefer conditional moves over branches
+;; even when it's predictable.
+
+define i32 @cinc(i32 %cond, i32 %num) #0 {
+; SPARC-LABEL: cinc:
+; SPARC: ! %bb.0: ! %entry
+; SPARC-NEXT: cmp %o0, 0
+; SPARC-NEXT: bne %icc, .LBB0_2
+; SPARC-NEXT: mov %o1, %o0
+; SPARC-NEXT: ! %bb.1: ! %inc
+; SPARC-NEXT: add %o0, 1, %o0
+; SPARC-NEXT: .LBB0_2: ! %cont
+; SPARC-NEXT: retl
+; SPARC-NEXT: nop
+;
+; SPARC64-LABEL: cinc:
+; SPARC64: ! %bb.0: ! %entry
+; SPARC64-NEXT: cmp %o0, 0
+; SPARC64-NEXT: bne %icc, .LBB0_2
+; SPARC64-NEXT: mov %o1, %o0
+; SPARC64-NEXT: ! %bb.1: ! %inc
+; SPARC64-NEXT: add %o0, 1, %o0
+; SPARC64-NEXT: .LBB0_2: ! %cont
+; SPARC64-NEXT: retl
+; SPARC64-NEXT: nop
+entry:
+ %cmp = icmp eq i32 %cond, 0
+ %exp = call i1 @llvm.expect.i1(i1 %cmp, i1 0)
+ br i1 %exp, label %inc, label %cont
+inc:
+ %add = add nsw i32 %num, 1
+ br label %cont
+cont:
+ %phi = phi i32 [ %add, %inc ], [ %num, %entry ]
+ ret i32 %phi
+}
+declare i1 @llvm.expect.i1(i1, i1)
+
+attributes #0 = { nounwind "tune-cpu"="niagara" }
|
| ; SPARC64-LABEL: cinc: | ||
| ; SPARC64: ! %bb.0: ! %entry | ||
| ; SPARC64-NEXT: cmp %o0, 0 | ||
| ; SPARC64-NEXT: bne %icc, .LBB0_2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm tuning this for niagara and from debugging dumps, I see that branches have been properly marked as expensive and PredictableSelectIsExpensive is false, yet the codegen still chooses branches over conditional moves.
How do I convince the codegen to emit conditional moves here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is because the input IR contains explicit branches. llc doesn't run CFG optimizer.
Consider rewriting this test to use select instruction
s-barannikov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting these flags doesn't necessarily result in better codegen (in my experience).
The impact should be assessed using some benchmarks.
Ya, that's why it's still marked as WIP. |
|
So I tried some pgbench on a 48-thread SPARC T2, and the patch does seem to increase performance (with only one exception on 48-thread SELECT workload):
The speedup is quite modest (at around 2-3%), but given that it's only from setting two codegen tunables I'd say that this is a good result. |
|
2-3% is actually a huge difference (one may say too good to be true) |
| @@ -0,0 +1,33 @@ | |||
| ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test should show the difference in codegen between subtargets with and without branch predictor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done~
s-barannikov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| ret i32 %ret | ||
| } | ||
|
|
||
| attributes #0 = { nounwind "tune-cpu"="niagara" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) Can the two tests be combined into one with two RUN lines, one of which passing -mattr=+no-predictor?
Early Niagara processors (T1-T3) lacks any branch predictor, yet they also have a pipeline long enough that the delay slot cannot cover for all of the branch latency.
This means that branch instructions will stall the processor for a couple cycles, which makes them an expensive operation. Additionally, the high cost of branching means that it's still profitable to prefer conditional moves even when the conditional is predictable, so let LLVM know about both things.
On SPARC T2, a pgbench test seem to show a modest, but pretty consistent speedup (up to around 3%).