-
Notifications
You must be signed in to change notification settings - Fork 10.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AArch64] Fix heuristics for folding "lsl" into load/store ops. #86894
Conversation
The existing heuristics were assuming that every core behaves like an Apple A7, where any extend/shift costs an extra micro-op... but in reality, nothing else behaves like that. On some older Cortex designs, shifts by 1 or 4 cost extra, but all other shifts/extensions are free. On all other cores, as far as I can tell, all shifts/extensions are free (i.e. the same cost as an unshifted load). To reflect this, this patch: - Enables aggressive folding of shifts into loads by default. - Removes the old AddrLSLFast feature, since it applies to everything except A7 (and even if you are explicitly targeting A7, we want to assume extensions are free because the code will almost always run on a newer core). - Adds a new feature AddrLSLSlow14 that applies specifically to the Cortex cores where shifts by 1 or 4 cost extra. I didn't add support for AddrLSLSlow14 on the GlobalISel side because it would require a bunch of refactoring to work correctly. Someone can pick this up as a followup.
@llvm/pr-subscribers-llvm-globalisel @llvm/pr-subscribers-backend-aarch64 Author: Eli Friedman (efriedma-quic) ChangesThe existing heuristics were assuming that every core behaves like an Apple A7, where any extend/shift costs an extra micro-op... but in reality, nothing else behaves like that. On some older Cortex designs, shifts by 1 or 4 cost extra, but all other shifts/extensions are free. On all other cores, as far as I can tell, all shifts/extensions are free (i.e. the same cost as an unshifted load). To reflect this, this patch:
I didn't add support for AddrLSLSlow14 on the GlobalISel side because it would require a bunch of refactoring to work correctly. Someone can pick this up as a followup. Patch is 40.28 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/86894.diff 14 Files Affected:
diff --git a/llvm/lib/Target/AArch64/AArch64.td b/llvm/lib/Target/AArch64/AArch64.td
index 6425aa9b091f7f..1436b6bc0ee388 100644
--- a/llvm/lib/Target/AArch64/AArch64.td
+++ b/llvm/lib/Target/AArch64/AArch64.td
@@ -391,9 +391,18 @@ def FeatureNoNegativeImmediates : SubtargetFeature<"no-neg-immediates",
"equivalent when the immediate does "
"not fit in the encoding.">;
-def FeatureAddrLSLFast : SubtargetFeature<
- "addr-lsl-fast", "HasAddrLSLFast", "true",
- "Address operands with logical shift of up to 3 places are cheap">;
+// Address operands with shift amount 2 or 3 are fast on all Arm chips except
+// some old Apple cores (A7-A10?) which handle all shifts slowly. Cortex-A57
+// and derived designs through Cortex-X1 take an extra micro-op for shifts
+// of 1 or 4. Other Arm chips handle all shifted operands at the same speed
+// as unshifted operands.
+//
+// We don't try to model the behavior of the old Apple cores because new code
+// targeting A7 is very unlikely to actually run on an A7. The Cortex cores
+// are modeled by FeatureAddrLSLSlow14.
+def FeatureAddrLSLSlow14 : SubtargetFeature<
+ "addr-lsl-slow-14", "HasAddrLSLSlow14", "true",
+ "Address operands with shift amount of 1 or 4 are slow">;
def FeatureALULSLFast : SubtargetFeature<
"alu-lsl-fast", "HasALULSLFast", "true",
@@ -885,6 +894,7 @@ def TuneA57 : SubtargetFeature<"a57", "ARMProcFamily", "CortexA57",
FeatureBalanceFPOps,
FeatureFuseAdrpAdd,
FeatureFuseLiterals,
+ FeatureAddrLSLSlow14,
FeaturePostRAScheduler,
FeatureEnableSelectOptimize,
FeaturePredictableSelectIsExpensive]>;
@@ -903,6 +913,7 @@ def TuneA72 : SubtargetFeature<"a72", "ARMProcFamily", "CortexA72",
FeatureFuseAES,
FeatureFuseAdrpAdd,
FeatureFuseLiterals,
+ FeatureAddrLSLSlow14,
FeatureEnableSelectOptimize,
FeaturePredictableSelectIsExpensive]>;
@@ -910,6 +921,7 @@ def TuneA73 : SubtargetFeature<"a73", "ARMProcFamily", "CortexA73",
"Cortex-A73 ARM processors", [
FeatureFuseAES,
FeatureFuseAdrpAdd,
+ FeatureAddrLSLSlow14,
FeatureEnableSelectOptimize,
FeaturePredictableSelectIsExpensive]>;
@@ -917,6 +929,7 @@ def TuneA75 : SubtargetFeature<"a75", "ARMProcFamily", "CortexA75",
"Cortex-A75 ARM processors", [
FeatureFuseAES,
FeatureFuseAdrpAdd,
+ FeatureAddrLSLSlow14,
FeatureEnableSelectOptimize,
FeaturePredictableSelectIsExpensive]>;
@@ -924,7 +937,7 @@ def TuneA76 : SubtargetFeature<"a76", "ARMProcFamily", "CortexA76",
"Cortex-A76 ARM processors", [
FeatureFuseAES,
FeatureFuseAdrpAdd,
- FeatureAddrLSLFast,
+ FeatureAddrLSLSlow14,
FeatureALULSLFast,
FeatureEnableSelectOptimize,
FeaturePredictableSelectIsExpensive]>;
@@ -934,7 +947,7 @@ def TuneA77 : SubtargetFeature<"a77", "ARMProcFamily", "CortexA77",
FeatureCmpBccFusion,
FeatureFuseAES,
FeatureFuseAdrpAdd,
- FeatureAddrLSLFast,
+ FeatureAddrLSLSlow14,
FeatureALULSLFast,
FeatureEnableSelectOptimize,
FeaturePredictableSelectIsExpensive]>;
@@ -944,7 +957,7 @@ def TuneA78 : SubtargetFeature<"a78", "ARMProcFamily", "CortexA78",
FeatureCmpBccFusion,
FeatureFuseAES,
FeatureFuseAdrpAdd,
- FeatureAddrLSLFast,
+ FeatureAddrLSLSlow14,
FeatureALULSLFast,
FeaturePostRAScheduler,
FeatureEnableSelectOptimize,
@@ -956,7 +969,7 @@ def TuneA78AE : SubtargetFeature<"a78ae", "ARMProcFamily",
FeatureCmpBccFusion,
FeatureFuseAES,
FeatureFuseAdrpAdd,
- FeatureAddrLSLFast,
+ FeatureAddrLSLSlow14,
FeatureALULSLFast,
FeaturePostRAScheduler,
FeatureEnableSelectOptimize,
@@ -968,7 +981,7 @@ def TuneA78C : SubtargetFeature<"a78c", "ARMProcFamily",
FeatureCmpBccFusion,
FeatureFuseAES,
FeatureFuseAdrpAdd,
- FeatureAddrLSLFast,
+ FeatureAddrLSLSlow14,
FeatureALULSLFast,
FeaturePostRAScheduler,
FeatureEnableSelectOptimize,
@@ -979,7 +992,6 @@ def TuneA710 : SubtargetFeature<"a710", "ARMProcFamily", "CortexA710",
FeatureCmpBccFusion,
FeatureFuseAES,
FeatureFuseAdrpAdd,
- FeatureAddrLSLFast,
FeatureALULSLFast,
FeaturePostRAScheduler,
FeatureEnableSelectOptimize,
@@ -990,7 +1002,6 @@ def TuneA715 : SubtargetFeature<"a715", "ARMProcFamily", "CortexA715",
FeatureFuseAES,
FeaturePostRAScheduler,
FeatureCmpBccFusion,
- FeatureAddrLSLFast,
FeatureALULSLFast,
FeatureFuseAdrpAdd,
FeatureEnableSelectOptimize,
@@ -1001,7 +1012,6 @@ def TuneA720 : SubtargetFeature<"a720", "ARMProcFamily", "CortexA720",
FeatureFuseAES,
FeaturePostRAScheduler,
FeatureCmpBccFusion,
- FeatureAddrLSLFast,
FeatureALULSLFast,
FeatureFuseAdrpAdd,
FeatureEnableSelectOptimize,
@@ -1012,7 +1022,6 @@ def TuneA720AE : SubtargetFeature<"a720ae", "ARMProcFamily", "CortexA720",
FeatureFuseAES,
FeaturePostRAScheduler,
FeatureCmpBccFusion,
- FeatureAddrLSLFast,
FeatureALULSLFast,
FeatureFuseAdrpAdd,
FeatureEnableSelectOptimize,
@@ -1028,7 +1037,7 @@ def TuneX1 : SubtargetFeature<"cortex-x1", "ARMProcFamily", "CortexX1",
FeatureCmpBccFusion,
FeatureFuseAES,
FeatureFuseAdrpAdd,
- FeatureAddrLSLFast,
+ FeatureAddrLSLSlow14,
FeatureALULSLFast,
FeaturePostRAScheduler,
FeatureEnableSelectOptimize,
@@ -1039,7 +1048,6 @@ def TuneX2 : SubtargetFeature<"cortex-x2", "ARMProcFamily", "CortexX2",
FeatureCmpBccFusion,
FeatureFuseAES,
FeatureFuseAdrpAdd,
- FeatureAddrLSLFast,
FeatureALULSLFast,
FeaturePostRAScheduler,
FeatureEnableSelectOptimize,
@@ -1047,7 +1055,6 @@ def TuneX2 : SubtargetFeature<"cortex-x2", "ARMProcFamily", "CortexX2",
def TuneX3 : SubtargetFeature<"cortex-x3", "ARMProcFamily", "CortexX3",
"Cortex-X3 ARM processors", [
- FeatureAddrLSLFast,
FeatureALULSLFast,
FeatureFuseAdrpAdd,
FeatureFuseAES,
@@ -1057,7 +1064,6 @@ def TuneX3 : SubtargetFeature<"cortex-x3", "ARMProcFamily", "CortexX3",
def TuneX4 : SubtargetFeature<"cortex-x4", "ARMProcFamily", "CortexX4",
"Cortex-X4 ARM processors", [
- FeatureAddrLSLFast,
FeatureALULSLFast,
FeatureFuseAdrpAdd,
FeatureFuseAES,
@@ -1215,7 +1221,6 @@ def TuneExynosM3 : SubtargetFeature<"exynosm3", "ARMProcFamily", "ExynosM3",
FeatureFuseAdrpAdd,
FeatureFuseLiterals,
FeatureStorePairSuppress,
- FeatureAddrLSLFast,
FeatureALULSLFast,
FeaturePostRAScheduler,
FeaturePredictableSelectIsExpensive]>;
@@ -1234,7 +1239,6 @@ def TuneExynosM4 : SubtargetFeature<"exynosm4", "ARMProcFamily", "ExynosM3",
FeatureFuseAdrpAdd,
FeatureFuseLiterals,
FeatureStorePairSuppress,
- FeatureAddrLSLFast,
FeatureALULSLFast,
FeaturePostRAScheduler,
FeatureZCZeroing]>;
@@ -1244,7 +1248,6 @@ def TuneKryo : SubtargetFeature<"kryo", "ARMProcFamily", "Kryo",
FeaturePostRAScheduler,
FeaturePredictableSelectIsExpensive,
FeatureZCZeroing,
- FeatureAddrLSLFast,
FeatureALULSLFast,
FeatureStorePairSuppress]>;
@@ -1254,7 +1257,6 @@ def TuneFalkor : SubtargetFeature<"falkor", "ARMProcFamily", "Falkor",
FeaturePredictableSelectIsExpensive,
FeatureZCZeroing,
FeatureStorePairSuppress,
- FeatureAddrLSLFast,
FeatureALULSLFast,
FeatureSlowSTRQro]>;
@@ -1268,7 +1270,6 @@ def TuneNeoverseN1 : SubtargetFeature<"neoversen1", "ARMProcFamily", "NeoverseN1
"Neoverse N1 ARM processors", [
FeatureFuseAES,
FeatureFuseAdrpAdd,
- FeatureAddrLSLFast,
FeatureALULSLFast,
FeaturePostRAScheduler,
FeatureEnableSelectOptimize,
@@ -1278,7 +1279,6 @@ def TuneNeoverseN2 : SubtargetFeature<"neoversen2", "ARMProcFamily", "NeoverseN2
"Neoverse N2 ARM processors", [
FeatureFuseAES,
FeatureFuseAdrpAdd,
- FeatureAddrLSLFast,
FeatureALULSLFast,
FeaturePostRAScheduler,
FeatureEnableSelectOptimize,
@@ -1288,7 +1288,6 @@ def TuneNeoverse512TVB : SubtargetFeature<"neoverse512tvb", "ARMProcFamily", "Ne
"Neoverse 512-TVB ARM processors", [
FeatureFuseAES,
FeatureFuseAdrpAdd,
- FeatureAddrLSLFast,
FeatureALULSLFast,
FeaturePostRAScheduler,
FeatureEnableSelectOptimize,
@@ -1298,7 +1297,6 @@ def TuneNeoverseV1 : SubtargetFeature<"neoversev1", "ARMProcFamily", "NeoverseV1
"Neoverse V1 ARM processors", [
FeatureFuseAES,
FeatureFuseAdrpAdd,
- FeatureAddrLSLFast,
FeatureALULSLFast,
FeaturePostRAScheduler,
FeatureEnableSelectOptimize,
@@ -1309,7 +1307,6 @@ def TuneNeoverseV2 : SubtargetFeature<"neoversev2", "ARMProcFamily", "NeoverseV2
"Neoverse V2 ARM processors", [
FeatureFuseAES,
FeatureFuseAdrpAdd,
- FeatureAddrLSLFast,
FeatureALULSLFast,
FeaturePostRAScheduler,
FeatureEnableSelectOptimize,
@@ -1321,7 +1318,6 @@ def TuneSaphira : SubtargetFeature<"saphira", "ARMProcFamily", "Saphira",
FeaturePredictableSelectIsExpensive,
FeatureZCZeroing,
FeatureStorePairSuppress,
- FeatureAddrLSLFast,
FeatureALULSLFast]>;
def TuneThunderX2T99 : SubtargetFeature<"thunderx2t99", "ARMProcFamily", "ThunderX2T99",
@@ -1381,7 +1377,6 @@ def TuneAmpere1 : SubtargetFeature<"ampere1", "ARMProcFamily", "Ampere1",
FeaturePostRAScheduler,
FeatureFuseAES,
FeatureFuseAdrpAdd,
- FeatureAddrLSLFast,
FeatureALULSLFast,
FeatureAggressiveFMA,
FeatureArithmeticBccFusion,
@@ -1397,7 +1392,6 @@ def TuneAmpere1A : SubtargetFeature<"ampere1a", "ARMProcFamily", "Ampere1A",
FeaturePostRAScheduler,
FeatureFuseAES,
FeatureFuseAdrpAdd,
- FeatureAddrLSLFast,
FeatureALULSLFast,
FeatureAggressiveFMA,
FeatureArithmeticBccFusion,
@@ -1414,7 +1408,6 @@ def TuneAmpere1B : SubtargetFeature<"ampere1b", "ARMProcFamily", "Ampere1B",
FeaturePostRAScheduler,
FeatureFuseAES,
FeatureFuseAdrpAdd,
- FeatureAddrLSLFast,
FeatureALULSLFast,
FeatureAggressiveFMA,
FeatureArithmeticBccFusion,
diff --git a/llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp b/llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp
index 163ed520a8a677..51bec3604026b0 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp
@@ -462,7 +462,7 @@ class AArch64DAGToDAGISel : public SelectionDAGISel {
SDValue &Offset, SDValue &SignExtend,
SDValue &DoShift);
bool isWorthFoldingALU(SDValue V, bool LSL = false) const;
- bool isWorthFoldingAddr(SDValue V) const;
+ bool isWorthFoldingAddr(SDValue V, unsigned Size) const;
bool SelectExtendedSHL(SDValue N, unsigned Size, bool WantExtend,
SDValue &Offset, SDValue &SignExtend);
@@ -674,17 +674,22 @@ static bool isWorthFoldingSHL(SDValue V) {
/// Determine whether it is worth to fold V into an extended register addressing
/// mode.
-bool AArch64DAGToDAGISel::isWorthFoldingAddr(SDValue V) const {
+bool AArch64DAGToDAGISel::isWorthFoldingAddr(SDValue V, unsigned Size) const {
// Trivial if we are optimizing for code size or if there is only
// one use of the value.
if (CurDAG->shouldOptForSize() || V.hasOneUse())
return true;
- // If a subtarget has a fastpath LSL we can fold a logical shift into
- // the addressing mode and save a cycle.
- if (Subtarget->hasAddrLSLFast() && V.getOpcode() == ISD::SHL &&
- isWorthFoldingSHL(V))
+
+ // If a subtarget has a slow shift, folding a shift into multiple loads
+ // costs additional micro-ops.
+ if (Subtarget->hasAddrLSLSlow14() && (Size == 2 || Size == 16))
+ return false;
+
+ // Check whether we're going to emit the address arithmetic anyway because
+ // it's used by a non-address operation.
+ if (V.getOpcode() == ISD::SHL && isWorthFoldingSHL(V))
return true;
- if (Subtarget->hasAddrLSLFast() && V.getOpcode() == ISD::ADD) {
+ if (V.getOpcode() == ISD::ADD) {
const SDValue LHS = V.getOperand(0);
const SDValue RHS = V.getOperand(1);
if (LHS.getOpcode() == ISD::SHL && isWorthFoldingSHL(LHS))
@@ -1203,7 +1208,7 @@ bool AArch64DAGToDAGISel::SelectExtendedSHL(SDValue N, unsigned Size,
if (ShiftVal != 0 && ShiftVal != LegalShiftVal)
return false;
- return isWorthFoldingAddr(N);
+ return isWorthFoldingAddr(N, Size);
}
bool AArch64DAGToDAGISel::SelectAddrModeWRO(SDValue N, unsigned Size,
@@ -1231,7 +1236,7 @@ bool AArch64DAGToDAGISel::SelectAddrModeWRO(SDValue N, unsigned Size,
}
// Remember if it is worth folding N when it produces extended register.
- bool IsExtendedRegisterWorthFolding = isWorthFoldingAddr(N);
+ bool IsExtendedRegisterWorthFolding = isWorthFoldingAddr(N, Size);
// Try to match a shifted extend on the RHS.
if (IsExtendedRegisterWorthFolding && RHS.getOpcode() == ISD::SHL &&
@@ -1261,7 +1266,7 @@ bool AArch64DAGToDAGISel::SelectAddrModeWRO(SDValue N, unsigned Size,
Offset = narrowIfNeeded(CurDAG, LHS.getOperand(0));
SignExtend = CurDAG->getTargetConstant(Ext == AArch64_AM::SXTW, dl,
MVT::i32);
- if (isWorthFoldingAddr(LHS))
+ if (isWorthFoldingAddr(LHS, Size))
return true;
}
@@ -1273,7 +1278,7 @@ bool AArch64DAGToDAGISel::SelectAddrModeWRO(SDValue N, unsigned Size,
Offset = narrowIfNeeded(CurDAG, RHS.getOperand(0));
SignExtend = CurDAG->getTargetConstant(Ext == AArch64_AM::SXTW, dl,
MVT::i32);
- if (isWorthFoldingAddr(RHS))
+ if (isWorthFoldingAddr(RHS, Size))
return true;
}
@@ -1343,7 +1348,7 @@ bool AArch64DAGToDAGISel::SelectAddrModeXRO(SDValue N, unsigned Size,
}
// Remember if it is worth folding N when it produces extended register.
- bool IsExtendedRegisterWorthFolding = isWorthFoldingAddr(N);
+ bool IsExtendedRegisterWorthFolding = isWorthFoldingAddr(N, Size);
// Try to match a shifted extend on the RHS.
if (IsExtendedRegisterWorthFolding && RHS.g...
[truncated]
|
I looked at some of the older reviews, and rechecked my research... a few more notes:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into this, it is one thing off my todo list.
It turns out the current commit message isn't precisely right. Current Cortex cores (X2 and later) have free shift by 1 for integer loads... but no free shift by 1/4 for floating-point loads. Not sure if it's worth explicitly modeling int-shift vs. float-shift.
That might be an inaccuracy in the optimization guide more than a difference between int/fp. lsl #4 are still an extra operation, but I do not believe they come up very often.
https://reviews.llvm.org/D155470#4527270 suggests that we should default to AddrLSLSlow14... I'm not sure if that's the right choice. An explicit shift is guaranteed to increase latency, but an extra integer micro-op generated by a folded shift might not matter in a lot of cases.
Yeah I agree - with more new cores having faster #1 shifts I think it should be OK as the default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGMT, so long as the tests are OK. Thanks
Tests did not go well, though it seems to be no fault of this PR. Rebase and try again @efriedma-quic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for the Apple cores, thanks!
The existing heuristics were assuming that every core behaves like an Apple A7, where any extend/shift costs an extra micro-op... but in reality, nothing else behaves like that.
On some older Cortex designs, shifts by 1 or 4 cost extra, but all other shifts/extensions are free. On all other cores, as far as I can tell, all shifts/extensions for integer loads are free (i.e. the same cost as an unshifted load).
To reflect this, this patch:
Enables aggressive folding of shifts into loads by default.
Removes the old AddrLSLFast feature, since it applies to everything except A7 (and even if you are explicitly targeting A7, we want to assume extensions are free because the code will almost always run on a newer core).
Adds a new feature AddrLSLSlow14 that applies specifically to the Cortex cores where shifts by 1 or 4 cost extra.
I didn't add support for AddrLSLSlow14 on the GlobalISel side because it would require a bunch of refactoring to work correctly. Someone can pick this up as a followup.