[X86][CostModel] Add per-shape gather/scatter cost tables for AMD znver4+#199488
[X86][CostModel] Add per-shape gather/scatter cost tables for AMD znver4+#199488amd-subharad wants to merge 4 commits into
Conversation
|
Hello @amd-subharad 👋 Thank you for submitting a Pull Request (PR) to the LLVM Project. Since this is your first PR, here are a few useful links covering our main contribution policies and review practices.
Please reply to this message to confirm that you have read these policies, especially the LLVM AI Tool Use Policy, and that any AI tool usage has been noted in the PR description. Frequently asked questionsHow do I add reviewers? This PR will be automatically labeled, and the relevant teams will be notified. For some parts of the project, reviewers may also be added automatically. You can also add reviewers manually using the Reviewers section on this page. If you cannot use that section, it is probably because you do not have write permissions for the repository. In that case, you can request a review by tagging reviewers in a comment using What if there are no comments? If you have not received any comments on your PR after a week, you can request a review by pinging the PR with a comment such as “Ping”. The common courtesy ping rate is once a week. Please remember that you are asking for volunteer time from other developers. Are any special GitHub settings required to contribute to LLVM? We only require contributors to have a public email address associated with their GitHub commits, see this section of LLVM Developer Policy for details. If you have questions, feel free to leave a comment on this PR, or ask on LLVM Discord or LLVM Discourse. Thank you, |
|
@llvm/pr-subscribers-llvm-transforms @llvm/pr-subscribers-llvm-analysis Author: Sumukh J Bharadwaj (amd-subharad) Changes<!-- SummaryThe X86 cost model currently returns a single flat overhead from This PR adds a subtarget tuning bit, When the tuning bit is set, Cost tablesGather (VF=2..16 over i32 / f32 / f64; i64 falls through to the flat default):
Scatter (VF=4..16 over i32 / f32 / f64; i64 and VF=2 fall through):
MethodologyThe numbers in both tables are empirical break-even costs measured on znver4 / znver5 hardware:
The scatter table is keyed independently for 32-bit (i32 / f32) and 64-bit (f64) lanes because the sweep results diverged on Zen hardware: 64-bit scatter break-even is consistently lower than 32-bit scatter break-even at the same VF. Test
Benchmark validationMeasured on a Ryzen 9 9950X (Zen 5), community LLVM trunk at the parent of this commit vs. the commit. Numbers are SPEC rate medians across K=3 iterations, single-copy ref:
Biggest individual movers:
Non-goals
Test plan
Patch is 44.80 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/199488.diff 5 Files Affected:
diff --git a/llvm/docs/ReleaseNotes.md b/llvm/docs/ReleaseNotes.md
index fffd696e59baf..df5e17a27d26c 100644
--- a/llvm/docs/ReleaseNotes.md
+++ b/llvm/docs/ReleaseNotes.md
@@ -220,6 +220,11 @@ Makes programs 10x faster by doing Special New Thing.
* `.att_syntax` directive is now emitted for assembly files when AT&T syntax is
in use. This matches the behaviour of Intel syntax and aids with
compatibility when changing the default Clang syntax to the Intel syntax.
+* Masked gather and scatter cost overheads are now per-shape on AMD znver4
+ and znver5 targets via a new `TuningPreferAMDZenGSCost` subtarget
+ feature, replacing the single flat overhead inherited from the generic
+ AVX-512 path. The per-shape costs use empirical break-even values
+ measured on Zen 4 / Zen 5 hardware.
### Changes to the OCaml bindings
diff --git a/llvm/lib/Target/X86/X86.td b/llvm/lib/Target/X86/X86.td
index 50fb7204ebfa1..28bbd639649bb 100644
--- a/llvm/lib/Target/X86/X86.td
+++ b/llvm/lib/Target/X86/X86.td
@@ -721,6 +721,17 @@ def TuningFastGather
: SubtargetFeature<"fast-gather", "HasFastGather", "true",
"Indicates if gather is reasonably fast (this is true for Skylake client and all AVX-512 CPUs)">;
+// Use AMD Zen-tuned cost tables for masked gather/scatter intrinsics in the
+// X86 TargetTransformInfo cost model. Refines the flat overhead used by other
+// AVX-512 targets with per-element-type/per-VL costs measured on znver4 and
+// znver5. Inherited automatically by every znver4+ CPU via ZN4Tuning; not
+// applied to pre-AVX-512 Zen parts (znver1..3), which take the scalarise
+// path for masked gather anyway.
+def TuningPreferAMDZenGSCost
+ : SubtargetFeature<"prefer-amd-zen-gs-cost",
+ "HasPreferAMDZenGSCost", "true",
+ "Use AMD Zen-tuned gather/scatter cost tables in the cost model">;
+
// Generate vpdpwssd instead of vpmaddwd+vpaddd sequence.
def TuningFastDPWSSD
: SubtargetFeature<
@@ -1631,7 +1642,8 @@ def ProcessorFeatures {
list<SubtargetFeature> ZN3Features =
!listconcat(ZN2Features, ZN3AdditionalFeatures);
- list<SubtargetFeature> ZN4AdditionalTuning = [TuningFastDPWSSD];
+ list<SubtargetFeature> ZN4AdditionalTuning = [TuningFastDPWSSD,
+ TuningPreferAMDZenGSCost];
list<SubtargetFeature> ZN4Tuning =
!listconcat(ZN3Tuning, ZN4AdditionalTuning);
list<SubtargetFeature> ZN4AdditionalFeatures = [FeatureAVX512,
diff --git a/llvm/lib/Target/X86/X86TargetTransformInfo.cpp b/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
index 698be1615a04b..edc8e78c7f040 100644
--- a/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
+++ b/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
@@ -6253,20 +6253,79 @@ InstructionCost X86TTIImpl::getCFInstrCost(unsigned Opcode,
return TTI::TCC_Free;
}
-int X86TTIImpl::getGatherOverhead() const {
+int X86TTIImpl::getGatherOverhead(Type *SrcVTy) const {
// Some CPUs have more overhead for gather. The specified overhead is relative
// to the Load operation. "2" is the number provided by Intel architects. This
// parameter is used for cost estimation of Gather Op and comparison with
// other alternatives.
// TODO: Remove the explicit hasAVX512()?, That would mean we would only
// enable gather with a -march.
+
+ // AMD znver4+ targets enable per-shape costs measured on the hardware via
+ // TuningPreferAMDZenGSCost (set in ZN4Tuning). Pre-AVX-512 Zen parts
+ // (znver1..3) take the scalarise path for masked gather and never reach
+ // this code, so the table only needs to cover AVX-512 widths.
+ if (ST->hasPreferAMDZenGSCost() && SrcVTy) {
+ // Per-shape gather costs for AMD znver4+ targets.
+ //
+ // The numbers are the empirical "break-even" (lower-bound) costs
+ // measured by sweeping a forced gather cost while compiling a
+ // controlled gather micro-benchmark and observing the point at which
+ // the LoopVectorizer still chose the gather lowering over the scalar
+ // fallback. The sweep was run independently for every (data type,
+ // VF) combination on Genoa / Milan / Turin and re-validated on Zen 5;
+ // the value tabulated below is the cost at which gather emission
+ // was the right call for that shape.
+ //
+ // i64 entries are intentionally absent: the i64 sweep landed within
+ // the noise of the generic flat overhead, so those shapes fall
+ // through to the existing flat cost.
+ static const CostTblEntry ZenGatherCostTable[] = {
+ {ISD::LOAD, MVT::v2i32, 20}, {ISD::LOAD, MVT::v4i32, 7},
+ {ISD::LOAD, MVT::v8i32, 17}, {ISD::LOAD, MVT::v16i32, 14},
+ {ISD::LOAD, MVT::v2f32, 20}, {ISD::LOAD, MVT::v4f32, 7},
+ {ISD::LOAD, MVT::v8f32, 17}, {ISD::LOAD, MVT::v16f32, 14},
+ {ISD::LOAD, MVT::v2f64, 20}, {ISD::LOAD, MVT::v4f64, 7},
+ {ISD::LOAD, MVT::v8f64, 17}, {ISD::LOAD, MVT::v16f64, 14},
+ };
+ EVT VT = TLI->getValueType(DL, SrcVTy);
+ if (VT.isSimple())
+ if (const auto *E = CostTableLookup(ZenGatherCostTable, ISD::LOAD,
+ VT.getSimpleVT()))
+ return E->Cost;
+ }
+
if (ST->hasAVX512() || (ST->hasAVX2() && ST->hasFastGather()))
return 2;
return 1024;
}
-int X86TTIImpl::getScatterOverhead() const {
+int X86TTIImpl::getScatterOverhead(Type *SrcVTy) const {
+ // AMD znver4+ targets use per-shape scatter costs measured on the hardware
+ // via TuningPreferAMDZenGSCost (set in ZN4Tuning). Fall through to the
+ // generic flat overhead for shapes we have not characterised.
+ if (ST->hasPreferAMDZenGSCost() && ST->hasAVX512() && SrcVTy) {
+ // Per-shape scatter costs for AMD znver4+ targets, measured with the
+ // same break-even methodology as the gather table above. i32 / f32
+ // and f64 lanes use independent curves because their sweep results
+ // diverged on Zen hardware. i64 entries and VF=2 entries are
+ // intentionally absent and fall through to the generic flat overhead.
+ static const CostTblEntry ZenScatterCostTable[] = {
+ {ISD::STORE, MVT::v4i32, 12}, {ISD::STORE, MVT::v8i32, 14},
+ {ISD::STORE, MVT::v16i32, 6},
+ {ISD::STORE, MVT::v4f32, 12}, {ISD::STORE, MVT::v8f32, 14},
+ {ISD::STORE, MVT::v16f32, 16},
+ {ISD::STORE, MVT::v4f64, 5}, {ISD::STORE, MVT::v8f64, 15},
+ {ISD::STORE, MVT::v16f64, 3},
+ };
+ EVT VT = TLI->getValueType(DL, SrcVTy);
+ if (VT.isSimple())
+ if (const auto *E = CostTableLookup(ZenScatterCostTable, ISD::STORE,
+ VT.getSimpleVT()))
+ return E->Cost;
+ }
+
if (ST->hasAVX512())
return 2;
@@ -6338,8 +6397,9 @@ InstructionCost X86TTIImpl::getGSVectorCost(unsigned Opcode,
// The gather / scatter cost is given by Intel architects. It is a rough
// number since we are looking at one instruction in a time.
- const int GSOverhead = (Opcode == Instruction::Load) ? getGatherOverhead()
- : getScatterOverhead();
+ const int GSOverhead = (Opcode == Instruction::Load)
+ ? getGatherOverhead(SrcVTy)
+ : getScatterOverhead(SrcVTy);
return GSOverhead + VF * getMemoryOpCost(Opcode, SrcVTy->getScalarType(),
Alignment, AddressSpace, CostKind);
}
diff --git a/llvm/lib/Target/X86/X86TargetTransformInfo.h b/llvm/lib/Target/X86/X86TargetTransformInfo.h
index ea277bfeab560..ceb6dcc172f94 100644
--- a/llvm/lib/Target/X86/X86TargetTransformInfo.h
+++ b/llvm/lib/Target/X86/X86TargetTransformInfo.h
@@ -346,8 +346,8 @@ class X86TTIImpl final : public BasicTTIImplBase<X86TTIImpl> {
Type *DataTy, const Value *Ptr,
Align Alignment, unsigned AddressSpace) const;
- int getGatherOverhead() const;
- int getScatterOverhead() const;
+ int getGatherOverhead(Type *SrcVTy) const;
+ int getScatterOverhead(Type *SrcVTy) const;
/// @}
};
diff --git a/llvm/test/Analysis/CostModel/X86/masked-gather-scatter-amd-zen.ll b/llvm/test/Analysis/CostModel/X86/masked-gather-scatter-amd-zen.ll
new file mode 100644
index 0000000000000..1565568d8d010
--- /dev/null
+++ b/llvm/test/Analysis/CostModel/X86/masked-gather-scatter-amd-zen.ll
@@ -0,0 +1,553 @@
+; NOTE: Assertions have been autogenerated by utils/update_analyze_test_checks.py
+; Cost-model coverage for AMD Zen-tuned masked gather/scatter overheads.
+;
+; ZNVER4 / ZNVER5 enable the per-shape Zen cost tables via
+; TuningPreferAMDZenGSCost (set in ZN4Tuning and inherited by ZN5Tuning) and
+; have AVX-512, so the new tables are consulted in getGSVectorCost.
+; ZNVER3 does NOT carry TuningPreferAMDZenGSCost and lacks both AVX-512 and
+; TuningFastGather, so isLegalMaskedGather() returns false and the cost model
+; walks the scalarise path (getGSScalarCost). The ZNVER3 numbers below are the
+; unchanged scalar fallback cost, included here only to lock in that this
+; change does not regress pre-AVX-512 Zen targets.
+; SKX is a non-Zen AVX-512 baseline showing the generic flat overhead of 2.
+;
+; RUN: opt < %s -S -mtriple=x86_64-unknown-linux-gnu -passes="print<cost-model>" 2>&1 -disable-output -cost-kind=throughput -mcpu=znver4 | FileCheck %s --check-prefix=ZNVER4
+; RUN: opt < %s -S -mtriple=x86_64-unknown-linux-gnu -passes="print<cost-model>" 2>&1 -disable-output -cost-kind=throughput -mcpu=znver5 | FileCheck %s --check-prefix=ZNVER5
+; RUN: opt < %s -S -mtriple=x86_64-unknown-linux-gnu -passes="print<cost-model>" 2>&1 -disable-output -cost-kind=throughput -mcpu=znver3 | FileCheck %s --check-prefix=ZNVER3
+; RUN: opt < %s -S -mtriple=x86_64-unknown-linux-gnu -passes="print<cost-model>" 2>&1 -disable-output -cost-kind=throughput -mcpu=skx | FileCheck %s --check-prefix=SKX
+
+;------------------------------------------------------------------------------
+; Masked gather - i32 element type
+;------------------------------------------------------------------------------
+
+define <2 x i32> @gather_v2i32(<2 x ptr> %ptrs, <2 x i1> %mask) {
+; ZNVER4-LABEL: 'gather_v2i32'
+; ZNVER4-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %v = call <2 x i32> @llvm.masked.gather.v2i32.v2p0(<2 x ptr> align 4 %ptrs, <2 x i1> %mask, <2 x i32> undef)
+; ZNVER4-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i32> %v
+;
+; ZNVER5-LABEL: 'gather_v2i32'
+; ZNVER5-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %v = call <2 x i32> @llvm.masked.gather.v2i32.v2p0(<2 x ptr> align 4 %ptrs, <2 x i1> %mask, <2 x i32> undef)
+; ZNVER5-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i32> %v
+;
+; ZNVER3-LABEL: 'gather_v2i32'
+; ZNVER3-NEXT: Cost Model: Found an estimated cost of 7 for instruction: %v = call <2 x i32> @llvm.masked.gather.v2i32.v2p0(<2 x ptr> align 4 %ptrs, <2 x i1> %mask, <2 x i32> undef)
+; ZNVER3-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i32> %v
+;
+; SKX-LABEL: 'gather_v2i32'
+; SKX-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %v = call <2 x i32> @llvm.masked.gather.v2i32.v2p0(<2 x ptr> align 4 %ptrs, <2 x i1> %mask, <2 x i32> undef)
+; SKX-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i32> %v
+;
+ %v = call <2 x i32> @llvm.masked.gather.v2i32.v2p0(<2 x ptr> %ptrs, i32 4, <2 x i1> %mask, <2 x i32> undef)
+ ret <2 x i32> %v
+}
+
+define <4 x i32> @gather_v4i32(<4 x ptr> %ptrs, <4 x i1> %mask) {
+; ZNVER4-LABEL: 'gather_v4i32'
+; ZNVER4-NEXT: Cost Model: Found an estimated cost of 11 for instruction: %v = call <4 x i32> @llvm.masked.gather.v4i32.v4p0(<4 x ptr> align 4 %ptrs, <4 x i1> %mask, <4 x i32> undef)
+; ZNVER4-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i32> %v
+;
+; ZNVER5-LABEL: 'gather_v4i32'
+; ZNVER5-NEXT: Cost Model: Found an estimated cost of 11 for instruction: %v = call <4 x i32> @llvm.masked.gather.v4i32.v4p0(<4 x ptr> align 4 %ptrs, <4 x i1> %mask, <4 x i32> undef)
+; ZNVER5-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i32> %v
+;
+; ZNVER3-LABEL: 'gather_v4i32'
+; ZNVER3-NEXT: Cost Model: Found an estimated cost of 14 for instruction: %v = call <4 x i32> @llvm.masked.gather.v4i32.v4p0(<4 x ptr> align 4 %ptrs, <4 x i1> %mask, <4 x i32> undef)
+; ZNVER3-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i32> %v
+;
+; SKX-LABEL: 'gather_v4i32'
+; SKX-NEXT: Cost Model: Found an estimated cost of 6 for instruction: %v = call <4 x i32> @llvm.masked.gather.v4i32.v4p0(<4 x ptr> align 4 %ptrs, <4 x i1> %mask, <4 x i32> undef)
+; SKX-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i32> %v
+;
+ %v = call <4 x i32> @llvm.masked.gather.v4i32.v4p0(<4 x ptr> %ptrs, i32 4, <4 x i1> %mask, <4 x i32> undef)
+ ret <4 x i32> %v
+}
+
+define <8 x i32> @gather_v8i32(<8 x ptr> %ptrs, <8 x i1> %mask) {
+; ZNVER4-LABEL: 'gather_v8i32'
+; ZNVER4-NEXT: Cost Model: Found an estimated cost of 25 for instruction: %v = call <8 x i32> @llvm.masked.gather.v8i32.v8p0(<8 x ptr> align 4 %ptrs, <8 x i1> %mask, <8 x i32> undef)
+; ZNVER4-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <8 x i32> %v
+;
+; ZNVER5-LABEL: 'gather_v8i32'
+; ZNVER5-NEXT: Cost Model: Found an estimated cost of 25 for instruction: %v = call <8 x i32> @llvm.masked.gather.v8i32.v8p0(<8 x ptr> align 4 %ptrs, <8 x i1> %mask, <8 x i32> undef)
+; ZNVER5-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <8 x i32> %v
+;
+; ZNVER3-LABEL: 'gather_v8i32'
+; ZNVER3-NEXT: Cost Model: Found an estimated cost of 28 for instruction: %v = call <8 x i32> @llvm.masked.gather.v8i32.v8p0(<8 x ptr> align 4 %ptrs, <8 x i1> %mask, <8 x i32> undef)
+; ZNVER3-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <8 x i32> %v
+;
+; SKX-LABEL: 'gather_v8i32'
+; SKX-NEXT: Cost Model: Found an estimated cost of 10 for instruction: %v = call <8 x i32> @llvm.masked.gather.v8i32.v8p0(<8 x ptr> align 4 %ptrs, <8 x i1> %mask, <8 x i32> undef)
+; SKX-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <8 x i32> %v
+;
+ %v = call <8 x i32> @llvm.masked.gather.v8i32.v8p0(<8 x ptr> %ptrs, i32 4, <8 x i1> %mask, <8 x i32> undef)
+ ret <8 x i32> %v
+}
+
+define <16 x i32> @gather_v16i32(<16 x ptr> %ptrs, <16 x i1> %mask) {
+; ZNVER4-LABEL: 'gather_v16i32'
+; ZNVER4-NEXT: Cost Model: Found an estimated cost of 50 for instruction: %v = call <16 x i32> @llvm.masked.gather.v16i32.v16p0(<16 x ptr> align 4 %ptrs, <16 x i1> %mask, <16 x i32> undef)
+; ZNVER4-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <16 x i32> %v
+;
+; ZNVER5-LABEL: 'gather_v16i32'
+; ZNVER5-NEXT: Cost Model: Found an estimated cost of 50 for instruction: %v = call <16 x i32> @llvm.masked.gather.v16i32.v16p0(<16 x ptr> align 4 %ptrs, <16 x i1> %mask, <16 x i32> undef)
+; ZNVER5-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <16 x i32> %v
+;
+; ZNVER3-LABEL: 'gather_v16i32'
+; ZNVER3-NEXT: Cost Model: Found an estimated cost of 55 for instruction: %v = call <16 x i32> @llvm.masked.gather.v16i32.v16p0(<16 x ptr> align 4 %ptrs, <16 x i1> %mask, <16 x i32> undef)
+; ZNVER3-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <16 x i32> %v
+;
+; SKX-LABEL: 'gather_v16i32'
+; SKX-NEXT: Cost Model: Found an estimated cost of 20 for instruction: %v = call <16 x i32> @llvm.masked.gather.v16i32.v16p0(<16 x ptr> align 4 %ptrs, <16 x i1> %mask, <16 x i32> undef)
+; SKX-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <16 x i32> %v
+;
+ %v = call <16 x i32> @llvm.masked.gather.v16i32.v16p0(<16 x ptr> %ptrs, i32 4, <16 x i1> %mask, <16 x i32> undef)
+ ret <16 x i32> %v
+}
+
+;------------------------------------------------------------------------------
+; Masked gather - i64 element type
+;------------------------------------------------------------------------------
+
+define <2 x i64> @gather_v2i64(<2 x ptr> %ptrs, <2 x i1> %mask) {
+; ZNVER4-LABEL: 'gather_v2i64'
+; ZNVER4-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %v = call <2 x i64> @llvm.masked.gather.v2i64.v2p0(<2 x ptr> align 8 %ptrs, <2 x i1> %mask, <2 x i64> undef)
+; ZNVER4-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i64> %v
+;
+; ZNVER5-LABEL: 'gather_v2i64'
+; ZNVER5-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %v = call <2 x i64> @llvm.masked.gather.v2i64.v2p0(<2 x ptr> align 8 %ptrs, <2 x i1> %mask, <2 x i64> undef)
+; ZNVER5-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i64> %v
+;
+; ZNVER3-LABEL: 'gather_v2i64'
+; ZNVER3-NEXT: Cost Model: Found an estimated cost of 7 for instruction: %v = call <2 x i64> @llvm.masked.gather.v2i64.v2p0(<2 x ptr> align 8 %ptrs, <2 x i1> %mask, <2 x i64> undef)
+; ZNVER3-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i64> %v
+;
+; SKX-LABEL: 'gather_v2i64'
+; SKX-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %v = call <2 x i64> @llvm.masked.gather.v2i64.v2p0(<2 x ptr> align 8 %ptrs, <2 x i1> %mask, <2 x i64> undef)
+; SKX-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i64> %v
+;
+ %v = call <2 x i64> @llvm.masked.gather.v2i64.v2p0(<2 x ptr> %ptrs, i32 8, <2 x i1> %mask, <2 x i64> undef)
+ ret <2 x i64> %v
+}
+
+define <4 x i64> @gather_v4i64(<4 x ptr> %ptrs, <4 x i1> %mask) {
+; ZNVER4-LABEL: 'gather_v4i64'
+; ZNVER4-NEXT: Cost Model: Found an estimated cost of 6 for instruction: %v = call <4 x i64> @llvm.masked.gather.v4i64.v4p0(<4 x ptr> align 8 %ptrs, <4 x i1> %mask, <4 x i64> undef)
+; ZNVER4-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i64> %v
+;
+; ZNVER5-LABEL: 'gather_v4i64'
+; ZNVER5-NEXT: Cost Model: Found an estimated cost of 6 for instruction: %v = call <4 x i64> @llvm.masked.gather.v4i64.v4p0(<4 x ptr> align 8 %ptrs, <4 x i1> %mask, <4 x i64> undef)
+; ZNVER5-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i64> %v
+;
+; ZNVER3-LABEL: 'gather_v4i64'
+; ZNVER3-NEXT: Cost Model: Found an estimated cost of 15 for instruction: %v = call <4 x i64> @llvm.masked.gather.v4i64.v4p0(<4 x ptr> align 8 %ptrs, <4 x i1> %mask, <4 x i64> undef)
+; ZNVER3-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i64> %v
+;
+; SKX-LABEL: 'gather_v4i64'
+; SKX-NEXT: Cost Model: Found an estimated cost of 6 for instruction: %v = call <4 x i64> @llvm.masked.gather.v4i64.v4p0(<4 x ptr> align 8 %ptrs, <4 x i1> %mask, <4 x i64> undef)
+; SKX-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i64> %v
+;
+ %v = call <4 x i64> @llvm.masked.gather.v4i64.v4p0(<4 x ptr> %ptrs, i32 8, <4 x i1> %mask, <4 x i64> undef)
+ ret <4 x i64> %v
+}
+
+define <8 x i64> @gather_v8i64(<8 x ptr> %ptrs, <8 x i1> %mask) {
+; ZNVER4-LABEL: 'gather_v8i64'
+; ZNVER4-NEXT: Cost Model: Found an estimated cost of 10 for instruction: %v = call <8 x i64> @llvm.masked.gather.v8i64.v8p0(<8 x ptr> align 8 %ptrs, <8 x i1> %mask, <8 x i64> undef)
+; ZNVER4-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <8 x i64> %v
+;
+; ZNVER5-LABEL: 'gather_v8i64'
+; ZNVER5-NEXT: Cost Model: Found an estimated cost of 10 for instruction: %v = call <8 x i64> @llvm.masked.gather.v8i64.v8p0(<8 x ptr> align 8 %ptrs, <8 x i1> %mask, <8 x i64> undef)
+; ZNVER5-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <8 x i64> %v
+;
+; ZNVER3-LABEL: 'gather_v8i64'
+; ZNVER3-NEXT: Cost Model: Found an estimated cost of 29 for instruction: %v = call <8 x i64> @llvm.masked.gather.v8i64.v8p0(<8 x ptr> align 8 %ptrs, <8 x i1> %mask, <8 x i64> undef)
+; ZNVER3-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <8 x i64> %v
+;
+; SKX-LABEL: 'gather_v8i64'
+; SKX-NEXT: Cost Model: Found an estimated cost of 10 for instruction: %v = call <8 x i64> @llvm.masked.gather.v8i64.v8p0(<8 x ptr> align 8 %ptrs, <8 x i1> %mask, <8 x i64> undef)
+; SKX-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <8 x i64> %v
+;
+ %v = call <8 x i64> @llvm.masked.gather.v8i64.v8p0(<8 x ptr> %ptrs, i32 8, <8 x i1> %mask, <8 x i64> undef)
+ ret <8 x i64> %v
+}
+
+;---------...
[truncated]
|
MattPD
left a comment
There was a problem hiding this comment.
Thanks for the PR! Per-shape cost modeling looks promising for Zen gather/scatter tuning. A few questions, comments, and suggestions, roughly in priority order:
1. i64 shapes fall through to overhead=2: too low for Zen4?
The tables omit i64 with a note "i64 falls through to the flat default". Indeed, the fallthrough path hits if (ST->hasAVX512()) return 2, giving znver4 the same i64 gather/scatter cost as Intel SKX. However:
VPSCATTERQQ(zmm) on Zen4 is 48 µops vs ≤20 on SKX (uops.info).- Port pressure: 18 µops restricted to ports {4,5} on Zen4 vs 8 µops on ports {2,3} on SKX.
- Performance impact: in benchmarking on my end SPECint2006/LIBQUANTUM is 1.5–1.66× slower with v8i64 scatter on Zen4.
With overhead=2, the vectorizer sees v8i64 scatter at total cost ≈ 2 + 8×1 = 10, which looks highly profitable vs the scalar alternative. This re-enables the (harmful for performance) i64 scatter decisions that the conservative approach was trying to prevent (cf. PR #198850).
The commit message says "the i64 sweep landed within the noise of the generic flat overhead". However, the generic flat overhead for AVX-512 targets is 2, which is the Intel-tuned value. If the sweep landed near 2, that would mean i64 gather/scatter is as fast as on SKX, which contradicts the µop data.
Suggestion: Either add explicit i64 entries with overhead derived from measurement (I'd expect ≥20 given the 48-µop count), or fall through to a Zen-appropriate conservative value rather than the Intel-derived 2.
2. Dead table entries (v16f64, v2 shapes)
Several table rows appear to be unreachable:
-
v16f64 (
{ISD::LOAD, MVT::v16f64, 14},{ISD::STORE, MVT::v16f64, 3}):getGSVectorCostperforms type legalization splitting before callinggetGatherOverhead/getScatterOverhead. For<16 x double>(1024 bits), the function recurses with<8 x double>after splitting. AFAICT, the v16f64 entries can never be matched, so the effective cost is 2 × (v8f64 entry), not the tabulated value. -
All v2 gather entries (
{ISD::LOAD, MVT::v2i32, 20}, etc.): AVX-512 targets force-scalarize VF=2 gathers viaforceScalarizeMaskedGather. The test confirms this:gather_v2i32on ZNVER4 costs 8 (scalarized), not 20 + per-element.
These entries create a false impression of coverage and will mislead future maintainers who attempt to modify them.
Suggestion: Remove unreachable entries; add a brief comment noting that v16f64 is handled by splitting (cost = 2 × v8f64) and VF=2 is force-scalarized.
3. End-to-end loop-vectorize test
The test validates cost model numbers (opt -passes='print<cost-model>') but not vectorization decisions. The stated goal, fixing issue #91370 (vectorizer emitting gather for contiguous loads on znver4), is untested at the pass level.
Suggestion: Add a test under llvm/test/Transforms/LoopVectorize/X86/ that exercises the actual vectorizer decision, e.g.:
- A loop where gather IS chosen on znver4 (a strided/indirect pattern matching the lbm win):
CHECK: @llvm.masked.gather - A loop where gather is NOT chosen (e.g., an i64 indirect load):
CHECK-NOT: @llvm.masked.gather
This locks in the behavior (the high-level intent) the cost model is meant to produce, so future cost model refactors that accidentally re-enable harmful gathers get caught.
4. Non-monotonic values deserve a comment
The gather table progression (v2i32=20, v4i32=7, v8i32=17, v16i32=14) is non-intuitive and will look like a typo to future developers. Similarly, scatter: v16i32=6 but v8i32=14; v16i32=6 but v16f32=16.
I believe these arise because the "break-even" methodology measures the overhead threshold at which overhead + VF × memcost > scalar_alternative_cost, and the scalar alternative's cost doesn't scale linearly with VF (due to unrolling, pipelining, address computation differences). If so, a brief comment in the source explaining why values are non-monotonic would help future maintainability significantly.
Also: the i32-vs-f32 divergence at VF=16 for scatter (6 vs 16)--is this genuine or a measurement artifact? They're the same physical 512-bit operation on 32-bit lanes.
5. Methodology reproducibility (minor)
The PR body references -force-gather-overhead-cost=N as the sweep tool — this doesn't exist in upstream LLVM. The values are likely derived from a local patch that overrides the return value of getGatherOverhead/getScatterOverhead. That's a perfectly reasonable approach (I've done the same in a downstream fork), but since the methodology can't be replicated by other developers, a sentence in the commit message noting "values measured using a local patch that forces the gather/scatter overhead to a specified value" would be helpful for transparency.
Overall: the per-shape approach is promising the provided benchmarks look good (although a wider SPEC benchmark results would be very welcome). The i64 fallthrough and dead entries are the main items I'd want addressed before landing.
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
|
✅ With the latest revision this PR passed the undef deprecator. |
88efe35 to
6bf7ce0
Compare
|
Hello @MattPD
|
MattPD
left a comment
There was a problem hiding this comment.
Thanks for the revisions: No remaining blockers on my end!
Two optional suggestions for a follow-up (which I leave up to you):
- The LV test has
CHECK-NOT: @llvm.masked.gatherfor cases 2/3, which could pass vacuously if the loop fails to vectorize entirely (not just "vectorizes without gather"). Adding a positive anchor likeCHECK: vector.bodywould distinguish "scalarized the gather" from "didn't vectorize at all." - For symmetry with
scatter_v16i32_gep, agather_v16i32_geptest would exercise the GEP-index reduction path for gather as well. The code path is shared so this is purely for documentation/completeness.
6bf7ce0 to
fac152a
Compare
Two test-only nits from review of llvm#199488: 1. The LoopVectorize test had `CHECK-NOT: @llvm.masked.gather` on Case 2 (i64 gather avoided) and Case 3 (unit-stride no gather) without a positive anchor, so the check would pass vacuously if the loop ever failed to vectorize at all (rather than vectorizing without a gather). Adding `CHECK: vector.body` in front of each `CHECK-NOT` distinguishes the two outcomes; under `-force-vector-width=1` both new CHECKs now correctly fail. 2. The cost-model test had `scatter_v16{i32,f32}_gep` to exercise the GEP-index reducibility path for the v16 scatter row but no analogous case for gather. Added `gather_v16i32_gep` (cost = 30 on znver4/5). Both i32 and f32 GEP cases for gather would share the same code path, so one case is sufficient for the v16 gather row; the section comment is updated to make that explicit. No behavior change. Both tests pass.
Jason-Van-Beusekom
left a comment
There was a problem hiding this comment.
Overall LGTM for me (after the failing test is fixed) however I would like to get approval from others before merging
…er4+
The X86 cost model currently returns a single flat overhead from
getGatherOverhead / getScatterOverhead, applied to every shape of
masked gather or scatter on every X86 subtarget that reaches the
gather/scatter path. On modern AMD parts the actual cost of these
instructions varies substantially with the vector width and element
size, and the single flat number forces the LoopVectorizer to either
under- or over-estimate the profitability of vectorising loops that
need indirect memory access.
This change adds a subtarget tuning bit, TuningPreferAMDZenGSCost,
attached to ZN4Tuning so znver4 and znver5 pick it up automatically.
Pre-AVX-512 Zen parts (znver1..3) take the scalarise path for masked
gather and never reach the new code, so the bit is intentionally NOT
placed in ZNTuning; flagging older Zen parts with the feature would
be misleading.
When the tuning bit is set, getGatherOverhead / getScatterOverhead
look the source vector type up in per-shape cost tables before
falling back to the existing generic flat overhead. The tables only
contain the shapes that are actually reached at the lookup site:
VF=2 is force-scalarised on AVX-512 (forceScalarizeMaskedGather), and
v16f64 is split via type legalisation in getGSVectorCost (1024-bit
data exceeds a single zmm), so neither row would ever be queried and
both are omitted. The live tables cover gather and scatter for
VF=4..16 over i32 / f32 / f64, plus VF=4 and VF=8 for i64.
Methodology
The numbers are empirical break-even costs measured on znver4 and
znver5 hardware. The methodology, summarised:
1. Take a controlled gather/scatter micro-benchmark with one
indirect memory access per inner-loop iteration.
2. Sweep the gather/scatter overhead via a local cl::opt patch on
X86TTI (the upstream tree has no such knob today; this is a
standalone local instrument that returns the forced value from
getGatherOverhead / getScatterOverhead). A reproducible version
of the patch lives on the author's zen-gs-i64-sweep branch.
3. For each (element type, VF) compile the micro-benchmark at a
range of forced overheads and identify the "flip" cost above
which the LoopVectorizer stops emitting the gather / scatter
instruction (it switches to an extract-load-insert lowering or
to a pure scalar loop). The tabulated cost is the highest value
at which gather/scatter emission was the right call: the
vectoriser still selects it AND the resulting binary is at least
as fast as the post-flip alternatives on the test hardware.
4. The sweep is run independently for each (element type, VF) on
Genoa, Milan and Turin and re-validated on Zen 5.
Notes on individual entries
* i64 entries are higher than their f64 counterparts at the same
VF. The scalar alternative for i64 runs on the integer pipeline
(cheaper than f64 on the FP pipeline), so gather has to be
cheaper to win. At the f64-style break-even, i64 gather was
1.7-3.5x slower than the scalarised lowering across stride
patterns on Zen 5, so the i64 break-even sits at the minimum
cost that suppresses vpgatherqq / vpscatterqq emission for the
measured patterns (which include the libquantum-style indirect
scatter cited in PR llvm#198850).
* f32 rows for both tables mirror i32 rows: the original sweep
only characterised i32 and f64 lanes, and the f32 rows were
derived by symmetry because vpgatherdd / vpscatterdd and the
corresponding ps variants share the same physical lane width on
Zen. The runtime equivalence of vpscatterdd and vscatterdps was
verified directly (within 3% across VF and stride patterns).
Tests
The cost-model test
llvm/test/Analysis/CostModel/X86/masked-gather-scatter-amd-zen.ll
covers every live shape on znver4 / znver5 and pins the unchanged
behaviour for znver3 (scalarise path) and skx (generic flat
overhead). It also covers the 32-bit-reducible GEP form for v16f32
and v16i32 scatter, which is the only path that actually queries the
v16 row of the scatter table (the <16 x ptr> form recurses to v8
through type legalisation).
A second test
llvm/test/Transforms/LoopVectorize/X86/amd-zen-gather-scatter-decisions.ll
pins the resulting vectoriser decisions end-to-end: gather IS emitted
for an f64 indirect-load reduction on znver5; gather is NOT emitted
for an i64 indirect-load reduction (where the cost table is meant to
suppress it); and a unit-stride load must not become a gather
regardless of cost-table values (regression guard for issue llvm#91370).
Two test-only nits from review of llvm#199488: 1. The LoopVectorize test had `CHECK-NOT: @llvm.masked.gather` on Case 2 (i64 gather avoided) and Case 3 (unit-stride no gather) without a positive anchor, so the check would pass vacuously if the loop ever failed to vectorize at all (rather than vectorizing without a gather). Adding `CHECK: vector.body` in front of each `CHECK-NOT` distinguishes the two outcomes; under `-force-vector-width=1` both new CHECKs now correctly fail. 2. The cost-model test had `scatter_v16{i32,f32}_gep` to exercise the GEP-index reducibility path for the v16 scatter row but no analogous case for gather. Added `gather_v16i32_gep` (cost = 30 on znver4/5). Both i32 and f32 GEP cases for gather would share the same code path, so one case is sufficient for the v16 gather row; the section comment is updated to make that explicit. No behavior change. Both tests pass.
CI's code_formatter job runs both clang-format AND the undef-deprecator check; the latter rejects new uses of `undef` in tests under the LangRef poison/undef migration. The masked-gather passthru argument was the only `undef` in the file. Replace all 75 occurrences with `poison` (purely an operand-printing change -- gather costs and behaviour are unaffected). No functional change; both tests still pass.
Apply clang-format to the per-shape Zen gather/scatter tables in
X86TargetTransformInfo.cpp:
- Drop the manual double-space numeric alignment in the rows
(clang-format collapses it and re-packs the rows 2-per-line).
- Re-wrap the `if (const auto *E = CostTableLookup(...))` line
in getGatherOverhead to put the call expression on its own
indented line.
Pure formatting; cost tables and lookup behaviour are unchanged.
68ed3f6 to
8cc82c4
Compare
Summary
The X86 cost model currently returns a single flat overhead from
getGatherOverhead/getScatterOverhead, applied to every shape of masked gather/scatter on every X86 subtarget that reaches the gather/scatter path. On modern AMD parts the actual cost of these instructions varies substantially with the vector width and element size, and the single flat number forces the LoopVectorizer to either under- or over-estimate the profitability of vectorising loops that need indirect memory access.This PR adds a subtarget tuning bit,
TuningPreferAMDZenGSCost, attached toZN4Tuningso znver4 and znver5 pick it up automatically. Pre-AVX-512 Zen parts (znver1..3) take the scalarise path for masked gather and never reach the new code, so the bit is intentionally NOT placed inZNTuning; flagging older Zen parts with the feature would be misleading.When the tuning bit is set,
getGatherOverhead/getScatterOverheadlook the source vector type up in per-shape cost tables before falling back to the existing generic flat overhead.Cost tables
The tables only contain shapes that are actually reached at the lookup site. Two omissions are deliberate:
VF=2is force-scalarised on AVX-512 byforceScalarizeMaskedGather, so av2*row would never be queried.v16f64(1024 bits of data) is split via type legalisation ingetGSVectorCostbefore reaching the lookup, so av16f64row would never be queried either.Gather:
Scatter:
The 32-bit gather/scatter rows are the cost overhead values published in AOCC for the same hardware family, used here unchanged so that community LLVM matches the AOCC behaviour on Zen 4 / Zen 5 for indirect-memory vectorisation decisions. The 64-bit f64 and
i64rows are measured on Zen 4 / Zen 5 hardware as described below; thei64row appears in both tables becausevpscatterqqwas measured to be just as harmful asvpgatherqqat the f64-style break-even (this is the libquantum / PR #198850 case the reviewer flagged for scatter specifically).Methodology
The intent behind a row depends on whether the masked intrinsic is faster or slower than its scalarised alternative on Zen for that shape:
i64shapes on Zen, andv4i32/v4f32scatter as it happens), the table value is chosen at or above that same flip point, so the LoopVectorizer falls back to the cheaper scalarised lowering.The numbers are derived empirically on znver4 and znver5 hardware. The procedure:
cl::optpatch onX86TTI(the upstream tree has no such knob today; this is a standalone local instrument that returns the forced value fromgetGatherOverhead/getScatterOverhead). A reproducible version of the patch lives on the author'szen-gs-i64-sweepbranch.Notes on individual entries
v4i64=10,v8i64=22) and are higher than their f64 counterparts at the same VF. The scalar alternative for i64 runs on the integer pipeline (cheaper than f64 on the FP pipeline), so gather / scatter has to be cheaper to win. At the f64-style break-even, i64 gather was 1.5-1.7x slower than the scalarised lowering on Zen 5 across the stride patterns measured (the prior wider range of 1.7-3.5x quoted earlier in this PR's history came from larger working sets);vpscatterqqshowed the same pattern at 1.20-1.30x, matching the reviewer's 1.5-1.66x libquantum (PR [X86] Change Gather and Scatter cost to 1024 for ZN4 and ZN5 #198850) figures. The i64 break-even therefore sits at the minimum cost that suppressesvpgatherqq/vpscatterqqemission.vpgatherdd/vpscatterddand the correspondingpsvariants share the same physical lane width on Zen. The runtime equivalence ofvpscatterddandvscatterdpswas verified directly (within 3% across VF and stride patterns).v4i32/v4f32scatter are the two negative-Δ entries. The AOCC values (12 each) were not designed with a "suppress" intent on Zen, but on Zen 5 they happen to sit just above the flip (9 and 8). Measurement showed scalar wins by 1.20x and 1.30x respectively, so leaving these rows at the AOCC values is correct on Zen 5; resetting them to the flip would re-enable a slower vec lowering.Tests
llvm/test/Analysis/CostModel/X86/masked-gather-scatter-amd-zen.llcovers every live shape on znver4 / znver5 and pins the unchanged behaviour for znver3 (scalarise path) and skx (generic flat overhead). It also covers the 32-bit-reducible GEP form forv16f32andv16i32scatter, which is the only path that actually queries thev16row of the scatter table (the<16 x ptr>form recurses tov8through type legalisation).llvm/test/Transforms/LoopVectorize/X86/amd-zen-gather-scatter-decisions.llpins the resulting vectoriser decisions end-to-end:znver4orskylake#91370).Benchmark validation
Measured on a Ryzen 9 9950X (Zen 5), community LLVM trunk at the parent of this commit vs. the commit. Single-copy refrate, 3 iterations, median-selected,
-O3 -march=znver5 -ffast-math -flto.519.lbm_r(SPEC CPU2017)782.lbm_r(SPEC CPU2026)Both benchmarks are dominated by an inner loop that performs strided / gather-style memory access (the D3Q19 lattice neighbour update in lbm) that the LoopVectorizer now correctly prices as profitable on Zen.
Non-goals
ZN4Tuning, every other subtarget hits the original code path with byte-identical behaviour.Test plan
llvm-lit -v llvm/test/Analysis/CostModel/X86/masked-gather-scatter-amd-zen.ll(1/1 PASS)llvm-lit -v llvm/test/Transforms/LoopVectorize/X86/amd-zen-gather-scatter-decisions.ll(1/1 PASS)llvm-lit llvm/test/Analysis/CostModel/X86/ llvm/test/Transforms/LoopVectorize/X86/(408/408 PASS)