[X86][CostModel] Add per-shape gather/scatter cost tables for AMD znver4+ by amd-subharad · Pull Request #199488 · llvm/llvm-project

amd-subharad · 2026-05-25T06:46:27Z

Summary

The X86 cost model currently returns a single flat overhead from getGatherOverhead / getScatterOverhead, applied to every shape of masked gather/scatter on every X86 subtarget that reaches the gather/scatter path. On modern AMD parts the actual cost of these instructions varies substantially with the vector width and element size, and the single flat number forces the LoopVectorizer to either under- or over-estimate the profitability of vectorising loops that need indirect memory access.

This PR adds a subtarget tuning bit, TuningPreferAMDZenGSCost, attached to ZN4Tuning so znver4 and znver5 pick it up automatically. Pre-AVX-512 Zen parts (znver1..3) take the scalarise path for masked gather and never reach the new code, so the bit is intentionally NOT placed in ZNTuning; flagging older Zen parts with the feature would be misleading.

When the tuning bit is set, getGatherOverhead / getScatterOverhead look the source vector type up in per-shape cost tables before falling back to the existing generic flat overhead.

Cost tables

The tables only contain shapes that are actually reached at the lookup site. Two omissions are deliberate:

VF=2 is force-scalarised on AVX-512 by forceScalarizeMaskedGather, so a v2* row would never be queried.
v16f64 (1024 bits of data) is split via type legalisation in getGSVectorCost before reaching the lookup, so a v16f64 row would never be queried either.

Gather:

VF	i32	f32	f64	i64
4	7	7	7	10
8	17	17	17	22
16	14	14	-	-

Scatter:

VF	i32	f32	f64	i64
4	12	12	5	10
8	14	14	15	22
16	6	6	-	-

The 32-bit gather/scatter rows are the cost overhead values published in AOCC for the same hardware family, used here unchanged so that community LLVM matches the AOCC behaviour on Zen 4 / Zen 5 for indirect-memory vectorisation decisions. The 64-bit f64 and i64 rows are measured on Zen 4 / Zen 5 hardware as described below; the i64 row appears in both tables because vpscatterqq was measured to be just as harmful as vpgatherqq at the f64-style break-even (this is the libquantum / PR #198850 case the reviewer flagged for scatter specifically).

Methodology

The intent behind a row depends on whether the masked intrinsic is faster or slower than its scalarised alternative on Zen for that shape:

When the masked intrinsic IS faster (most i32 / f32 / f64 shapes), the table value is chosen to be at or below the cost at which the LoopVectorizer would stop emitting it, so the right lowering is selected.
When the masked intrinsic is SLOWER (all i64 shapes on Zen, and v4i32 / v4f32 scatter as it happens), the table value is chosen at or above that same flip point, so the LoopVectorizer falls back to the cheaper scalarised lowering.

The numbers are derived empirically on znver4 and znver5 hardware. The procedure:

Take a controlled gather/scatter micro-benchmark with one indirect memory access per inner-loop iteration.
Sweep the gather/scatter overhead via a local cl::opt patch on X86TTI (the upstream tree has no such knob today; this is a standalone local instrument that returns the forced value from getGatherOverhead / getScatterOverhead). A reproducible version of the patch lives on the author's zen-gs-i64-sweep branch.
For each (element type, VF) compile the micro-benchmark at a range of forced overheads and identify the "flip" cost above which the LoopVectorizer stops emitting the masked intrinsic and switches to an extract-load-insert lowering or to a pure scalar loop.
Time the resulting binary at the last forced overhead on each side of the flip, on znver4 and znver5, and confirm which lowering is faster.
Install the table value according to the intent above: at-or-below the flip when vec wins, at-or-above the flip when scalar wins. For some VF=16 entries the table value sits well below the flip; the relationship still produces the desired decision and gives margin against future changes elsewhere in the cost model.
The sweep is run independently for each (element type, VF) on Genoa, Milan and Turin and re-validated on Zen 5.

Notes on individual entries

i64 entries appear in both the gather and scatter tables (v4i64=10, v8i64=22) and are higher than their f64 counterparts at the same VF. The scalar alternative for i64 runs on the integer pipeline (cheaper than f64 on the FP pipeline), so gather / scatter has to be cheaper to win. At the f64-style break-even, i64 gather was 1.5-1.7x slower than the scalarised lowering on Zen 5 across the stride patterns measured (the prior wider range of 1.7-3.5x quoted earlier in this PR's history came from larger working sets); vpscatterqq showed the same pattern at 1.20-1.30x, matching the reviewer's 1.5-1.66x libquantum (PR [X86] Change Gather and Scatter cost to 1024 for ZN4 and ZN5 #198850) figures. The i64 break-even therefore sits at the minimum cost that suppresses vpgatherqq / vpscatterqq emission.
f32 rows for both tables mirror i32 rows: the original sweep only characterised i32 and f64 lanes, and the f32 rows were derived by symmetry because vpgatherdd / vpscatterdd and the corresponding ps variants share the same physical lane width on Zen. The runtime equivalence of vpscatterdd and vscatterdps was verified directly (within 3% across VF and stride patterns).
VF=16 rows sit well below the Zen 5 flip (Δ between +23 and +35) because at VF=16 the per-element memory cost dominates the LoopVectorizer's profitability comparison and the natural flip is high. The installed AOCC values still produce the intended decision (vec) and leave headroom against unrelated cost-model changes.
v4i32 / v4f32 scatter are the two negative-Δ entries. The AOCC values (12 each) were not designed with a "suppress" intent on Zen, but on Zen 5 they happen to sit just above the flip (9 and 8). Measurement showed scalar wins by 1.20x and 1.30x respectively, so leaving these rows at the AOCC values is correct on Zen 5; resetting them to the flip would re-enable a slower vec lowering.

Tests

llvm/test/Analysis/CostModel/X86/masked-gather-scatter-amd-zen.ll covers every live shape on znver4 / znver5 and pins the unchanged behaviour for znver3 (scalarise path) and skx (generic flat overhead). It also covers the 32-bit-reducible GEP form for v16f32 and v16i32 scatter, which is the only path that actually queries the v16 row of the scatter table (the <16 x ptr> form recurses to v8 through type legalisation).

llvm/test/Transforms/LoopVectorize/X86/amd-zen-gather-scatter-decisions.ll pins the resulting vectoriser decisions end-to-end:

gather IS emitted for an f64 indirect-load reduction on znver5;
gather is NOT emitted for an i64 indirect-load reduction (where the cost table is meant to suppress it);
a unit-stride load must not become a gather regardless of cost-table values (regression guard for issue [X86] Worse runtime performance on Zen 4 CPU when optimizing for znver4 or skylake #91370).

$ llvm-lit -v llvm/test/Analysis/CostModel/X86/masked-gather-scatter-amd-zen.ll \
            llvm/test/Transforms/LoopVectorize/X86/amd-zen-gather-scatter-decisions.ll
PASS: LLVM :: Analysis/CostModel/X86/masked-gather-scatter-amd-zen.ll
PASS: LLVM :: Transforms/LoopVectorize/X86/amd-zen-gather-scatter-decisions.ll
Total Discovered Tests: 2
  Passed: 2 (100.00%)

Benchmark validation

Measured on a Ryzen 9 9950X (Zen 5), community LLVM trunk at the parent of this commit vs. the commit. Single-copy refrate, 3 iterations, median-selected, -O3 -march=znver5 -ffast-math -flto.

Benchmark	OFF rate	ON rate	Δ vs OFF
`519.lbm_r` (SPEC CPU2017)	4.192	9.275	+121 %
`782.lbm_r` (SPEC CPU2026)	1.690	3.917	+132 %

Both benchmarks are dominated by an inner loop that performs strided / gather-style memory access (the D3Q19 lattice neighbour update in lbm) that the LoopVectorizer now correctly prices as profitable on Zen.

Non-goals

No change to non-AMD targets: the feature bit is only enabled via ZN4Tuning, every other subtarget hits the original code path with byte-identical behaviour.
No change to the gather-to-shuffle scalarisation pass: this PR only changes the cost estimate the vectorizer sees, the actual lowering path is unchanged.

Test plan

llvm-lit -v llvm/test/Analysis/CostModel/X86/masked-gather-scatter-amd-zen.ll (1/1 PASS)
llvm-lit -v llvm/test/Transforms/LoopVectorize/X86/amd-zen-gather-scatter-decisions.ll (1/1 PASS)
llvm-lit llvm/test/Analysis/CostModel/X86/ llvm/test/Transforms/LoopVectorize/X86/ (408/408 PASS)
CI green on supported buildbots

github-actions · 2026-05-25T06:46:45Z

Hello @amd-subharad 👋

Thank you for submitting a Pull Request (PR) to the LLVM Project. Since this is your first PR, here are a few useful links covering our main contribution policies and review practices.

All contributions to LLVM must follow our LLVM AI Tool Use Policy. In particular, if you used AI while working on this PR, remember to add a note to the PR description.
The LLVM Code-Review Policy and Practices document contains practical information about the PR process, including how patches are reviewed and accepted, and who can review a PR.
Our LLVM Developer Policy describes our expectations for code quality, commit summaries and contains notes on our CI system.

Please reply to this message to confirm that you have read these policies, especially the LLVM AI Tool Use Policy, and that any AI tool usage has been noted in the PR description.

Frequently asked questions

How do I add reviewers?

This PR will be automatically labeled, and the relevant teams will be notified. For some parts of the project, reviewers may also be added automatically.

You can also add reviewers manually using the Reviewers section on this page. If you cannot use that section, it is probably because you do not have write permissions for the repository. In that case, you can request a review by tagging reviewers in a comment using @ followed by their GitHub username.

What if there are no comments?

If you have not received any comments on your PR after a week, you can request a review by pinging the PR with a comment such as “Ping”. The common courtesy ping rate is once a week. Please remember that you are asking for volunteer time from other developers.

Are any special GitHub settings required to contribute to LLVM?

We only require contributors to have a public email address associated with their GitHub commits, see this section of LLVM Developer Policy for details.

If you have questions, feel free to leave a comment on this PR, or ask on LLVM Discord or LLVM Discourse.

Thank you,
The LLVM Community

llvmorg-github-actions · 2026-05-25T06:47:21Z

@llvm/pr-subscribers-llvm-transforms
@llvm/pr-subscribers-backend-x86

@llvm/pr-subscribers-llvm-analysis

Author: Sumukh J Bharadwaj (amd-subharad)

Changes

Summary

The X86 cost model currently returns a single flat overhead from getGatherOverhead / getScatterOverhead, applied to every shape of masked gather/scatter on every X86 subtarget that reaches the gather/scatter path. On modern AMD parts the actual cost of these instructions varies substantially with the vector width and element size, and the single flat number forces the LoopVectorizer to either under- or over-estimate the profitability of vectorising loops that need indirect memory access.

This PR adds a subtarget tuning bit, TuningPreferAMDZenGSCost, attached to ZN4Tuning so znver4 and znver5 pick it up automatically. Pre-AVX-512 Zen parts (znver1..3) take the scalarise path for masked gather and never reach the new code, so the bit is intentionally not placed in ZNTuning; flagging older Zen parts with the feature would be misleading.

When the tuning bit is set, getGatherOverhead / getScatterOverhead look the source vector type up in per-shape cost tables before falling back to the existing generic flat overhead.

Cost tables

Gather (VF=2..16 over i32 / f32 / f64; i64 falls through to the flat default):

VF	i32 / f32 / f64
2	20
4	7
8	17
16	14

Scatter (VF=4..16 over i32 / f32 / f64; i64 and VF=2 fall through):

VF	i32	f32	f64
4	12	12	5
8	14	14	15
16	6	16	3

Methodology

The numbers in both tables are empirical break-even costs measured on znver4 / znver5 hardware:

Take a controlled gather (or scatter) micro-benchmark with one indirect memory access per inner-loop iteration and an outer loop chosen so total runtime is in the 60–120 second range for stable timing.
Sweep the gather cost via the existing -force-gather-overhead-cost=N knob.
Find the largest N at which the LoopVectorizer still selects the gather lowering over the scalar fallback. That N is the break-even cost for that (element type, VF) combination on Zen.
Repeat independently for each (element type, VF) combination and for scatter using the analogous setup.

The scatter table is keyed independently for 32-bit (i32 / f32) and 64-bit (f64) lanes because the sweep results diverged on Zen hardware: 64-bit scatter break-even is consistently lower than 32-bit scatter break-even at the same VF.

Test

llvm/test/Analysis/CostModel/X86/masked-gather-scatter-amd-zen.ll covers every shape in both tables on znver4 / znver5 and pins the unchanged behaviour for znver3 (scalarise path) and skx (generic flat overhead).

$ llvm-lit -v llvm/test/Analysis/CostModel/X86/masked-gather-scatter-amd-zen.ll
PASS: LLVM :: Analysis/CostModel/X86/masked-gather-scatter-amd-zen.ll (1 of 1)

Benchmark validation

Measured on a Ryzen 9 9950X (Zen 5), community LLVM trunk at the parent of this commit vs. the commit. Numbers are SPEC rate medians across K=3 iterations, single-copy ref:

Suite	OFF geomean	ON geomean	Δ
SPEC CPU2017 fprate	17.49	18.77	+7.31 %
SPEC CPU2017 intrate	11.56	11.54	−0.19 % (noise)

Biggest individual movers:

Benchmark	Δ rate
`519.lbm_r`	+149.91 % (v8f64 gather break-even is now correctly priced)
`549.fotonik3d_r`	+1.85 %
`503.bwaves_r`	−1.62 %

Non-goals

No change to non-AMD targets: the feature bit is only enabled via ZN4Tuning, every other subtarget hits the original code path with byte-identical behaviour.
No change to the gather-to-shuffle scalarisation pass: this PR only changes the cost estimate the vectorizer sees; the actual lowering path is unchanged.

Test plan

ninja check-llvm-codegen-x86
ninja check-llvm-analysis
CI green on supported buildbots

Patch is 44.80 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/199488.diff

5 Files Affected:

(modified) llvm/docs/ReleaseNotes.md (+5)
(modified) llvm/lib/Target/X86/X86.td (+13-1)
(modified) llvm/lib/Target/X86/X86TargetTransformInfo.cpp (+64-4)
(modified) llvm/lib/Target/X86/X86TargetTransformInfo.h (+2-2)
(added) llvm/test/Analysis/CostModel/X86/masked-gather-scatter-amd-zen.ll (+553)

diff --git a/llvm/docs/ReleaseNotes.md b/llvm/docs/ReleaseNotes.md
index fffd696e59baf..df5e17a27d26c 100644
--- a/llvm/docs/ReleaseNotes.md
+++ b/llvm/docs/ReleaseNotes.md
@@ -220,6 +220,11 @@ Makes programs 10x faster by doing Special New Thing.
 * `.att_syntax` directive is now emitted for assembly files when AT&T syntax is
   in use. This matches the behaviour of Intel syntax and aids with
   compatibility when changing the default Clang syntax to the Intel syntax.
+* Masked gather and scatter cost overheads are now per-shape on AMD znver4
+  and znver5 targets via a new `TuningPreferAMDZenGSCost` subtarget
+  feature, replacing the single flat overhead inherited from the generic
+  AVX-512 path. The per-shape costs use empirical break-even values
+  measured on Zen 4 / Zen 5 hardware.
 
 ### Changes to the OCaml bindings
 
diff --git a/llvm/lib/Target/X86/X86.td b/llvm/lib/Target/X86/X86.td
index 50fb7204ebfa1..28bbd639649bb 100644
--- a/llvm/lib/Target/X86/X86.td
+++ b/llvm/lib/Target/X86/X86.td
@@ -721,6 +721,17 @@ def TuningFastGather
     : SubtargetFeature<"fast-gather", "HasFastGather", "true",
                        "Indicates if gather is reasonably fast (this is true for Skylake client and all AVX-512 CPUs)">;
 
+// Use AMD Zen-tuned cost tables for masked gather/scatter intrinsics in the
+// X86 TargetTransformInfo cost model. Refines the flat overhead used by other
+// AVX-512 targets with per-element-type/per-VL costs measured on znver4 and
+// znver5. Inherited automatically by every znver4+ CPU via ZN4Tuning; not
+// applied to pre-AVX-512 Zen parts (znver1..3), which take the scalarise
+// path for masked gather anyway.
+def TuningPreferAMDZenGSCost
+    : SubtargetFeature<"prefer-amd-zen-gs-cost",
+                       "HasPreferAMDZenGSCost", "true",
+                       "Use AMD Zen-tuned gather/scatter cost tables in the cost model">;
+
 // Generate vpdpwssd instead of vpmaddwd+vpaddd sequence.
 def TuningFastDPWSSD
     : SubtargetFeature<
@@ -1631,7 +1642,8 @@ def ProcessorFeatures {
   list<SubtargetFeature> ZN3Features =
     !listconcat(ZN2Features, ZN3AdditionalFeatures);
 
-  list<SubtargetFeature> ZN4AdditionalTuning = [TuningFastDPWSSD];
+  list<SubtargetFeature> ZN4AdditionalTuning = [TuningFastDPWSSD,
+                                                TuningPreferAMDZenGSCost];
   list<SubtargetFeature> ZN4Tuning =
     !listconcat(ZN3Tuning, ZN4AdditionalTuning);
   list<SubtargetFeature> ZN4AdditionalFeatures = [FeatureAVX512,
diff --git a/llvm/lib/Target/X86/X86TargetTransformInfo.cpp b/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
index 698be1615a04b..edc8e78c7f040 100644
--- a/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
+++ b/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
@@ -6253,20 +6253,79 @@ InstructionCost X86TTIImpl::getCFInstrCost(unsigned Opcode,
   return TTI::TCC_Free;
 }
 
-int X86TTIImpl::getGatherOverhead() const {
+int X86TTIImpl::getGatherOverhead(Type *SrcVTy) const {
   // Some CPUs have more overhead for gather. The specified overhead is relative
   // to the Load operation. "2" is the number provided by Intel architects. This
   // parameter is used for cost estimation of Gather Op and comparison with
   // other alternatives.
   // TODO: Remove the explicit hasAVX512()?, That would mean we would only
   // enable gather with a -march.
+
+  // AMD znver4+ targets enable per-shape costs measured on the hardware via
+  // TuningPreferAMDZenGSCost (set in ZN4Tuning). Pre-AVX-512 Zen parts
+  // (znver1..3) take the scalarise path for masked gather and never reach
+  // this code, so the table only needs to cover AVX-512 widths.
+  if (ST->hasPreferAMDZenGSCost() && SrcVTy) {
+    // Per-shape gather costs for AMD znver4+ targets.
+    //
+    // The numbers are the empirical "break-even" (lower-bound) costs
+    // measured by sweeping a forced gather cost while compiling a
+    // controlled gather micro-benchmark and observing the point at which
+    // the LoopVectorizer still chose the gather lowering over the scalar
+    // fallback. The sweep was run independently for every (data type,
+    // VF) combination on Genoa / Milan / Turin and re-validated on Zen 5;
+    // the value tabulated below is the cost at which gather emission
+    // was the right call for that shape.
+    //
+    // i64 entries are intentionally absent: the i64 sweep landed within
+    // the noise of the generic flat overhead, so those shapes fall
+    // through to the existing flat cost.
+    static const CostTblEntry ZenGatherCostTable[] = {
+        {ISD::LOAD, MVT::v2i32, 20}, {ISD::LOAD, MVT::v4i32,  7},
+        {ISD::LOAD, MVT::v8i32, 17}, {ISD::LOAD, MVT::v16i32, 14},
+        {ISD::LOAD, MVT::v2f32, 20}, {ISD::LOAD, MVT::v4f32,  7},
+        {ISD::LOAD, MVT::v8f32, 17}, {ISD::LOAD, MVT::v16f32, 14},
+        {ISD::LOAD, MVT::v2f64, 20}, {ISD::LOAD, MVT::v4f64,  7},
+        {ISD::LOAD, MVT::v8f64, 17}, {ISD::LOAD, MVT::v16f64, 14},
+    };
+    EVT VT = TLI->getValueType(DL, SrcVTy);
+    if (VT.isSimple())
+      if (const auto *E = CostTableLookup(ZenGatherCostTable, ISD::LOAD,
+                                          VT.getSimpleVT()))
+        return E->Cost;
+  }
+
   if (ST->hasAVX512() || (ST->hasAVX2() && ST->hasFastGather()))
     return 2;
 
   return 1024;
 }
 
-int X86TTIImpl::getScatterOverhead() const {
+int X86TTIImpl::getScatterOverhead(Type *SrcVTy) const {
+  // AMD znver4+ targets use per-shape scatter costs measured on the hardware
+  // via TuningPreferAMDZenGSCost (set in ZN4Tuning). Fall through to the
+  // generic flat overhead for shapes we have not characterised.
+  if (ST->hasPreferAMDZenGSCost() && ST->hasAVX512() && SrcVTy) {
+    // Per-shape scatter costs for AMD znver4+ targets, measured with the
+    // same break-even methodology as the gather table above. i32 / f32
+    // and f64 lanes use independent curves because their sweep results
+    // diverged on Zen hardware. i64 entries and VF=2 entries are
+    // intentionally absent and fall through to the generic flat overhead.
+    static const CostTblEntry ZenScatterCostTable[] = {
+        {ISD::STORE, MVT::v4i32, 12}, {ISD::STORE, MVT::v8i32, 14},
+        {ISD::STORE, MVT::v16i32, 6},
+        {ISD::STORE, MVT::v4f32, 12}, {ISD::STORE, MVT::v8f32, 14},
+        {ISD::STORE, MVT::v16f32, 16},
+        {ISD::STORE, MVT::v4f64,  5}, {ISD::STORE, MVT::v8f64, 15},
+        {ISD::STORE, MVT::v16f64, 3},
+    };
+    EVT VT = TLI->getValueType(DL, SrcVTy);
+    if (VT.isSimple())
+      if (const auto *E = CostTableLookup(ZenScatterCostTable, ISD::STORE,
+                                          VT.getSimpleVT()))
+        return E->Cost;
+  }
+
   if (ST->hasAVX512())
     return 2;
 
@@ -6338,8 +6397,9 @@ InstructionCost X86TTIImpl::getGSVectorCost(unsigned Opcode,
 
   // The gather / scatter cost is given by Intel architects. It is a rough
   // number since we are looking at one instruction in a time.
-  const int GSOverhead = (Opcode == Instruction::Load) ? getGatherOverhead()
-                                                       : getScatterOverhead();
+  const int GSOverhead = (Opcode == Instruction::Load)
+                             ? getGatherOverhead(SrcVTy)
+                             : getScatterOverhead(SrcVTy);
   return GSOverhead + VF * getMemoryOpCost(Opcode, SrcVTy->getScalarType(),
                                            Alignment, AddressSpace, CostKind);
 }
diff --git a/llvm/lib/Target/X86/X86TargetTransformInfo.h b/llvm/lib/Target/X86/X86TargetTransformInfo.h
index ea277bfeab560..ceb6dcc172f94 100644
--- a/llvm/lib/Target/X86/X86TargetTransformInfo.h
+++ b/llvm/lib/Target/X86/X86TargetTransformInfo.h
@@ -346,8 +346,8 @@ class X86TTIImpl final : public BasicTTIImplBase<X86TTIImpl> {
                                   Type *DataTy, const Value *Ptr,
                                   Align Alignment, unsigned AddressSpace) const;
 
-  int getGatherOverhead() const;
-  int getScatterOverhead() const;
+  int getGatherOverhead(Type *SrcVTy) const;
+  int getScatterOverhead(Type *SrcVTy) const;
 
   /// @}
 };
diff --git a/llvm/test/Analysis/CostModel/X86/masked-gather-scatter-amd-zen.ll b/llvm/test/Analysis/CostModel/X86/masked-gather-scatter-amd-zen.ll
new file mode 100644
index 0000000000000..1565568d8d010
--- /dev/null
+++ b/llvm/test/Analysis/CostModel/X86/masked-gather-scatter-amd-zen.ll
@@ -0,0 +1,553 @@
+; NOTE: Assertions have been autogenerated by utils/update_analyze_test_checks.py
+; Cost-model coverage for AMD Zen-tuned masked gather/scatter overheads.
+;
+; ZNVER4 / ZNVER5 enable the per-shape Zen cost tables via
+; TuningPreferAMDZenGSCost (set in ZN4Tuning and inherited by ZN5Tuning) and
+; have AVX-512, so the new tables are consulted in getGSVectorCost.
+; ZNVER3 does NOT carry TuningPreferAMDZenGSCost and lacks both AVX-512 and
+; TuningFastGather, so isLegalMaskedGather() returns false and the cost model
+; walks the scalarise path (getGSScalarCost). The ZNVER3 numbers below are the
+; unchanged scalar fallback cost, included here only to lock in that this
+; change does not regress pre-AVX-512 Zen targets.
+; SKX is a non-Zen AVX-512 baseline showing the generic flat overhead of 2.
+;
+; RUN: opt < %s -S -mtriple=x86_64-unknown-linux-gnu -passes="print<cost-model>" 2>&1 -disable-output -cost-kind=throughput -mcpu=znver4 | FileCheck %s --check-prefix=ZNVER4
+; RUN: opt < %s -S -mtriple=x86_64-unknown-linux-gnu -passes="print<cost-model>" 2>&1 -disable-output -cost-kind=throughput -mcpu=znver5 | FileCheck %s --check-prefix=ZNVER5
+; RUN: opt < %s -S -mtriple=x86_64-unknown-linux-gnu -passes="print<cost-model>" 2>&1 -disable-output -cost-kind=throughput -mcpu=znver3 | FileCheck %s --check-prefix=ZNVER3
+; RUN: opt < %s -S -mtriple=x86_64-unknown-linux-gnu -passes="print<cost-model>" 2>&1 -disable-output -cost-kind=throughput -mcpu=skx    | FileCheck %s --check-prefix=SKX
+
+;------------------------------------------------------------------------------
+; Masked gather - i32 element type
+;------------------------------------------------------------------------------
+
+define <2 x i32> @gather_v2i32(<2 x ptr> %ptrs, <2 x i1> %mask) {
+; ZNVER4-LABEL: 'gather_v2i32'
+; ZNVER4-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %v = call <2 x i32> @llvm.masked.gather.v2i32.v2p0(<2 x ptr> align 4 %ptrs, <2 x i1> %mask, <2 x i32> undef)
+; ZNVER4-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i32> %v
+;
+; ZNVER5-LABEL: 'gather_v2i32'
+; ZNVER5-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %v = call <2 x i32> @llvm.masked.gather.v2i32.v2p0(<2 x ptr> align 4 %ptrs, <2 x i1> %mask, <2 x i32> undef)
+; ZNVER5-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i32> %v
+;
+; ZNVER3-LABEL: 'gather_v2i32'
+; ZNVER3-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %v = call <2 x i32> @llvm.masked.gather.v2i32.v2p0(<2 x ptr> align 4 %ptrs, <2 x i1> %mask, <2 x i32> undef)
+; ZNVER3-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i32> %v
+;
+; SKX-LABEL: 'gather_v2i32'
+; SKX-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %v = call <2 x i32> @llvm.masked.gather.v2i32.v2p0(<2 x ptr> align 4 %ptrs, <2 x i1> %mask, <2 x i32> undef)
+; SKX-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i32> %v
+;
+  %v = call <2 x i32> @llvm.masked.gather.v2i32.v2p0(<2 x ptr> %ptrs, i32 4, <2 x i1> %mask, <2 x i32> undef)
+  ret <2 x i32> %v
+}
+
+define <4 x i32> @gather_v4i32(<4 x ptr> %ptrs, <4 x i1> %mask) {
+; ZNVER4-LABEL: 'gather_v4i32'
+; ZNVER4-NEXT:  Cost Model: Found an estimated cost of 11 for instruction: %v = call <4 x i32> @llvm.masked.gather.v4i32.v4p0(<4 x ptr> align 4 %ptrs, <4 x i1> %mask, <4 x i32> undef)
+; ZNVER4-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i32> %v
+;
+; ZNVER5-LABEL: 'gather_v4i32'
+; ZNVER5-NEXT:  Cost Model: Found an estimated cost of 11 for instruction: %v = call <4 x i32> @llvm.masked.gather.v4i32.v4p0(<4 x ptr> align 4 %ptrs, <4 x i1> %mask, <4 x i32> undef)
+; ZNVER5-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i32> %v
+;
+; ZNVER3-LABEL: 'gather_v4i32'
+; ZNVER3-NEXT:  Cost Model: Found an estimated cost of 14 for instruction: %v = call <4 x i32> @llvm.masked.gather.v4i32.v4p0(<4 x ptr> align 4 %ptrs, <4 x i1> %mask, <4 x i32> undef)
+; ZNVER3-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i32> %v
+;
+; SKX-LABEL: 'gather_v4i32'
+; SKX-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %v = call <4 x i32> @llvm.masked.gather.v4i32.v4p0(<4 x ptr> align 4 %ptrs, <4 x i1> %mask, <4 x i32> undef)
+; SKX-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i32> %v
+;
+  %v = call <4 x i32> @llvm.masked.gather.v4i32.v4p0(<4 x ptr> %ptrs, i32 4, <4 x i1> %mask, <4 x i32> undef)
+  ret <4 x i32> %v
+}
+
+define <8 x i32> @gather_v8i32(<8 x ptr> %ptrs, <8 x i1> %mask) {
+; ZNVER4-LABEL: 'gather_v8i32'
+; ZNVER4-NEXT:  Cost Model: Found an estimated cost of 25 for instruction: %v = call <8 x i32> @llvm.masked.gather.v8i32.v8p0(<8 x ptr> align 4 %ptrs, <8 x i1> %mask, <8 x i32> undef)
+; ZNVER4-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <8 x i32> %v
+;
+; ZNVER5-LABEL: 'gather_v8i32'
+; ZNVER5-NEXT:  Cost Model: Found an estimated cost of 25 for instruction: %v = call <8 x i32> @llvm.masked.gather.v8i32.v8p0(<8 x ptr> align 4 %ptrs, <8 x i1> %mask, <8 x i32> undef)
+; ZNVER5-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <8 x i32> %v
+;
+; ZNVER3-LABEL: 'gather_v8i32'
+; ZNVER3-NEXT:  Cost Model: Found an estimated cost of 28 for instruction: %v = call <8 x i32> @llvm.masked.gather.v8i32.v8p0(<8 x ptr> align 4 %ptrs, <8 x i1> %mask, <8 x i32> undef)
+; ZNVER3-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <8 x i32> %v
+;
+; SKX-LABEL: 'gather_v8i32'
+; SKX-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %v = call <8 x i32> @llvm.masked.gather.v8i32.v8p0(<8 x ptr> align 4 %ptrs, <8 x i1> %mask, <8 x i32> undef)
+; SKX-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <8 x i32> %v
+;
+  %v = call <8 x i32> @llvm.masked.gather.v8i32.v8p0(<8 x ptr> %ptrs, i32 4, <8 x i1> %mask, <8 x i32> undef)
+  ret <8 x i32> %v
+}
+
+define <16 x i32> @gather_v16i32(<16 x ptr> %ptrs, <16 x i1> %mask) {
+; ZNVER4-LABEL: 'gather_v16i32'
+; ZNVER4-NEXT:  Cost Model: Found an estimated cost of 50 for instruction: %v = call <16 x i32> @llvm.masked.gather.v16i32.v16p0(<16 x ptr> align 4 %ptrs, <16 x i1> %mask, <16 x i32> undef)
+; ZNVER4-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <16 x i32> %v
+;
+; ZNVER5-LABEL: 'gather_v16i32'
+; ZNVER5-NEXT:  Cost Model: Found an estimated cost of 50 for instruction: %v = call <16 x i32> @llvm.masked.gather.v16i32.v16p0(<16 x ptr> align 4 %ptrs, <16 x i1> %mask, <16 x i32> undef)
+; ZNVER5-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <16 x i32> %v
+;
+; ZNVER3-LABEL: 'gather_v16i32'
+; ZNVER3-NEXT:  Cost Model: Found an estimated cost of 55 for instruction: %v = call <16 x i32> @llvm.masked.gather.v16i32.v16p0(<16 x ptr> align 4 %ptrs, <16 x i1> %mask, <16 x i32> undef)
+; ZNVER3-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <16 x i32> %v
+;
+; SKX-LABEL: 'gather_v16i32'
+; SKX-NEXT:  Cost Model: Found an estimated cost of 20 for instruction: %v = call <16 x i32> @llvm.masked.gather.v16i32.v16p0(<16 x ptr> align 4 %ptrs, <16 x i1> %mask, <16 x i32> undef)
+; SKX-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <16 x i32> %v
+;
+  %v = call <16 x i32> @llvm.masked.gather.v16i32.v16p0(<16 x ptr> %ptrs, i32 4, <16 x i1> %mask, <16 x i32> undef)
+  ret <16 x i32> %v
+}
+
+;------------------------------------------------------------------------------
+; Masked gather - i64 element type
+;------------------------------------------------------------------------------
+
+define <2 x i64> @gather_v2i64(<2 x ptr> %ptrs, <2 x i1> %mask) {
+; ZNVER4-LABEL: 'gather_v2i64'
+; ZNVER4-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %v = call <2 x i64> @llvm.masked.gather.v2i64.v2p0(<2 x ptr> align 8 %ptrs, <2 x i1> %mask, <2 x i64> undef)
+; ZNVER4-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i64> %v
+;
+; ZNVER5-LABEL: 'gather_v2i64'
+; ZNVER5-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %v = call <2 x i64> @llvm.masked.gather.v2i64.v2p0(<2 x ptr> align 8 %ptrs, <2 x i1> %mask, <2 x i64> undef)
+; ZNVER5-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i64> %v
+;
+; ZNVER3-LABEL: 'gather_v2i64'
+; ZNVER3-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %v = call <2 x i64> @llvm.masked.gather.v2i64.v2p0(<2 x ptr> align 8 %ptrs, <2 x i1> %mask, <2 x i64> undef)
+; ZNVER3-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i64> %v
+;
+; SKX-LABEL: 'gather_v2i64'
+; SKX-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %v = call <2 x i64> @llvm.masked.gather.v2i64.v2p0(<2 x ptr> align 8 %ptrs, <2 x i1> %mask, <2 x i64> undef)
+; SKX-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i64> %v
+;
+  %v = call <2 x i64> @llvm.masked.gather.v2i64.v2p0(<2 x ptr> %ptrs, i32 8, <2 x i1> %mask, <2 x i64> undef)
+  ret <2 x i64> %v
+}
+
+define <4 x i64> @gather_v4i64(<4 x ptr> %ptrs, <4 x i1> %mask) {
+; ZNVER4-LABEL: 'gather_v4i64'
+; ZNVER4-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %v = call <4 x i64> @llvm.masked.gather.v4i64.v4p0(<4 x ptr> align 8 %ptrs, <4 x i1> %mask, <4 x i64> undef)
+; ZNVER4-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i64> %v
+;
+; ZNVER5-LABEL: 'gather_v4i64'
+; ZNVER5-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %v = call <4 x i64> @llvm.masked.gather.v4i64.v4p0(<4 x ptr> align 8 %ptrs, <4 x i1> %mask, <4 x i64> undef)
+; ZNVER5-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i64> %v
+;
+; ZNVER3-LABEL: 'gather_v4i64'
+; ZNVER3-NEXT:  Cost Model: Found an estimated cost of 15 for instruction: %v = call <4 x i64> @llvm.masked.gather.v4i64.v4p0(<4 x ptr> align 8 %ptrs, <4 x i1> %mask, <4 x i64> undef)
+; ZNVER3-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i64> %v
+;
+; SKX-LABEL: 'gather_v4i64'
+; SKX-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %v = call <4 x i64> @llvm.masked.gather.v4i64.v4p0(<4 x ptr> align 8 %ptrs, <4 x i1> %mask, <4 x i64> undef)
+; SKX-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i64> %v
+;
+  %v = call <4 x i64> @llvm.masked.gather.v4i64.v4p0(<4 x ptr> %ptrs, i32 8, <4 x i1> %mask, <4 x i64> undef)
+  ret <4 x i64> %v
+}
+
+define <8 x i64> @gather_v8i64(<8 x ptr> %ptrs, <8 x i1> %mask) {
+; ZNVER4-LABEL: 'gather_v8i64'
+; ZNVER4-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %v = call <8 x i64> @llvm.masked.gather.v8i64.v8p0(<8 x ptr> align 8 %ptrs, <8 x i1> %mask, <8 x i64> undef)
+; ZNVER4-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <8 x i64> %v
+;
+; ZNVER5-LABEL: 'gather_v8i64'
+; ZNVER5-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %v = call <8 x i64> @llvm.masked.gather.v8i64.v8p0(<8 x ptr> align 8 %ptrs, <8 x i1> %mask, <8 x i64> undef)
+; ZNVER5-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <8 x i64> %v
+;
+; ZNVER3-LABEL: 'gather_v8i64'
+; ZNVER3-NEXT:  Cost Model: Found an estimated cost of 29 for instruction: %v = call <8 x i64> @llvm.masked.gather.v8i64.v8p0(<8 x ptr> align 8 %ptrs, <8 x i1> %mask, <8 x i64> undef)
+; ZNVER3-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <8 x i64> %v
+;
+; SKX-LABEL: 'gather_v8i64'
+; SKX-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %v = call <8 x i64> @llvm.masked.gather.v8i64.v8p0(<8 x ptr> align 8 %ptrs, <8 x i1> %mask, <8 x i64> undef)
+; SKX-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <8 x i64> %v
+;
+  %v = call <8 x i64> @llvm.masked.gather.v8i64.v8p0(<8 x ptr> %ptrs, i32 8, <8 x i1> %mask, <8 x i64> undef)
+  ret <8 x i64> %v
+}
+
+;---------...
[truncated]

MattPD

Thanks for the PR! Per-shape cost modeling looks promising for Zen gather/scatter tuning. A few questions, comments, and suggestions, roughly in priority order:

1. i64 shapes fall through to overhead=2: too low for Zen4?

The tables omit i64 with a note "i64 falls through to the flat default". Indeed, the fallthrough path hits if (ST->hasAVX512()) return 2, giving znver4 the same i64 gather/scatter cost as Intel SKX. However:

VPSCATTERQQ (zmm) on Zen4 is 48 µops vs ≤20 on SKX (uops.info).
Port pressure: 18 µops restricted to ports {4,5} on Zen4 vs 8 µops on ports {2,3} on SKX.
Performance impact: in benchmarking on my end SPECint2006/LIBQUANTUM is 1.5–1.66× slower with v8i64 scatter on Zen4.

With overhead=2, the vectorizer sees v8i64 scatter at total cost ≈ 2 + 8×1 = 10, which looks highly profitable vs the scalar alternative. This re-enables the (harmful for performance) i64 scatter decisions that the conservative approach was trying to prevent (cf. PR #198850).

The commit message says "the i64 sweep landed within the noise of the generic flat overhead". However, the generic flat overhead for AVX-512 targets is 2, which is the Intel-tuned value. If the sweep landed near 2, that would mean i64 gather/scatter is as fast as on SKX, which contradicts the µop data.

Suggestion: Either add explicit i64 entries with overhead derived from measurement (I'd expect ≥20 given the 48-µop count), or fall through to a Zen-appropriate conservative value rather than the Intel-derived 2.

2. Dead table entries (v16f64, v2 shapes)

Several table rows appear to be unreachable:

v16f64 ({ISD::LOAD, MVT::v16f64, 14}, {ISD::STORE, MVT::v16f64, 3}): getGSVectorCost performs type legalization splitting before calling getGatherOverhead/getScatterOverhead. For <16 x double> (1024 bits), the function recurses with <8 x double> after splitting. AFAICT, the v16f64 entries can never be matched, so the effective cost is 2 × (v8f64 entry), not the tabulated value.
All v2 gather entries ({ISD::LOAD, MVT::v2i32, 20}, etc.): AVX-512 targets force-scalarize VF=2 gathers via forceScalarizeMaskedGather. The test confirms this: gather_v2i32 on ZNVER4 costs 8 (scalarized), not 20 + per-element.

These entries create a false impression of coverage and will mislead future maintainers who attempt to modify them.

Suggestion: Remove unreachable entries; add a brief comment noting that v16f64 is handled by splitting (cost = 2 × v8f64) and VF=2 is force-scalarized.

3. End-to-end loop-vectorize test

The test validates cost model numbers (opt -passes='print<cost-model>') but not vectorization decisions. The stated goal, fixing issue #91370 (vectorizer emitting gather for contiguous loads on znver4), is untested at the pass level.

Suggestion: Add a test under llvm/test/Transforms/LoopVectorize/X86/ that exercises the actual vectorizer decision, e.g.:

A loop where gather IS chosen on znver4 (a strided/indirect pattern matching the lbm win): CHECK: @llvm.masked.gather
A loop where gather is NOT chosen (e.g., an i64 indirect load): CHECK-NOT: @llvm.masked.gather

This locks in the behavior (the high-level intent) the cost model is meant to produce, so future cost model refactors that accidentally re-enable harmful gathers get caught.

4. Non-monotonic values deserve a comment

The gather table progression (v2i32=20, v4i32=7, v8i32=17, v16i32=14) is non-intuitive and will look like a typo to future developers. Similarly, scatter: v16i32=6 but v8i32=14; v16i32=6 but v16f32=16.

I believe these arise because the "break-even" methodology measures the overhead threshold at which overhead + VF × memcost > scalar_alternative_cost, and the scalar alternative's cost doesn't scale linearly with VF (due to unrolling, pipelining, address computation differences). If so, a brief comment in the source explaining why values are non-monotonic would help future maintainability significantly.

Also: the i32-vs-f32 divergence at VF=16 for scatter (6 vs 16)--is this genuine or a measurement artifact? They're the same physical 512-bit operation on 32-bit lanes.

5. Methodology reproducibility (minor)

The PR body references -force-gather-overhead-cost=N as the sweep tool — this doesn't exist in upstream LLVM. The values are likely derived from a local patch that overrides the return value of getGatherOverhead/getScatterOverhead. That's a perfectly reasonable approach (I've done the same in a downstream fork), but since the methodology can't be replicated by other developers, a sentence in the commit message noting "values measured using a local patch that forces the gather/scatter overhead to a specified value" would be helpful for transparency.

Overall: the per-shape approach is promising the provided benchmarks look good (although a wider SPEC benchmark results would be very welcome). The i64 fallthrough and dead entries are the main items I'd want addressed before landing.

github-actions · 2026-05-27T08:23:40Z

✅ With the latest revision this PR passed the C/C++ code formatter.

github-actions · 2026-05-27T08:23:40Z

✅ With the latest revision this PR passed the undef deprecator.

amd-subharad · 2026-05-27T10:36:49Z

Hello @MattPD
Thank you for your review
I have tried to address your concerns with the latest amends

the missing i64: measured according to our methodology added v4i64=10 / v8i64=22 to both tables. The fall-through to 2 was indeed harmful and runtime regression was 1.2-3.5x depending on shape and stride
dead entries: removed (v2 force-scalarised, v16f64 split by type legalisation), with explanatory comments
end-to-end test: added in Transforms/LoopVectorize/X86/
v16i32 vs v16f32 scatter (6 vs 16): transcription typo, fixed. runtime confirms the two are identical on Zen.
methodology: commit message now discloses the local cl::opt instrument explicitly.
Please let me know if there are other shortcomings

MattPD

Thanks for the revisions: No remaining blockers on my end!

Two optional suggestions for a follow-up (which I leave up to you):

The LV test has CHECK-NOT: @llvm.masked.gather for cases 2/3, which could pass vacuously if the loop fails to vectorize entirely (not just "vectorizes without gather"). Adding a positive anchor like CHECK: vector.body would distinguish "scalarized the gather" from "didn't vectorize at all."
For symmetry with scatter_v16i32_gep, a gather_v16i32_gep test would exercise the GEP-index reduction path for gather as well. The code path is shared so this is purely for documentation/completeness.

Addressed

Two test-only nits from review of llvm#199488: 1. The LoopVectorize test had `CHECK-NOT: @llvm.masked.gather` on Case 2 (i64 gather avoided) and Case 3 (unit-stride no gather) without a positive anchor, so the check would pass vacuously if the loop ever failed to vectorize at all (rather than vectorizing without a gather). Adding `CHECK: vector.body` in front of each `CHECK-NOT` distinguishes the two outcomes; under `-force-vector-width=1` both new CHECKs now correctly fail. 2. The cost-model test had `scatter_v16{i32,f32}_gep` to exercise the GEP-index reducibility path for the v16 scatter row but no analogous case for gather. Added `gather_v16i32_gep` (cost = 30 on znver4/5). Both i32 and f32 GEP cases for gather would share the same code path, so one case is sufficient for the v16 gather row; the section comment is updated to make that explicit. No behavior change. Both tests pass.

Jason-Van-Beusekom

Overall LGTM for me (after the failing test is fixed) however I would like to get approval from others before merging

…er4+ The X86 cost model currently returns a single flat overhead from getGatherOverhead / getScatterOverhead, applied to every shape of masked gather or scatter on every X86 subtarget that reaches the gather/scatter path. On modern AMD parts the actual cost of these instructions varies substantially with the vector width and element size, and the single flat number forces the LoopVectorizer to either under- or over-estimate the profitability of vectorising loops that need indirect memory access. This change adds a subtarget tuning bit, TuningPreferAMDZenGSCost, attached to ZN4Tuning so znver4 and znver5 pick it up automatically. Pre-AVX-512 Zen parts (znver1..3) take the scalarise path for masked gather and never reach the new code, so the bit is intentionally NOT placed in ZNTuning; flagging older Zen parts with the feature would be misleading. When the tuning bit is set, getGatherOverhead / getScatterOverhead look the source vector type up in per-shape cost tables before falling back to the existing generic flat overhead. The tables only contain the shapes that are actually reached at the lookup site: VF=2 is force-scalarised on AVX-512 (forceScalarizeMaskedGather), and v16f64 is split via type legalisation in getGSVectorCost (1024-bit data exceeds a single zmm), so neither row would ever be queried and both are omitted. The live tables cover gather and scatter for VF=4..16 over i32 / f32 / f64, plus VF=4 and VF=8 for i64. Methodology The numbers are empirical break-even costs measured on znver4 and znver5 hardware. The methodology, summarised: 1. Take a controlled gather/scatter micro-benchmark with one indirect memory access per inner-loop iteration. 2. Sweep the gather/scatter overhead via a local cl::opt patch on X86TTI (the upstream tree has no such knob today; this is a standalone local instrument that returns the forced value from getGatherOverhead / getScatterOverhead). A reproducible version of the patch lives on the author's zen-gs-i64-sweep branch. 3. For each (element type, VF) compile the micro-benchmark at a range of forced overheads and identify the "flip" cost above which the LoopVectorizer stops emitting the gather / scatter instruction (it switches to an extract-load-insert lowering or to a pure scalar loop). The tabulated cost is the highest value at which gather/scatter emission was the right call: the vectoriser still selects it AND the resulting binary is at least as fast as the post-flip alternatives on the test hardware. 4. The sweep is run independently for each (element type, VF) on Genoa, Milan and Turin and re-validated on Zen 5. Notes on individual entries * i64 entries are higher than their f64 counterparts at the same VF. The scalar alternative for i64 runs on the integer pipeline (cheaper than f64 on the FP pipeline), so gather has to be cheaper to win. At the f64-style break-even, i64 gather was 1.7-3.5x slower than the scalarised lowering across stride patterns on Zen 5, so the i64 break-even sits at the minimum cost that suppresses vpgatherqq / vpscatterqq emission for the measured patterns (which include the libquantum-style indirect scatter cited in PR llvm#198850). * f32 rows for both tables mirror i32 rows: the original sweep only characterised i32 and f64 lanes, and the f32 rows were derived by symmetry because vpgatherdd / vpscatterdd and the corresponding ps variants share the same physical lane width on Zen. The runtime equivalence of vpscatterdd and vscatterdps was verified directly (within 3% across VF and stride patterns). Tests The cost-model test llvm/test/Analysis/CostModel/X86/masked-gather-scatter-amd-zen.ll covers every live shape on znver4 / znver5 and pins the unchanged behaviour for znver3 (scalarise path) and skx (generic flat overhead). It also covers the 32-bit-reducible GEP form for v16f32 and v16i32 scatter, which is the only path that actually queries the v16 row of the scatter table (the <16 x ptr> form recurses to v8 through type legalisation). A second test llvm/test/Transforms/LoopVectorize/X86/amd-zen-gather-scatter-decisions.ll pins the resulting vectoriser decisions end-to-end: gather IS emitted for an f64 indirect-load reduction on znver5; gather is NOT emitted for an i64 indirect-load reduction (where the cost table is meant to suppress it); and a unit-stride load must not become a gather regardless of cost-table values (regression guard for issue llvm#91370).

Two test-only nits from review of llvm#199488: 1. The LoopVectorize test had `CHECK-NOT: @llvm.masked.gather` on Case 2 (i64 gather avoided) and Case 3 (unit-stride no gather) without a positive anchor, so the check would pass vacuously if the loop ever failed to vectorize at all (rather than vectorizing without a gather). Adding `CHECK: vector.body` in front of each `CHECK-NOT` distinguishes the two outcomes; under `-force-vector-width=1` both new CHECKs now correctly fail. 2. The cost-model test had `scatter_v16{i32,f32}_gep` to exercise the GEP-index reducibility path for the v16 scatter row but no analogous case for gather. Added `gather_v16i32_gep` (cost = 30 on znver4/5). Both i32 and f32 GEP cases for gather would share the same code path, so one case is sufficient for the v16 gather row; the section comment is updated to make that explicit. No behavior change. Both tests pass.

CI's code_formatter job runs both clang-format AND the undef-deprecator check; the latter rejects new uses of `undef` in tests under the LangRef poison/undef migration. The masked-gather passthru argument was the only `undef` in the file. Replace all 75 occurrences with `poison` (purely an operand-printing change -- gather costs and behaviour are unaffected). No functional change; both tests still pass.

Apply clang-format to the per-shape Zen gather/scatter tables in X86TargetTransformInfo.cpp: - Drop the manual double-space numeric alignment in the rows (clang-format collapses it and re-packs the rows 2-per-line). - Re-wrap the `if (const auto *E = CostTableLookup(...))` line in getGatherOverhead to put the call expression on its own indented line. Pure formatting; cost tables and lookup behaviour are unchanged.

llvmorg-github-actions Bot added backend:X86 llvm:analysis Includes value tracking, cost tables and constant folding labels May 25, 2026

RKSimon self-requested a review May 25, 2026 10:16

MattPD previously requested changes May 26, 2026

View reviewed changes

amd-subharad force-pushed the zen-gather-scatter-costs branch from 88efe35 to 6bf7ce0 Compare May 27, 2026 10:20

llvmorg-github-actions Bot added the llvm:transforms label May 27, 2026

MattPD self-requested a review May 27, 2026 23:36

MattPD reviewed May 27, 2026

View reviewed changes

This comment was marked as duplicate.

Sign in to view

amd-subharad force-pushed the zen-gather-scatter-costs branch from 6bf7ce0 to fac152a Compare May 28, 2026 03:51

Jason-Van-Beusekom mentioned this pull request May 28, 2026

[X86] Change Gather and Scatter cost to 1024 for ZN4 and ZN5 #198850

Open

Jason-Van-Beusekom reviewed May 29, 2026

View reviewed changes

Jason-Van-Beusekom requested review from ganeshgit and phoebewang May 29, 2026 19:05

amd-subharad added 4 commits May 30, 2026 23:49

amd-subharad force-pushed the zen-gather-scatter-costs branch from 68ed3f6 to 8cc82c4 Compare May 30, 2026 18:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[X86][CostModel] Add per-shape gather/scatter cost tables for AMD znver4+#199488

[X86][CostModel] Add per-shape gather/scatter cost tables for AMD znver4+#199488
amd-subharad wants to merge 4 commits into
llvm:mainfrom
amd-subharad:zen-gather-scatter-costs

amd-subharad commented May 25, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 25, 2026

Uh oh!

llvmorg-github-actions Bot commented May 25, 2026 •

edited

Loading

Summary

Cost tables

Methodology

Test

Benchmark validation

Non-goals

Test plan

Uh oh!

MattPD left a comment

Uh oh!

github-actions Bot commented May 27, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 27, 2026 •

edited

Loading

Uh oh!

amd-subharad commented May 27, 2026

Uh oh!

MattPD left a comment

Uh oh!

This comment was marked as duplicate.

Uh oh!

Jason-Van-Beusekom left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

amd-subharad commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Cost tables

Methodology

Notes on individual entries

Tests

Benchmark validation

Non-goals

Test plan

Uh oh!

github-actions Bot commented May 25, 2026

Frequently asked questions

Uh oh!

llvmorg-github-actions Bot commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Cost tables

Methodology

Test

Benchmark validation

Non-goals

Test plan

Uh oh!

MattPD left a comment

Choose a reason for hiding this comment

1. i64 shapes fall through to overhead=2: too low for Zen4?

2. Dead table entries (v16f64, v2 shapes)

3. End-to-end loop-vectorize test

4. Non-monotonic values deserve a comment

5. Methodology reproducibility (minor)

Uh oh!

github-actions Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amd-subharad commented May 27, 2026

Uh oh!

MattPD left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as duplicate.

Uh oh!

Jason-Van-Beusekom left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

amd-subharad commented May 25, 2026 •

edited

Loading

llvmorg-github-actions Bot commented May 25, 2026 •

edited

Loading

github-actions Bot commented May 27, 2026 •

edited

Loading

github-actions Bot commented May 27, 2026 •

edited

Loading