Skip to content

[X86][CostModel] Add per-shape gather/scatter cost tables for AMD znver4+#199488

Open
amd-subharad wants to merge 4 commits into
llvm:mainfrom
amd-subharad:zen-gather-scatter-costs
Open

[X86][CostModel] Add per-shape gather/scatter cost tables for AMD znver4+#199488
amd-subharad wants to merge 4 commits into
llvm:mainfrom
amd-subharad:zen-gather-scatter-costs

Conversation

@amd-subharad
Copy link
Copy Markdown

@amd-subharad amd-subharad commented May 25, 2026

Summary

The X86 cost model currently returns a single flat overhead from getGatherOverhead / getScatterOverhead, applied to every shape of masked gather/scatter on every X86 subtarget that reaches the gather/scatter path. On modern AMD parts the actual cost of these instructions varies substantially with the vector width and element size, and the single flat number forces the LoopVectorizer to either under- or over-estimate the profitability of vectorising loops that need indirect memory access.

This PR adds a subtarget tuning bit, TuningPreferAMDZenGSCost, attached to ZN4Tuning so znver4 and znver5 pick it up automatically. Pre-AVX-512 Zen parts (znver1..3) take the scalarise path for masked gather and never reach the new code, so the bit is intentionally NOT placed in ZNTuning; flagging older Zen parts with the feature would be misleading.

When the tuning bit is set, getGatherOverhead / getScatterOverhead look the source vector type up in per-shape cost tables before falling back to the existing generic flat overhead.

Cost tables

The tables only contain shapes that are actually reached at the lookup site. Two omissions are deliberate:

  • VF=2 is force-scalarised on AVX-512 by forceScalarizeMaskedGather, so a v2* row would never be queried.
  • v16f64 (1024 bits of data) is split via type legalisation in getGSVectorCost before reaching the lookup, so a v16f64 row would never be queried either.

Gather:

VF i32 f32 f64 i64
4 7 7 7 10
8 17 17 17 22
16 14 14 - -

Scatter:

VF i32 f32 f64 i64
4 12 12 5 10
8 14 14 15 22
16 6 6 - -

The 32-bit gather/scatter rows are the cost overhead values published in AOCC for the same hardware family, used here unchanged so that community LLVM matches the AOCC behaviour on Zen 4 / Zen 5 for indirect-memory vectorisation decisions. The 64-bit f64 and i64 rows are measured on Zen 4 / Zen 5 hardware as described below; the i64 row appears in both tables because vpscatterqq was measured to be just as harmful as vpgatherqq at the f64-style break-even (this is the libquantum / PR #198850 case the reviewer flagged for scatter specifically).

Methodology

The intent behind a row depends on whether the masked intrinsic is faster or slower than its scalarised alternative on Zen for that shape:

  • When the masked intrinsic IS faster (most i32 / f32 / f64 shapes), the table value is chosen to be at or below the cost at which the LoopVectorizer would stop emitting it, so the right lowering is selected.
  • When the masked intrinsic is SLOWER (all i64 shapes on Zen, and v4i32 / v4f32 scatter as it happens), the table value is chosen at or above that same flip point, so the LoopVectorizer falls back to the cheaper scalarised lowering.

The numbers are derived empirically on znver4 and znver5 hardware. The procedure:

  1. Take a controlled gather/scatter micro-benchmark with one indirect memory access per inner-loop iteration.
  2. Sweep the gather/scatter overhead via a local cl::opt patch on X86TTI (the upstream tree has no such knob today; this is a standalone local instrument that returns the forced value from getGatherOverhead / getScatterOverhead). A reproducible version of the patch lives on the author's zen-gs-i64-sweep branch.
  3. For each (element type, VF) compile the micro-benchmark at a range of forced overheads and identify the "flip" cost above which the LoopVectorizer stops emitting the masked intrinsic and switches to an extract-load-insert lowering or to a pure scalar loop.
  4. Time the resulting binary at the last forced overhead on each side of the flip, on znver4 and znver5, and confirm which lowering is faster.
  5. Install the table value according to the intent above: at-or-below the flip when vec wins, at-or-above the flip when scalar wins. For some VF=16 entries the table value sits well below the flip; the relationship still produces the desired decision and gives margin against future changes elsewhere in the cost model.
  6. The sweep is run independently for each (element type, VF) on Genoa, Milan and Turin and re-validated on Zen 5.

Notes on individual entries

  • i64 entries appear in both the gather and scatter tables (v4i64=10, v8i64=22) and are higher than their f64 counterparts at the same VF. The scalar alternative for i64 runs on the integer pipeline (cheaper than f64 on the FP pipeline), so gather / scatter has to be cheaper to win. At the f64-style break-even, i64 gather was 1.5-1.7x slower than the scalarised lowering on Zen 5 across the stride patterns measured (the prior wider range of 1.7-3.5x quoted earlier in this PR's history came from larger working sets); vpscatterqq showed the same pattern at 1.20-1.30x, matching the reviewer's 1.5-1.66x libquantum (PR [X86] Change Gather and Scatter cost to 1024 for ZN4 and ZN5 #198850) figures. The i64 break-even therefore sits at the minimum cost that suppresses vpgatherqq / vpscatterqq emission.
  • f32 rows for both tables mirror i32 rows: the original sweep only characterised i32 and f64 lanes, and the f32 rows were derived by symmetry because vpgatherdd / vpscatterdd and the corresponding ps variants share the same physical lane width on Zen. The runtime equivalence of vpscatterdd and vscatterdps was verified directly (within 3% across VF and stride patterns).
  • VF=16 rows sit well below the Zen 5 flip (Δ between +23 and +35) because at VF=16 the per-element memory cost dominates the LoopVectorizer's profitability comparison and the natural flip is high. The installed AOCC values still produce the intended decision (vec) and leave headroom against unrelated cost-model changes.
  • v4i32 / v4f32 scatter are the two negative-Δ entries. The AOCC values (12 each) were not designed with a "suppress" intent on Zen, but on Zen 5 they happen to sit just above the flip (9 and 8). Measurement showed scalar wins by 1.20x and 1.30x respectively, so leaving these rows at the AOCC values is correct on Zen 5; resetting them to the flip would re-enable a slower vec lowering.

Tests

llvm/test/Analysis/CostModel/X86/masked-gather-scatter-amd-zen.ll covers every live shape on znver4 / znver5 and pins the unchanged behaviour for znver3 (scalarise path) and skx (generic flat overhead). It also covers the 32-bit-reducible GEP form for v16f32 and v16i32 scatter, which is the only path that actually queries the v16 row of the scatter table (the <16 x ptr> form recurses to v8 through type legalisation).

llvm/test/Transforms/LoopVectorize/X86/amd-zen-gather-scatter-decisions.ll pins the resulting vectoriser decisions end-to-end:

$ llvm-lit -v llvm/test/Analysis/CostModel/X86/masked-gather-scatter-amd-zen.ll \
            llvm/test/Transforms/LoopVectorize/X86/amd-zen-gather-scatter-decisions.ll
PASS: LLVM :: Analysis/CostModel/X86/masked-gather-scatter-amd-zen.ll
PASS: LLVM :: Transforms/LoopVectorize/X86/amd-zen-gather-scatter-decisions.ll
Total Discovered Tests: 2
  Passed: 2 (100.00%)

Benchmark validation

Measured on a Ryzen 9 9950X (Zen 5), community LLVM trunk at the parent of this commit vs. the commit. Single-copy refrate, 3 iterations, median-selected, -O3 -march=znver5 -ffast-math -flto.

Benchmark OFF rate ON rate Δ vs OFF
519.lbm_r (SPEC CPU2017) 4.192 9.275 +121 %
782.lbm_r (SPEC CPU2026) 1.690 3.917 +132 %

Both benchmarks are dominated by an inner loop that performs strided / gather-style memory access (the D3Q19 lattice neighbour update in lbm) that the LoopVectorizer now correctly prices as profitable on Zen.

Non-goals

  • No change to non-AMD targets: the feature bit is only enabled via ZN4Tuning, every other subtarget hits the original code path with byte-identical behaviour.
  • No change to the gather-to-shuffle scalarisation pass: this PR only changes the cost estimate the vectorizer sees, the actual lowering path is unchanged.

Test plan

  • llvm-lit -v llvm/test/Analysis/CostModel/X86/masked-gather-scatter-amd-zen.ll (1/1 PASS)
  • llvm-lit -v llvm/test/Transforms/LoopVectorize/X86/amd-zen-gather-scatter-decisions.ll (1/1 PASS)
  • llvm-lit llvm/test/Analysis/CostModel/X86/ llvm/test/Transforms/LoopVectorize/X86/ (408/408 PASS)
  • CI green on supported buildbots

@github-actions
Copy link
Copy Markdown

Hello @amd-subharad 👋

Thank you for submitting a Pull Request (PR) to the LLVM Project. Since this is your first PR, here are a few useful links covering our main contribution policies and review practices.

  • All contributions to LLVM must follow our LLVM AI Tool Use Policy. In particular, if you used AI while working on this PR, remember to add a note to the PR description.
  • The LLVM Code-Review Policy and Practices document contains practical information about the PR process, including how patches are reviewed and accepted, and who can review a PR.
  • Our LLVM Developer Policy describes our expectations for code quality, commit summaries and contains notes on our CI system.

Please reply to this message to confirm that you have read these policies, especially the LLVM AI Tool Use Policy, and that any AI tool usage has been noted in the PR description.


Frequently asked questions

How do I add reviewers?

This PR will be automatically labeled, and the relevant teams will be notified. For some parts of the project, reviewers may also be added automatically.

You can also add reviewers manually using the Reviewers section on this page. If you cannot use that section, it is probably because you do not have write permissions for the repository. In that case, you can request a review by tagging reviewers in a comment using @ followed by their GitHub username.

What if there are no comments?

If you have not received any comments on your PR after a week, you can request a review by pinging the PR with a comment such as “Ping”. The common courtesy ping rate is once a week. Please remember that you are asking for volunteer time from other developers.

Are any special GitHub settings required to contribute to LLVM?

We only require contributors to have a public email address associated with their GitHub commits, see this section of LLVM Developer Policy for details.


If you have questions, feel free to leave a comment on this PR, or ask on LLVM Discord or LLVM Discourse.

Thank you,
The LLVM Community

@llvmorg-github-actions llvmorg-github-actions Bot added backend:X86 llvm:analysis Includes value tracking, cost tables and constant folding labels May 25, 2026
@llvmorg-github-actions
Copy link
Copy Markdown

llvmorg-github-actions Bot commented May 25, 2026

@llvm/pr-subscribers-llvm-transforms
@llvm/pr-subscribers-backend-x86

@llvm/pr-subscribers-llvm-analysis

Author: Sumukh J Bharadwaj (amd-subharad)

Changes

<!--
PR body for community LLVM. Edit freely before opening the PR.
-->

Summary

The X86 cost model currently returns a single flat overhead from getGatherOverhead / getScatterOverhead, applied to every shape of masked gather/scatter on every X86 subtarget that reaches the gather/scatter path. On modern AMD parts the actual cost of these instructions varies substantially with the vector width and element size, and the single flat number forces the LoopVectorizer to either under- or over-estimate the profitability of vectorising loops that need indirect memory access.

This PR adds a subtarget tuning bit, TuningPreferAMDZenGSCost, attached to ZN4Tuning so znver4 and znver5 pick it up automatically. Pre-AVX-512 Zen parts (znver1..3) take the scalarise path for masked gather and never reach the new code, so the bit is intentionally not placed in ZNTuning; flagging older Zen parts with the feature would be misleading.

When the tuning bit is set, getGatherOverhead / getScatterOverhead look the source vector type up in per-shape cost tables before falling back to the existing generic flat overhead.

Cost tables

Gather (VF=2..16 over i32 / f32 / f64; i64 falls through to the flat default):

VF i32 / f32 / f64
2 20
4 7
8 17
16 14

Scatter (VF=4..16 over i32 / f32 / f64; i64 and VF=2 fall through):

VF i32 f32 f64
4 12 12 5
8 14 14 15
16 6 16 3

Methodology

The numbers in both tables are empirical break-even costs measured on znver4 / znver5 hardware:

  1. Take a controlled gather (or scatter) micro-benchmark with one indirect memory access per inner-loop iteration and an outer loop chosen so total runtime is in the 60–120 second range for stable timing.
  2. Sweep the gather cost via the existing -force-gather-overhead-cost=N knob.
  3. Find the largest N at which the LoopVectorizer still selects the gather lowering over the scalar fallback. That N is the break-even cost for that (element type, VF) combination on Zen.
  4. Repeat independently for each (element type, VF) combination and for scatter using the analogous setup.

The scatter table is keyed independently for 32-bit (i32 / f32) and 64-bit (f64) lanes because the sweep results diverged on Zen hardware: 64-bit scatter break-even is consistently lower than 32-bit scatter break-even at the same VF.

Test

llvm/test/Analysis/CostModel/X86/masked-gather-scatter-amd-zen.ll covers every shape in both tables on znver4 / znver5 and pins the unchanged behaviour for znver3 (scalarise path) and skx (generic flat overhead).

$ llvm-lit -v llvm/test/Analysis/CostModel/X86/masked-gather-scatter-amd-zen.ll
PASS: LLVM :: Analysis/CostModel/X86/masked-gather-scatter-amd-zen.ll (1 of 1)

Benchmark validation

Measured on a Ryzen 9 9950X (Zen 5), community LLVM trunk at the parent of this commit vs. the commit. Numbers are SPEC rate medians across K=3 iterations, single-copy ref:

Suite OFF geomean ON geomean Δ
SPEC CPU2017 fprate 17.49 18.77 +7.31 %
SPEC CPU2017 intrate 11.56 11.54 −0.19 % (noise)

Biggest individual movers:

Benchmark Δ rate
519.lbm_r +149.91 % (v8f64 gather break-even is now correctly priced)
549.fotonik3d_r +1.85 %
503.bwaves_r −1.62 %

Non-goals

  • No change to non-AMD targets: the feature bit is only enabled via ZN4Tuning, every other subtarget hits the original code path with byte-identical behaviour.
  • No change to the gather-to-shuffle scalarisation pass: this PR only changes the cost estimate the vectorizer sees; the actual lowering path is unchanged.

Test plan

  • ninja check-llvm-codegen-x86
  • ninja check-llvm-analysis
  • CI green on supported buildbots

Patch is 44.80 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/199488.diff

5 Files Affected:

  • (modified) llvm/docs/ReleaseNotes.md (+5)
  • (modified) llvm/lib/Target/X86/X86.td (+13-1)
  • (modified) llvm/lib/Target/X86/X86TargetTransformInfo.cpp (+64-4)
  • (modified) llvm/lib/Target/X86/X86TargetTransformInfo.h (+2-2)
  • (added) llvm/test/Analysis/CostModel/X86/masked-gather-scatter-amd-zen.ll (+553)
diff --git a/llvm/docs/ReleaseNotes.md b/llvm/docs/ReleaseNotes.md
index fffd696e59baf..df5e17a27d26c 100644
--- a/llvm/docs/ReleaseNotes.md
+++ b/llvm/docs/ReleaseNotes.md
@@ -220,6 +220,11 @@ Makes programs 10x faster by doing Special New Thing.
 * `.att_syntax` directive is now emitted for assembly files when AT&T syntax is
   in use. This matches the behaviour of Intel syntax and aids with
   compatibility when changing the default Clang syntax to the Intel syntax.
+* Masked gather and scatter cost overheads are now per-shape on AMD znver4
+  and znver5 targets via a new `TuningPreferAMDZenGSCost` subtarget
+  feature, replacing the single flat overhead inherited from the generic
+  AVX-512 path. The per-shape costs use empirical break-even values
+  measured on Zen 4 / Zen 5 hardware.
 
 ### Changes to the OCaml bindings
 
diff --git a/llvm/lib/Target/X86/X86.td b/llvm/lib/Target/X86/X86.td
index 50fb7204ebfa1..28bbd639649bb 100644
--- a/llvm/lib/Target/X86/X86.td
+++ b/llvm/lib/Target/X86/X86.td
@@ -721,6 +721,17 @@ def TuningFastGather
     : SubtargetFeature<"fast-gather", "HasFastGather", "true",
                        "Indicates if gather is reasonably fast (this is true for Skylake client and all AVX-512 CPUs)">;
 
+// Use AMD Zen-tuned cost tables for masked gather/scatter intrinsics in the
+// X86 TargetTransformInfo cost model. Refines the flat overhead used by other
+// AVX-512 targets with per-element-type/per-VL costs measured on znver4 and
+// znver5. Inherited automatically by every znver4+ CPU via ZN4Tuning; not
+// applied to pre-AVX-512 Zen parts (znver1..3), which take the scalarise
+// path for masked gather anyway.
+def TuningPreferAMDZenGSCost
+    : SubtargetFeature<"prefer-amd-zen-gs-cost",
+                       "HasPreferAMDZenGSCost", "true",
+                       "Use AMD Zen-tuned gather/scatter cost tables in the cost model">;
+
 // Generate vpdpwssd instead of vpmaddwd+vpaddd sequence.
 def TuningFastDPWSSD
     : SubtargetFeature<
@@ -1631,7 +1642,8 @@ def ProcessorFeatures {
   list<SubtargetFeature> ZN3Features =
     !listconcat(ZN2Features, ZN3AdditionalFeatures);
 
-  list<SubtargetFeature> ZN4AdditionalTuning = [TuningFastDPWSSD];
+  list<SubtargetFeature> ZN4AdditionalTuning = [TuningFastDPWSSD,
+                                                TuningPreferAMDZenGSCost];
   list<SubtargetFeature> ZN4Tuning =
     !listconcat(ZN3Tuning, ZN4AdditionalTuning);
   list<SubtargetFeature> ZN4AdditionalFeatures = [FeatureAVX512,
diff --git a/llvm/lib/Target/X86/X86TargetTransformInfo.cpp b/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
index 698be1615a04b..edc8e78c7f040 100644
--- a/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
+++ b/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
@@ -6253,20 +6253,79 @@ InstructionCost X86TTIImpl::getCFInstrCost(unsigned Opcode,
   return TTI::TCC_Free;
 }
 
-int X86TTIImpl::getGatherOverhead() const {
+int X86TTIImpl::getGatherOverhead(Type *SrcVTy) const {
   // Some CPUs have more overhead for gather. The specified overhead is relative
   // to the Load operation. "2" is the number provided by Intel architects. This
   // parameter is used for cost estimation of Gather Op and comparison with
   // other alternatives.
   // TODO: Remove the explicit hasAVX512()?, That would mean we would only
   // enable gather with a -march.
+
+  // AMD znver4+ targets enable per-shape costs measured on the hardware via
+  // TuningPreferAMDZenGSCost (set in ZN4Tuning). Pre-AVX-512 Zen parts
+  // (znver1..3) take the scalarise path for masked gather and never reach
+  // this code, so the table only needs to cover AVX-512 widths.
+  if (ST->hasPreferAMDZenGSCost() && SrcVTy) {
+    // Per-shape gather costs for AMD znver4+ targets.
+    //
+    // The numbers are the empirical "break-even" (lower-bound) costs
+    // measured by sweeping a forced gather cost while compiling a
+    // controlled gather micro-benchmark and observing the point at which
+    // the LoopVectorizer still chose the gather lowering over the scalar
+    // fallback. The sweep was run independently for every (data type,
+    // VF) combination on Genoa / Milan / Turin and re-validated on Zen 5;
+    // the value tabulated below is the cost at which gather emission
+    // was the right call for that shape.
+    //
+    // i64 entries are intentionally absent: the i64 sweep landed within
+    // the noise of the generic flat overhead, so those shapes fall
+    // through to the existing flat cost.
+    static const CostTblEntry ZenGatherCostTable[] = {
+        {ISD::LOAD, MVT::v2i32, 20}, {ISD::LOAD, MVT::v4i32,  7},
+        {ISD::LOAD, MVT::v8i32, 17}, {ISD::LOAD, MVT::v16i32, 14},
+        {ISD::LOAD, MVT::v2f32, 20}, {ISD::LOAD, MVT::v4f32,  7},
+        {ISD::LOAD, MVT::v8f32, 17}, {ISD::LOAD, MVT::v16f32, 14},
+        {ISD::LOAD, MVT::v2f64, 20}, {ISD::LOAD, MVT::v4f64,  7},
+        {ISD::LOAD, MVT::v8f64, 17}, {ISD::LOAD, MVT::v16f64, 14},
+    };
+    EVT VT = TLI->getValueType(DL, SrcVTy);
+    if (VT.isSimple())
+      if (const auto *E = CostTableLookup(ZenGatherCostTable, ISD::LOAD,
+                                          VT.getSimpleVT()))
+        return E->Cost;
+  }
+
   if (ST->hasAVX512() || (ST->hasAVX2() && ST->hasFastGather()))
     return 2;
 
   return 1024;
 }
 
-int X86TTIImpl::getScatterOverhead() const {
+int X86TTIImpl::getScatterOverhead(Type *SrcVTy) const {
+  // AMD znver4+ targets use per-shape scatter costs measured on the hardware
+  // via TuningPreferAMDZenGSCost (set in ZN4Tuning). Fall through to the
+  // generic flat overhead for shapes we have not characterised.
+  if (ST->hasPreferAMDZenGSCost() && ST->hasAVX512() && SrcVTy) {
+    // Per-shape scatter costs for AMD znver4+ targets, measured with the
+    // same break-even methodology as the gather table above. i32 / f32
+    // and f64 lanes use independent curves because their sweep results
+    // diverged on Zen hardware. i64 entries and VF=2 entries are
+    // intentionally absent and fall through to the generic flat overhead.
+    static const CostTblEntry ZenScatterCostTable[] = {
+        {ISD::STORE, MVT::v4i32, 12}, {ISD::STORE, MVT::v8i32, 14},
+        {ISD::STORE, MVT::v16i32, 6},
+        {ISD::STORE, MVT::v4f32, 12}, {ISD::STORE, MVT::v8f32, 14},
+        {ISD::STORE, MVT::v16f32, 16},
+        {ISD::STORE, MVT::v4f64,  5}, {ISD::STORE, MVT::v8f64, 15},
+        {ISD::STORE, MVT::v16f64, 3},
+    };
+    EVT VT = TLI->getValueType(DL, SrcVTy);
+    if (VT.isSimple())
+      if (const auto *E = CostTableLookup(ZenScatterCostTable, ISD::STORE,
+                                          VT.getSimpleVT()))
+        return E->Cost;
+  }
+
   if (ST->hasAVX512())
     return 2;
 
@@ -6338,8 +6397,9 @@ InstructionCost X86TTIImpl::getGSVectorCost(unsigned Opcode,
 
   // The gather / scatter cost is given by Intel architects. It is a rough
   // number since we are looking at one instruction in a time.
-  const int GSOverhead = (Opcode == Instruction::Load) ? getGatherOverhead()
-                                                       : getScatterOverhead();
+  const int GSOverhead = (Opcode == Instruction::Load)
+                             ? getGatherOverhead(SrcVTy)
+                             : getScatterOverhead(SrcVTy);
   return GSOverhead + VF * getMemoryOpCost(Opcode, SrcVTy->getScalarType(),
                                            Alignment, AddressSpace, CostKind);
 }
diff --git a/llvm/lib/Target/X86/X86TargetTransformInfo.h b/llvm/lib/Target/X86/X86TargetTransformInfo.h
index ea277bfeab560..ceb6dcc172f94 100644
--- a/llvm/lib/Target/X86/X86TargetTransformInfo.h
+++ b/llvm/lib/Target/X86/X86TargetTransformInfo.h
@@ -346,8 +346,8 @@ class X86TTIImpl final : public BasicTTIImplBase<X86TTIImpl> {
                                   Type *DataTy, const Value *Ptr,
                                   Align Alignment, unsigned AddressSpace) const;
 
-  int getGatherOverhead() const;
-  int getScatterOverhead() const;
+  int getGatherOverhead(Type *SrcVTy) const;
+  int getScatterOverhead(Type *SrcVTy) const;
 
   /// @}
 };
diff --git a/llvm/test/Analysis/CostModel/X86/masked-gather-scatter-amd-zen.ll b/llvm/test/Analysis/CostModel/X86/masked-gather-scatter-amd-zen.ll
new file mode 100644
index 0000000000000..1565568d8d010
--- /dev/null
+++ b/llvm/test/Analysis/CostModel/X86/masked-gather-scatter-amd-zen.ll
@@ -0,0 +1,553 @@
+; NOTE: Assertions have been autogenerated by utils/update_analyze_test_checks.py
+; Cost-model coverage for AMD Zen-tuned masked gather/scatter overheads.
+;
+; ZNVER4 / ZNVER5 enable the per-shape Zen cost tables via
+; TuningPreferAMDZenGSCost (set in ZN4Tuning and inherited by ZN5Tuning) and
+; have AVX-512, so the new tables are consulted in getGSVectorCost.
+; ZNVER3 does NOT carry TuningPreferAMDZenGSCost and lacks both AVX-512 and
+; TuningFastGather, so isLegalMaskedGather() returns false and the cost model
+; walks the scalarise path (getGSScalarCost). The ZNVER3 numbers below are the
+; unchanged scalar fallback cost, included here only to lock in that this
+; change does not regress pre-AVX-512 Zen targets.
+; SKX is a non-Zen AVX-512 baseline showing the generic flat overhead of 2.
+;
+; RUN: opt < %s -S -mtriple=x86_64-unknown-linux-gnu -passes="print<cost-model>" 2>&1 -disable-output -cost-kind=throughput -mcpu=znver4 | FileCheck %s --check-prefix=ZNVER4
+; RUN: opt < %s -S -mtriple=x86_64-unknown-linux-gnu -passes="print<cost-model>" 2>&1 -disable-output -cost-kind=throughput -mcpu=znver5 | FileCheck %s --check-prefix=ZNVER5
+; RUN: opt < %s -S -mtriple=x86_64-unknown-linux-gnu -passes="print<cost-model>" 2>&1 -disable-output -cost-kind=throughput -mcpu=znver3 | FileCheck %s --check-prefix=ZNVER3
+; RUN: opt < %s -S -mtriple=x86_64-unknown-linux-gnu -passes="print<cost-model>" 2>&1 -disable-output -cost-kind=throughput -mcpu=skx    | FileCheck %s --check-prefix=SKX
+
+;------------------------------------------------------------------------------
+; Masked gather - i32 element type
+;------------------------------------------------------------------------------
+
+define <2 x i32> @gather_v2i32(<2 x ptr> %ptrs, <2 x i1> %mask) {
+; ZNVER4-LABEL: 'gather_v2i32'
+; ZNVER4-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %v = call <2 x i32> @llvm.masked.gather.v2i32.v2p0(<2 x ptr> align 4 %ptrs, <2 x i1> %mask, <2 x i32> undef)
+; ZNVER4-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i32> %v
+;
+; ZNVER5-LABEL: 'gather_v2i32'
+; ZNVER5-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %v = call <2 x i32> @llvm.masked.gather.v2i32.v2p0(<2 x ptr> align 4 %ptrs, <2 x i1> %mask, <2 x i32> undef)
+; ZNVER5-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i32> %v
+;
+; ZNVER3-LABEL: 'gather_v2i32'
+; ZNVER3-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %v = call <2 x i32> @llvm.masked.gather.v2i32.v2p0(<2 x ptr> align 4 %ptrs, <2 x i1> %mask, <2 x i32> undef)
+; ZNVER3-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i32> %v
+;
+; SKX-LABEL: 'gather_v2i32'
+; SKX-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %v = call <2 x i32> @llvm.masked.gather.v2i32.v2p0(<2 x ptr> align 4 %ptrs, <2 x i1> %mask, <2 x i32> undef)
+; SKX-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i32> %v
+;
+  %v = call <2 x i32> @llvm.masked.gather.v2i32.v2p0(<2 x ptr> %ptrs, i32 4, <2 x i1> %mask, <2 x i32> undef)
+  ret <2 x i32> %v
+}
+
+define <4 x i32> @gather_v4i32(<4 x ptr> %ptrs, <4 x i1> %mask) {
+; ZNVER4-LABEL: 'gather_v4i32'
+; ZNVER4-NEXT:  Cost Model: Found an estimated cost of 11 for instruction: %v = call <4 x i32> @llvm.masked.gather.v4i32.v4p0(<4 x ptr> align 4 %ptrs, <4 x i1> %mask, <4 x i32> undef)
+; ZNVER4-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i32> %v
+;
+; ZNVER5-LABEL: 'gather_v4i32'
+; ZNVER5-NEXT:  Cost Model: Found an estimated cost of 11 for instruction: %v = call <4 x i32> @llvm.masked.gather.v4i32.v4p0(<4 x ptr> align 4 %ptrs, <4 x i1> %mask, <4 x i32> undef)
+; ZNVER5-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i32> %v
+;
+; ZNVER3-LABEL: 'gather_v4i32'
+; ZNVER3-NEXT:  Cost Model: Found an estimated cost of 14 for instruction: %v = call <4 x i32> @llvm.masked.gather.v4i32.v4p0(<4 x ptr> align 4 %ptrs, <4 x i1> %mask, <4 x i32> undef)
+; ZNVER3-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i32> %v
+;
+; SKX-LABEL: 'gather_v4i32'
+; SKX-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %v = call <4 x i32> @llvm.masked.gather.v4i32.v4p0(<4 x ptr> align 4 %ptrs, <4 x i1> %mask, <4 x i32> undef)
+; SKX-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i32> %v
+;
+  %v = call <4 x i32> @llvm.masked.gather.v4i32.v4p0(<4 x ptr> %ptrs, i32 4, <4 x i1> %mask, <4 x i32> undef)
+  ret <4 x i32> %v
+}
+
+define <8 x i32> @gather_v8i32(<8 x ptr> %ptrs, <8 x i1> %mask) {
+; ZNVER4-LABEL: 'gather_v8i32'
+; ZNVER4-NEXT:  Cost Model: Found an estimated cost of 25 for instruction: %v = call <8 x i32> @llvm.masked.gather.v8i32.v8p0(<8 x ptr> align 4 %ptrs, <8 x i1> %mask, <8 x i32> undef)
+; ZNVER4-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <8 x i32> %v
+;
+; ZNVER5-LABEL: 'gather_v8i32'
+; ZNVER5-NEXT:  Cost Model: Found an estimated cost of 25 for instruction: %v = call <8 x i32> @llvm.masked.gather.v8i32.v8p0(<8 x ptr> align 4 %ptrs, <8 x i1> %mask, <8 x i32> undef)
+; ZNVER5-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <8 x i32> %v
+;
+; ZNVER3-LABEL: 'gather_v8i32'
+; ZNVER3-NEXT:  Cost Model: Found an estimated cost of 28 for instruction: %v = call <8 x i32> @llvm.masked.gather.v8i32.v8p0(<8 x ptr> align 4 %ptrs, <8 x i1> %mask, <8 x i32> undef)
+; ZNVER3-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <8 x i32> %v
+;
+; SKX-LABEL: 'gather_v8i32'
+; SKX-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %v = call <8 x i32> @llvm.masked.gather.v8i32.v8p0(<8 x ptr> align 4 %ptrs, <8 x i1> %mask, <8 x i32> undef)
+; SKX-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <8 x i32> %v
+;
+  %v = call <8 x i32> @llvm.masked.gather.v8i32.v8p0(<8 x ptr> %ptrs, i32 4, <8 x i1> %mask, <8 x i32> undef)
+  ret <8 x i32> %v
+}
+
+define <16 x i32> @gather_v16i32(<16 x ptr> %ptrs, <16 x i1> %mask) {
+; ZNVER4-LABEL: 'gather_v16i32'
+; ZNVER4-NEXT:  Cost Model: Found an estimated cost of 50 for instruction: %v = call <16 x i32> @llvm.masked.gather.v16i32.v16p0(<16 x ptr> align 4 %ptrs, <16 x i1> %mask, <16 x i32> undef)
+; ZNVER4-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <16 x i32> %v
+;
+; ZNVER5-LABEL: 'gather_v16i32'
+; ZNVER5-NEXT:  Cost Model: Found an estimated cost of 50 for instruction: %v = call <16 x i32> @llvm.masked.gather.v16i32.v16p0(<16 x ptr> align 4 %ptrs, <16 x i1> %mask, <16 x i32> undef)
+; ZNVER5-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <16 x i32> %v
+;
+; ZNVER3-LABEL: 'gather_v16i32'
+; ZNVER3-NEXT:  Cost Model: Found an estimated cost of 55 for instruction: %v = call <16 x i32> @llvm.masked.gather.v16i32.v16p0(<16 x ptr> align 4 %ptrs, <16 x i1> %mask, <16 x i32> undef)
+; ZNVER3-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <16 x i32> %v
+;
+; SKX-LABEL: 'gather_v16i32'
+; SKX-NEXT:  Cost Model: Found an estimated cost of 20 for instruction: %v = call <16 x i32> @llvm.masked.gather.v16i32.v16p0(<16 x ptr> align 4 %ptrs, <16 x i1> %mask, <16 x i32> undef)
+; SKX-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <16 x i32> %v
+;
+  %v = call <16 x i32> @llvm.masked.gather.v16i32.v16p0(<16 x ptr> %ptrs, i32 4, <16 x i1> %mask, <16 x i32> undef)
+  ret <16 x i32> %v
+}
+
+;------------------------------------------------------------------------------
+; Masked gather - i64 element type
+;------------------------------------------------------------------------------
+
+define <2 x i64> @gather_v2i64(<2 x ptr> %ptrs, <2 x i1> %mask) {
+; ZNVER4-LABEL: 'gather_v2i64'
+; ZNVER4-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %v = call <2 x i64> @llvm.masked.gather.v2i64.v2p0(<2 x ptr> align 8 %ptrs, <2 x i1> %mask, <2 x i64> undef)
+; ZNVER4-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i64> %v
+;
+; ZNVER5-LABEL: 'gather_v2i64'
+; ZNVER5-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %v = call <2 x i64> @llvm.masked.gather.v2i64.v2p0(<2 x ptr> align 8 %ptrs, <2 x i1> %mask, <2 x i64> undef)
+; ZNVER5-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i64> %v
+;
+; ZNVER3-LABEL: 'gather_v2i64'
+; ZNVER3-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %v = call <2 x i64> @llvm.masked.gather.v2i64.v2p0(<2 x ptr> align 8 %ptrs, <2 x i1> %mask, <2 x i64> undef)
+; ZNVER3-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i64> %v
+;
+; SKX-LABEL: 'gather_v2i64'
+; SKX-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %v = call <2 x i64> @llvm.masked.gather.v2i64.v2p0(<2 x ptr> align 8 %ptrs, <2 x i1> %mask, <2 x i64> undef)
+; SKX-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i64> %v
+;
+  %v = call <2 x i64> @llvm.masked.gather.v2i64.v2p0(<2 x ptr> %ptrs, i32 8, <2 x i1> %mask, <2 x i64> undef)
+  ret <2 x i64> %v
+}
+
+define <4 x i64> @gather_v4i64(<4 x ptr> %ptrs, <4 x i1> %mask) {
+; ZNVER4-LABEL: 'gather_v4i64'
+; ZNVER4-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %v = call <4 x i64> @llvm.masked.gather.v4i64.v4p0(<4 x ptr> align 8 %ptrs, <4 x i1> %mask, <4 x i64> undef)
+; ZNVER4-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i64> %v
+;
+; ZNVER5-LABEL: 'gather_v4i64'
+; ZNVER5-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %v = call <4 x i64> @llvm.masked.gather.v4i64.v4p0(<4 x ptr> align 8 %ptrs, <4 x i1> %mask, <4 x i64> undef)
+; ZNVER5-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i64> %v
+;
+; ZNVER3-LABEL: 'gather_v4i64'
+; ZNVER3-NEXT:  Cost Model: Found an estimated cost of 15 for instruction: %v = call <4 x i64> @llvm.masked.gather.v4i64.v4p0(<4 x ptr> align 8 %ptrs, <4 x i1> %mask, <4 x i64> undef)
+; ZNVER3-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i64> %v
+;
+; SKX-LABEL: 'gather_v4i64'
+; SKX-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %v = call <4 x i64> @llvm.masked.gather.v4i64.v4p0(<4 x ptr> align 8 %ptrs, <4 x i1> %mask, <4 x i64> undef)
+; SKX-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i64> %v
+;
+  %v = call <4 x i64> @llvm.masked.gather.v4i64.v4p0(<4 x ptr> %ptrs, i32 8, <4 x i1> %mask, <4 x i64> undef)
+  ret <4 x i64> %v
+}
+
+define <8 x i64> @gather_v8i64(<8 x ptr> %ptrs, <8 x i1> %mask) {
+; ZNVER4-LABEL: 'gather_v8i64'
+; ZNVER4-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %v = call <8 x i64> @llvm.masked.gather.v8i64.v8p0(<8 x ptr> align 8 %ptrs, <8 x i1> %mask, <8 x i64> undef)
+; ZNVER4-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <8 x i64> %v
+;
+; ZNVER5-LABEL: 'gather_v8i64'
+; ZNVER5-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %v = call <8 x i64> @llvm.masked.gather.v8i64.v8p0(<8 x ptr> align 8 %ptrs, <8 x i1> %mask, <8 x i64> undef)
+; ZNVER5-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <8 x i64> %v
+;
+; ZNVER3-LABEL: 'gather_v8i64'
+; ZNVER3-NEXT:  Cost Model: Found an estimated cost of 29 for instruction: %v = call <8 x i64> @llvm.masked.gather.v8i64.v8p0(<8 x ptr> align 8 %ptrs, <8 x i1> %mask, <8 x i64> undef)
+; ZNVER3-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <8 x i64> %v
+;
+; SKX-LABEL: 'gather_v8i64'
+; SKX-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %v = call <8 x i64> @llvm.masked.gather.v8i64.v8p0(<8 x ptr> align 8 %ptrs, <8 x i1> %mask, <8 x i64> undef)
+; SKX-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <8 x i64> %v
+;
+  %v = call <8 x i64> @llvm.masked.gather.v8i64.v8p0(<8 x ptr> %ptrs, i32 8, <8 x i1> %mask, <8 x i64> undef)
+  ret <8 x i64> %v
+}
+
+;---------...
[truncated]

@RKSimon RKSimon self-requested a review May 25, 2026 10:16
MattPD
MattPD previously requested changes May 26, 2026
Copy link
Copy Markdown
Member

@MattPD MattPD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Per-shape cost modeling looks promising for Zen gather/scatter tuning. A few questions, comments, and suggestions, roughly in priority order:


1. i64 shapes fall through to overhead=2: too low for Zen4?

The tables omit i64 with a note "i64 falls through to the flat default". Indeed, the fallthrough path hits if (ST->hasAVX512()) return 2, giving znver4 the same i64 gather/scatter cost as Intel SKX. However:

  • VPSCATTERQQ (zmm) on Zen4 is 48 µops vs ≤20 on SKX (uops.info).
  • Port pressure: 18 µops restricted to ports {4,5} on Zen4 vs 8 µops on ports {2,3} on SKX.
  • Performance impact: in benchmarking on my end SPECint2006/LIBQUANTUM is 1.5–1.66× slower with v8i64 scatter on Zen4.

With overhead=2, the vectorizer sees v8i64 scatter at total cost ≈ 2 + 8×1 = 10, which looks highly profitable vs the scalar alternative. This re-enables the (harmful for performance) i64 scatter decisions that the conservative approach was trying to prevent (cf. PR #198850).

The commit message says "the i64 sweep landed within the noise of the generic flat overhead". However, the generic flat overhead for AVX-512 targets is 2, which is the Intel-tuned value. If the sweep landed near 2, that would mean i64 gather/scatter is as fast as on SKX, which contradicts the µop data.

Suggestion: Either add explicit i64 entries with overhead derived from measurement (I'd expect ≥20 given the 48-µop count), or fall through to a Zen-appropriate conservative value rather than the Intel-derived 2.


2. Dead table entries (v16f64, v2 shapes)

Several table rows appear to be unreachable:

  • v16f64 ({ISD::LOAD, MVT::v16f64, 14}, {ISD::STORE, MVT::v16f64, 3}): getGSVectorCost performs type legalization splitting before calling getGatherOverhead/getScatterOverhead. For <16 x double> (1024 bits), the function recurses with <8 x double> after splitting. AFAICT, the v16f64 entries can never be matched, so the effective cost is 2 × (v8f64 entry), not the tabulated value.

  • All v2 gather entries ({ISD::LOAD, MVT::v2i32, 20}, etc.): AVX-512 targets force-scalarize VF=2 gathers via forceScalarizeMaskedGather. The test confirms this: gather_v2i32 on ZNVER4 costs 8 (scalarized), not 20 + per-element.

These entries create a false impression of coverage and will mislead future maintainers who attempt to modify them.

Suggestion: Remove unreachable entries; add a brief comment noting that v16f64 is handled by splitting (cost = 2 × v8f64) and VF=2 is force-scalarized.


3. End-to-end loop-vectorize test

The test validates cost model numbers (opt -passes='print<cost-model>') but not vectorization decisions. The stated goal, fixing issue #91370 (vectorizer emitting gather for contiguous loads on znver4), is untested at the pass level.

Suggestion: Add a test under llvm/test/Transforms/LoopVectorize/X86/ that exercises the actual vectorizer decision, e.g.:

  • A loop where gather IS chosen on znver4 (a strided/indirect pattern matching the lbm win): CHECK: @llvm.masked.gather
  • A loop where gather is NOT chosen (e.g., an i64 indirect load): CHECK-NOT: @llvm.masked.gather

This locks in the behavior (the high-level intent) the cost model is meant to produce, so future cost model refactors that accidentally re-enable harmful gathers get caught.


4. Non-monotonic values deserve a comment

The gather table progression (v2i32=20, v4i32=7, v8i32=17, v16i32=14) is non-intuitive and will look like a typo to future developers. Similarly, scatter: v16i32=6 but v8i32=14; v16i32=6 but v16f32=16.

I believe these arise because the "break-even" methodology measures the overhead threshold at which overhead + VF × memcost > scalar_alternative_cost, and the scalar alternative's cost doesn't scale linearly with VF (due to unrolling, pipelining, address computation differences). If so, a brief comment in the source explaining why values are non-monotonic would help future maintainability significantly.

Also: the i32-vs-f32 divergence at VF=16 for scatter (6 vs 16)--is this genuine or a measurement artifact? They're the same physical 512-bit operation on 32-bit lanes.


5. Methodology reproducibility (minor)

The PR body references -force-gather-overhead-cost=N as the sweep tool — this doesn't exist in upstream LLVM. The values are likely derived from a local patch that overrides the return value of getGatherOverhead/getScatterOverhead. That's a perfectly reasonable approach (I've done the same in a downstream fork), but since the methodology can't be replicated by other developers, a sentence in the commit message noting "values measured using a local patch that forces the gather/scatter overhead to a specified value" would be helpful for transparency.


Overall: the per-shape approach is promising the provided benchmarks look good (although a wider SPEC benchmark results would be very welcome). The i64 fallthrough and dead entries are the main items I'd want addressed before landing.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 27, 2026

✅ With the latest revision this PR passed the C/C++ code formatter.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 27, 2026

✅ With the latest revision this PR passed the undef deprecator.

@amd-subharad
Copy link
Copy Markdown
Author

Hello @MattPD
Thank you for your review
I have tried to address your concerns with the latest amends

  1. the missing i64: measured according to our methodology added v4i64=10 / v8i64=22 to both tables. The fall-through to 2 was indeed harmful and runtime regression was 1.2-3.5x depending on shape and stride
  2. dead entries: removed (v2 force-scalarised, v16f64 split by type legalisation), with explanatory comments
  3. end-to-end test: added in Transforms/LoopVectorize/X86/
  4. v16i32 vs v16f32 scatter (6 vs 16): transcription typo, fixed. runtime confirms the two are identical on Zen.
  5. methodology: commit message now discloses the local cl::opt instrument explicitly.
    Please let me know if there are other shortcomings

@MattPD MattPD self-requested a review May 27, 2026 23:36
Copy link
Copy Markdown
Member

@MattPD MattPD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the revisions: No remaining blockers on my end!

Two optional suggestions for a follow-up (which I leave up to you):

  • The LV test has CHECK-NOT: @llvm.masked.gather for cases 2/3, which could pass vacuously if the loop fails to vectorize entirely (not just "vectorizes without gather"). Adding a positive anchor like CHECK: vector.body would distinguish "scalarized the gather" from "didn't vectorize at all."
  • For symmetry with scatter_v16i32_gep, a gather_v16i32_gep test would exercise the GEP-index reduction path for gather as well. The code path is shared so this is purely for documentation/completeness.

MattPD

This comment was marked as duplicate.

@MattPD MattPD dismissed their stale review May 27, 2026 23:39

Addressed

@amd-subharad amd-subharad force-pushed the zen-gather-scatter-costs branch from 6bf7ce0 to fac152a Compare May 28, 2026 03:51
amd-subharad added a commit to amd-subharad/zen-gather-scatter-costs that referenced this pull request May 28, 2026
Two test-only nits from review of llvm#199488:

1. The LoopVectorize test had `CHECK-NOT: @llvm.masked.gather` on Case 2
   (i64 gather avoided) and Case 3 (unit-stride no gather) without a
   positive anchor, so the check would pass vacuously if the loop ever
   failed to vectorize at all (rather than vectorizing without a gather).
   Adding `CHECK: vector.body` in front of each `CHECK-NOT` distinguishes
   the two outcomes; under `-force-vector-width=1` both new CHECKs now
   correctly fail.

2. The cost-model test had `scatter_v16{i32,f32}_gep` to exercise the
   GEP-index reducibility path for the v16 scatter row but no analogous
   case for gather. Added `gather_v16i32_gep` (cost = 30 on znver4/5).
   Both i32 and f32 GEP cases for gather would share the same code path,
   so one case is sufficient for the v16 gather row; the section comment
   is updated to make that explicit.

No behavior change. Both tests pass.
Copy link
Copy Markdown
Contributor

@Jason-Van-Beusekom Jason-Van-Beusekom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM for me (after the failing test is fixed) however I would like to get approval from others before merging

…er4+

The X86 cost model currently returns a single flat overhead from
getGatherOverhead / getScatterOverhead, applied to every shape of
masked gather or scatter on every X86 subtarget that reaches the
gather/scatter path. On modern AMD parts the actual cost of these
instructions varies substantially with the vector width and element
size, and the single flat number forces the LoopVectorizer to either
under- or over-estimate the profitability of vectorising loops that
need indirect memory access.

This change adds a subtarget tuning bit, TuningPreferAMDZenGSCost,
attached to ZN4Tuning so znver4 and znver5 pick it up automatically.
Pre-AVX-512 Zen parts (znver1..3) take the scalarise path for masked
gather and never reach the new code, so the bit is intentionally NOT
placed in ZNTuning; flagging older Zen parts with the feature would
be misleading.

When the tuning bit is set, getGatherOverhead / getScatterOverhead
look the source vector type up in per-shape cost tables before
falling back to the existing generic flat overhead. The tables only
contain the shapes that are actually reached at the lookup site:
VF=2 is force-scalarised on AVX-512 (forceScalarizeMaskedGather), and
v16f64 is split via type legalisation in getGSVectorCost (1024-bit
data exceeds a single zmm), so neither row would ever be queried and
both are omitted. The live tables cover gather and scatter for
VF=4..16 over i32 / f32 / f64, plus VF=4 and VF=8 for i64.

Methodology

The numbers are empirical break-even costs measured on znver4 and
znver5 hardware. The methodology, summarised:

  1. Take a controlled gather/scatter micro-benchmark with one
     indirect memory access per inner-loop iteration.
  2. Sweep the gather/scatter overhead via a local cl::opt patch on
     X86TTI (the upstream tree has no such knob today; this is a
     standalone local instrument that returns the forced value from
     getGatherOverhead / getScatterOverhead). A reproducible version
     of the patch lives on the author's zen-gs-i64-sweep branch.
  3. For each (element type, VF) compile the micro-benchmark at a
     range of forced overheads and identify the "flip" cost above
     which the LoopVectorizer stops emitting the gather / scatter
     instruction (it switches to an extract-load-insert lowering or
     to a pure scalar loop). The tabulated cost is the highest value
     at which gather/scatter emission was the right call: the
     vectoriser still selects it AND the resulting binary is at least
     as fast as the post-flip alternatives on the test hardware.
  4. The sweep is run independently for each (element type, VF) on
     Genoa, Milan and Turin and re-validated on Zen 5.

Notes on individual entries

  * i64 entries are higher than their f64 counterparts at the same
    VF. The scalar alternative for i64 runs on the integer pipeline
    (cheaper than f64 on the FP pipeline), so gather has to be
    cheaper to win. At the f64-style break-even, i64 gather was
    1.7-3.5x slower than the scalarised lowering across stride
    patterns on Zen 5, so the i64 break-even sits at the minimum
    cost that suppresses vpgatherqq / vpscatterqq emission for the
    measured patterns (which include the libquantum-style indirect
    scatter cited in PR llvm#198850).
  * f32 rows for both tables mirror i32 rows: the original sweep
    only characterised i32 and f64 lanes, and the f32 rows were
    derived by symmetry because vpgatherdd / vpscatterdd and the
    corresponding ps variants share the same physical lane width on
    Zen. The runtime equivalence of vpscatterdd and vscatterdps was
    verified directly (within 3% across VF and stride patterns).

Tests

The cost-model test
  llvm/test/Analysis/CostModel/X86/masked-gather-scatter-amd-zen.ll
covers every live shape on znver4 / znver5 and pins the unchanged
behaviour for znver3 (scalarise path) and skx (generic flat
overhead). It also covers the 32-bit-reducible GEP form for v16f32
and v16i32 scatter, which is the only path that actually queries the
v16 row of the scatter table (the <16 x ptr> form recurses to v8
through type legalisation).

A second test
  llvm/test/Transforms/LoopVectorize/X86/amd-zen-gather-scatter-decisions.ll
pins the resulting vectoriser decisions end-to-end: gather IS emitted
for an f64 indirect-load reduction on znver5; gather is NOT emitted
for an i64 indirect-load reduction (where the cost table is meant to
suppress it); and a unit-stride load must not become a gather
regardless of cost-table values (regression guard for issue llvm#91370).
Two test-only nits from review of llvm#199488:

1. The LoopVectorize test had `CHECK-NOT: @llvm.masked.gather` on Case 2
   (i64 gather avoided) and Case 3 (unit-stride no gather) without a
   positive anchor, so the check would pass vacuously if the loop ever
   failed to vectorize at all (rather than vectorizing without a gather).
   Adding `CHECK: vector.body` in front of each `CHECK-NOT` distinguishes
   the two outcomes; under `-force-vector-width=1` both new CHECKs now
   correctly fail.

2. The cost-model test had `scatter_v16{i32,f32}_gep` to exercise the
   GEP-index reducibility path for the v16 scatter row but no analogous
   case for gather. Added `gather_v16i32_gep` (cost = 30 on znver4/5).
   Both i32 and f32 GEP cases for gather would share the same code path,
   so one case is sufficient for the v16 gather row; the section comment
   is updated to make that explicit.

No behavior change. Both tests pass.
CI's code_formatter job runs both clang-format AND the undef-deprecator
check; the latter rejects new uses of `undef` in tests under the
LangRef poison/undef migration. The masked-gather passthru argument
was the only `undef` in the file. Replace all 75 occurrences with
`poison` (purely an operand-printing change -- gather costs and
behaviour are unaffected).

No functional change; both tests still pass.
Apply clang-format to the per-shape Zen gather/scatter tables in
X86TargetTransformInfo.cpp:

  - Drop the manual double-space numeric alignment in the rows
    (clang-format collapses it and re-packs the rows 2-per-line).
  - Re-wrap the `if (const auto *E = CostTableLookup(...))` line
    in getGatherOverhead to put the call expression on its own
    indented line.

Pure formatting; cost tables and lookup behaviour are unchanged.
@amd-subharad amd-subharad force-pushed the zen-gather-scatter-costs branch from 68ed3f6 to 8cc82c4 Compare May 30, 2026 18:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend:X86 llvm:analysis Includes value tracking, cost tables and constant folding llvm:transforms

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants