[MLIR][Affine] Refuse affine-super-vectorize on non-unit access strides by swjng · Pull Request #196973 · llvm/llvm-project

swjng · 2026-05-11T15:15:48Z

Summary

affine-super-vectorize can currently treat %a[%i * 4] as a contiguous
access because the vectorized IV affects only one memref dimension. It then
emits a contiguous vector.transfer_write, touching cells that the scalar loop
never writes.

This patch requires the IV coefficient in the varying memref dimension to be
1 before treating an access as contiguous.

Reproducer

func.func @non_unit_stride_store_unsupported(%arg0: memref<64xf32>) {
  %cst = arith.constant 0.000000e+00 : f32
  affine.for %i = 0 to 16 {
    affine.store %cst, %arg0[%i * 4] : memref<64xf32>
  }
  return
}

Apply:

mlir-opt repro.mlir -affine-super-vectorize="virtual-vector-size=4"

Before this patch, the pass can emit a vector.transfer_write at %i * 4.
That writes %i * 4 + 1, %i * 4 + 2, and %i * 4 + 3, which are not part of
the scalar loop's memory footprint.

Why

The vectorization legality check treats the access as vectorizable because the
loop IV affects only one memref dimension. It does not require the IV's
coefficient in that dimension to be 1.

For %a[%i * 4], the scalar loop writes only {0, 4, 8, ...}. The vectorized
loop writes a full vector at %i * 4, touching {%i*4, %i*4+1, %i*4+2, %i*4+3}. Those extra lanes are in-bounds but outside the source loop's
iteration footprint.

Fix

Inline affine.apply chains via fullyComposeAffineMapAndOperands so the
access map's operand list refers to leaf SSA values. After the existing
per-dimension invariance scan picks the unique IV-varying dimension, flatten
that dimension's result expression and read off the IV coefficient. If it is
not 1, reject vectorization. Other dimensions never constrain stride, so
the coefficient computation is deferred until the varying dimension is
identified rather than run on every dimension.

Composition is required to see through %a[%apply(%i)]-style operands: the
existing super-vec machinery (including %i mod N strided-broadcast patterns)
already relies on this canonical form, so reusing it keeps the check
consistent and avoids over-rejection.

Test

Added
mlir/test/Dialect/Affine/SuperVectorize/vectorize_non_unit_stride.mlir,
checking that %arg0[%i * 4] remains scalar and no vector.transfer_write is
emitted.

Updated mlir/test/Dialect/Affine/access-analysis.mlir: the test previously
carried Note/FIXME: access stride isn't being checked annotations next to
expected-remark lines that asserted buggy "contiguous along loop N"
diagnostics for stride-8 and stride-(-16) accesses. With the fix in place
those remarks are correctly suppressed; the FIXME comments and corresponding
expected-remarks are dropped.

Ran:

ninja -C build mlir-opt FileCheck count not
build/bin/llvm-lit -sv mlir/test/Dialect/Affine/

Result: 70/70 tests pass (no regressions in vectorize_affine_apply.mlir,
which exercises the affine.apply-composition path).

How it was found

Surfaced by an in-progress SMT-solver-based analysis tool for MLIR affine
passes.

Tool-use disclosure

This change was drafted with OpenAI Codex assistance per LLVM's AI Tool Use
Policy. All code was read, reviewed, and tested locally; the commit should
carry an Assisted-by: OpenAI Codex trailer.

llvmorg-github-actions · 2026-05-11T15:16:39Z

@llvm/pr-subscribers-mlir

@llvm/pr-subscribers-mlir-affine

Author: Soowon Jeong (swjng)

Changes

[MLIR][Affine] Refuse affine-super-vectorize on non-unit access strides

Summary

affine-super-vectorize can currently treat %a[%i * 4] as a contiguous
access because the vectorized IV affects only one memref dimension. It then
emits a contiguous vector.transfer_write, touching cells that the scalar loop
never writes.

This patch requires the IV coefficient in the varying memref dimension to be
1 before treating an access as contiguous.

Reproducer

func.func @<!-- -->non_unit_stride_store_unsupported(%arg0: memref&lt;64xf32&gt;) {
  %cst = arith.constant 0.000000e+00 : f32
  affine.for %i = 0 to 16 {
    affine.store %cst, %arg0[%i * 4] : memref&lt;64xf32&gt;
  }
  return
}

Apply:

mlir-opt repro.mlir -affine-super-vectorize="virtual-vector-size=4"

Before this patch, the pass can emit a vector.transfer_write at %i * 4.
That writes %i * 4 + 1, %i * 4 + 2, and %i * 4 + 3, which are not part of
the scalar loop's memory footprint.

Why

The vectorization legality check treats the access as vectorizable because the
loop IV affects only one memref dimension. It does not require the IV's
coefficient in that dimension to be 1.

For %a[%i * 4], the scalar loop writes only {0, 4, 8, ...}. The vectorized
loop writes a full vector at %i * 4, touching {%i*4, %i*4+1, %i*4+2, %i*4+3}. Those extra lanes are in-bounds but outside the source loop's
iteration footprint.

Fix

Inline affine.apply chains via fullyComposeAffineMapAndOperands so the
access map's operand list refers to leaf SSA values, then flatten each result
expression and compute the IV coefficient. If the IV-varying dimension is the
contiguous one but the coefficient is not 1, reject vectorization.

Composition is required to see through %a[%apply(%i)]-style operands: the
existing super-vec machinery (including %i mod N strided-broadcast patterns)
already relies on this canonical form, so reusing it keeps the check
consistent and avoids over-rejection.

Test

Added
mlir/test/Dialect/Affine/SuperVectorize/vectorize_non_unit_stride.mlir,
checking that %arg0[%i * 4] remains scalar and no vector.transfer_write is
emitted.

Updated mlir/test/Dialect/Affine/access-analysis.mlir: the test previously
carried Note/FIXME: access stride isn't being checked annotations next to
expected-remark lines that asserted buggy "contiguous along loop N"
diagnostics for stride-8 and stride-(-16) accesses. With the fix in place
those remarks are correctly suppressed; the FIXME comments and corresponding
expected-remarks are dropped.

Ran:

ninja -C build mlir-opt FileCheck count not
build/bin/llvm-lit -sv mlir/test/Dialect/Affine/

Result: 70/70 tests pass (no regressions in vectorize_affine_apply.mlir,
which exercises the affine.apply-composition path).

How it was found

Surfaced by an in-progress SMT-solver-based analysis tool for MLIR affine
passes.

Tool-use disclosure

This change was drafted with OpenAI Codex assistance per LLVM's AI Tool Use
Policy. All code was read, reviewed, and tested locally; the commit should
carry an Assisted-by: OpenAI Codex trailer.

Full diff: https://github.com/llvm/llvm-project/pull/196973.diff

3 Files Affected:

(modified) mlir/lib/Dialect/Affine/Analysis/LoopAnalysis.cpp (+20)
(added) mlir/test/Dialect/Affine/SuperVectorize/vectorize_non_unit_stride.mlir (+17)
(modified) mlir/test/Dialect/Affine/access-analysis.mlir (+8-7)

diff --git a/mlir/lib/Dialect/Affine/Analysis/LoopAnalysis.cpp b/mlir/lib/Dialect/Affine/Analysis/LoopAnalysis.cpp
index 166d39e88d41e..caac64d86cc46 100644
--- a/mlir/lib/Dialect/Affine/Analysis/LoopAnalysis.cpp
+++ b/mlir/lib/Dialect/Affine/Analysis/LoopAnalysis.cpp
@@ -12,6 +12,7 @@
 
 #include "mlir/Dialect/Affine/Analysis/LoopAnalysis.h"
 
+#include "mlir/Analysis/FlatLinearValueConstraints.h"
 #include "mlir/Analysis/SliceAnalysis.h"
 #include "mlir/Dialect/Affine/Analysis/AffineAnalysis.h"
 #include "mlir/Dialect/Affine/Analysis/AffineStructures.h"
@@ -331,8 +332,25 @@ bool mlir::affine::isContiguousAccess(Value iv, LoadOrStoreOp memoryOp,
   int uniqueVaryingIndexAlongIv = -1;
   auto accessMap = memoryOp.getAffineMap();
   SmallVector<Value, 4> mapOperands(memoryOp.getMapOperands());
+  // Inline affine.apply chains so the iv-coefficient check below sees iv even
+  // when it arrives through a derived value.
+  fullyComposeAffineMapAndOperands(&accessMap, &mapOperands);
   unsigned numDims = accessMap.getNumDims();
+  unsigned numSymbols = accessMap.getNumSymbols();
   for (unsigned i = 0, e = memRefType.getRank(); i < e; ++i) {
+    SmallVector<int64_t, 4> flattenedExpr;
+    if (failed(getFlattenedAffineExpr(accessMap.getResult(i), numDims,
+                                      numSymbols, &flattenedExpr)))
+      return false;
+
+    int64_t ivCoeff = 0;
+    for (unsigned dim = 0; dim < numDims; ++dim)
+      if (mapOperands[dim] == iv)
+        ivCoeff += flattenedExpr[dim];
+    for (unsigned sym = 0; sym < numSymbols; ++sym)
+      if (mapOperands[numDims + sym] == iv)
+        ivCoeff += flattenedExpr[numDims + sym];
+
     // Gather map operands used in result expr 'i' in 'exprOperands'.
     SmallVector<Value, 4> exprOperands;
     auto resultExpr = accessMap.getResult(i);
@@ -352,6 +370,8 @@ bool mlir::affine::isContiguousAccess(Value iv, LoadOrStoreOp memoryOp,
         uniqueVaryingIndexAlongIv = i;
       }
     }
+    if (uniqueVaryingIndexAlongIv == static_cast<int>(i) && ivCoeff != 1)
+      return false;
   }
 
   if (uniqueVaryingIndexAlongIv == -1)
diff --git a/mlir/test/Dialect/Affine/SuperVectorize/vectorize_non_unit_stride.mlir b/mlir/test/Dialect/Affine/SuperVectorize/vectorize_non_unit_stride.mlir
new file mode 100644
index 0000000000000..d0b69078fad36
--- /dev/null
+++ b/mlir/test/Dialect/Affine/SuperVectorize/vectorize_non_unit_stride.mlir
@@ -0,0 +1,17 @@
+// RUN: mlir-opt %s -affine-super-vectorize="virtual-vector-size=4" | FileCheck %s
+
+// A non-unit coefficient on the vectorized IV is not a contiguous access. A
+// plain vector.transfer_write would touch consecutive cells even though the
+// scalar loop writes only every fourth cell.
+
+// CHECK-LABEL: func.func @non_unit_stride_store_unsupported
+// CHECK:         affine.for
+// CHECK:           affine.store
+// CHECK-NOT:     vector.transfer_write
+func.func @non_unit_stride_store_unsupported(%arg0: memref<64xf32>) {
+  %cst = arith.constant 0.000000e+00 : f32
+  affine.for %i = 0 to 16 {
+    affine.store %cst, %arg0[%i * 4] : memref<64xf32>
+  }
+  return
+}
diff --git a/mlir/test/Dialect/Affine/access-analysis.mlir b/mlir/test/Dialect/Affine/access-analysis.mlir
index 789de646a8f9e..1bb230aebbfe1 100644
--- a/mlir/test/Dialect/Affine/access-analysis.mlir
+++ b/mlir/test/Dialect/Affine/access-analysis.mlir
@@ -11,17 +11,16 @@ func.func @loop_simple(%A : memref<?x?xf32>, %B : memref<?x?x?xf32>) {
        // expected-remark@above {{invariant along loop 1}}
        affine.load %A[%c0, 8 * %i + %j] : memref<?x?xf32>
        // expected-remark@above {{contiguous along loop 1}}
-       // Note/FIXME: access stride isn't being checked.
-       // expected-remark@-3 {{contiguous along loop 0}}
+       // Access stride 8 along loop 0 in the last memref dim is no longer
+       // reported as contiguous.
 
        // These are all non-contiguous along both loops. Nothing is emitted.
        affine.load %A[%i, %c0] : memref<?x?xf32>
        // expected-remark@above {{invariant along loop 1}}
-       // Note/FIXME: access stride isn't being checked.
        affine.load %A[%i, 8 * %j] : memref<?x?xf32>
-       // expected-remark@above {{contiguous along loop 1}}
+       // Access stride 8 along loop 1; no contiguity remark emitted.
        affine.load %A[%j, 4 * %i] : memref<?x?xf32>
-       // expected-remark@above {{contiguous along loop 0}}
+       // Access stride 4 along loop 0 in the last memref dim; no remark.
      }
    }
    return
@@ -70,8 +69,9 @@ func.func @tiled(%arg0: memref<*xf32>) {
             // expected-remark@above {{invariant along loop 4}}
             affine.store %0, %alloc_0[0, %arg1 * -16 + %arg4, 0, %arg3 * -16 + %arg5] : memref<1x16x1x16xf32>
             // expected-remark@above {{contiguous along loop 4}}
-            // expected-remark@above {{contiguous along loop 2}}
             // expected-remark@above {{invariant along loop 1}}
+            // The %arg3 contribution to dim 3 is `%arg3 * -16`, so the
+            // access is not contiguous along loop 2.
           }
         }
         affine.for %arg4 = #map(%arg1) to #map1(%arg1) {
@@ -79,7 +79,8 @@ func.func @tiled(%arg0: memref<*xf32>) {
             affine.for %arg6 = #map(%arg3) to #map1(%arg3) {
               %0 = affine.load %alloc_0[0, %arg1 * -16 + %arg4, -%arg2 + %arg5, %arg3 * -16 + %arg6] : memref<1x16x1x16xf32>
               // expected-remark@above {{contiguous along loop 5}}
-              // expected-remark@above {{contiguous along loop 2}}
+              // The %arg3 contribution to dim 3 is `%arg3 * -16`, so the
+              // access is not contiguous along loop 2.
               affine.store %0, %alloc[0, %arg5, %arg6, %arg4] : memref<1x224x224x64xf32>
               // expected-remark@above {{contiguous along loop 3}}
               // expected-remark@above {{invariant along loop 0}}

`isContiguousAccess` checked only whether the candidate vector IV appears in one memref dimension; it didn't check the IV's *coefficient*. As a result `%a[%i * 4]` was treated as contiguous, and super-vectorize emitted a `vector.transfer_write` that touches `{%i*4, %i*4+1, %i*4+2, %i*4+3}` while the scalar loop only writes `{0, 4, 8, ...}`. After composing intervening `affine.apply` chains with `fullyComposeAffineMapAndOperands`, flatten each result expression and require the IV-varying dimension's coefficient on the candidate IV to be exactly `1`. `mlir/test/Dialect/Affine/access-analysis.mlir` previously documented this as a `Note/FIXME: access stride isn't being checked` next to expected-remarks that asserted spurious "contiguous along loop N" for stride-8 / stride-(-16) accesses. With the fix in place those remarks are correctly suppressed; the FIXME annotations and the remarks they guarded are dropped. Adds a regression in `mlir/test/Dialect/Affine/SuperVectorize/vectorize_non_unit_stride.mlir`.

llvmorg-github-actions Bot added mlir:affine mlir labels May 11, 2026

swjng force-pushed the fix/affine-supervec-nonunit-stride branch from 96134f9 to c68902c Compare May 12, 2026 02:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MLIR][Affine] Refuse affine-super-vectorize on non-unit access strides#196973

[MLIR][Affine] Refuse affine-super-vectorize on non-unit access strides#196973
swjng wants to merge 1 commit into
llvm:mainfrom
swjng:fix/affine-supervec-nonunit-stride

swjng commented May 11, 2026 •

edited

Loading

Uh oh!

llvmorg-github-actions Bot commented May 11, 2026 •

edited

Loading

[MLIR][Affine] Refuse affine-super-vectorize on non-unit access strides

Summary

Reproducer

Why

Fix

Test

How it was found

Tool-use disclosure

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

swjng commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Reproducer

Why

Fix

Test

How it was found

Tool-use disclosure

Uh oh!

llvmorg-github-actions Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

[MLIR][Affine] Refuse affine-super-vectorize on non-unit access strides

Summary

Reproducer

Why

Fix

Test

How it was found

Tool-use disclosure

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

swjng commented May 11, 2026 •

edited

Loading

llvmorg-github-actions Bot commented May 11, 2026 •

edited

Loading