Skip to content

[MLIR][Affine] Refuse affine-super-vectorize on non-unit access strides#196973

Open
swjng wants to merge 1 commit into
llvm:mainfrom
swjng:fix/affine-supervec-nonunit-stride
Open

[MLIR][Affine] Refuse affine-super-vectorize on non-unit access strides#196973
swjng wants to merge 1 commit into
llvm:mainfrom
swjng:fix/affine-supervec-nonunit-stride

Conversation

@swjng
Copy link
Copy Markdown
Contributor

@swjng swjng commented May 11, 2026

Summary

affine-super-vectorize can currently treat %a[%i * 4] as a contiguous
access because the vectorized IV affects only one memref dimension. It then
emits a contiguous vector.transfer_write, touching cells that the scalar loop
never writes.

This patch requires the IV coefficient in the varying memref dimension to be
1 before treating an access as contiguous.

Reproducer

func.func @non_unit_stride_store_unsupported(%arg0: memref<64xf32>) {
  %cst = arith.constant 0.000000e+00 : f32
  affine.for %i = 0 to 16 {
    affine.store %cst, %arg0[%i * 4] : memref<64xf32>
  }
  return
}

Apply:

mlir-opt repro.mlir -affine-super-vectorize="virtual-vector-size=4"

Before this patch, the pass can emit a vector.transfer_write at %i * 4.
That writes %i * 4 + 1, %i * 4 + 2, and %i * 4 + 3, which are not part of
the scalar loop's memory footprint.

Why

The vectorization legality check treats the access as vectorizable because the
loop IV affects only one memref dimension. It does not require the IV's
coefficient in that dimension to be 1.

For %a[%i * 4], the scalar loop writes only {0, 4, 8, ...}. The vectorized
loop writes a full vector at %i * 4, touching {%i*4, %i*4+1, %i*4+2, %i*4+3}. Those extra lanes are in-bounds but outside the source loop's
iteration footprint.

Fix

Inline affine.apply chains via fullyComposeAffineMapAndOperands so the
access map's operand list refers to leaf SSA values. After the existing
per-dimension invariance scan picks the unique IV-varying dimension, flatten
that dimension's result expression and read off the IV coefficient. If it is
not 1, reject vectorization. Other dimensions never constrain stride, so
the coefficient computation is deferred until the varying dimension is
identified rather than run on every dimension.

Composition is required to see through %a[%apply(%i)]-style operands: the
existing super-vec machinery (including %i mod N strided-broadcast patterns)
already relies on this canonical form, so reusing it keeps the check
consistent and avoids over-rejection.

Test

Added
mlir/test/Dialect/Affine/SuperVectorize/vectorize_non_unit_stride.mlir,
checking that %arg0[%i * 4] remains scalar and no vector.transfer_write is
emitted.

Updated mlir/test/Dialect/Affine/access-analysis.mlir: the test previously
carried Note/FIXME: access stride isn't being checked annotations next to
expected-remark lines that asserted buggy "contiguous along loop N"
diagnostics for stride-8 and stride-(-16) accesses. With the fix in place
those remarks are correctly suppressed; the FIXME comments and corresponding
expected-remarks are dropped.

Ran:

ninja -C build mlir-opt FileCheck count not
build/bin/llvm-lit -sv mlir/test/Dialect/Affine/

Result: 70/70 tests pass (no regressions in vectorize_affine_apply.mlir,
which exercises the affine.apply-composition path).

How it was found

Surfaced by an in-progress SMT-solver-based analysis tool for MLIR affine
passes.

Tool-use disclosure

This change was drafted with OpenAI Codex assistance per LLVM's AI Tool Use
Policy. All code was read, reviewed, and tested locally; the commit should
carry an Assisted-by: OpenAI Codex trailer.

@llvmorg-github-actions
Copy link
Copy Markdown

llvmorg-github-actions Bot commented May 11, 2026

@llvm/pr-subscribers-mlir

@llvm/pr-subscribers-mlir-affine

Author: Soowon Jeong (swjng)

Changes

[MLIR][Affine] Refuse affine-super-vectorize on non-unit access strides

Summary

affine-super-vectorize can currently treat %a[%i * 4] as a contiguous
access because the vectorized IV affects only one memref dimension. It then
emits a contiguous vector.transfer_write, touching cells that the scalar loop
never writes.

This patch requires the IV coefficient in the varying memref dimension to be
1 before treating an access as contiguous.

Reproducer

func.func @<!-- -->non_unit_stride_store_unsupported(%arg0: memref&lt;64xf32&gt;) {
  %cst = arith.constant 0.000000e+00 : f32
  affine.for %i = 0 to 16 {
    affine.store %cst, %arg0[%i * 4] : memref&lt;64xf32&gt;
  }
  return
}

Apply:

mlir-opt repro.mlir -affine-super-vectorize="virtual-vector-size=4"

Before this patch, the pass can emit a vector.transfer_write at %i * 4.
That writes %i * 4 + 1, %i * 4 + 2, and %i * 4 + 3, which are not part of
the scalar loop's memory footprint.

Why

The vectorization legality check treats the access as vectorizable because the
loop IV affects only one memref dimension. It does not require the IV's
coefficient in that dimension to be 1.

For %a[%i * 4], the scalar loop writes only {0, 4, 8, ...}. The vectorized
loop writes a full vector at %i * 4, touching {%i*4, %i*4+1, %i*4+2, %i*4+3}. Those extra lanes are in-bounds but outside the source loop's
iteration footprint.

Fix

Inline affine.apply chains via fullyComposeAffineMapAndOperands so the
access map's operand list refers to leaf SSA values, then flatten each result
expression and compute the IV coefficient. If the IV-varying dimension is the
contiguous one but the coefficient is not 1, reject vectorization.

Composition is required to see through %a[%apply(%i)]-style operands: the
existing super-vec machinery (including %i mod N strided-broadcast patterns)
already relies on this canonical form, so reusing it keeps the check
consistent and avoids over-rejection.

Test

Added
mlir/test/Dialect/Affine/SuperVectorize/vectorize_non_unit_stride.mlir,
checking that %arg0[%i * 4] remains scalar and no vector.transfer_write is
emitted.

Updated mlir/test/Dialect/Affine/access-analysis.mlir: the test previously
carried Note/FIXME: access stride isn't being checked annotations next to
expected-remark lines that asserted buggy "contiguous along loop N"
diagnostics for stride-8 and stride-(-16) accesses. With the fix in place
those remarks are correctly suppressed; the FIXME comments and corresponding
expected-remarks are dropped.

Ran:

ninja -C build mlir-opt FileCheck count not
build/bin/llvm-lit -sv mlir/test/Dialect/Affine/

Result: 70/70 tests pass (no regressions in vectorize_affine_apply.mlir,
which exercises the affine.apply-composition path).

How it was found

Surfaced by an in-progress SMT-solver-based analysis tool for MLIR affine
passes.

Tool-use disclosure

This change was drafted with OpenAI Codex assistance per LLVM's AI Tool Use
Policy. All code was read, reviewed, and tested locally; the commit should
carry an Assisted-by: OpenAI Codex trailer.


Full diff: https://github.com/llvm/llvm-project/pull/196973.diff

3 Files Affected:

  • (modified) mlir/lib/Dialect/Affine/Analysis/LoopAnalysis.cpp (+20)
  • (added) mlir/test/Dialect/Affine/SuperVectorize/vectorize_non_unit_stride.mlir (+17)
  • (modified) mlir/test/Dialect/Affine/access-analysis.mlir (+8-7)
diff --git a/mlir/lib/Dialect/Affine/Analysis/LoopAnalysis.cpp b/mlir/lib/Dialect/Affine/Analysis/LoopAnalysis.cpp
index 166d39e88d41e..caac64d86cc46 100644
--- a/mlir/lib/Dialect/Affine/Analysis/LoopAnalysis.cpp
+++ b/mlir/lib/Dialect/Affine/Analysis/LoopAnalysis.cpp
@@ -12,6 +12,7 @@
 
 #include "mlir/Dialect/Affine/Analysis/LoopAnalysis.h"
 
+#include "mlir/Analysis/FlatLinearValueConstraints.h"
 #include "mlir/Analysis/SliceAnalysis.h"
 #include "mlir/Dialect/Affine/Analysis/AffineAnalysis.h"
 #include "mlir/Dialect/Affine/Analysis/AffineStructures.h"
@@ -331,8 +332,25 @@ bool mlir::affine::isContiguousAccess(Value iv, LoadOrStoreOp memoryOp,
   int uniqueVaryingIndexAlongIv = -1;
   auto accessMap = memoryOp.getAffineMap();
   SmallVector<Value, 4> mapOperands(memoryOp.getMapOperands());
+  // Inline affine.apply chains so the iv-coefficient check below sees iv even
+  // when it arrives through a derived value.
+  fullyComposeAffineMapAndOperands(&accessMap, &mapOperands);
   unsigned numDims = accessMap.getNumDims();
+  unsigned numSymbols = accessMap.getNumSymbols();
   for (unsigned i = 0, e = memRefType.getRank(); i < e; ++i) {
+    SmallVector<int64_t, 4> flattenedExpr;
+    if (failed(getFlattenedAffineExpr(accessMap.getResult(i), numDims,
+                                      numSymbols, &flattenedExpr)))
+      return false;
+
+    int64_t ivCoeff = 0;
+    for (unsigned dim = 0; dim < numDims; ++dim)
+      if (mapOperands[dim] == iv)
+        ivCoeff += flattenedExpr[dim];
+    for (unsigned sym = 0; sym < numSymbols; ++sym)
+      if (mapOperands[numDims + sym] == iv)
+        ivCoeff += flattenedExpr[numDims + sym];
+
     // Gather map operands used in result expr 'i' in 'exprOperands'.
     SmallVector<Value, 4> exprOperands;
     auto resultExpr = accessMap.getResult(i);
@@ -352,6 +370,8 @@ bool mlir::affine::isContiguousAccess(Value iv, LoadOrStoreOp memoryOp,
         uniqueVaryingIndexAlongIv = i;
       }
     }
+    if (uniqueVaryingIndexAlongIv == static_cast<int>(i) && ivCoeff != 1)
+      return false;
   }
 
   if (uniqueVaryingIndexAlongIv == -1)
diff --git a/mlir/test/Dialect/Affine/SuperVectorize/vectorize_non_unit_stride.mlir b/mlir/test/Dialect/Affine/SuperVectorize/vectorize_non_unit_stride.mlir
new file mode 100644
index 0000000000000..d0b69078fad36
--- /dev/null
+++ b/mlir/test/Dialect/Affine/SuperVectorize/vectorize_non_unit_stride.mlir
@@ -0,0 +1,17 @@
+// RUN: mlir-opt %s -affine-super-vectorize="virtual-vector-size=4" | FileCheck %s
+
+// A non-unit coefficient on the vectorized IV is not a contiguous access. A
+// plain vector.transfer_write would touch consecutive cells even though the
+// scalar loop writes only every fourth cell.
+
+// CHECK-LABEL: func.func @non_unit_stride_store_unsupported
+// CHECK:         affine.for
+// CHECK:           affine.store
+// CHECK-NOT:     vector.transfer_write
+func.func @non_unit_stride_store_unsupported(%arg0: memref<64xf32>) {
+  %cst = arith.constant 0.000000e+00 : f32
+  affine.for %i = 0 to 16 {
+    affine.store %cst, %arg0[%i * 4] : memref<64xf32>
+  }
+  return
+}
diff --git a/mlir/test/Dialect/Affine/access-analysis.mlir b/mlir/test/Dialect/Affine/access-analysis.mlir
index 789de646a8f9e..1bb230aebbfe1 100644
--- a/mlir/test/Dialect/Affine/access-analysis.mlir
+++ b/mlir/test/Dialect/Affine/access-analysis.mlir
@@ -11,17 +11,16 @@ func.func @loop_simple(%A : memref<?x?xf32>, %B : memref<?x?x?xf32>) {
        // expected-remark@above {{invariant along loop 1}}
        affine.load %A[%c0, 8 * %i + %j] : memref<?x?xf32>
        // expected-remark@above {{contiguous along loop 1}}
-       // Note/FIXME: access stride isn't being checked.
-       // expected-remark@-3 {{contiguous along loop 0}}
+       // Access stride 8 along loop 0 in the last memref dim is no longer
+       // reported as contiguous.
 
        // These are all non-contiguous along both loops. Nothing is emitted.
        affine.load %A[%i, %c0] : memref<?x?xf32>
        // expected-remark@above {{invariant along loop 1}}
-       // Note/FIXME: access stride isn't being checked.
        affine.load %A[%i, 8 * %j] : memref<?x?xf32>
-       // expected-remark@above {{contiguous along loop 1}}
+       // Access stride 8 along loop 1; no contiguity remark emitted.
        affine.load %A[%j, 4 * %i] : memref<?x?xf32>
-       // expected-remark@above {{contiguous along loop 0}}
+       // Access stride 4 along loop 0 in the last memref dim; no remark.
      }
    }
    return
@@ -70,8 +69,9 @@ func.func @tiled(%arg0: memref<*xf32>) {
             // expected-remark@above {{invariant along loop 4}}
             affine.store %0, %alloc_0[0, %arg1 * -16 + %arg4, 0, %arg3 * -16 + %arg5] : memref<1x16x1x16xf32>
             // expected-remark@above {{contiguous along loop 4}}
-            // expected-remark@above {{contiguous along loop 2}}
             // expected-remark@above {{invariant along loop 1}}
+            // The %arg3 contribution to dim 3 is `%arg3 * -16`, so the
+            // access is not contiguous along loop 2.
           }
         }
         affine.for %arg4 = #map(%arg1) to #map1(%arg1) {
@@ -79,7 +79,8 @@ func.func @tiled(%arg0: memref<*xf32>) {
             affine.for %arg6 = #map(%arg3) to #map1(%arg3) {
               %0 = affine.load %alloc_0[0, %arg1 * -16 + %arg4, -%arg2 + %arg5, %arg3 * -16 + %arg6] : memref<1x16x1x16xf32>
               // expected-remark@above {{contiguous along loop 5}}
-              // expected-remark@above {{contiguous along loop 2}}
+              // The %arg3 contribution to dim 3 is `%arg3 * -16`, so the
+              // access is not contiguous along loop 2.
               affine.store %0, %alloc[0, %arg5, %arg6, %arg4] : memref<1x224x224x64xf32>
               // expected-remark@above {{contiguous along loop 3}}
               // expected-remark@above {{invariant along loop 0}}

`isContiguousAccess` checked only whether the candidate vector IV appears
in one memref dimension; it didn't check the IV's *coefficient*. As a
result `%a[%i * 4]` was treated as contiguous, and super-vectorize
emitted a `vector.transfer_write` that touches `{%i*4, %i*4+1, %i*4+2,
%i*4+3}` while the scalar loop only writes `{0, 4, 8, ...}`.

After composing intervening `affine.apply` chains with
`fullyComposeAffineMapAndOperands`, flatten each result expression and
require the IV-varying dimension's coefficient on the candidate IV to be
exactly `1`.

`mlir/test/Dialect/Affine/access-analysis.mlir` previously documented
this as a `Note/FIXME: access stride isn't being checked` next to
expected-remarks that asserted spurious "contiguous along loop N" for
stride-8 / stride-(-16) accesses. With the fix in place those remarks
are correctly suppressed; the FIXME annotations and the remarks they
guarded are dropped.

Adds a regression in
`mlir/test/Dialect/Affine/SuperVectorize/vectorize_non_unit_stride.mlir`.
@swjng swjng force-pushed the fix/affine-supervec-nonunit-stride branch from 96134f9 to c68902c Compare May 12, 2026 02:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant