[MLIR][XeGPU][VectorToXeGPU] Add lowering from vector.gather/scatter to xegpu.load/store #158024

dchigarev · 2025-09-11T10:05:55Z

Lowering for vector.gather/vector.scatter into xegpu.load/xegpu.store. This PR heavily reuses utility functions added in #152429 for vector.transfer_read/write lowering.

High level steps to lower vector.gather/scatter:

%0 = vector.gather %source[%off1, %off2, %off3][%indices], %mask,
       %pass_thru : memref<8x16x32xf32>, vector<8xindex>, vector<8xi1>, vector<8xf32> into vector<8xf32>

Compute strides and a memref offset for the %source memref using computeMemrefMeta func from the transfer_read/write lowering
Compute a linear offset like %lin_off = %base_offset + %off1 * strides#0 + %off2 * strides#1 + %off3 * strides#2
Combine the linear offset with %indices: %off = (broadcast %lin_off : index to vector<8xindex>) + %indices * strides#2
Convert memref to an i64: %flat_memref = memref.extract_aligned_pointer_as_index %source + arith.index_cast
Perform load/store: %vec = xegpu.load %flat_memref[%off], %mask
Apply selection to propagate values from the pass_thru vector: %res = arith.select %mask, %vec, %pass_thru

Complete lowering for vector.gather

gpu.module @xevm_module {
gpu.func @load_1D_vector(%source: memref<8x16x32xf32>,
     %off1: index, %off2: index, %off3: index,
     %indices: vector<8xindex>, %mask: vector<8xi1>,
     %pass_thru: vector<8xf32>) -> vector<8xf32> {
  %0 = vector.gather %source[%off1, %off2, %off3][%indices], %mask,
       %pass_thru : memref<8x16x32xf32>, vector<8xindex>, vector<8xi1>, vector<8xf32> into vector<8xf32>
  gpu.return %0 : vector<8xf32>
}
}

///////

module {
  gpu.module @xevm_module {
    gpu.func @load_1D_vector(%arg0: memref<8x16x32xf32>, %arg1: index, %arg2: index, %arg3: index, %arg4: vector<8xindex>, %arg5: vector<8xi1>, %arg6: vector<8xf32>) -> vector<8xf32> {
      %c512 = arith.constant 512 : index
      %c32 = arith.constant 32 : index
      %0 = arith.muli %arg1, %c512 : index
      %1 = arith.muli %arg2, %c32 : index
      %2 = arith.addi %0, %1 : index
      %3 = arith.addi %2, %arg3 : index
      %4 = vector.broadcast %3 : index to vector<8xindex>
      %5 = arith.addi %4, %arg4 : vector<8xindex>
      %intptr = memref.extract_aligned_pointer_as_index %arg0 : memref<8x16x32xf32> -> index
      %6 = arith.index_cast %intptr : index to i64
      %7 = xegpu.load %6[%5], %arg5  : i64, vector<8xindex>, vector<8xi1> -> vector<8xf32>
      %8 = arith.select %arg5, %7, %arg6 : vector<8xi1>, vector<8xf32>
      gpu.return %8 : vector<8xf32>
    }
  }
}

Complete lowering for vector.scatter

gpu.module @xevm_module {
gpu.func @store_1D_vector(%vec: vector<8xf32>, %source: memref<8x16x32xf32>,
     %off1: index, %off2: index, %off3: index,
     %indices: vector<8xindex>, %mask: vector<8xi1>) {
  vector.scatter %source[%off1, %off2, %off3][%indices], %mask, %vec
       : memref<8x16x32xf32>, vector<8xindex>, vector<8xi1>, vector<8xf32>
  gpu.return
}
}

///////

module {
  gpu.module @xevm_module {
    gpu.func @store_1D_vector(%arg0: vector<8xf32>, %arg1: memref<8x16x32xf32>, %arg2: index, %arg3: index, %arg4: index, %arg5: vector<8xindex>, %arg6: vector<8xi1>) {
      %c512 = arith.constant 512 : index
      %c32 = arith.constant 32 : index
      %0 = arith.muli %arg2, %c512 : index
      %1 = arith.muli %arg3, %c32 : index
      %2 = arith.addi %0, %1 : index
      %3 = arith.addi %2, %arg4 : index
      %4 = vector.broadcast %3 : index to vector<8xindex>
      %5 = arith.addi %4, %arg5 : vector<8xindex>
      %intptr = memref.extract_aligned_pointer_as_index %arg1 : memref<8x16x32xf32> -> index
      %6 = arith.index_cast %intptr : index to i64
      xegpu.store %arg0, %6[%5], %arg6  : vector<8xf32>, i64, vector<8xindex>, vector<8xi1>
      gpu.return
    }
  }
}

github-actions · 2025-09-11T10:06:16Z

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

dchigarev · 2025-09-11T10:09:39Z

mlir/lib/Conversion/VectorToXeGPU/VectorToXeGPU.cpp

+template <
+    typename OpType,
+    typename = std::enable_if_t<llvm::is_one_of<
+        std::decay_t<OpType>, vector::TransferReadOp, vector::TransferWriteOp,
+        vector::GatherOp, vector::ScatterOp>::value>>
+static SmallVector<Value> computeStrides(OpType xferOp,


Unfortunately there is no common interface for Transfer and Gather/Scatter ops, so was forced to do SFINAE here. I don't quite like this approach since this makes the definition bulky, but I don't like runtime checks via isa<> either. I'm open to change it to something else if reviewers would want to.

dchigarev · 2025-09-11T10:21:04Z

@Jianhui-Li @adam-smnk for review

mlir/lib/Conversion/VectorToXeGPU/VectorToXeGPU.cpp

Garra1980 · 2025-09-11T21:31:24Z

I guess now it depends on #158126?

Jianhui-Li · 2025-09-12T00:41:02Z

I guess now it depends on #158126?

yes. need to add subview test and use extract_aligned_pointer_as_index also.

@dchigarev Please turn the PR as "ready for review" once you are done.

…tore Signed-off-by: dchigarev <dmitry.chigarev@intel.com>

Signed-off-by: dchigarev <dmitry.chigarev@intel.com>

dchigarev · 2025-09-12T11:30:44Z

@Jianhui-Li ready for review

llvmbot · 2025-09-12T11:30:53Z

@llvm/pr-subscribers-mlir-gpu

Author: Dmitry Chigarev (dchigarev)

Changes

Lowering for vector.gather/vector.scatter into xegpu.load/xegpu.store. This PR heavily reuses utility functions added in #152429 for vector.transfer_read/write lowering.

High level steps to lower vector.gather/scatter:

%0 = vector.gather %source[%off1, %off2, %off3][%indices], %mask,
       %pass_thru : memref&lt;8x16x32xf32&gt;, vector&lt;8xindex&gt;, vector&lt;8xi1&gt;, vector&lt;8xf32&gt; into vector&lt;8xf32&gt;

Compute strides and a memref offset for the %source memref using computeMemrefMeta func from the transfer_read/write lowering
Compute a linear offset like %lin_off = %base_offset + %off1 * strides#0 + %off2 * strides#1 + %off3
Combine the linear offset with %indices: %off = (broadcast %lin_off : index to vector<8xindex>) + %indices
Convert memref to an i64: %flat_memref = memref.extract_aligned_pointer_as_index %source + arith.index_cast
Perform load/store: %vec = xegpu.load %flat_memref[%off], %mask
Apply selection to propagate values from the pass_thru vector: %res = arith.select %mask, %vec, %pass_thru

<details><summary>Complete lowering for vector.gather</summary>

gpu.module @<!-- -->xevm_module {
gpu.func @<!-- -->load_1D_vector(%source: memref&lt;8x16x32xf32&gt;,
     %off1: index, %off2: index, %off3: index,
     %indices: vector&lt;8xindex&gt;, %mask: vector&lt;8xi1&gt;,
     %pass_thru: vector&lt;8xf32&gt;) -&gt; vector&lt;8xf32&gt; {
  %0 = vector.gather %source[%off1, %off2, %off3][%indices], %mask,
       %pass_thru : memref&lt;8x16x32xf32&gt;, vector&lt;8xindex&gt;, vector&lt;8xi1&gt;, vector&lt;8xf32&gt; into vector&lt;8xf32&gt;
  gpu.return %0 : vector&lt;8xf32&gt;
}
}

///////

module {
  gpu.module @<!-- -->xevm_module {
    gpu.func @<!-- -->load_1D_vector(%arg0: memref&lt;8x16x32xf32&gt;, %arg1: index, %arg2: index, %arg3: index, %arg4: vector&lt;8xindex&gt;, %arg5: vector&lt;8xi1&gt;, %arg6: vector&lt;8xf32&gt;) -&gt; vector&lt;8xf32&gt; {
      %c512 = arith.constant 512 : index
      %c32 = arith.constant 32 : index
      %0 = arith.muli %arg1, %c512 : index
      %1 = arith.muli %arg2, %c32 : index
      %2 = arith.addi %0, %1 : index
      %3 = arith.addi %2, %arg3 : index
      %4 = vector.broadcast %3 : index to vector&lt;8xindex&gt;
      %5 = arith.addi %4, %arg4 : vector&lt;8xindex&gt;
      %intptr = memref.extract_aligned_pointer_as_index %arg0 : memref&lt;8x16x32xf32&gt; -&gt; index
      %6 = arith.index_cast %intptr : index to i64
      %7 = xegpu.load %6[%5], %arg5  : i64, vector&lt;8xindex&gt;, vector&lt;8xi1&gt; -&gt; vector&lt;8xf32&gt;
      %8 = arith.select %arg5, %7, %arg6 : vector&lt;8xi1&gt;, vector&lt;8xf32&gt;
      gpu.return %8 : vector&lt;8xf32&gt;
    }
  }
}

</details>

<details><summary>Complete lowering for vector.scatter</summary>

gpu.module @<!-- -->xevm_module {
gpu.func @<!-- -->store_1D_vector(%vec: vector&lt;8xf32&gt;, %source: memref&lt;8x16x32xf32&gt;,
     %off1: index, %off2: index, %off3: index,
     %indices: vector&lt;8xindex&gt;, %mask: vector&lt;8xi1&gt;) {
  vector.scatter %source[%off1, %off2, %off3][%indices], %mask, %vec
       : memref&lt;8x16x32xf32&gt;, vector&lt;8xindex&gt;, vector&lt;8xi1&gt;, vector&lt;8xf32&gt;
  gpu.return
}
}

///////

module {
  gpu.module @<!-- -->xevm_module {
    gpu.func @<!-- -->store_1D_vector(%arg0: vector&lt;8xf32&gt;, %arg1: memref&lt;8x16x32xf32&gt;, %arg2: index, %arg3: index, %arg4: index, %arg5: vector&lt;8xindex&gt;, %arg6: vector&lt;8xi1&gt;) {
      %c512 = arith.constant 512 : index
      %c32 = arith.constant 32 : index
      %0 = arith.muli %arg2, %c512 : index
      %1 = arith.muli %arg3, %c32 : index
      %2 = arith.addi %0, %1 : index
      %3 = arith.addi %2, %arg4 : index
      %4 = vector.broadcast %3 : index to vector&lt;8xindex&gt;
      %5 = arith.addi %4, %arg5 : vector&lt;8xindex&gt;
      %intptr = memref.extract_aligned_pointer_as_index %arg1 : memref&lt;8x16x32xf32&gt; -&gt; index
      %6 = arith.index_cast %intptr : index to i64
      xegpu.store %arg0, %6[%5], %arg6  : vector&lt;8xf32&gt;, i64, vector&lt;8xindex&gt;, vector&lt;8xi1&gt;
      gpu.return
    }
  }
}

</details>

Patch is 28.52 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/158024.diff

3 Files Affected:

(modified) mlir/lib/Conversion/VectorToXeGPU/VectorToXeGPU.cpp (+138-8)
(added) mlir/test/Conversion/VectorToXeGPU/gather-to-xegpu.mlir (+187)
(added) mlir/test/Conversion/VectorToXeGPU/scatter-to-xegpu.mlir (+163)

diff --git a/mlir/lib/Conversion/VectorToXeGPU/VectorToXeGPU.cpp b/mlir/lib/Conversion/VectorToXeGPU/VectorToXeGPU.cpp
index 852c322cc6467..eebaceba488b4 100644
--- a/mlir/lib/Conversion/VectorToXeGPU/VectorToXeGPU.cpp
+++ b/mlir/lib/Conversion/VectorToXeGPU/VectorToXeGPU.cpp
@@ -97,6 +97,24 @@ static LogicalResult transferPreconditions(PatternRewriter &rewriter,
   return success();
 }
 
+// Common preconditions for the lowering of vector.gather and vector.scatter:
+//  1. Source is a memref.
+//  2. The innermost dimension of the memref is contiguous (stride == 1)
+static LogicalResult gatherScatterPreconditions(PatternRewriter &rewriter,
+                                                Operation *op, Type baseType) {
+  auto srcTy = dyn_cast<MemRefType>(baseType);
+  if (!srcTy)
+    return rewriter.notifyMatchFailure(op, "Expects memref source");
+
+  SmallVector<int64_t> strides;
+  int64_t offset;
+  if (failed(srcTy.getStridesAndOffset(strides, offset)) || strides.back() != 1)
+    return rewriter.notifyMatchFailure(
+        op, "Buffer must be contiguous in the innermost dimension");
+
+  return success();
+}
+
 static xegpu::CreateNdDescOp
 createNdDescriptor(PatternRewriter &rewriter, Location loc,
                    xegpu::TensorDescType descType, TypedValue<MemRefType> src,
@@ -183,11 +201,15 @@ static void adjustStridesForPermutation(AffineMap permMap,
 // Computes memory strides and a memref offset for vector transfer operations,
 // handling both static and dynamic memrefs while applying permutation
 // transformations for XeGPU lowering.
+template <
+    typename OpType,
+    typename = std::enable_if_t<llvm::is_one_of<
+        std::decay_t<OpType>, vector::TransferReadOp, vector::TransferWriteOp,
+        vector::GatherOp, vector::ScatterOp>::value>>
 static std::pair<SmallVector<Value>, Value>
-computeMemrefMeta(VectorTransferOpInterface xferOp, PatternRewriter &rewriter) {
+computeMemrefMeta(OpType xferOp, PatternRewriter &rewriter) {
   SmallVector<Value> strides;
   Value baseMemref = xferOp.getBase();
-  AffineMap permMap = xferOp.getPermutationMap();
   MemRefType memrefType = dyn_cast<MemRefType>(baseMemref.getType());
 
   Location loc = xferOp.getLoc();
@@ -232,8 +254,14 @@ computeMemrefMeta(VectorTransferOpInterface xferOp, PatternRewriter &rewriter) {
     if (!offsetVal)
       offsetVal = meta.getOffset();
   }
-  // Adjust strides according to the permutation map (e.g., for transpose)
-  adjustStridesForPermutation(permMap, strides);
+
+  if constexpr (llvm::is_one_of<std::decay_t<OpType>, vector::TransferReadOp,
+                                vector::TransferWriteOp>::value) {
+    AffineMap permMap = xferOp.getPermutationMap();
+    // Adjust strides according to the permutation map (e.g., for transpose)
+    adjustStridesForPermutation(permMap, strides);
+  }
+
   return {strides, offsetVal};
 }
 
@@ -339,9 +367,44 @@ static Value computeOffsets(VectorTransferOpInterface xferOp,
   return localOffsets;
 }
 
+// Compute the element-wise offsets for vector.gather or vector.scatter ops.
+//
+// This function linearizes the base offsets of the gather/scatter operation
+// and combines them with the per-element indices to produce a final vector of
+// memory offsets.
+template <
+    typename OpType,
+    typename = std::enable_if_t<llvm::is_one_of<
+        std::decay_t<OpType>, vector::GatherOp, vector::ScatterOp>::value>>
+static Value computeOffsets(PatternRewriter &rewriter, OpType gatScatOp,
+                            ArrayRef<Value> strides, Value baseOffset) {
+  Location loc = gatScatOp.getLoc();
+  SmallVector<Value> offsets = gatScatOp.getOffsets();
+  for (size_t i = 0; i < offsets.size(); ++i) {
+    Value offsetContrib =
+        arith::MulIOp::create(rewriter, loc, offsets[i], strides[i]);
+    baseOffset =
+        arith::AddIOp::create(rewriter, loc, baseOffset, offsetContrib);
+  }
+  Value indices = gatScatOp.getIndices();
+  VectorType vecType = cast<VectorType>(indices.getType());
+
+  Value baseVector =
+      vector::BroadcastOp::create(
+          rewriter, loc,
+          VectorType::get(vecType.getShape(), rewriter.getIndexType()),
+          baseOffset)
+          .getResult();
+  return arith::AddIOp::create(rewriter, loc, baseVector, indices).getResult();
+}
+
+template <
+    typename OpType,
+    typename = std::enable_if_t<llvm::is_one_of<
+        std::decay_t<OpType>, vector::TransferReadOp, vector::TransferWriteOp,
+        vector::GatherOp, vector::ScatterOp>::value>>
 // Convert memref to i64 base pointer
-static Value memrefToIndexPtr(VectorTransferOpInterface xferOp,
-                              PatternRewriter &rewriter) {
+static Value memrefToIndexPtr(OpType xferOp, PatternRewriter &rewriter) {
   Location loc = xferOp.getLoc();
   auto indexPtr = memref::ExtractAlignedPointerAsIndexOp::create(
                       rewriter, loc, xferOp.getBase())
@@ -539,6 +602,71 @@ struct TransferWriteLowering
   }
 };
 
+struct GatherLowering : public OpRewritePattern<vector::GatherOp> {
+  using OpRewritePattern<vector::GatherOp>::OpRewritePattern;
+
+  LogicalResult matchAndRewrite(vector::GatherOp gatherOp,
+                                PatternRewriter &rewriter) const override {
+    if (failed(gatherScatterPreconditions(rewriter, gatherOp,
+                                          gatherOp.getBase().getType())))
+      return failure();
+
+    Location loc = gatherOp.getLoc();
+    VectorType vectorType = gatherOp.getVectorType();
+
+    auto meta = computeMemrefMeta(gatherOp, rewriter);
+    if (meta.first.empty())
+      return rewriter.notifyMatchFailure(gatherOp, "Failed to compute strides");
+
+    Value localOffsets =
+        computeOffsets(rewriter, gatherOp, meta.first, meta.second);
+    Value flatMemref = memrefToIndexPtr(gatherOp, rewriter);
+
+    auto xeGatherOp = xegpu::LoadGatherOp::create(
+        rewriter, loc, vectorType, flatMemref, localOffsets, gatherOp.getMask(),
+        /*chunk_size=*/IntegerAttr{},
+        /*l1_hint=*/xegpu::CachePolicyAttr{},
+        /*l2_hint=*/xegpu::CachePolicyAttr{},
+        /*l3_hint=*/xegpu::CachePolicyAttr{});
+
+    auto selectOp =
+        arith::SelectOp::create(rewriter, loc, gatherOp.getMask(),
+                                xeGatherOp.getResult(), gatherOp.getPassThru());
+    rewriter.replaceOp(gatherOp, selectOp.getResult());
+    return success();
+  }
+};
+
+struct ScatterLowering : public OpRewritePattern<vector::ScatterOp> {
+  using OpRewritePattern<vector::ScatterOp>::OpRewritePattern;
+
+  LogicalResult matchAndRewrite(vector::ScatterOp scatterOp,
+                                PatternRewriter &rewriter) const override {
+    if (failed(gatherScatterPreconditions(rewriter, scatterOp,
+                                          scatterOp.getBase().getType())))
+      return failure();
+
+    Location loc = scatterOp.getLoc();
+    auto meta = computeMemrefMeta(scatterOp, rewriter);
+    if (meta.first.empty())
+      return rewriter.notifyMatchFailure(scatterOp,
+                                         "Failed to compute strides");
+
+    Value localOffsets =
+        computeOffsets(rewriter, scatterOp, meta.first, meta.second);
+    Value flatMemref = memrefToIndexPtr(scatterOp, rewriter);
+
+    xegpu::StoreScatterOp::create(rewriter, loc, scatterOp.getValueToStore(),
+                                  flatMemref, localOffsets, scatterOp.getMask(),
+                                  /*chunk_size=*/IntegerAttr{},
+                                  /*l1_hint=*/xegpu::CachePolicyAttr{},
+                                  /*l2_hint=*/xegpu::CachePolicyAttr{},
+                                  /*l3_hint=*/xegpu::CachePolicyAttr{});
+    rewriter.eraseOp(scatterOp);
+    return success();
+  }
+};
+
 struct LoadLowering : public OpRewritePattern<vector::LoadOp> {
   using OpRewritePattern<vector::LoadOp>::OpRewritePattern;
 
@@ -654,6 +782,8 @@ struct ConvertVectorToXeGPUPass
 
 void mlir::populateVectorToXeGPUConversionPatterns(
     RewritePatternSet &patterns) {
-  patterns.add<TransferReadLowering, TransferWriteLowering, LoadLowering,
-               StoreLowering, ContractionLowering>(patterns.getContext());
+  patterns
+      .add<TransferReadLowering, TransferWriteLowering, LoadLowering,
+           ScatterLowering, GatherLowering, StoreLowering, ContractionLowering>(
+          patterns.getContext());
 }
diff --git a/mlir/test/Conversion/VectorToXeGPU/gather-to-xegpu.mlir b/mlir/test/Conversion/VectorToXeGPU/gather-to-xegpu.mlir
new file mode 100644
index 0000000000000..8eb9a40f5ae53
--- /dev/null
+++ b/mlir/test/Conversion/VectorToXeGPU/gather-to-xegpu.mlir
@@ -0,0 +1,187 @@
+// RUN: mlir-opt %s -convert-vector-to-xegpu -split-input-file | FileCheck %s
+
+gpu.module @xevm_module {
+gpu.func @load_1D_vector(%source: memref<8x16x32xf32>,
+     %off1: index, %off2: index, %off3: index,
+     %indices: vector<8xindex>, %mask: vector<8xi1>,
+     %pass_thru: vector<8xf32>) -> vector<8xf32> {
+  %0 = vector.gather %source[%off1, %off2, %off3][%indices], %mask,
+       %pass_thru : memref<8x16x32xf32>, vector<8xindex>, vector<8xi1>, vector<8xf32> into vector<8xf32>
+  gpu.return %0 : vector<8xf32>
+}
+// CHECK-LABEL:  @load_1D_vector(
+// CHECK-SAME:   %[[SRC:.+]]: memref<8x16x32xf32>,
+// CHECK-SAME:   %[[OFF1:.+]]: index, %[[OFF2:.+]]: index, %[[OFF3:.+]]: index,
+// CHECK-SAME:   %[[INDICES:.+]]: vector<8xindex>
+// CHECK-SAME:   %[[MASK:.+]]: vector<8xi1>
+// CHECK-SAME:   %[[PASS_THRU:.+]]: vector<8xf32>) -> vector<8xf32> {
+// CHECK-COUNT2: arith.muli {{.*}} : index
+// CHECK-COUNT2: arith.addi {{.*}} : index
+// CHECK:        %[[SPLAT:.+]] = vector.broadcast {{.*}}:  index to vector<8xindex>
+// CHECK:        %[[LIN_IDX:.+]] = arith.addi %[[SPLAT]], %[[INDICES]] : vector<8xindex>
+// CHECK:        %[[COLLAPSE:.+]] = memref.extract_aligned_pointer_as_index %[[SRC]] : memref<8x16x32xf32> -> index
+// CHECK:        %[[COLLAPSE_I:.+]] = arith.index_cast %[[COLLAPSE]] : index to i64
+// CHECK:        %[[VEC:.+]] = xegpu.load %[[COLLAPSE_I]]{{\[}}%[[LIN_IDX]]{{\]}}, %[[MASK]] : i64, vector<8xindex>, vector<8xi1> -> vector<8xf32>
+// CHECK:        %[[RES:.+]] = arith.select %[[MASK]], %[[VEC]], %[[PASS_THRU]] : vector<8xi1>, vector<8xf32>
+// CHECK:        gpu.return %[[RES]] : vector<8xf32>
+}
+
+// -----
+gpu.module @xevm_module {
+gpu.func @load_2D_memref(%source: memref<8x32xf32>,
+     %off1: index, %off2: index,
+     %indices: vector<8xindex>, %mask: vector<8xi1>,
+     %pass_thru: vector<8xf32>) -> vector<8xf32> {
+  %0 = vector.gather %source[%off1, %off2][%indices], %mask,
+       %pass_thru : memref<8x32xf32>, vector<8xindex>, vector<8xi1>, vector<8xf32> into vector<8xf32>
+  gpu.return %0 : vector<8xf32>
+}
+// CHECK-LABEL:  @load_2D_memref(
+// CHECK-SAME:   %[[SRC:.+]]: memref<8x32xf32>,
+// CHECK-SAME:   %[[OFF1:.+]]: index, %[[OFF2:.+]]: index
+// CHECK-SAME:   %[[INDICES:.+]]: vector<8xindex>
+// CHECK-SAME:   %[[MASK:.+]]: vector<8xi1>
+// CHECK-SAME:   %[[PASS_THRU:.+]]: vector<8xf32>) -> vector<8xf32> {
+// CHECK-COUNT1: arith.muli {{.*}} : index
+// CHECK-COUNT1: arith.addi {{.*}} : index
+// CHECK:        %[[SPLAT:.+]] = vector.broadcast {{.*}}:  index to vector<8xindex>
+// CHECK:        %[[LIN_IDX:.+]] = arith.addi %[[SPLAT]], %[[INDICES]] : vector<8xindex>
+// CHECK:        %[[COLLAPSE:.+]] = memref.extract_aligned_pointer_as_index %[[SRC]] : memref<8x32xf32> -> index
+// CHECK:        %[[COLLAPSE_I:.+]] = arith.index_cast %[[COLLAPSE]] : index to i64
+// CHECK:        %[[VEC:.+]] = xegpu.load %[[COLLAPSE_I]]{{\[}}%[[LIN_IDX]]{{\]}}, %[[MASK]] : i64, vector<8xindex>, vector<8xi1> -> vector<8xf32>
+// CHECK:        %[[RES:.+]] = arith.select %[[MASK]], %[[VEC]], %[[PASS_THRU]] : vector<8xi1>, vector<8xf32>
+// CHECK:        gpu.return %[[RES]] : vector<8xf32>
+}
+
+// -----
+gpu.module @xevm_module {
+gpu.func @load_2D_vector(%source: memref<8x16x32xf32>,
+    %off0: index, %off1: index, %off2: index,
+    %indices: vector<8x16xindex>, %mask: vector<8x16xi1>,
+    %pass_thru: vector<8x16xf32>) -> vector<8x16xf32> {
+  %0 = vector.gather %source[%off0, %off1, %off2][%indices], %mask,
+       %pass_thru : memref<8x16x32xf32>, vector<8x16xindex>, vector<8x16xi1>, vector<8x16xf32> into vector<8x16xf32>
+  gpu.return %0 : vector<8x16xf32>
+}
+// CHECK-LABEL:  @load_2D_vector(
+// CHECK-SAME:   %[[SRC:.+]]: memref<8x16x32xf32>,
+// CHECK-SAME:   %[[OFF1:.+]]: index, %[[OFF2:.+]]: index, %[[OFF3:.+]]: index,
+// CHECK-SAME:   %[[INDICES:.+]]: vector<8x16xindex>
+// CHECK-SAME:   %[[MASK:.+]]: vector<8x16xi1>
+// CHECK-SAME:   %[[PASS_THRU:.+]]: vector<8x16xf32>) -> vector<8x16xf32> {
+// CHECK-COUNT2: arith.muli {{.*}} : index
+// CHECK-COUNT2: arith.addi {{.*}} : index
+// CHECK:        %[[SPLAT:.+]] = vector.broadcast {{.*}}:  index to vector<8x16xindex>
+// CHECK:        %[[LIN_IDX:.+]] = arith.addi %[[SPLAT]], %[[INDICES]] : vector<8x16xindex>
+// CHECK:        %[[COLLAPSE:.+]] = memref.extract_aligned_pointer_as_index %[[SRC]] : memref<8x16x32xf32> -> index
+// CHECK:        %[[COLLAPSE_I:.+]] = arith.index_cast %[[COLLAPSE]] : index to i64
+// CHECK:        %[[VEC:.+]] = xegpu.load %[[COLLAPSE_I]]{{\[}}%[[LIN_IDX]]{{\]}}, %[[MASK]] : i64, vector<8x16xindex>, vector<8x16xi1> -> vector<8x16xf32>
+// CHECK:        %[[RES:.+]] = arith.select %[[MASK]], %[[VEC]], %[[PASS_THRU]] : vector<8x16xi1>, vector<8x16xf32>
+// CHECK:        gpu.return %[[RES]] : vector<8x16xf32>
+}
+
+// -----
+gpu.module @xevm_module {
+gpu.func @load_dynamic_source(%source: memref<?x?x?xf32>,
+    %off0: index, %off1: index, %off2: index,
+    %indices: vector<8x16xindex>, %mask: vector<8x16xi1>,
+    %pass_thru: vector<8x16xf32>) -> vector<8x16xf32> {
+  %0 = vector.gather %source[%off0, %off1, %off2][%indices], %mask,
+       %pass_thru : memref<?x?x?xf32>, vector<8x16xindex>, vector<8x16xi1>, vector<8x16xf32> into vector<8x16xf32>
+  gpu.return %0 : vector<8x16xf32>
+}
+// CHECK-LABEL:  @load_dynamic_source(
+// CHECK-SAME:   %[[SRC:.+]]: memref<?x?x?xf32>,
+// CHECK-SAME:   %[[OFF1:.+]]: index, %[[OFF2:.+]]: index, %[[OFF3:.+]]: index,
+// CHECK-SAME:   %[[INDICES:.+]]: vector<8x16xindex>
+// CHECK-SAME:   %[[MASK:.+]]: vector<8x16xi1>
+// CHECK-SAME:   %[[PASS_THRU:.+]]: vector<8x16xf32>) -> vector<8x16xf32> {
+// CHECK:        memref.extract_strided_metadata %[[SRC]]
+// CHECK-COUNT2: arith.muli {{.*}} : index
+// CHECK-COUNT2: arith.addi {{.*}} : index
+// CHECK:        %[[SPLAT:.+]] = vector.broadcast {{.*}}:  index to vector<8x16xindex>
+// CHECK:        %[[LIN_IDX:.+]] = arith.addi %[[SPLAT]], %[[INDICES]] : vector<8x16xindex>
+// CHECK:        %[[COLLAPSE:.+]] = memref.extract_aligned_pointer_as_index %[[SRC]] : memref<?x?x?xf32> -> index
+// CHECK:        %[[COLLAPSE_I:.+]] = arith.index_cast %[[COLLAPSE]] : index to i64
+// CHECK:        %[[VEC:.+]] = xegpu.load %[[COLLAPSE_I]]{{\[}}%[[LIN_IDX]]{{\]}}, %[[MASK]] : i64, vector<8x16xindex>, vector<8x16xi1> -> vector<8x16xf32>
+// CHECK:        %[[RES:.+]] = arith.select %[[MASK]], %[[VEC]], %[[PASS_THRU]] : vector<8x16xi1>, vector<8x16xf32>
+// CHECK:        gpu.return %[[RES]] : vector<8x16xf32>
+}
+
+// -----
+gpu.module @xevm_module {
+gpu.func @load_dynamic_source2(%source: memref<?x8x16xf32>,
+    %off0: index, %off1: index, %off2: index,
+    %indices: vector<8x16xindex>, %mask: vector<8x16xi1>,
+    %pass_thru: vector<8x16xf32>) -> vector<8x16xf32> {
+  %0 = vector.gather %source[%off0, %off1, %off2][%indices], %mask,
+       %pass_thru : memref<?x8x16xf32>, vector<8x16xindex>, vector<8x16xi1>, vector<8x16xf32> into vector<8x16xf32>
+  gpu.return %0 : vector<8x16xf32>
+}
+// CHECK-LABEL:  @load_dynamic_source2(
+// CHECK-SAME:   %[[SRC:.+]]: memref<?x8x16xf32>,
+// CHECK-SAME:   %[[OFF1:.+]]: index, %[[OFF2:.+]]: index, %[[OFF3:.+]]: index,
+// CHECK-SAME:   %[[INDICES:.+]]: vector<8x16xindex>
+// CHECK-SAME:   %[[MASK:.+]]: vector<8x16xi1>
+// CHECK-SAME:   %[[PASS_THRU:.+]]: vector<8x16xf32>) -> vector<8x16xf32> {
+// CHECK-NOT:    memref.extract_strided_metadata %[[SRC]]
+// CHECK-COUNT2: arith.muli {{.*}} : index
+// CHECK-COUNT2: arith.addi {{.*}} : index
+// CHECK:        %[[SPLAT:.+]] = vector.broadcast {{.*}}:  index to vector<8x16xindex>
+// CHECK:        %[[LIN_IDX:.+]] = arith.addi %[[SPLAT]], %[[INDICES]] : vector<8x16xindex>
+// CHECK:        %[[COLLAPSE:.+]] = memref.extract_aligned_pointer_as_index %[[SRC]] : memref<?x8x16xf32> -> index
+// CHECK:        %[[COLLAPSE_I:.+]] = arith.index_cast %[[COLLAPSE]] : index to i64
+// CHECK:        %[[VEC:.+]] = xegpu.load %[[COLLAPSE_I]]{{\[}}%[[LIN_IDX]]{{\]}}, %[[MASK]] : i64, vector<8x16xindex>, vector<8x16xi1> -> vector<8x16xf32>
+// CHECK:        %[[RES:.+]] = arith.select %[[MASK]], %[[VEC]], %[[PASS_THRU]] : vector<8x16xi1>, vector<8x16xf32>
+// CHECK:        gpu.return %[[RES]] : vector<8x16xf32>
+}
+
+// -----
+gpu.module @xevm_module {
+gpu.func @no_load_tensor(%source: tensor<32x64xf32>,
+    %off: index, %indices: vector<8x16xindex>,
+    %mask: vector<8x16xi1>, %pass_thru: vector<8x16xf32>) -> vector<8x16xf32> {
+  %0 = vector.gather %source[%off, %off][%indices], %mask,
+       %pass_thru : tensor<32x64xf32>, vector<8x16xindex>, vector<8x16xi1>, vector<8x16xf32> into vector<8x16xf32>
+  gpu.return %0 : vector<8x16xf32>
+}
+// CHECK-LABEL:  @no_load_tensor(
+// CHECK:        vector.gather
+}
+
+// -----
+gpu.module @xevm_module {
+gpu.func @gather_from_subview(%source: memref<4096x4096xf16>,
+                              %off1: index, %off2: index,
+                              %indices: vector<8xindex>,
+                              %mask: vector<8xi1>,
+                              %pass_thru: vector<8xf16>) -> vector<8xf16> {
+  %subview = memref.subview %source[%off1, %off2] [256, 256] [1, 1]
+      : memref<4096x4096xf16>
+        to memref<256x256xf16, strided<[4096, 1], offset: ?>>
+  %0 = vector.gather %subview[%off1, %off2][%indices], %mask, %pass_thru
+       : memref<256x256xf16, strided<[4096, 1], offset: ?>>,
+         vector<8xindex>, vector<8xi1>, vector<8xf16>
+         into vector<8xf16>
+  gpu.return %0 : vector<8xf16>
+}
+// CHECK-LABEL:  @gather_from_subview(
+// CHECK-SAME:   %[[SRC:.+]]: memref<4096x4096xf16>,
+// CHECK-SAME:   %[[OFF1:.+]]: index, %[[OFF2:.+]]: index,
+// CHECK-SAME:   %[[INDICES:.+]]: vector<8xindex>,
+// CHECK-SAME:   %[[MASK:.+]]: vector<8xi1>,
+// CHECK-SAME:   %[[PASS:.+]]: vector<8xf16>) -> vector<8xf16> {
+// CHECK:        %[[SUBVIEW:.+]] = memref.subview %[[SRC]][%[[OFF1]], %[[OFF2]]] [256, 256] [1, 1]
+// CHECK:        %[[BB:.+]], %[[OFFSET:.+]],{{.*}},{{.*}} = memref.extract_strided_metadata %[[SUBVIEW]] : memref<256x256xf16, strided<[4096, 1], offset: ?>> -> memref<f16>, index, index, index, index, index
+// CHECK:        arith.muli {{.*}} : index
+// CHECK:        arith.addi %[[OFFSET]]{{.*}} : index
+// CHECK:        %[[BASE_OFF:.+]] = arith.addi {{.*}} : index
+// CHECK:        %[[SPLAT:.+]] = vector.broadcast %[[BASE_OFF]] : index to vector<8xindex>
+// CHECK:        %[[LIN:.+]] = arith.addi %[[SPLAT]], %[[INDICES]] : vector<8xindex>
+// CHECK:        %[[BASE_IDX:.+]] = memref.extract_aligned_pointer_as_index %[[SUBVIEW]] : memref<256x256xf16, strided<[4096, 1], offset: ?>> -> index
+// CHECK:        %[[BASE_I64:.+]] = arith.index_cast %[[BASE_IDX]] : index to i64
+// CHECK:        %[[VEC:.+]] = xegpu.load %[[BASE_I64]]{{\[}}%[[LIN]]{{\]}}, %[[MASK]]
+// CHECK-SAME:     : i64, vector<8xindex>, vector<8xi1> -> vector<8xf16>
+// CHECK:        %[[RES:.+]] = arith.select %[[MASK]], %[[VEC]], %[[PASS]] : vector<8xi1>, vector<8xf16>
+// CHECK:        gpu.return %[[RES]] : vector<8xf16>
+}
diff --git a/mlir/test/Conversion/VectorToXeGPU/scatter-to-xegpu.mlir b/mlir/test/Conversion/VectorToXeGPU/scatter-to-xegpu.mlir
new file mode 100644
index 0000000000000..ea6a34a437962
--- /dev/null
+++ b/mlir/test/Conversion/VectorToXeGPU/scatter-to-xegpu.mlir
@@ -0,0 +1,163 @@
+// RUN: mlir-opt %s -convert-vector-to-xegpu -split-input-file | FileCheck %s
+
+gpu.module @xevm_module {
+gpu.func @store_1D_vector(%vec: vector<8xf32>, %source: memref<8x16x32xf32>,
+     %off1: index, %off2: index, %off3: index,
+     %indices: vector<8xindex>, %mask: vector<8xi1>) {
+  vector.scatter %source[%off1, %off2, %off3][%indices], %mask, %vec
+       : memref<8x16x32xf32>, vector<8xindex>, vector<8xi1>, vector<8xf32>
+  gpu.return
+}
+// CHECK-LABEL:  @store_1D_vector(
+// CHECK-SAM...
[truncated]

llvmbot · 2025-09-12T11:30:54Z

@llvm/pr-subscribers-mlir

Author: Dmitry Chigarev (dchigarev)

Changes

Lowering for vector.gather/vector.scatter into xegpu.load/xegpu.store. This PR heavily reuses utility functions added in #152429 for vector.transfer_read/write lowering.

High level steps to lower vector.gather/scatter:

%0 = vector.gather %source[%off1, %off2, %off3][%indices], %mask,
       %pass_thru : memref&lt;8x16x32xf32&gt;, vector&lt;8xindex&gt;, vector&lt;8xi1&gt;, vector&lt;8xf32&gt; into vector&lt;8xf32&gt;

Compute strides and a memref offset for the %source memref using computeMemrefMeta func from the transfer_read/write lowering
Compute a linear offset like %lin_off = %base_offset + %off1 * strides#0 + %off2 * strides#1 + %off3
Combine the linear offset with %indices: %off = (broadcast %lin_off : index to vector<8xindex>) + %indices
Convert memref to an i64: %flat_memref = memref.extract_aligned_pointer_as_index %source + arith.index_cast
Perform load/store: %vec = xegpu.load %flat_memref[%off], %mask
Apply selection to propagate values from the pass_thru vector: %res = arith.select %mask, %vec, %pass_thru

<details><summary>Complete lowering for vector.gather</summary>

gpu.module @<!-- -->xevm_module {
gpu.func @<!-- -->load_1D_vector(%source: memref&lt;8x16x32xf32&gt;,
     %off1: index, %off2: index, %off3: index,
     %indices: vector&lt;8xindex&gt;, %mask: vector&lt;8xi1&gt;,
     %pass_thru: vector&lt;8xf32&gt;) -&gt; vector&lt;8xf32&gt; {
  %0 = vector.gather %source[%off1, %off2, %off3][%indices], %mask,
       %pass_thru : memref&lt;8x16x32xf32&gt;, vector&lt;8xindex&gt;, vector&lt;8xi1&gt;, vector&lt;8xf32&gt; into vector&lt;8xf32&gt;
  gpu.return %0 : vector&lt;8xf32&gt;
}
}

///////

module {
  gpu.module @<!-- -->xevm_module {
    gpu.func @<!-- -->load_1D_vector(%arg0: memref&lt;8x16x32xf32&gt;, %arg1: index, %arg2: index, %arg3: index, %arg4: vector&lt;8xindex&gt;, %arg5: vector&lt;8xi1&gt;, %arg6: vector&lt;8xf32&gt;) -&gt; vector&lt;8xf32&gt; {
      %c512 = arith.constant 512 : index
      %c32 = arith.constant 32 : index
      %0 = arith.muli %arg1, %c512 : index
      %1 = arith.muli %arg2, %c32 : index
      %2 = arith.addi %0, %1 : index
      %3 = arith.addi %2, %arg3 : index
      %4 = vector.broadcast %3 : index to vector&lt;8xindex&gt;
      %5 = arith.addi %4, %arg4 : vector&lt;8xindex&gt;
      %intptr = memref.extract_aligned_pointer_as_index %arg0 : memref&lt;8x16x32xf32&gt; -&gt; index
      %6 = arith.index_cast %intptr : index to i64
      %7 = xegpu.load %6[%5], %arg5  : i64, vector&lt;8xindex&gt;, vector&lt;8xi1&gt; -&gt; vector&lt;8xf32&gt;
      %8 = arith.select %arg5, %7, %arg6 : vector&lt;8xi1&gt;, vector&lt;8xf32&gt;
      gpu.return %8 : vector&lt;8xf32&gt;
    }
  }
}

</details>

<details><summary>Complete lowering for vector.scatter</summary>

gpu.module @<!-- -->xevm_module {
gpu.func @<!-- -->store_1D_vector(%vec: vector&lt;8xf32&gt;, %source: memref&lt;8x16x32xf32&gt;,
     %off1: index, %off2: index, %off3: index,
     %indices: vector&lt;8xindex&gt;, %mask: vector&lt;8xi1&gt;) {
  vector.scatter %source[%off1, %off2, %off3][%indices], %mask, %vec
       : memref&lt;8x16x32xf32&gt;, vector&lt;8xindex&gt;, vector&lt;8xi1&gt;, vector&lt;8xf32&gt;
  gpu.return
}
}

///////

module {
  gpu.module @<!-- -->xevm_module {
    gpu.func @<!-- -->store_1D_vector(%arg0: vector&lt;8xf32&gt;, %arg1: memref&lt;8x16x32xf32&gt;, %arg2: index, %arg3: index, %arg4: index, %arg5: vector&lt;8xindex&gt;, %arg6: vector&lt;8xi1&gt;) {
      %c512 = arith.constant 512 : index
      %c32 = arith.constant 32 : index
      %0 = arith.muli %arg2, %c512 : index
      %1 = arith.muli %arg3, %c32 : index
      %2 = arith.addi %0, %1 : index
      %3 = arith.addi %2, %arg4 : index
      %4 = vector.broadcast %3 : index to vector&lt;8xindex&gt;
      %5 = arith.addi %4, %arg5 : vector&lt;8xindex&gt;
      %intptr = memref.extract_aligned_pointer_as_index %arg1 : memref&lt;8x16x32xf32&gt; -&gt; index
      %6 = arith.index_cast %intptr : index to i64
      xegpu.store %arg0, %6[%5], %arg6  : vector&lt;8xf32&gt;, i64, vector&lt;8xindex&gt;, vector&lt;8xi1&gt;
      gpu.return
    }
  }
}

</details>

Patch is 28.52 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/158024.diff

3 Files Affected:

(modified) mlir/lib/Conversion/VectorToXeGPU/VectorToXeGPU.cpp (+138-8)
(added) mlir/test/Conversion/VectorToXeGPU/gather-to-xegpu.mlir (+187)
(added) mlir/test/Conversion/VectorToXeGPU/scatter-to-xegpu.mlir (+163)

diff --git a/mlir/lib/Conversion/VectorToXeGPU/VectorToXeGPU.cpp b/mlir/lib/Conversion/VectorToXeGPU/VectorToXeGPU.cpp
index 852c322cc6467..eebaceba488b4 100644
--- a/mlir/lib/Conversion/VectorToXeGPU/VectorToXeGPU.cpp
+++ b/mlir/lib/Conversion/VectorToXeGPU/VectorToXeGPU.cpp
@@ -97,6 +97,24 @@ static LogicalResult transferPreconditions(PatternRewriter &rewriter,
   return success();
 }
 
+// Common preconditions for the lowering of vector.gather and vector.scatter:
+//  1. Source is a memref.
+//  2. The innermost dimension of the memref is contiguous (stride == 1)
+static LogicalResult gatherScatterPreconditions(PatternRewriter &rewriter,
+                                                Operation *op, Type baseType) {
+  auto srcTy = dyn_cast<MemRefType>(baseType);
+  if (!srcTy)
+    return rewriter.notifyMatchFailure(op, "Expects memref source");
+
+  SmallVector<int64_t> strides;
+  int64_t offset;
+  if (failed(srcTy.getStridesAndOffset(strides, offset)) || strides.back() != 1)
+    return rewriter.notifyMatchFailure(
+        op, "Buffer must be contiguous in the innermost dimension");
+
+  return success();
+}
+
 static xegpu::CreateNdDescOp
 createNdDescriptor(PatternRewriter &rewriter, Location loc,
                    xegpu::TensorDescType descType, TypedValue<MemRefType> src,
@@ -183,11 +201,15 @@ static void adjustStridesForPermutation(AffineMap permMap,
 // Computes memory strides and a memref offset for vector transfer operations,
 // handling both static and dynamic memrefs while applying permutation
 // transformations for XeGPU lowering.
+template <
+    typename OpType,
+    typename = std::enable_if_t<llvm::is_one_of<
+        std::decay_t<OpType>, vector::TransferReadOp, vector::TransferWriteOp,
+        vector::GatherOp, vector::ScatterOp>::value>>
 static std::pair<SmallVector<Value>, Value>
-computeMemrefMeta(VectorTransferOpInterface xferOp, PatternRewriter &rewriter) {
+computeMemrefMeta(OpType xferOp, PatternRewriter &rewriter) {
   SmallVector<Value> strides;
   Value baseMemref = xferOp.getBase();
-  AffineMap permMap = xferOp.getPermutationMap();
   MemRefType memrefType = dyn_cast<MemRefType>(baseMemref.getType());
 
   Location loc = xferOp.getLoc();
@@ -232,8 +254,14 @@ computeMemrefMeta(VectorTransferOpInterface xferOp, PatternRewriter &rewriter) {
     if (!offsetVal)
       offsetVal = meta.getOffset();
   }
-  // Adjust strides according to the permutation map (e.g., for transpose)
-  adjustStridesForPermutation(permMap, strides);
+
+  if constexpr (llvm::is_one_of<std::decay_t<OpType>, vector::TransferReadOp,
+                                vector::TransferWriteOp>::value) {
+    AffineMap permMap = xferOp.getPermutationMap();
+    // Adjust strides according to the permutation map (e.g., for transpose)
+    adjustStridesForPermutation(permMap, strides);
+  }
+
   return {strides, offsetVal};
 }
 
@@ -339,9 +367,44 @@ static Value computeOffsets(VectorTransferOpInterface xferOp,
   return localOffsets;
 }
 
+// Compute the element-wise offsets for vector.gather or vector.scatter ops.
+//
+// This function linearizes the base offsets of the gather/scatter operation
+// and combines them with the per-element indices to produce a final vector of
+// memory offsets.
+template <
+    typename OpType,
+    typename = std::enable_if_t<llvm::is_one_of<
+        std::decay_t<OpType>, vector::GatherOp, vector::ScatterOp>::value>>
+static Value computeOffsets(PatternRewriter &rewriter, OpType gatScatOp,
+                            ArrayRef<Value> strides, Value baseOffset) {
+  Location loc = gatScatOp.getLoc();
+  SmallVector<Value> offsets = gatScatOp.getOffsets();
+  for (size_t i = 0; i < offsets.size(); ++i) {
+    Value offsetContrib =
+        arith::MulIOp::create(rewriter, loc, offsets[i], strides[i]);
+    baseOffset =
+        arith::AddIOp::create(rewriter, loc, baseOffset, offsetContrib);
+  }
+  Value indices = gatScatOp.getIndices();
+  VectorType vecType = cast<VectorType>(indices.getType());
+
+  Value baseVector =
+      vector::BroadcastOp::create(
+          rewriter, loc,
+          VectorType::get(vecType.getShape(), rewriter.getIndexType()),
+          baseOffset)
+          .getResult();
+  return arith::AddIOp::create(rewriter, loc, baseVector, indices).getResult();
+}
+
+template <
+    typename OpType,
+    typename = std::enable_if_t<llvm::is_one_of<
+        std::decay_t<OpType>, vector::TransferReadOp, vector::TransferWriteOp,
+        vector::GatherOp, vector::ScatterOp>::value>>
 // Convert memref to i64 base pointer
-static Value memrefToIndexPtr(VectorTransferOpInterface xferOp,
-                              PatternRewriter &rewriter) {
+static Value memrefToIndexPtr(OpType xferOp, PatternRewriter &rewriter) {
   Location loc = xferOp.getLoc();
   auto indexPtr = memref::ExtractAlignedPointerAsIndexOp::create(
                       rewriter, loc, xferOp.getBase())
@@ -539,6 +602,71 @@ struct TransferWriteLowering
   }
 };
 
+struct GatherLowering : public OpRewritePattern<vector::GatherOp> {
+  using OpRewritePattern<vector::GatherOp>::OpRewritePattern;
+
+  LogicalResult matchAndRewrite(vector::GatherOp gatherOp,
+                                PatternRewriter &rewriter) const override {
+    if (failed(gatherScatterPreconditions(rewriter, gatherOp,
+                                          gatherOp.getBase().getType())))
+      return failure();
+
+    Location loc = gatherOp.getLoc();
+    VectorType vectorType = gatherOp.getVectorType();
+
+    auto meta = computeMemrefMeta(gatherOp, rewriter);
+    if (meta.first.empty())
+      return rewriter.notifyMatchFailure(gatherOp, "Failed to compute strides");
+
+    Value localOffsets =
+        computeOffsets(rewriter, gatherOp, meta.first, meta.second);
+    Value flatMemref = memrefToIndexPtr(gatherOp, rewriter);
+
+    auto xeGatherOp = xegpu::LoadGatherOp::create(
+        rewriter, loc, vectorType, flatMemref, localOffsets, gatherOp.getMask(),
+        /*chunk_size=*/IntegerAttr{},
+        /*l1_hint=*/xegpu::CachePolicyAttr{},
+        /*l2_hint=*/xegpu::CachePolicyAttr{},
+        /*l3_hint=*/xegpu::CachePolicyAttr{});
+
+    auto selectOp =
+        arith::SelectOp::create(rewriter, loc, gatherOp.getMask(),
+                                xeGatherOp.getResult(), gatherOp.getPassThru());
+    rewriter.replaceOp(gatherOp, selectOp.getResult());
+    return success();
+  }
+};
+
+struct ScatterLowering : public OpRewritePattern<vector::ScatterOp> {
+  using OpRewritePattern<vector::ScatterOp>::OpRewritePattern;
+
+  LogicalResult matchAndRewrite(vector::ScatterOp scatterOp,
+                                PatternRewriter &rewriter) const override {
+    if (failed(gatherScatterPreconditions(rewriter, scatterOp,
+                                          scatterOp.getBase().getType())))
+      return failure();
+
+    Location loc = scatterOp.getLoc();
+    auto meta = computeMemrefMeta(scatterOp, rewriter);
+    if (meta.first.empty())
+      return rewriter.notifyMatchFailure(scatterOp,
+                                         "Failed to compute strides");
+
+    Value localOffsets =
+        computeOffsets(rewriter, scatterOp, meta.first, meta.second);
+    Value flatMemref = memrefToIndexPtr(scatterOp, rewriter);
+
+    xegpu::StoreScatterOp::create(rewriter, loc, scatterOp.getValueToStore(),
+                                  flatMemref, localOffsets, scatterOp.getMask(),
+                                  /*chunk_size=*/IntegerAttr{},
+                                  /*l1_hint=*/xegpu::CachePolicyAttr{},
+                                  /*l2_hint=*/xegpu::CachePolicyAttr{},
+                                  /*l3_hint=*/xegpu::CachePolicyAttr{});
+    rewriter.eraseOp(scatterOp);
+    return success();
+  }
+};
+
 struct LoadLowering : public OpRewritePattern<vector::LoadOp> {
   using OpRewritePattern<vector::LoadOp>::OpRewritePattern;
 
@@ -654,6 +782,8 @@ struct ConvertVectorToXeGPUPass
 
 void mlir::populateVectorToXeGPUConversionPatterns(
     RewritePatternSet &patterns) {
-  patterns.add<TransferReadLowering, TransferWriteLowering, LoadLowering,
-               StoreLowering, ContractionLowering>(patterns.getContext());
+  patterns
+      .add<TransferReadLowering, TransferWriteLowering, LoadLowering,
+           ScatterLowering, GatherLowering, StoreLowering, ContractionLowering>(
+          patterns.getContext());
 }
diff --git a/mlir/test/Conversion/VectorToXeGPU/gather-to-xegpu.mlir b/mlir/test/Conversion/VectorToXeGPU/gather-to-xegpu.mlir
new file mode 100644
index 0000000000000..8eb9a40f5ae53
--- /dev/null
+++ b/mlir/test/Conversion/VectorToXeGPU/gather-to-xegpu.mlir
@@ -0,0 +1,187 @@
+// RUN: mlir-opt %s -convert-vector-to-xegpu -split-input-file | FileCheck %s
+
+gpu.module @xevm_module {
+gpu.func @load_1D_vector(%source: memref<8x16x32xf32>,
+     %off1: index, %off2: index, %off3: index,
+     %indices: vector<8xindex>, %mask: vector<8xi1>,
+     %pass_thru: vector<8xf32>) -> vector<8xf32> {
+  %0 = vector.gather %source[%off1, %off2, %off3][%indices], %mask,
+       %pass_thru : memref<8x16x32xf32>, vector<8xindex>, vector<8xi1>, vector<8xf32> into vector<8xf32>
+  gpu.return %0 : vector<8xf32>
+}
+// CHECK-LABEL:  @load_1D_vector(
+// CHECK-SAME:   %[[SRC:.+]]: memref<8x16x32xf32>,
+// CHECK-SAME:   %[[OFF1:.+]]: index, %[[OFF2:.+]]: index, %[[OFF3:.+]]: index,
+// CHECK-SAME:   %[[INDICES:.+]]: vector<8xindex>
+// CHECK-SAME:   %[[MASK:.+]]: vector<8xi1>
+// CHECK-SAME:   %[[PASS_THRU:.+]]: vector<8xf32>) -> vector<8xf32> {
+// CHECK-COUNT2: arith.muli {{.*}} : index
+// CHECK-COUNT2: arith.addi {{.*}} : index
+// CHECK:        %[[SPLAT:.+]] = vector.broadcast {{.*}}:  index to vector<8xindex>
+// CHECK:        %[[LIN_IDX:.+]] = arith.addi %[[SPLAT]], %[[INDICES]] : vector<8xindex>
+// CHECK:        %[[COLLAPSE:.+]] = memref.extract_aligned_pointer_as_index %[[SRC]] : memref<8x16x32xf32> -> index
+// CHECK:        %[[COLLAPSE_I:.+]] = arith.index_cast %[[COLLAPSE]] : index to i64
+// CHECK:        %[[VEC:.+]] = xegpu.load %[[COLLAPSE_I]]{{\[}}%[[LIN_IDX]]{{\]}}, %[[MASK]] : i64, vector<8xindex>, vector<8xi1> -> vector<8xf32>
+// CHECK:        %[[RES:.+]] = arith.select %[[MASK]], %[[VEC]], %[[PASS_THRU]] : vector<8xi1>, vector<8xf32>
+// CHECK:        gpu.return %[[RES]] : vector<8xf32>
+}
+
+// -----
+gpu.module @xevm_module {
+gpu.func @load_2D_memref(%source: memref<8x32xf32>,
+     %off1: index, %off2: index,
+     %indices: vector<8xindex>, %mask: vector<8xi1>,
+     %pass_thru: vector<8xf32>) -> vector<8xf32> {
+  %0 = vector.gather %source[%off1, %off2][%indices], %mask,
+       %pass_thru : memref<8x32xf32>, vector<8xindex>, vector<8xi1>, vector<8xf32> into vector<8xf32>
+  gpu.return %0 : vector<8xf32>
+}
+// CHECK-LABEL:  @load_2D_memref(
+// CHECK-SAME:   %[[SRC:.+]]: memref<8x32xf32>,
+// CHECK-SAME:   %[[OFF1:.+]]: index, %[[OFF2:.+]]: index
+// CHECK-SAME:   %[[INDICES:.+]]: vector<8xindex>
+// CHECK-SAME:   %[[MASK:.+]]: vector<8xi1>
+// CHECK-SAME:   %[[PASS_THRU:.+]]: vector<8xf32>) -> vector<8xf32> {
+// CHECK-COUNT1: arith.muli {{.*}} : index
+// CHECK-COUNT1: arith.addi {{.*}} : index
+// CHECK:        %[[SPLAT:.+]] = vector.broadcast {{.*}}:  index to vector<8xindex>
+// CHECK:        %[[LIN_IDX:.+]] = arith.addi %[[SPLAT]], %[[INDICES]] : vector<8xindex>
+// CHECK:        %[[COLLAPSE:.+]] = memref.extract_aligned_pointer_as_index %[[SRC]] : memref<8x32xf32> -> index
+// CHECK:        %[[COLLAPSE_I:.+]] = arith.index_cast %[[COLLAPSE]] : index to i64
+// CHECK:        %[[VEC:.+]] = xegpu.load %[[COLLAPSE_I]]{{\[}}%[[LIN_IDX]]{{\]}}, %[[MASK]] : i64, vector<8xindex>, vector<8xi1> -> vector<8xf32>
+// CHECK:        %[[RES:.+]] = arith.select %[[MASK]], %[[VEC]], %[[PASS_THRU]] : vector<8xi1>, vector<8xf32>
+// CHECK:        gpu.return %[[RES]] : vector<8xf32>
+}
+
+// -----
+gpu.module @xevm_module {
+gpu.func @load_2D_vector(%source: memref<8x16x32xf32>,
+    %off0: index, %off1: index, %off2: index,
+    %indices: vector<8x16xindex>, %mask: vector<8x16xi1>,
+    %pass_thru: vector<8x16xf32>) -> vector<8x16xf32> {
+  %0 = vector.gather %source[%off0, %off1, %off2][%indices], %mask,
+       %pass_thru : memref<8x16x32xf32>, vector<8x16xindex>, vector<8x16xi1>, vector<8x16xf32> into vector<8x16xf32>
+  gpu.return %0 : vector<8x16xf32>
+}
+// CHECK-LABEL:  @load_2D_vector(
+// CHECK-SAME:   %[[SRC:.+]]: memref<8x16x32xf32>,
+// CHECK-SAME:   %[[OFF1:.+]]: index, %[[OFF2:.+]]: index, %[[OFF3:.+]]: index,
+// CHECK-SAME:   %[[INDICES:.+]]: vector<8x16xindex>
+// CHECK-SAME:   %[[MASK:.+]]: vector<8x16xi1>
+// CHECK-SAME:   %[[PASS_THRU:.+]]: vector<8x16xf32>) -> vector<8x16xf32> {
+// CHECK-COUNT2: arith.muli {{.*}} : index
+// CHECK-COUNT2: arith.addi {{.*}} : index
+// CHECK:        %[[SPLAT:.+]] = vector.broadcast {{.*}}:  index to vector<8x16xindex>
+// CHECK:        %[[LIN_IDX:.+]] = arith.addi %[[SPLAT]], %[[INDICES]] : vector<8x16xindex>
+// CHECK:        %[[COLLAPSE:.+]] = memref.extract_aligned_pointer_as_index %[[SRC]] : memref<8x16x32xf32> -> index
+// CHECK:        %[[COLLAPSE_I:.+]] = arith.index_cast %[[COLLAPSE]] : index to i64
+// CHECK:        %[[VEC:.+]] = xegpu.load %[[COLLAPSE_I]]{{\[}}%[[LIN_IDX]]{{\]}}, %[[MASK]] : i64, vector<8x16xindex>, vector<8x16xi1> -> vector<8x16xf32>
+// CHECK:        %[[RES:.+]] = arith.select %[[MASK]], %[[VEC]], %[[PASS_THRU]] : vector<8x16xi1>, vector<8x16xf32>
+// CHECK:        gpu.return %[[RES]] : vector<8x16xf32>
+}
+
+// -----
+gpu.module @xevm_module {
+gpu.func @load_dynamic_source(%source: memref<?x?x?xf32>,
+    %off0: index, %off1: index, %off2: index,
+    %indices: vector<8x16xindex>, %mask: vector<8x16xi1>,
+    %pass_thru: vector<8x16xf32>) -> vector<8x16xf32> {
+  %0 = vector.gather %source[%off0, %off1, %off2][%indices], %mask,
+       %pass_thru : memref<?x?x?xf32>, vector<8x16xindex>, vector<8x16xi1>, vector<8x16xf32> into vector<8x16xf32>
+  gpu.return %0 : vector<8x16xf32>
+}
+// CHECK-LABEL:  @load_dynamic_source(
+// CHECK-SAME:   %[[SRC:.+]]: memref<?x?x?xf32>,
+// CHECK-SAME:   %[[OFF1:.+]]: index, %[[OFF2:.+]]: index, %[[OFF3:.+]]: index,
+// CHECK-SAME:   %[[INDICES:.+]]: vector<8x16xindex>
+// CHECK-SAME:   %[[MASK:.+]]: vector<8x16xi1>
+// CHECK-SAME:   %[[PASS_THRU:.+]]: vector<8x16xf32>) -> vector<8x16xf32> {
+// CHECK:        memref.extract_strided_metadata %[[SRC]]
+// CHECK-COUNT2: arith.muli {{.*}} : index
+// CHECK-COUNT2: arith.addi {{.*}} : index
+// CHECK:        %[[SPLAT:.+]] = vector.broadcast {{.*}}:  index to vector<8x16xindex>
+// CHECK:        %[[LIN_IDX:.+]] = arith.addi %[[SPLAT]], %[[INDICES]] : vector<8x16xindex>
+// CHECK:        %[[COLLAPSE:.+]] = memref.extract_aligned_pointer_as_index %[[SRC]] : memref<?x?x?xf32> -> index
+// CHECK:        %[[COLLAPSE_I:.+]] = arith.index_cast %[[COLLAPSE]] : index to i64
+// CHECK:        %[[VEC:.+]] = xegpu.load %[[COLLAPSE_I]]{{\[}}%[[LIN_IDX]]{{\]}}, %[[MASK]] : i64, vector<8x16xindex>, vector<8x16xi1> -> vector<8x16xf32>
+// CHECK:        %[[RES:.+]] = arith.select %[[MASK]], %[[VEC]], %[[PASS_THRU]] : vector<8x16xi1>, vector<8x16xf32>
+// CHECK:        gpu.return %[[RES]] : vector<8x16xf32>
+}
+
+// -----
+gpu.module @xevm_module {
+gpu.func @load_dynamic_source2(%source: memref<?x8x16xf32>,
+    %off0: index, %off1: index, %off2: index,
+    %indices: vector<8x16xindex>, %mask: vector<8x16xi1>,
+    %pass_thru: vector<8x16xf32>) -> vector<8x16xf32> {
+  %0 = vector.gather %source[%off0, %off1, %off2][%indices], %mask,
+       %pass_thru : memref<?x8x16xf32>, vector<8x16xindex>, vector<8x16xi1>, vector<8x16xf32> into vector<8x16xf32>
+  gpu.return %0 : vector<8x16xf32>
+}
+// CHECK-LABEL:  @load_dynamic_source2(
+// CHECK-SAME:   %[[SRC:.+]]: memref<?x8x16xf32>,
+// CHECK-SAME:   %[[OFF1:.+]]: index, %[[OFF2:.+]]: index, %[[OFF3:.+]]: index,
+// CHECK-SAME:   %[[INDICES:.+]]: vector<8x16xindex>
+// CHECK-SAME:   %[[MASK:.+]]: vector<8x16xi1>
+// CHECK-SAME:   %[[PASS_THRU:.+]]: vector<8x16xf32>) -> vector<8x16xf32> {
+// CHECK-NOT:    memref.extract_strided_metadata %[[SRC]]
+// CHECK-COUNT2: arith.muli {{.*}} : index
+// CHECK-COUNT2: arith.addi {{.*}} : index
+// CHECK:        %[[SPLAT:.+]] = vector.broadcast {{.*}}:  index to vector<8x16xindex>
+// CHECK:        %[[LIN_IDX:.+]] = arith.addi %[[SPLAT]], %[[INDICES]] : vector<8x16xindex>
+// CHECK:        %[[COLLAPSE:.+]] = memref.extract_aligned_pointer_as_index %[[SRC]] : memref<?x8x16xf32> -> index
+// CHECK:        %[[COLLAPSE_I:.+]] = arith.index_cast %[[COLLAPSE]] : index to i64
+// CHECK:        %[[VEC:.+]] = xegpu.load %[[COLLAPSE_I]]{{\[}}%[[LIN_IDX]]{{\]}}, %[[MASK]] : i64, vector<8x16xindex>, vector<8x16xi1> -> vector<8x16xf32>
+// CHECK:        %[[RES:.+]] = arith.select %[[MASK]], %[[VEC]], %[[PASS_THRU]] : vector<8x16xi1>, vector<8x16xf32>
+// CHECK:        gpu.return %[[RES]] : vector<8x16xf32>
+}
+
+// -----
+gpu.module @xevm_module {
+gpu.func @no_load_tensor(%source: tensor<32x64xf32>,
+    %off: index, %indices: vector<8x16xindex>,
+    %mask: vector<8x16xi1>, %pass_thru: vector<8x16xf32>) -> vector<8x16xf32> {
+  %0 = vector.gather %source[%off, %off][%indices], %mask,
+       %pass_thru : tensor<32x64xf32>, vector<8x16xindex>, vector<8x16xi1>, vector<8x16xf32> into vector<8x16xf32>
+  gpu.return %0 : vector<8x16xf32>
+}
+// CHECK-LABEL:  @no_load_tensor(
+// CHECK:        vector.gather
+}
+
+// -----
+gpu.module @xevm_module {
+gpu.func @gather_from_subview(%source: memref<4096x4096xf16>,
+                              %off1: index, %off2: index,
+                              %indices: vector<8xindex>,
+                              %mask: vector<8xi1>,
+                              %pass_thru: vector<8xf16>) -> vector<8xf16> {
+  %subview = memref.subview %source[%off1, %off2] [256, 256] [1, 1]
+      : memref<4096x4096xf16>
+        to memref<256x256xf16, strided<[4096, 1], offset: ?>>
+  %0 = vector.gather %subview[%off1, %off2][%indices], %mask, %pass_thru
+       : memref<256x256xf16, strided<[4096, 1], offset: ?>>,
+         vector<8xindex>, vector<8xi1>, vector<8xf16>
+         into vector<8xf16>
+  gpu.return %0 : vector<8xf16>
+}
+// CHECK-LABEL:  @gather_from_subview(
+// CHECK-SAME:   %[[SRC:.+]]: memref<4096x4096xf16>,
+// CHECK-SAME:   %[[OFF1:.+]]: index, %[[OFF2:.+]]: index,
+// CHECK-SAME:   %[[INDICES:.+]]: vector<8xindex>,
+// CHECK-SAME:   %[[MASK:.+]]: vector<8xi1>,
+// CHECK-SAME:   %[[PASS:.+]]: vector<8xf16>) -> vector<8xf16> {
+// CHECK:        %[[SUBVIEW:.+]] = memref.subview %[[SRC]][%[[OFF1]], %[[OFF2]]] [256, 256] [1, 1]
+// CHECK:        %[[BB:.+]], %[[OFFSET:.+]],{{.*}},{{.*}} = memref.extract_strided_metadata %[[SUBVIEW]] : memref<256x256xf16, strided<[4096, 1], offset: ?>> -> memref<f16>, index, index, index, index, index
+// CHECK:        arith.muli {{.*}} : index
+// CHECK:        arith.addi %[[OFFSET]]{{.*}} : index
+// CHECK:        %[[BASE_OFF:.+]] = arith.addi {{.*}} : index
+// CHECK:        %[[SPLAT:.+]] = vector.broadcast %[[BASE_OFF]] : index to vector<8xindex>
+// CHECK:        %[[LIN:.+]] = arith.addi %[[SPLAT]], %[[INDICES]] : vector<8xindex>
+// CHECK:        %[[BASE_IDX:.+]] = memref.extract_aligned_pointer_as_index %[[SUBVIEW]] : memref<256x256xf16, strided<[4096, 1], offset: ?>> -> index
+// CHECK:        %[[BASE_I64:.+]] = arith.index_cast %[[BASE_IDX]] : index to i64
+// CHECK:        %[[VEC:.+]] = xegpu.load %[[BASE_I64]]{{\[}}%[[LIN]]{{\]}}, %[[MASK]]
+// CHECK-SAME:     : i64, vector<8xindex>, vector<8xi1> -> vector<8xf16>
+// CHECK:        %[[RES:.+]] = arith.select %[[MASK]], %[[VEC]], %[[PASS]] : vector<8xi1>, vector<8xf16>
+// CHECK:        gpu.return %[[RES]] : vector<8xf16>
+}
diff --git a/mlir/test/Conversion/VectorToXeGPU/scatter-to-xegpu.mlir b/mlir/test/Conversion/VectorToXeGPU/scatter-to-xegpu.mlir
new file mode 100644
index 0000000000000..ea6a34a437962
--- /dev/null
+++ b/mlir/test/Conversion/VectorToXeGPU/scatter-to-xegpu.mlir
@@ -0,0 +1,163 @@
+// RUN: mlir-opt %s -convert-vector-to-xegpu -split-input-file | FileCheck %s
+
+gpu.module @xevm_module {
+gpu.func @store_1D_vector(%vec: vector<8xf32>, %source: memref<8x16x32xf32>,
+     %off1: index, %off2: index, %off3: index,
+     %indices: vector<8xindex>, %mask: vector<8xi1>) {
+  vector.scatter %source[%off1, %off2, %off3][%indices], %mask, %vec
+       : memref<8x16x32xf32>, vector<8xindex>, vector<8xi1>, vector<8xf32>
+  gpu.return
+}
+// CHECK-LABEL:  @store_1D_vector(
+// CHECK-SAM...
[truncated]

Jianhui-Li · 2025-09-15T22:02:01Z

mlir/lib/Conversion/VectorToXeGPU/VectorToXeGPU.cpp


+// Common preconditions for the lowering of vector.gather and vector.scatter:
+//  1. Source is a memref.
+//  2. The innermost dimension of the memref is contiguous (stride == 1)


what is the reason the memref must have stride==1? The HW should support non-unit-stride memref since the offset is per lane.

From vector.gather definition.
result[i,j] := if mask[i,j] then base[i0, i1, i2 + indices[i,j]]
else pass_thru[i,j]

My understading is that we first compute the base_offset of base[i0, i1, i2 ], and then compute the offset described by indices by computeing base[0, 0, indices[i, j]], and combine them to get the memory address. The computation uses the strides[] and the strides could be permutated or equal to 1.

what is the reason the memref must have stride==1? The HW should support non-unit-stride memref since the offset is per lane.

my bad, I genuinely thought that vector.gather/scatter only support inner strides == 1. Added support for non-unit inner strides

Jianhui-Li · 2025-09-15T22:16:21Z

mlir/lib/Conversion/VectorToXeGPU/VectorToXeGPU.cpp

+
+  if constexpr (llvm::is_one_of<std::decay_t<OpType>, vector::TransferReadOp,
+                                vector::TransferWriteOp>::value) {
+    AffineMap permMap = xferOp.getPermutationMap();


The permutation map could exist for the gather/store's memref, and I don't understand why we should treat them differently here.

here we're accessing a permutation map of the operation (transfer_read/write). vector.gather/scatter don't have a permutation map, that's why we're skipping it in that case

If a memref has its own permutation this should be handled automatically by memref.extract_strided_metadata

Jianhui-Li · 2025-09-15T22:23:25Z

mlir/lib/Conversion/VectorToXeGPU/VectorToXeGPU.cpp


+// Common preconditions for the lowering of vector.gather and vector.scatter:
+//  1. Source is a memref.
+//  2. The innermost dimension of the memref is contiguous (stride == 1)


From vector.gather definition.
result[i,j] := if mask[i,j] then base[i0, i1, i2 + indices[i,j]]
else pass_thru[i,j]

My understading is that we first compute the base_offset of base[i0, i1, i2 ], and then compute the offset described by indices by computeing base[0, 0, indices[i, j]], and combine them to get the memory address. The computation uses the strides[] and the strides could be permutated or equal to 1.

Jianhui-Li · 2025-09-15T22:28:18Z

mlir/lib/Conversion/VectorToXeGPU/VectorToXeGPU.cpp

+    baseOffset =
+        arith::AddIOp::create(rewriter, loc, baseOffset, offsetContrib);
+  }
+  Value indices = gatScatOp.getIndices();


I think these indices need to multiple with stride of innermost dim, if we allow the non-unit innermost dim stride. I believe this is the only change we need to support it.

Jianhui-Li · 2025-09-15T22:36:10Z

mlir/test/Conversion/VectorToXeGPU/gather-to-xegpu.mlir

+  %subview = memref.subview %source[%off1, %off2] [256, 256] [1, 1]
+      : memref<4096x4096xf16>
+        to memref<256x256xf16, strided<[4096, 1], offset: ?>>
+  %0 = vector.gather %subview[%off1, %off2][%indices], %mask, %pass_thru


Can we use a different value than off1 and off2? The off1 and off2 is suppose to be multiple of 256, so the %subview[%off1, %off2] would be out of boundary? It also makes hard to read the check code sequence, which I believe doesn't contain offset computation for vector.gather using off1/off2
.

Fixed. Now subview and the gather op use different values

Signed-off-by: dchigarev <dmitry.chigarev@intel.com>

mlir/lib/Conversion/VectorToXeGPU/VectorToXeGPU.cpp

Signed-off-by: dchigarev <dmitry.chigarev@intel.com>

adam-smnk

LGTM, thanks 👍

mlir/test/Conversion/VectorToXeGPU/gather-to-xegpu.mlir

Jianhui-Li

LGTM with minor comment

mlir/lib/Conversion/VectorToXeGPU/VectorToXeGPU.cpp

Signed-off-by: dchigarev <dmitry.chigarev@intel.com>

dchigarev · 2025-09-19T08:31:02Z

@Jianhui-Li @adam-smnk I thing the PR is ready to be merged

github-actions · 2025-09-19T09:12:37Z

@dchigarev Congratulations on having your first Pull Request (PR) merged into the LLVM Project!

Your changes will be combined with recent changes from other authors, then tested by our build bots. If there is a problem with a build, you may receive a report in an email or a comment on this PR.

Please check whether problems have been caused by your change specifically, as the builds can include changes from many authors. It is not uncommon for your change to be included in a build that fails due to someone else's changes, or infrastructure issues.

How to do this, and the rest of the post-merge process, is covered in detail here.

If your change does cause a problem, it may be reverted, or you can revert it yourself. This is a normal part of LLVM development. You can fix your changes and open a new PR to merge them again.

If you don't get any reports, no action is required from you. Your changes are working as expected, well done!

dchigarev commented Sep 11, 2025

View reviewed changes

Garra1980 reviewed Sep 11, 2025

View reviewed changes

mlir/lib/Conversion/VectorToXeGPU/VectorToXeGPU.cpp Outdated Show resolved Hide resolved

dchigarev added 3 commits September 12, 2025 09:46

[MLIR][XeGPU] Add lowering from vector.gather/scatter to xegpu.load/s…

d97a3c2

…tore Signed-off-by: dchigarev <dmitry.chigarev@intel.com>

Add alignment handling

4d2c284

Signed-off-by: dchigarev <dmitry.chigarev@intel.com>

Align lowering with new utils behavior

62c5c38

Signed-off-by: dchigarev <dmitry.chigarev@intel.com>

dchigarev force-pushed the dchigarev/vec-gather-xegpu branch from a750113 to 62c5c38 Compare September 12, 2025 11:02

dchigarev marked this pull request as ready for review September 12, 2025 11:30

llvmbot added mlir:gpu mlir labels Sep 12, 2025

Jianhui-Li reviewed Sep 15, 2025

View reviewed changes

dchigarev added 2 commits September 16, 2025 09:43

Handle non-unit inner stride

78b057d

Signed-off-by: dchigarev <dmitry.chigarev@intel.com>

Correct memref.subview test

4ce2415

Signed-off-by: dchigarev <dmitry.chigarev@intel.com>

dchigarev requested a review from Jianhui-Li September 16, 2025 10:21

adam-smnk reviewed Sep 16, 2025

View reviewed changes

mlir/lib/Conversion/VectorToXeGPU/VectorToXeGPU.cpp Outdated Show resolved Hide resolved

mlir/lib/Conversion/VectorToXeGPU/VectorToXeGPU.cpp Outdated Show resolved Hide resolved

Remove non-necessary comment

eb5fbc7

Signed-off-by: dchigarev <dmitry.chigarev@intel.com>

dchigarev requested a review from adam-smnk September 16, 2025 16:16

adam-smnk approved these changes Sep 17, 2025

View reviewed changes

mlir/test/Conversion/VectorToXeGPU/gather-to-xegpu.mlir Show resolved Hide resolved

Jianhui-Li approved these changes Sep 18, 2025

View reviewed changes

mlir/lib/Conversion/VectorToXeGPU/VectorToXeGPU.cpp Outdated Show resolved Hide resolved

Remove 'gatherScatterPreconditions' function

54a8299

Signed-off-by: dchigarev <dmitry.chigarev@intel.com>

adam-smnk merged commit c4617bc into llvm:main Sep 19, 2025
9 checks passed

[MLIR][XeGPU][VectorToXeGPU] Add lowering from vector.gather/scatter to xegpu.load/store #158024

[MLIR][XeGPU][VectorToXeGPU] Add lowering from vector.gather/scatter to xegpu.load/store #158024

Conversation

dchigarev commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Sep 11, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dchigarev commented Sep 11, 2025

Uh oh!

Uh oh!

Garra1980 commented Sep 11, 2025

Uh oh!

Jianhui-Li commented Sep 12, 2025

Uh oh!

dchigarev commented Sep 12, 2025

Uh oh!

llvmbot commented Sep 12, 2025

Uh oh!

llvmbot commented Sep 12, 2025

Uh oh!

Jianhui-Li Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

adam-smnk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Jianhui-Li left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dchigarev commented Sep 19, 2025

Uh oh!

Uh oh!

github-actions bot commented Sep 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dchigarev commented Sep 11, 2025 •

edited

Loading

Jianhui-Li Sep 15, 2025 •

edited

Loading