[MLIR][Linalg] Add aggregate ops decomposition pass and softmax decom… #97582

kurapov-peter · 2024-07-03T14:31:56Z

…position implementation

Add a decomposition pass that handles complex aggregate ops (e.g., softmax), replacing them with a sequence of non-aggregate linalg named ops. Implementation for softmax follows the lowering semantics of popular frameworks like PyTorch, TensorFlow, and others, as well as the op description (no semantic changes). The decomposition sets the semantic by providing an equivalent representation through non-aggregate named ops.
Make the AggregatedOpInterface return a DecompositionResult, similar to the tiling interface. This is to communicate the decomposition sequence nicely (e.g., useful for transform dialect, see below).
Rework DecomposeInterfaceOp implementation. This removes code duplication between the generalization pass and decomposition implementation - now aggregate ops are decomposed first and then generalized.

llvmbot · 2024-07-03T14:32:27Z

@llvm/pr-subscribers-mlir

@llvm/pr-subscribers-mlir-linalg

Author: Petr Kurapov (kurapov-peter)

Changes

…position implementation

Add a decomposition pass that handles complex aggregate ops (e.g., softmax), replacing them with a sequence of non-aggregate linalg named ops. Implementation for softmax follows the lowering semantics of popular frameworks like PyTorch, TensorFlow, and others.
Make the AggregatedOpInterface return a DecompositionResult, similar to the tiling interface. This is to communicate the decomposition sequence nicely (e.g., useful for transform dialect, see below).
Rework DecomposeInterfaceOp to return variadic results and use the new decomposition. This removes code duplication between the generalization pass and decomposition implementation - now aggregate ops are decomposed first and then generalized.

Patch is 35.95 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/97582.diff

11 Files Affected:

(modified) mlir/include/mlir/Dialect/Linalg/IR/LinalgInterfaces.h (+10)
(modified) mlir/include/mlir/Dialect/Linalg/IR/LinalgInterfaces.td (+1-1)
(modified) mlir/include/mlir/Dialect/Linalg/Passes.td (+5)
(modified) mlir/include/mlir/Dialect/Linalg/TransformOps/LinalgTransformOps.td (+7-11)
(modified) mlir/include/mlir/Dialect/Linalg/Transforms/Transforms.h (+5)
(modified) mlir/lib/Dialect/Linalg/IR/LinalgOps.cpp (+82-129)
(modified) mlir/lib/Dialect/Linalg/TransformOps/LinalgTransformOps.cpp (+15-19)
(modified) mlir/lib/Dialect/Linalg/Transforms/CMakeLists.txt (+1)
(added) mlir/lib/Dialect/Linalg/Transforms/DecomposeAggregateNamedLinalgOps.cpp (+62)
(added) mlir/test/Dialect/Linalg/decompose-named-ops.mlir (+107)
(modified) mlir/test/Dialect/Linalg/transform-op-decompose.mlir (+44-10)

diff --git a/mlir/include/mlir/Dialect/Linalg/IR/LinalgInterfaces.h b/mlir/include/mlir/Dialect/Linalg/IR/LinalgInterfaces.h
index 08afdf373f014..3858075fae137 100644
--- a/mlir/include/mlir/Dialect/Linalg/IR/LinalgInterfaces.h
+++ b/mlir/include/mlir/Dialect/Linalg/IR/LinalgInterfaces.h
@@ -30,6 +30,16 @@ class IteratorTypeAttr;
 class LinalgOp;
 class GenericOp;
 
+/// Container for result values of decomposition.
+/// - `decomposedOps` contains operations created by the decomposition that are
+/// returned to the caller for further transformations.
+/// - `decomposedValues` contains the values corresponding to the result of the
+/// aggregate operation.
+struct DecompositionResult {
+  SmallVector<Operation *> decomposedOps;
+  SmallVector<Value> decomposedValues;
+};
+
 namespace detail {
 /// Implementation of the method that check if given operands
 /// can be dropped, i.e. the remaining operands can compute the loop
diff --git a/mlir/include/mlir/Dialect/Linalg/IR/LinalgInterfaces.td b/mlir/include/mlir/Dialect/Linalg/IR/LinalgInterfaces.td
index fbf3f19cde0e9..9b1ab20552628 100644
--- a/mlir/include/mlir/Dialect/Linalg/IR/LinalgInterfaces.td
+++ b/mlir/include/mlir/Dialect/Linalg/IR/LinalgInterfaces.td
@@ -862,7 +862,7 @@ def AggregatedOpInterface : OpInterface<"AggregatedOpInterface"> {
           In other words, the returned vector can be used directly with
           `RewriterBase::replaceOp(this, returnedValues)`.
         }],
-        /*retType=*/"FailureOr<SmallVector<Value>>",
+        /*retType=*/"FailureOr<DecompositionResult>",
         /*methodName=*/"decomposeOperation",
         /*args=*/(ins
             "OpBuilder &":$b),
diff --git a/mlir/include/mlir/Dialect/Linalg/Passes.td b/mlir/include/mlir/Dialect/Linalg/Passes.td
index 0621a9f33ba1e..3031126e582f7 100644
--- a/mlir/include/mlir/Dialect/Linalg/Passes.td
+++ b/mlir/include/mlir/Dialect/Linalg/Passes.td
@@ -94,6 +94,11 @@ def LinalgGeneralizeNamedOpsPass : Pass<"linalg-generalize-named-ops"> {
   let dependentDialects = ["linalg::LinalgDialect"];
 }
 
+def LinalgDecomposeAggregateNamedOpsPass : Pass<"linalg-decompose-named-ops"> {
+  let summary = "Decompose complex named ops (e.g., Softmax) into a sequence of linalg named ops";
+  let dependentDialects = ["linalg::LinalgDialect"];
+}
+
 def LinalgDetensorizePass : InterfacePass<"linalg-detensorize", "FunctionOpInterface"> {
   let summary = "Detensorize linalg ops";
   let dependentDialects = [];
diff --git a/mlir/include/mlir/Dialect/Linalg/TransformOps/LinalgTransformOps.td b/mlir/include/mlir/Dialect/Linalg/TransformOps/LinalgTransformOps.td
index 93e2c2db729da..2e8e294aa2e4c 100644
--- a/mlir/include/mlir/Dialect/Linalg/TransformOps/LinalgTransformOps.td
+++ b/mlir/include/mlir/Dialect/Linalg/TransformOps/LinalgTransformOps.td
@@ -1317,25 +1317,21 @@ def ConvertToLoopsOp : Op<Transform_Dialect, "structured.convert_to_loops",
 def DecomposeInterfaceOp : Op<Transform_Dialect, "structured.decompose_interface",
     [FunctionalStyleTransformOpTrait,
      MemoryEffectsOpInterface,
-     TransformOpInterface,
-     TransformEachOpTrait,
+     DeclareOpInterfaceMethods<TransformOpInterface>,
      ReportTrackingListenerFailuresOpTrait]> {
   let description = [{
-    TODO
+    Decomposes high-level named ops into a sequence of non-aggregate named ops
+    via `AggregatedOpInterface`.
+
+    The operation ignores non-decomposable ops. The return handles point to
+    a sequence of named ops produced by the decomposition.
   }];
 
   let arguments = (ins TransformHandleTypeInterface:$target);
-  let results = (outs TransformHandleTypeInterface:$transformed);
+  let results = (outs Variadic<TransformHandleTypeInterface>:$transformed);
   let assemblyFormat =
       "$target attr-dict `:` functional-type(operands, results)";
 
-  let extraClassDeclaration = [{
-    ::mlir::DiagnosedSilenceableFailure applyToOne(
-        ::mlir::transform::TransformRewriter &rewriter,
-        ::mlir::Operation *target,
-        ::mlir::transform::ApplyToEachResultList &results,
-        ::mlir::transform::TransformState &state);
-  }];
 }
 //===----------------------------------------------------------------------===//
 // RewriteInDestinationPassingStyleOp.
diff --git a/mlir/include/mlir/Dialect/Linalg/Transforms/Transforms.h b/mlir/include/mlir/Dialect/Linalg/Transforms/Transforms.h
index 05e97befdec1f..b0eeb274f71bb 100644
--- a/mlir/include/mlir/Dialect/Linalg/Transforms/Transforms.h
+++ b/mlir/include/mlir/Dialect/Linalg/Transforms/Transforms.h
@@ -1546,6 +1546,11 @@ void populateLinalgTilingCanonicalizationPatterns(RewritePatternSet &patterns);
 /// linalg.generic ops.
 void populateLinalgNamedOpsGeneralizationPatterns(RewritePatternSet &patterns);
 
+/// Populates `patterns` with patterns to decompose high-level aggregate named
+/// ops (e.g., softmax) into a sequence of simpler linalg named ops, defining
+/// the operation semantics.
+void populateDecomposeAggregateNamedOpsPatterns(RewritePatternSet &patterns);
+
 /// Linalg decompose convolutions patterns
 
 /// Populates patterns to decompose high-D convolution ops into low-D ones.
diff --git a/mlir/lib/Dialect/Linalg/IR/LinalgOps.cpp b/mlir/lib/Dialect/Linalg/IR/LinalgOps.cpp
index 57d126603ebd7..383f285969ad7 100644
--- a/mlir/lib/Dialect/Linalg/IR/LinalgOps.cpp
+++ b/mlir/lib/Dialect/Linalg/IR/LinalgOps.cpp
@@ -2564,116 +2564,41 @@ void SoftmaxOp::getEffects(
 
 // Helper functions for softmax decomposition.
 // @{
-
-// Helper function to produce the iterator types (reduction or parallel) and
-// affine maps for the iterators used in the decomposition of softmax.
-// This method creates:
-// If allParallel == true:
-// - iterator type: {parallel, ..., parallel}
-// - affine maps:
-// -- identity with inputRank dimensions.
-// -- (d0, ..., dN) -> (d0, ..., d_dim-1, d_dim+1, ..., dN),
-//    where N == inputRank.
-//
-// If allParallel == false:
-// - iterator type at dim(i) == parallel for i != \p dim and
-//   dim(dim) == reduction.
-// - affine map:
-// -- identity with inputRank dimensions.
-// -- (d0, ..., dN) -> (d0, ..., d_dim-1, d_dim+1, ..., dN),
-//    where N == inputRank.
-static std::tuple<SmallVector<utils::IteratorType>, SmallVector<AffineMap>>
-computeIteratorTypesAndIndexingMaps(OpBuilder &builder, int64_t inputRank,
-                                    int64_t dim, bool allParallel = false) {
-  SmallVector<utils::IteratorType> iteratorTypes(inputRank,
-                                                 utils::IteratorType::parallel);
-  if (!allParallel)
-    iteratorTypes[dim] = utils::IteratorType::reduction;
-  MLIRContext *ctxt = builder.getContext();
-  auto identityMap = AffineMap::getMultiDimIdentityMap(inputRank, ctxt);
-  SmallVector<AffineExpr, 2> affineExprs;
-  for (int i = 0; i < inputRank; i++) {
-    if (i != dim)
-      affineExprs.push_back(mlir::getAffineDimExpr(i, ctxt));
-  }
-  auto reductionMap =
-      AffineMap::get(inputRank, /*symbols=*/0, affineExprs, ctxt);
-  SmallVector<AffineMap> indexingMaps{identityMap, reductionMap};
-  return std::make_tuple(iteratorTypes, indexingMaps);
-}
-
-// Helper function to produce a linalg.generic that computes a reduction on
-// dimension \p dim with the operation type \p T.
-template <typename T>
-static Value reduce(OpBuilder &builder, Location loc, Value input, Value output,
-                    int64_t dim) {
-  auto inputType = cast<ShapedType>(input.getType());
-  ArrayRef<int64_t> inputShape = inputType.getShape();
-  int64_t inputRank = inputShape.size();
-  auto [iteratorTypes, indexingMaps] =
-      computeIteratorTypesAndIndexingMaps(builder, inputRank, dim);
-  assert(indexingMaps.size() == 2 &&
-         "We should have two maps: 1 for the input, 1 for the output");
-  assert(indexingMaps[0].isIdentity() && "input map should be identity");
-
-  auto genericOp = builder.create<linalg::GenericOp>(
-      loc, output.getType(), input, output, indexingMaps, iteratorTypes,
-      [&](OpBuilder &b, Location loc, ValueRange args) {
-        Value result = b.create<T>(loc, args[0], args[1]);
-        b.create<linalg::YieldOp>(loc, result);
-      });
-  return genericOp.getResult(0);
-}
-
-/// Produce a linalg generic that computes the second step of the softmax
-/// decomposition: res = exp(input - max), where \p max is the max of \p input
-/// on dimension \p dim.
-static Value buildSubAndExpOp(OpBuilder &builder, Location loc, Value input,
-                              Value max, Value output, int64_t dim) {
-  auto inputType = cast<ShapedType>(input.getType());
-  ArrayRef<int64_t> inputShape = inputType.getShape();
-  int64_t inputRank = inputShape.size();
-  auto [iteratorTypes, indexingMaps] = computeIteratorTypesAndIndexingMaps(
-      builder, inputRank, dim, /*allParallel=*/true);
-  assert(indexingMaps.size() == 2 && "We should have one map for each input");
-  assert(indexingMaps[0].isIdentity() && "input map should be identity");
-  // Add the affine map for the output argument.
-  indexingMaps.push_back(indexingMaps[0]);
-  auto genericOp = builder.create<linalg::GenericOp>(
-      loc, input.getType(), ValueRange{input, max}, output, indexingMaps,
-      iteratorTypes, [&](OpBuilder &b, Location loc, ValueRange args) {
-        Value diff = b.create<arith::SubFOp>(loc, args[0], args[1]);
-        Value result = b.create<math::ExpOp>(loc, diff);
-        b.create<linalg::YieldOp>(loc, result);
-      });
-  return genericOp.getResult(0);
-}
-
-/// Produce a linalg generic that computes the final step of the softmax
-/// decomposition.
-/// \returns  linalg.generic ins(\p numerator, \p denominator) outs(\p output) {
-///   yield  n / d
-/// }
-static Value buildDivOp(OpBuilder &builder, Location loc, Value numerator,
-                        Value denominator, Value output, int64_t dim) {
-  auto inputType = cast<ShapedType>(numerator.getType());
-  ArrayRef<int64_t> inputShape = inputType.getShape();
-  int64_t inputRank = inputShape.size();
-  auto [iteratorTypes, indexingMaps] = computeIteratorTypesAndIndexingMaps(
-      builder, inputRank, dim, /*allParallel=*/true);
-  assert(indexingMaps.size() == 2 &&
-         "We should have one map for each input (2)");
-  assert(indexingMaps[0].isIdentity() && "Numerator map should be identity");
-  // Add the affine map for the output tensor.
-  indexingMaps.push_back(indexingMaps[0]);
-  auto genericOp = builder.create<linalg::GenericOp>(
-      loc, numerator.getType(), ValueRange{numerator, denominator}, output,
-      indexingMaps, iteratorTypes,
-      [&](OpBuilder &b, Location loc, ValueRange args) {
-        Value result = b.create<arith::DivFOp>(loc, args[0], args[1]);
-        b.create<linalg::YieldOp>(loc, result);
-      });
-  return genericOp.getResult(0);
+TypedAttr createInitValueForReduceMaxOp(Type type, OpBuilder &b) {
+  if (isa<FloatType>(type))
+    return b.getFloatAttr(
+        type, APFloat::getSmallest(cast<FloatType>(type).getFloatSemantics()));
+  if (isa<IntegerType>(type))
+    return b.getIntegerAttr(
+        type, APInt::getSignedMinValue(type.getIntOrFloatBitWidth()));
+  return {};
+}
+
+TypedAttr createInitValueForReduceSumOp(Type type, OpBuilder &b) {
+  if (isa<FloatType>(type))
+    return b.getFloatAttr(
+        type, APFloat::getZero(cast<FloatType>(type).getFloatSemantics()));
+  if (isa<IntegerType>(type))
+    return b.getIntegerAttr(type, APInt::getZero(type.getIntOrFloatBitWidth()));
+  return {};
+}
+
+Value createLinalgReduceMaxBody(OpBuilder b, Location loc, ValueRange args,
+                                Type elementTy) {
+  if (isa<FloatType>(elementTy))
+    return b.create<arith::MaxNumFOp>(loc, args[0], args[1]);
+  if (isa<IntegerType>(elementTy))
+    return b.create<arith::MaxSIOp>(loc, args[0], args[1]);
+  return {};
+}
+
+Value createLinalgReduceSumBody(OpBuilder &b, Location loc, ValueRange args,
+                                Type elementTy) {
+  if (isa<FloatType>(elementTy))
+    return b.create<arith::AddFOp>(loc, args[0], args[1]);
+  if (isa<IntegerType>(elementTy))
+    return b.create<arith::AddIOp>(loc, args[0], args[1]);
+  return {};
 }
 // @} End helper functions for softmax decomposition.
 
@@ -2695,7 +2620,7 @@ static Value buildDivOp(OpBuilder &builder, Location loc, Value numerator,
 /// 4. Divide z and l. This gives the N-dimensional softmax.
 ///    softmax = z / l
 ///
-FailureOr<SmallVector<Value>> SoftmaxOp::decomposeOperation(OpBuilder &b) {
+FailureOr<DecompositionResult> SoftmaxOp::decomposeOperation(OpBuilder &b) {
   OpBuilder::InsertionGuard guard(b);
   b.setInsertionPoint(*this);
   Location loc = getLoc();
@@ -2706,32 +2631,60 @@ FailureOr<SmallVector<Value>> SoftmaxOp::decomposeOperation(OpBuilder &b) {
   SmallVector<OpFoldResult> dims = tensor::getMixedSizes(b, loc, input);
   Value output = getOutput();
   dims.erase(dims.begin() + reductionDim);
+
   // Step 1: Compute max along dim.
   Value outputReduce = b.create<tensor::EmptyOp>(loc, dims, elementType);
-  Value neutralForMaxF = arith::getIdentityValue(arith::AtomicRMWKind::maximumf,
-                                                 elementType, b, loc,
-                                                 /*useOnlyFiniteValue=*/true);
-  Value neutralForMaxFInit =
-      b.create<linalg::FillOp>(loc, Value{neutralForMaxF}, outputReduce)
-          .result();
-  Value max =
-      reduce<arith::MaxNumFOp>(b, loc, input, neutralForMaxFInit, reductionDim);
+  auto maxFillValAttr = createInitValueForReduceMaxOp(elementType, b);
+  auto maxFillValue = b.create<arith::ConstantOp>(loc, maxFillValAttr);
+  auto neutralMaxInitOp = b.create<linalg::FillOp>(
+      loc, ValueRange{maxFillValue}, ValueRange{outputReduce});
+  Value neutralForMaxFInit = neutralMaxInitOp.result();
+
+  auto reduceMaxOp = b.create<linalg::ReduceOp>(
+      loc, input, neutralForMaxFInit, reductionDim,
+      [&](OpBuilder &nestedBuilder, Location nestedLoc, ValueRange args) {
+        auto result =
+            createLinalgReduceMaxBody(b, nestedLoc, args, elementType);
+        nestedBuilder.create<linalg::YieldOp>(nestedLoc, result);
+      });
 
   // Step 2: Subtract max from input and exponentiate.
-  Value numerator = buildSubAndExpOp(b, loc, input, max, output, reductionDim);
+  auto maxBroadcastOp = b.create<linalg::BroadcastOp>(
+      loc, reduceMaxOp.getResult(0), output, reduceMaxOp.getDimensionsAttr());
+
+  auto subOp = b.create<linalg::SubOp>(
+      loc, ValueRange{input, maxBroadcastOp.getResults().front()},
+      ValueRange{output});
+  auto expOp = b.create<linalg::ExpOp>(loc, ValueRange{subOp.getResult(0)},
+                                       ValueRange{output});
 
   // Step 3: Compute sum along dim.
-  Value zero = arith::getIdentityValue(arith::AtomicRMWKind::addf, elementType,
-                                       b, loc, /*useOnlyFiniteValue=*/true);
-  Value zeroInit =
-      b.create<linalg::FillOp>(loc, Value{zero}, outputReduce).result();
-  Value denominator =
-      reduce<arith::AddFOp>(b, loc, numerator, zeroInit, reductionDim);
+  auto sumFillValAttr = createInitValueForReduceSumOp(elementType, b);
+  auto sumFillValue = b.create<arith::ConstantOp>(loc, sumFillValAttr);
+  auto neutralSumInitOp = b.create<linalg::FillOp>(
+      loc, ValueRange{sumFillValue}, ValueRange{outputReduce});
+  auto sumFilledTensor = neutralSumInitOp.result();
+  auto reduceSumOp = b.create<linalg::ReduceOp>(
+      loc, expOp.getResults(), sumFilledTensor, reductionDim,
+      [&](OpBuilder &nestedBuilder, Location nestedLoc, ValueRange args) {
+        auto result =
+            createLinalgReduceSumBody(b, nestedLoc, args, elementType);
+        nestedBuilder.create<linalg::YieldOp>(nestedLoc, result);
+      });
 
   // Step 4: Compute softmax.
-  Value result =
-      buildDivOp(b, loc, numerator, denominator, output, reductionDim);
-  return SmallVector<Value>{result};
+  auto sumBcastOutput = b.create<tensor::EmptyOp>(
+      loc, getOutputOperandType().getShape(), elementType);
+  auto sumBroadcastOp = b.create<linalg::BroadcastOp>(
+      loc, reduceSumOp.getResult(0), sumBcastOutput,
+      reduceSumOp.getDimensionsAttr());
+  auto divOp = b.create<linalg::DivOp>(
+      loc, ValueRange{expOp.getResult(0), sumBroadcastOp.getResults().front()},
+      ValueRange{output});
+  return DecompositionResult{{neutralMaxInitOp, reduceMaxOp, maxBroadcastOp,
+                              subOp, expOp, neutralSumInitOp, reduceSumOp,
+                              sumBroadcastOp, divOp},
+                             {divOp.getResults().front()}};
 }
 
 //===----------------------------------------------------------------------===//
diff --git a/mlir/lib/Dialect/Linalg/TransformOps/LinalgTransformOps.cpp b/mlir/lib/Dialect/Linalg/TransformOps/LinalgTransformOps.cpp
index bc02788f9c441..e3f0a18a5ec2c 100644
--- a/mlir/lib/Dialect/Linalg/TransformOps/LinalgTransformOps.cpp
+++ b/mlir/lib/Dialect/Linalg/TransformOps/LinalgTransformOps.cpp
@@ -431,27 +431,23 @@ transform::DecomposeOp::applyToOne(transform::TransformRewriter &rewriter,
 // Decompose the target operation if it implements the AggregatedOpInterface.
 // Push the decomposed operations (the ones that replaces the values produced by
 // \p target) in the `results`.
-DiagnosedSilenceableFailure transform::DecomposeInterfaceOp::applyToOne(
-    transform::TransformRewriter &rewriter, Operation *target,
-    transform::ApplyToEachResultList &results,
-    transform::TransformState &state) {
-  auto decomposableOp = dyn_cast<AggregatedOpInterface>(target);
-  if (!decomposableOp) {
-    failed(rewriter.notifyMatchFailure(target,
-                                       "payload is not a decomposable op"));
-    return emitDefaultSilenceableFailure(target);
-  }
+DiagnosedSilenceableFailure
+transform::DecomposeInterfaceOp::apply(transform::TransformRewriter &rewriter,
+                                       TransformResults &transformResults,
+                                       TransformState &state) {
+  for (auto [i, target] : llvm::enumerate(state.getPayloadOps(getTarget()))) {
+    auto decomposableOp = dyn_cast<AggregatedOpInterface>(target);
+    if (!decomposableOp)
+      continue;
 
-  FailureOr<SmallVector<Value>> maybeNewResults =
-      decomposableOp.decomposeOperation(rewriter);
-  if (failed(maybeNewResults))
-    return emitDefaultSilenceableFailure(target);
+    FailureOr<DecompositionResult> maybeNewResults =
+        decomposableOp.decomposeOperation(rewriter);
+    if (failed(maybeNewResults))
+      return emitDefaultSilenceableFailure(target);
 
-  rewriter.replaceOp(decomposableOp, *maybeNewResults);
-  for (Value val : *maybeNewResults) {
-    Operation *definition = val.getDefiningOp();
-    if (definition)
-      results.push_back(definition);
+    rewriter.replaceOp(decomposableOp, maybeNewResults->decomposedValues);
+    transformResults.set(cast<OpResult>(getResult(i)),
+                         maybeNewResults->decomposedOps);
   }
   return DiagnosedSilenceableFailure::success();
 }
diff --git a/mlir/lib/Dialect/Linalg/Transforms/CMakeLists.txt b/mlir/lib/Dialect/Linalg/Transforms/CMakeLists.txt
index 7e3dc56e0acdc..68582fe6cbad2 100644
--- a/mlir/lib/Dialect/Linalg/Transforms/CMakeLists.txt
+++ b/mlir/lib/Dialect/Linalg/Transforms/CMakeLists.txt
@@ -7,6 +7,7 @@ add_mlir_dialect_library(MLIRLinalgTransforms
   ConvertConv2DToImg2Col.cpp
   DataLayoutPropagation.cpp
   DecomposeLinalgOps.cpp
+  DecomposeAggregateNamedLinalgOps.cpp
   Detensorize.cpp
   DropUnitDims.cpp
   ElementwiseOpFusion.cpp
diff --git a/mlir/lib/Dialect/Linalg/Transforms/DecomposeAggregateNamedLinalgOps.cpp b/mlir/lib/Dialect/Linalg/Transforms/DecomposeAggregateNamedLinalgOps.cpp
new file mode 100644
index 0000000000000..e8a5b96d54d34
--- /dev/null
+++ b/mlir/lib/Dialect/Linalg/Transforms/DecomposeAggregateNamedLinalgOps.cpp
@@ -0,0 +1,62 @@
+//===- DecomposeNamedLinalgOps.cpp - Patterns to break up complex ops -----===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#include "mlir/Dialect/Linalg/Passes.h"
+
+#include "mlir/Dialect/Linalg/IR/Linalg.h"
+#include "mlir/Dialect/Linalg/Transforms/Transforms.h"
+#include "mlir/Transforms/Gre...
[truncated]

…position implementation

…x decomposition implementation

rengolin

Thanks Petr, this is looks good to me (but I'll allow time for others to review, so won't approve).

The main differences I see after generalizing the named ops:

linalg.fills become a genteric, but that's a minor thing and I think it's the right thing to have.
linalg.subf and linalg.exp are not fused as before, but again, this is the correct lowering. One can fuse element-wise ops later if so inclined.
The linalg.reduce { maxnum }'s init constant is wrong, will yield wrong max if all numbers are <=0.

mlir/test/Dialect/Linalg/transform-op-decompose.mlir

kurapov-peter · 2024-07-03T16:31:10Z

The linalg.reduce { maxnum }'s init constant is wrong, will yield wrong max if all numbers are <=0.

Oh, nice catch! Misused the smallest api :) Will fix in a bit.

mlir/lib/Dialect/Linalg/IR/LinalgOps.cpp

rolfmorel

I don't think the interface of the decompose_interface TransformOp is right.

See my comment on it for why and a suggestion for a fix.

As issues around TransformOps being "vectorized" or not crop up more often, maybe @ftynse can chime in with other suggestions for the op's interface.

rolfmorel · 2024-07-03T19:07:53Z

mlir/include/mlir/Dialect/Linalg/TransformOps/LinalgTransformOps.td


  let arguments = (ins TransformHandleTypeInterface:$target);
-  let results = (outs TransformHandleTypeInterface:$transformed);
+  let results = (outs Variadic<TransformHandleTypeInterface>:$transformed);


I don't think this is right.

Variadic<..> allows instances of an op to have different number of return values, but for each instance the number of result values still needs to be statically known, i.e. there is always a fixed number of results in the IR.

What you are doing in the implementation of apply() in LinalgTransformOps.cpp is pushing a result for each op in the payload that was associated to target. Note that the number of ops in this payload is only known at runtime. Hence it better be that the number of ops associated to target equals the number of results that the instance of transform.structured.decompose_interface expects, otherwise there will be a runtime error.

This issue also crops up in another situation: you optionally do not push a result if an op associated to target does not implement the required interface, i.e. you "ignore" it. Because a user is statically encoding the number of expected results, a user would always need to know upfront how many of the ops passed via target do or do not implement the interface. Otherwise the number of results would be wrong and there would be a runtime error.

Hence this scheme does not work in general. It would mean always needing to know statically how many decomposable ops are going to be associated to an instance of decompose_interface's target argument at runtime.

Maybe @ftynse has a suggestion for still allowing the decompose_interface TransformOp to be "vectorized," but at the moment I don't see a unambiguous interface for it.

I think the interface that does work is to insist that target payload's size is at most one (there are a number of TransformOps that insist on payload sizes of one/at most one). In that case you just have one return handle (so no Variadic<..>) which either carries all the decomposed ops or is empty in case 1) target was empty or 2) the associated op didn't implement the decompose interface.

Thanks for the suggestion, Rolf! Yes, this is awkward. Here's an example I wrote while testing the thing, it highlights the usage problem:

func.func @softmax(%arg0: tensor<2x16x32xf32>, %dst: tensor<2x16x32xf32>) -> tensor<2x16x32xf32> { %out = tensor.empty() : tensor<2x16x32xf32> %1 = linalg.softmax dimension(2) ins(%arg0 : tensor<2x16x32xf32>) outs(%out: tensor<2x16x32xf32>) -> tensor<2x16x32xf32> %2 = linalg.softmax dimension(1) ins(%1: tensor<2x16x32xf32>) outs(%dst: tensor<2x16x32xf32>) -> tensor<2x16x32xf32> return %2 : tensor<2x16x32xf32> } module attributes {transform.with_named_sequence} { transform.named_sequence @__transform_main(%arg1: !transform.any_op {transform.readonly}) { %2 = transform.structured.match ops{["linalg.softmax"]} in %arg1 : (!transform.any_op) -> !transform.any_op %3, %4 = transform.structured.decompose_interface %2 : (!transform.any_op) -> (!transform.any_op, !transform.any_op) %5 = transform.structured.generalize %3: (!transform.any_op) -> !transform.any_op %6 = transform.structured.generalize %4: (!transform.any_op) -> !transform.any_op transform.yield } }

The other option I considered was to assign all the resulting ops to the single value the transform op returns. That had the downside of not knowing which new payload corresponds to which target, so I went this route.

Anyway, the purpose of the change was to introduce the pass. Transform op change is just a byproduct (and it seems the op was created for demonstration in the first place? I found no usage scenarios of it anywhere). I don't have an opinion on how the interface should look like. I'll just list some properties to consider:

Multiple ops in the resulting decomposition (i.e., can't go through applyToOne)

Should skip ops that don't implement the interface

Would be nice to apply to a sequence (?)

If there are no other strong opinions I'll follow the suggestion.

The other option I considered was to assign all the resulting ops to the single value the transform op returns. That had the downside of not knowing which new payload corresponds to which target

I think this is also a valid option, namely keep the op vectorized but it's up to the user to deal with the information loss of which decomposed ops belonged to which op in target's payload. For example, if you know you want to do the same thing to all decomposed ops anyway, e.g. generalize them, this would be more compact than needing to use a transform.foreach and stuffing the decompose_interface op in it's region. In case the user ensures there's only one op associated to target, e.g. by using transform.foreach, then they regain the unambiguous semantics of my suggestion.

You could even make the "accumulation" behaviour opt-in by requiring the user to use a (unit-)attribute or keyword. Whether this is a pattern that is desirable for the transform dialect is I think best addressed by @ftynse.

Similarly, you could make the ignoring behaviour opt-in or opt-out through a keyword or attribute. Though, IMO, for this PR, it's fine to pick one behaviour and document it. Somebody can make the behaviour be subject to a flag in a separate PR later.

In any case, the above is basically just bikeshedding. If you want something that can be merged quick, you could go with my non-vectorized suggestion, which keeps to an established pattern in the Transform Dialect.

The vectorized behaviour could always be added in a separate PR.

I went with binding all the payload with a single output value (reverting the op change). This was easiest, it doesn't modify the original interface, and has less verbose usage:

func.func @softmax(%arg0: tensor<2x16x32xf32>, %dst: tensor<2x16x32xf32>) -> tensor<2x16x32xf32> { %out = tensor.empty() : tensor<2x16x32xf32> %1 = linalg.softmax dimension(2) ins(%arg0 : tensor<2x16x32xf32>) outs(%out: tensor<2x16x32xf32>) -> tensor<2x16x32xf32> %2 = linalg.softmax dimension(1) ins(%1: tensor<2x16x32xf32>) outs(%dst: tensor<2x16x32xf32>) -> tensor<2x16x32xf32> return %2 : tensor<2x16x32xf32> } module attributes {transform.with_named_sequence} { transform.named_sequence @__transform_main(%arg1: !transform.any_op {transform.readonly}) { %2 = transform.structured.match ops{["linalg.softmax"]} in %arg1 : (!transform.any_op) -> !transform.any_op %3 = transform.structured.decompose_interface %2 : (!transform.any_op) -> (!transform.any_op) %5 = transform.structured.generalize %3: (!transform.any_op) -> !transform.any_op transform.yield } }

Fine by me. 👍

…ace op

mlir/lib/Dialect/Linalg/IR/LinalgOps.cpp

adam-smnk · 2024-07-04T11:04:20Z

mlir/test/Dialect/Linalg/transform-op-decompose.mlir


    %2 = transform.structured.match ops{["linalg.softmax"]} in %arg1 : (!transform.any_op) -> !transform.any_op
    %3 = transform.structured.decompose_interface %2 : (!transform.any_op) -> !transform.any_op
+    %4 = transform.structured.generalize %3: (!transform.any_op) -> !transform.any_op


Why add this to the existing test?
I don't think generalization is needed for anything and it makes it harder to see what, if anything, has changed after the logic rework.

This is to show that the generic lowering is still valid. The lowering to named ops is in another test.

generic lowering is still valid

It used to be decomposed to generic but now the decomposition is always to named ops and you have to rely on separate pass to have that generic representation. So, I still think it's unrelated to this test.

I think we need both. It doesn't matter to me if we create a new one for generic or rely on the new one Petr added.

To me it's a separate concern. AFAIK, decompose has never promised specific lowering format.
If we worry about generalization pass then that belongs to its own unit tests.

I think it's a nice demonstration of the resulting code which is useful for this review. I agree it's not related to the test. We can remove generalization from both decomposition and transform tests before the merge.

I agree with @rengolin that we need both. MLIR could definitely use more of "integration" tests, but without them replacing unit tests.

I'm confused. Does "both" mean we want to leave all the existing tests in the PR and add one more test for the transform op without generalization? Like in:

module attributes {transform.with_named_sequence} { transform.named_sequence @__transform_main(%arg1: !transform.any_op {transform.readonly}) { %2 = transform.structured.match ops{["linalg.softmax"]} in %arg1 : (!transform.any_op) -> !transform.any_op %3 = transform.structured.decompose_interface %2 : (!transform.any_op) -> (!transform.any_op) transform.yield } }

To me you already have both. The old test, which you generalize and compare the delta and the new one you added, which is just the named ops. I'm happy with the tests the way they are.

I agree with what's being proposed here (in terms of test coverage), but please make sure that we don't duplicate tests and that test files reflect what's being tested (I consider test file names to be part of documentation).

ATM:

This file is called "transform-op-decompose.mlir", but it's being updated to test both decomposition and generalisation.

Another file added in this PR, "decompose-named-ops.mlir", also tests both decomposition and generalisation.

IMO, there's should be a separate test file for:

decomposition

generalisation

e2e/integration (decomposition -> generalisation

This way it will be clear what is being tested and where.

EDIT "Unresolving" to make sure that this is addressed (commented after this was resolved)

mlir/test/Dialect/Linalg/decompose-named-ops.mlir

ftynse · 2024-07-05T14:42:39Z

mlir/include/mlir/Dialect/Linalg/IR/LinalgInterfaces.h

+/// - `decomposedValues` contains the values corresponding to the result of the
+/// aggregate operation.
+struct DecompositionResult {
+  SmallVector<Operation *> decomposedOps;


Are all decomposed ops Linalg ops? If so, can we use linalg::LinalgOp (interface) instead of the generic Operation *.

In softmax case - yes. I was thinking this could eventually become not a linalg-specific interface, so went with Operation*. Do you have a specific reason it should be LinalgOp in mind?

I think it makes sense to be a bit more restrictive to begin with, then expand as we have more use cases. The main problem is when someone else tries to use it for a wildly different case and complaints "it's not working" when in truth it was never intended to be supported.

I'm asking because this seems to be not a trivial change. LinalgOp is an incomplete type in the context and a forward declaration for DecompositionResult makes the struct an incomplete type for FailureOr template instantiation (in AggregatedOpInterface).

Hm, I see. I'm ok with this being Operation * for now, with a TODO/FIXME to make sure people are aware if they try to use it for other needs. I think the probability of anyone trying to use it outside of this scope is very small.

Added a todo.

I prefer it to be Operation *. Thats the more general form, and changing ABI causes breakages. There is no reason for this to be Linalg ops always.

mlir/lib/Dialect/Linalg/IR/LinalgOps.cpp

ftynse · 2024-07-05T14:53:52Z

mlir/test/Dialect/Linalg/transform-op-decompose.mlir


    %2 = transform.structured.match ops{["linalg.softmax"]} in %arg1 : (!transform.any_op) -> !transform.any_op
    %3 = transform.structured.decompose_interface %2 : (!transform.any_op) -> !transform.any_op
+    %4 = transform.structured.generalize %3: (!transform.any_op) -> !transform.any_op


I agree with @rengolin that we need both. MLIR could definitely use more of "integration" tests, but without them replacing unit tests.

ftynse · 2024-07-05T14:59:01Z

mlir/lib/Dialect/Linalg/TransformOps/LinalgTransformOps.cpp

-    if (definition)
-      results.push_back(definition);
+    rewriter.replaceOp(decomposableOp, maybeNewResults->decomposedValues);
+    allDecomposedOps.append(maybeNewResults->decomposedOps);


I'm not a fan of mixing all results together. It basically makes the result handle useless for further composition outside of the single-payload operand case. The most recent attempt at "fixing" this is to introduce a UnitAttr:$flatten the presence of which explicitly requests the results to be flattened into a single list. Its absence enables a check for operand being associated with at most one payload. This makes the behavior visible to the user and thus less surprising. When needed, they can wrap the logic in `transform.foreach.

@ftynse, could you please point me to the code/issue? I'm not familiar.

https://mlir.llvm.org/docs/Dialects/Transform/#transformforeach_match-transformforeachmatchop has flatten_results. You can git blame your way back to the PR that added it. The implementation is rather long, in this case it's just a matter of having an attribute and a conditional as described above.

banach-space · 2024-07-08T09:36:32Z

Thanks!

Does this PR change what Ops linalg.sotfmax is decomposed into? As in, do the semantics of linalg.softmax change? Looking at mlir/test/Dialect/Linalg/transform-op-decompose.mlir, I feel that the main difference was:

linalg.fills become a genteric, but that's a minor thing and I think it's the right thing to have.

Also, what's the difference between "transform-op-decompose.mlir" and "decompose-named-ops.mlir"? Is it just "how" the decomposition is "driven"? (TD vs Pass) Couldn't that be one test instead?

Implementation for softmax follows the lowering semantics of popular frameworks like intel/graph-compiler#10 (comment), intel/graph-compiler#10 (comment), and intel/graph-compiler#10 (comment).

Thanks for checking and for the extra context. I am just wondering:

If this claims alignment with e.g. PyTorch (or, more specifically, torch-mlir), shouldn't there be a link to torch-mlir docs/issues/code/test instead?
Are you saying that this PR is changing the semantics of softmax in Linalg?

now aggregate ops are decomposed first and then generalized

I am a bit confused, there's -linalg-decompose-named-ops and -linalg-generalize-named-ops - which one are you referring to? The first one would only decompose and the latter would only generalise, right?

banach-space · 2024-07-08T09:02:04Z

mlir/lib/Dialect/Linalg/IR/LinalgOps.cpp

-        b.create<linalg::YieldOp>(loc, result);
-      });
-  return genericOp.getResult(0);
+TypedAttr createInitValueForReduceMaxOp(Type type, OpBuilder &b) {


createInitValueForReduceMaxOp

This might be something obvious, but where does reduction happen?

The function is used to create an init value for the reduction. The reduction op itself is created right after that. I might not understand the question though.

The function is used to create an init value for the reduction.

What's stopping anyone from using these methods for things other than reductions? It's worth clarifying with a comment. Better still (given how small these are), add them as lambdas inside decomposeOperation.

Btw, shouldn't these methods be static?

I had the same thought actually, and there's similar stuff in tosa, so these could be some generic utilities. The values are specific to reductions though and can be specific to an operator (this is why I didn't pull it out into utils right away). Is there an example where those would be useful besides reductions you have in mind?

banach-space · 2024-07-08T09:18:27Z

mlir/test/Dialect/Linalg/decompose-named-ops.mlir

+// RUN: mlir-opt %s -split-input-file -linalg-decompose-named-ops | FileCheck %s
+// RUN: mlir-opt %s -split-input-file -linalg-decompose-named-ops -linalg-generalize-named-ops | FileCheck %s --check-prefix=GENERALIZECHECK


[nit] With more than one CHECK prefix, you can drop CHECK (which is just noise IMHO), and leverage the prefixes to better document what's tested:

// RUN: mlir-opt %s -split-input-file -linalg-decompose-named-ops | FileCheck %s -check-prefix=DECOMPOSED // RUN: mlir-opt %s -split-input-file -linalg-decompose-named-ops -linalg-generalize-named-ops | FileCheck %s --check-prefix=GENERALIZED

I am also making the suggestion to make sure that my understanding is correct.

banach-space · 2024-07-08T09:19:20Z

mlir/test/Dialect/Linalg/transform-op-decompose.mlir

 // CHECK-DAG:        %[[D1:.+]] = tensor.empty() : tensor<2x16xf32>
 // CHECK-DAG:        %[[CST:.+]] = arith.constant -3.40282347E+38 : f32
-// CHECK:        %[[D2:.+]] = linalg.fill ins(%[[CST]] : f32) outs(%[[D1]] : tensor<2x16xf32>) -> tensor<2x16xf32>
+// CHECK:        %[[D2:.+]] = linalg.generic {indexing_maps = [#[[$MAP2]], #[[$MAP3]]],


I'm wondering what D stands for in these tests 🤔 In any case, this would be a bit more descriptive.

Suggested change

// CHECK: %[[D2:.+]] = linalg.generic {indexing_maps = [#[[$MAP2]], #[[$MAP3]]],

// CHECK: %[[FILL_1:.+]] = linalg.generic {indexing_maps = [#[[$MAP2]], #[[$MAP3]]],

[nit]

I preserved the original naming for the diff to be readable. I also don't find the renaming very helpful or valuable.

I preserved the original naming for the diff to be readable.

We should avoid optimising our patches to work around the limitations of GitHub. If GitHub rendering is hard to parse, I will happily just clone a PR and review locally. Lets optimise for maintainability instead.

I also don't find the renaming very helpful or valuable.

I disagree - IMO it's both helpful and valuable. With linalg.fill, it doesn't matter that much whether it's %[[D2:.+]] or %[[FILL_1:.+]] - it's clear what the Op is. But your patch replaces linalg.fill with linalg.generic, which makes reading this rather dense test even trickier. To me, more descriptive LIT variables are helpful - they make parsing tests much easier.

I am not sure why this becomes a generic op. Can we preserve the original behavior of this being a fill op?

I am not sure why this becomes a generic op. Can we preserve the original behavior of this being a fill op?

The reason is the decomposition now acts at linalg level and lives up to the promise of splitting a complex aggregate op into simpler ops at the same level of abstraction (btw, this is why it does make sense to have LinalgOps in the decomposition result). After the decomposition, all the ops are non-generic linalg ops. At this stage, you have all the semantics available along with the benefit of no need to analyze affine maps (say, for fusion). After generalize you have a nice homogeneous set of generics as the last step in the test.

To me, more descriptive LIT variables are helpful - they make parsing tests much easier.

Renamed

banach-space · 2024-07-08T09:20:28Z

mlir/test/Dialect/Linalg/decompose-named-ops.mlir

+
+// -----
+
+// COM: decomposition assumes tensors as inputs, this is just to make sure nothing breaks


https://llvm.org/docs/CommandGuide/FileCheck.html#the-com-directive

kurapov-peter · 2024-07-08T10:09:11Z

Does this PR change what Ops linalg.sotfmax is decomposed into? As in, do the semantics of linalg.softmax change? Looking at mlir/test/Dialect/Linalg/transform-op-decompose.mlir, I feel that the main difference was:

The decomposition follows the op description (see LinalgOps.td) and specifies its semantics via the implementation. The implementation ends up generating the semantically the same code as the previous decomposition implementation (with minor deviation as you noted). The transform test demonstrates the change in the generalized code.

Also, what's the difference between "transform-op-decompose.mlir" and "decompose-named-ops.mlir"? Is it just "how" the decomposition is "driven"? (TD vs Pass) Couldn't that be one test instead?

The first one tests the new pass. The second one uses the transform interpreter and the decompose_interface op (which happen to partially rely on the same code now).

If this claims alignment with e.g. PyTorch (or, more specifically, torch-mlir), shouldn't there be a link to torch-mlir docs/issues/code/test instead?

The IR presented is the IR you get by lowering PyTorch to torch-mlir.

Are you saying that this PR is changing the semantics of softmax in Linalg?

I'd say it sets it. The implementation follows the op description, so there's no real 'change'.

now aggregate ops are decomposed first and then generalized

I am a bit confused, there's -linalg-decompose-named-ops and -linalg-generalize-named-ops - which one are you referring to?

Decomposition is performed by the newly introduced -linalg-decompose-named-ops (as the name suggests). Generalization is done by the default -linalg-generalize-named-ops.

The first one would only decompose and the latter would only generalise, right?

Correct.

MaheshRavishankar

Ill take a look more, but the changes to softmax op decomposition needs a look. Can that be split out into a separate change if that is indeed needed to keep the pass addition and change to return handle isolated from these changes.

MaheshRavishankar · 2024-07-08T23:37:30Z

mlir/test/Dialect/Linalg/transform-op-decompose.mlir

 // CHECK-DAG:        %[[D1:.+]] = tensor.empty() : tensor<2x16xf32>
 // CHECK-DAG:        %[[CST:.+]] = arith.constant -3.40282347E+38 : f32
-// CHECK:        %[[D2:.+]] = linalg.fill ins(%[[CST]] : f32) outs(%[[D1]] : tensor<2x16xf32>) -> tensor<2x16xf32>
+// CHECK:        %[[D2:.+]] = linalg.generic {indexing_maps = [#[[$MAP2]], #[[$MAP3]]],


I am not sure why this becomes a generic op. Can we preserve the original behavior of this being a fill op?

MaheshRavishankar · 2024-07-08T23:38:20Z

mlir/test/Dialect/Linalg/transform-op-decompose.mlir

 // CHECK:        } -> tensor<2x16xf32>
-// CHECK:        %[[D4:.+]] = linalg.generic {indexing_maps = [#[[$MAP]], #[[$MAP1]], #[[$MAP]]], iterator_types =
-// CHECK-SAME:     ["parallel", "parallel", "parallel"]} ins(%[[ARG0]], %[[D3]] : tensor<2x16x32xf32>, tensor<2x16xf32>)
+// CHECK:        %[[BCST:.+]] = linalg.generic {indexing_maps = [#[[$MAP1]], #[[$MAP]]],


Why are the broadcast ops separate?

Named ops require the same shapes, no implicit casting.

MaheshRavishankar · 2024-07-08T23:40:03Z

mlir/include/mlir/Dialect/Linalg/IR/LinalgInterfaces.h

+/// - `decomposedValues` contains the values corresponding to the result of the
+/// aggregate operation.
+struct DecompositionResult {
+  SmallVector<Operation *> decomposedOps;


I prefer it to be Operation *. Thats the more general form, and changing ABI causes breakages. There is no reason for this to be Linalg ops always.

MaheshRavishankar · 2024-07-08T23:40:35Z

mlir/include/mlir/Dialect/Linalg/IR/LinalgInterfaces.h

+struct DecompositionResult {
+  /// TODO: can be further constrained to LinalgOp.
+  SmallVector<Operation *> decomposedOps;
+  SmallVector<Value> decomposedValues;


Are these effectively replacements for the original results of the operation. decomposedValues is a misleading name. Maybe call it replacements?

MaheshRavishankar · 2024-07-08T23:43:06Z

mlir/lib/Dialect/Linalg/IR/LinalgOps.cpp

  Value output = getOutput();
-  dims.erase(dims.begin() + reductionDim);
+
+  SmallVector<int64_t> reduceShape;


Not sure why you are splitting this out to use int64_t and Value instead of keeping it as OpFoldResult. This seems to be equivalent to what was there, so not sure why change it?

These are just for the sake of pleasing tensor's empty op builder api. It won't accept a vector of OpFoldResult together with dynamic dims.

MaheshRavishankar · 2024-07-08T23:47:19Z

mlir/lib/Dialect/Linalg/IR/LinalgOps.cpp

+  }
+  auto sumBcastOutput = b.create<tensor::EmptyOp>(
+      loc, getOutputOperandType().getShape(), elementType, dynDims);
+  auto sumBroadcastOp = b.create<linalg::BroadcastOp>(


The attempt to use named ops here is making this more verbose and confusing. I would rather stick to this not using named ops (or at least provide a way to not use named ops).

What do you mean by that? What is confusing?

I think lowering to named ops this way is not great for softmax. If anything it is showing issues with named ops sa being currently defined and being developed. For example, linalg allows a more succicnt representation of broadcasting behavior than having to add an explicit broadcast operation. I dont think this is a great way forward. This is requiring everything that relies on this decomposition "to do additional work" to get back to the state that the decomposition was doing. That doesnt seem like a good idea to me. Maybe you can create an option that will allow you to lower softmax to named ops like you want here (and we can decide which to make the default), but changing the decomposition this way is a red flag for me.

Also the decomposition change needs to be a separate PR and not rolled into the PR that is adding a pass for decomposition.

First, regardless of the merits of lowering softmax to generics or not, I disagree that a more succinct representation is always preferred. We discuss this at length on the canonicalization threads in the forum.

Second, what this PR is adding is just the semantics that:

Decomposition is breaking named ops into further named ops.

Generalization is lowering named ops into generics.

Above you mention you prefer fill than its generic form, even though the step that the test is doing is generalization, while here you're advocating for a generic form, even though the step is decomposition. This seems totally arbitrary to me.

The linalg dialect and its transforms should be consistent and predictable. If you want to generalize only some ops and not others, this seems to me like a local pass.

All I am saying is that this is today it is lowered to what I consider a more succinct representation and with this change it does not seem that straightforward to get back to the previous state. So to not break downstream usage, it would be better to keep the default current behavior and introduce some optionality that allows you to lower to named ops. Going from named ops to the previous state does not seem that straight-forward to me (you essentially need to know this is a softmax and handle it accordingly).

W.R.T fill, ack your concern, but fill has always felt like a "special" op to me. To preserve the perfectly nested loop nature of linalg ops, what is

C = A * B

is converted to

C = zeros(...); D = C + A*B;

but any sane backend needs to "fuse" the fill with the matmul operation to generate efficient code.
Anyway, thats is a bit of a minor digression. My bigger concern is the change of the default that goes to named ops which in-turn enforce (what I consider outdated) explicit-broadcasting semantics (when Linalg has a perfectly succicnt and unambiguous way to represent broadcasts). If named ops allowed for representing broadcasting on-par with what Linalg inhernetly allows, then there would be no issue here.

banach-space · 2024-07-09T08:44:42Z

The implementation ends up generating the semantically the same code as the previous decomposition implementation (with minor deviation as you noted).

IMO this is key - please add that in the summary.

Also, what's the difference between "transform-op-decompose.mlir" and "decompose-named-ops.mlir"? Is it just "how" the decomposition is "driven"? (TD vs Pass) Couldn't that be one test instead?

The first one tests the new pass. The second one uses the transform interpreter and the decompose_interface op (which happen to partially rely on the same code now).

From what I can tell, both tests verify the decomposition of linalg.softmax:

func.func @softmax(%arg0: tensor<2x16x32xf32>, %dst: tensor<2x16x32xf32>) -> tensor<2x16x32xf32> {
  %1 = linalg.softmax dimension(2) ins(%arg0 : tensor<2x16x32xf32>) outs(%dst: tensor<2x16x32xf32>) -> tensor<2x16x32xf32>
  return %1 : tensor<2x16x32xf32>
}

Couldn't we re-use the input and the CHECK lines? To avoid duplication.

If this claims alignment with e.g. PyTorch (or, more specifically, torch-mlir), shouldn't there be a link to torch-mlir docs/issues/code/test instead?

The IR presented is the IR you get by lowering PyTorch to torch-mlir.

I know where the IR is coming from.

FWIW (without links to documentation), that IR is just an implementation detail of torch-mlir. In this PR we are discussing an implementation detail of Linalg. Are you saying that the implementation in Linalg should match torch-mlir? Why? What if the implementation in torch-mlir changes?

I'm trying to understand the motivation here and the overall design that we are converging towards.

Are you saying that this PR is changing the semantics of softmax in Linalg?

I'd say it sets it. The implementation follows the op description, so there's no real 'change'.

Key info, please highlight in the summary.

now aggregate ops are decomposed first and then generalized

I am a bit confused, there's -linalg-decompose-named-ops and -linalg-generalize-named-ops - which one are you referring to?

Decomposition is performed by the newly introduced -linalg-decompose-named-ops (as the name suggests). Generalization is done by the default -linalg-generalize-named-ops.

I know what the options are, thanks. To me, your comment implies that -linalg-decompose-named-ops is meant to be followed by -linalg-generalize-named-ops ("aggregate ops are decomposed first and then generalized")? Is that what you had in mind?

kurapov-peter · 2024-07-11T09:25:04Z

The implementation ends up generating the semantically the same code as the previous decomposition implementation (with minor deviation as you noted).

IMO this is key - please add that in the summary.

Done.

Also, what's the difference between "transform-op-decompose.mlir" and "decompose-named-ops.mlir"? Is it just "how" the decomposition is "driven"? (TD vs Pass) Couldn't that be one test instead?

The first one tests the new pass. The second one uses the transform interpreter and the decompose_interface op (which happen to partially rely on the same code now).

From what I can tell, both tests verify the decomposition of linalg.softmax:
func.func @softmax(%arg0: tensor<2x16x32xf32>, %dst: tensor<2x16x32xf32>) -> tensor<2x16x32xf32> {
  %1 = linalg.softmax dimension(2) ins(%arg0 : tensor<2x16x32xf32>) outs(%dst: tensor<2x16x32xf32>) -> tensor<2x16x32xf32>
  return %1 : tensor<2x16x32xf32>
}
Couldn't we re-use the input and the CHECK lines? To avoid duplication.

Do I understand correctly that you suggest having a single lit with the body of @softmax, the transformation IR, and runs both the decomposition via the pass and transform interpreter?

FWIW (without links to documentation), that IR is just an implementation detail of torch-mlir. In this PR we are discussing an implementation detail of Linalg. Are you saying that the implementation in Linalg should match torch-mlir? Why? What if the implementation in torch-mlir changes?

I'm trying to understand the motivation here and the overall design that we are converging towards.

So the goal was to look into what frameworks actually do in the implementation. If it so happens that all of them lower softmax to the same sequence (and this is what we happen to have) - we can have it set as the default decomposition to avoid re-implementing the thing. The general idea and direction is to have an intermediate decomposition stage that deals with complex ops (such as softmax) to aid other transformations and analyses (this is also an easy route to adding more named ops, upstream and downstream, and not implement all the interfaces like tiling, but convert it to simpler ops instead and enjoy all the existing goodness). Note: I'm leaving the question of accumulation out for now, this should be addressed separately.

Are you saying that this PR is changing the semantics of softmax in Linalg?

I'd say it sets it. The implementation follows the op description, so there's no real 'change'.

Key info, please highlight in the summary.

Done.

I know what the options are, thanks. To me, your comment implies that -linalg-decompose-named-ops is meant to be followed by -linalg-generalize-named-ops ("aggregate ops are decomposed first and then generalized")? Is that what you had in mind?

This just describes the transform op decomposition change (in other words it used to produce generics + fills, now it internally produces named ops sequence and then runs generalization), there's no strict requirement to run generalization after the decomposition of course.

banach-space · 2024-07-11T19:59:44Z

Do I understand correctly that you suggest having a single lit with the body of @softmax, the transformation IR, and runs both the decomposition via the pass and transform interpreter?

Yes, something along those lines. Basically, IMO, we should identify a canonical way to test transformations with both "passes" and TD so that we maximise "test case" re-use.

rengolin · 2024-07-16T10:00:37Z

FWIW (without links to documentation), that IR is just an implementation detail of torch-mlir. In this PR we are discussing an implementation detail of Linalg. Are you saying that the implementation in Linalg should match torch-mlir? Why? What if the implementation in torch-mlir changes?

PyTorch & torch-mlir are red herrings. This PR is about softmax.

The current lowering of softmax into generics is unsurprisingly the same as both PyTorch and Tensorflow expect it to be. This is related but distinc to lowering softmax to named ops. So, let's close this tangent, as it is irrelevant.

I know what the options are, thanks. To me, your comment implies that -linalg-decompose-named-ops is meant to be followed by -linalg-generalize-named-ops ("aggregate ops are decomposed first and then generalized")? Is that what you had in mind?

Absolutely not.

The test where we lower to named ops was created in this PR. We verify that it does the "right thing".

The old test checks the generic lowering (because it was the only thing we had), so @kurapov-peter added the generalization pass to match the old expectation. This provides us with a clear "apples to apples" comparison, and shows that the old output can no longer be attained. Most importantly, you don't get a mix of named + generics. You either get all named or all generic.

This seems to be a problem to @MaheshRavishankar and I want to understand it better. My guess is that there are pattern matchers that won't work with the generic version of fill (and why we want named ops in the first place).

An option is to make some "special" ops never to generalize, for example linalg.fill, by the generalize pattern. Or to have a flag in the generalize pass that does that, but without the option, it converts all. An alternative option is to piece-wise generalize downstream. It depends on how the matcher expects the code to be.

kurapov-peter · 2024-07-17T12:41:00Z

This seems to be a problem to @MaheshRavishankar and I want to understand it better. My guess is that there are pattern matchers that won't work with the generic version of fill (and why we want named ops in the first place).

I would like to understand the next steps here. @MaheshRavishankar, could you please elaborate ^^^?

Regarding the fill issue, I think customizable decomposition would be a reasonable solution - helps preserve the downstream usage and doesn't hold back the upstream.

Regarding broadcast, I could work on setting the semantics for implicit casting. One thing that is unclear to me though is whether having implicit cast semantics for named ops is beneficial. Wasn't the whole point of named ops to have a very explicit IR that is easy to analyze? In that regard, the absence of implicit casts is actually a good thing (I also don't see how it is ambiguous, could you please clarify?). Is there any real problem with broadcasts except for not being succinct? Wouldn't implicit casting just add unnecessary burden for analyses and transforms to handle various cases of arguments?

MaheshRavishankar · 2024-07-23T19:04:00Z

There is broadly two things that we need to make progress here

This is two separate PRs, one that is changing the softmax decomposition and one that is adding a pass for decomposition. The latter should be easy to land.
The change to the decomposition of softmax IMO is an inefficient lowering of softmax and will require "something" to get the state back. This should be part of the PR that is changing the decomposition. It is moving from a more succinct representation that Linalg allows to something that is (artifically) hamstrung with current definitions of the named ops. I dont expect the issue with named ops to be fixed as a precursor (though that would be the right thing to do IMO), but for this PR, I dont see how we can land it without having an option to chose how to decompose softmax (with default being what it is today, and an option to lower to sequence of named ops). On top of that adding a generalization to convert everything to linalg.generics is a non-starter IMO. You will be forcing all downstream users to either use "recognizers" heavily to retrieve back the information that is lost by generalization and not giving downstream users control on when they want to generalize.

This seems to be a problem to @MaheshRavishankar and I want to understand it better. My guess is that there are pattern matchers that won't work with the generic version of fill (and why we want named ops in the first place).

Just to map back to what I said above, we can "recognize" that its a fill, but that seems like an unnecessary burden added to downstream users because it has been generalized too early without any control. I can go into details about why I think "fill" is special but thats a separate issue IMO.

Regarding the fill issue, I think customizable decomposition would be a reasonable solution - helps preserve the downstream usage and doesn't hold back the upstream.

Regarding broadcast, I could work on setting the semantics for implicit casting. One thing that is unclear to me though is whether having implicit cast semantics for named ops is beneficial. Wasn't the whole point of named ops to have a very explicit IR that is easy to analyze? In that regard, the absence of implicit casts is actually a good thing (I also don't see how it is ambiguous, could you please clarify?). Is there any real problem with broadcasts except for not being succinct? Wouldn't implicit casting just add unnecessary burden for analyses and transforms to handle various cases of arguments?

I want clarify this. This is NOT implicit broadcasting. This is very much unambiguous broadcast representation. For example, linalg.generic allows you to represent broadcast-add this way

%2 = linalg.generic {
    indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0)>, affine_map<(d0, d1) -> (d0, d1)>],
    iterator_types = ["parallel", "parallel"]}
    ins(%0, %1 : tensor<?x?xf32>, tensor<?xf32>) outs(%empty : tensor<?x?xf32>) {
  ^bb0(%b0 : f32, %b1 : f32, %b2 : f32):
    %3 = arith.addf %b0, %b1 : f32
    linalg.yield %3: f32
  } -> tensor<?x?xf32>

There is nothing ambiguous or implicit in this broadcasting. The problem with named ops is that it forces all operands to be of the same rank, which is an unnecessary requirement at Linalg level. The fix is to allow named ops to make use of the broadcast representation that Linalg inherently allows. In the name of "explicit broadcasting" we have an artificial requirement of getting all operands to the same rank that is unnecessary IMO. Also it strictly easier to go from this representation to a representation that requires all operands to be of same rank (its essentially a lowering, you break up the operation into multiple ops). Going from a representation where all ops are "broadcasted" to the same rank to the above representation is IMO a lifting.

Actually that brings me to maybe a potential solution. You can take the existing lowering for softmax and then add a pass to explicitly split out the broadcast and then generalize. That will get you to the state you want here?

kurapov-peter · 2024-07-24T10:35:07Z

The change to the decomposition of softmax IMO is an inefficient lowering of softmax and will require "something" to get the state back. This should be part of the PR that is changing the decomposition. It is moving from a more succinct representation that Linalg allows to something that is (artifically) hamstrung with current definitions of the named ops. I dont expect the issue with named ops to be fixed as a precursor (though that would be the right thing to do IMO), but for this PR, I dont see how we can land it without having an option to chose how to decompose softmax (with default being what it is today, and an option to lower to sequence of named ops). On top of that adding a generalization to convert everything to linalg.generics is a non-starter IMO. You will be forcing all downstream users to either use "recognizers" heavily to retrieve back the information that is lost by generalization and not giving downstream users control on when they want to generalize.

Ok, I see. So this position is the opposite of what I'm proposing: changing the default decomposition to target named ops (note that this has nothing to do with generalization).

Here I'm summarizing the arguments for preserving the status quo and against it.

Cons of changing default:

Additional steps are required to reach the same IR state.
The IR is less succinct: explicit broadcasts are inserted due to named ops requirements.

Pro of changing default:

Current decomposition is a mix of an actual decomposition to exp, sum, sub, max, and div named ops + partial generalization + some fusion (I'll call the existing one a mixed decomposition here to differentiate between the proposed approach and the existing one). The proposed approach limits decomposition to a single responsibility.
Separating these three stages is beneficial because you can control the behavior better. For example, after decomposing softmax into a sequence of named ops one can fuse and tile them together with another named op that was not part of softmax. With the current approach, you'd still need the fusion pass run after the mixed decomposition to reach the same state, so pipeline complexity is the same. Moreover, new possibilities open up for pipelines that don't want to generalize the result of the decomposition.

Mitigating cons n.1: Even though reaching the same result of the mixed decomposition requires additional steps, those are existing upstream transformations. Hence, the downstream changes won't be a significant burden.
Mitigating cons n.2: As for broadcasting, as I mentioned earlier, adding implicit casting for named ops is an option. Though I still don't see an actual problem with the "same rank" requirement other than it is "unnecessary", I'm willing to work on it if it proves valuable.

I suggest we make a decision on the direction here @ftynse, @nicolasvasilache, @dcaballe, @rengolin.

kurapov-peter · 2024-07-24T10:45:10Z

Just to map back to what I said above, we can "recognize" that its a fill, but that seems like an unnecessary burden added to downstream users because it has been generalized too early without any control. I can go into details about why I think "fill" is special but thats a separate issue IMO.

I think you missed the point. The proposed decomposition only converts an aggregate operation into a sequence of non-aggregate ones. This has nothing to do with generalization. Downstreams don't need to recognize a fill from its generic form. The solution for you would be to do partial generalization, leaving fills intact.

Actually that brings me to maybe a potential solution. You can take the existing lowering for softmax and then add a pass to explicitly split out the broadcast and then generalize. That will get you to the state you want here?

Same here. Generalized IR with broadcasts is not the target state. The target is a sequence of named ops.

MaheshRavishankar · 2024-07-24T21:48:52Z

I have been trying to find a way to help land this PR without asking for too much "as a precursor" work, but given that there hasnt been much change in the approach the real issue IMO are two

It seems like there is a missing control in decomposition in general. An op might have multiple ways of decomposition that should be controllable (either through registering different interfaces or having an options struct that allows you to control the decomposition, dont know about the exact mechanism). I dont think there is a way this PR can land in its current form without introducing such a mechanism from the get go. We can decide the defaults, but the optionality to decomposition needs to be built (I think this should be done anyway, and I am happy to do it, but its not a priority for me right now)
If named ops didnt have (what I consider) an outdated semantics for broadcast handling, then there would be a path where you could just lower to named ops without structurally changing the lowering. That is also another path to make this work. This is also worth doing, but requires a broader agreement on direction.

The change to the decomposition of softmax IMO is an inefficient lowering of softmax and will require "something" to get the state back. This should be part of the PR that is changing the decomposition. It is moving from a more succinct representation that Linalg allows to something that is (artifically) hamstrung with current definitions of the named ops. I dont expect the issue with named ops to be fixed as a precursor (though that would be the right thing to do IMO), but for this PR, I dont see how we can land it without having an option to chose how to decompose softmax (with default being what it is today, and an option to lower to sequence of named ops). On top of that adding a generalization to convert everything to linalg.generics is a non-starter IMO. You will be forcing all downstream users to either use "recognizers" heavily to retrieve back the information that is lost by generalization and not giving downstream users control on when they want to generalize.

Ok, I see. So this position is the opposite of what I'm proposing: changing the default decomposition to target named ops (note that this has nothing to do with generalization).

Here I'm summarizing the arguments for preserving the status quo and against it.

Cons of changing default:

Additional steps are required to reach the same IR state.

The IR is less succinct: explicit broadcasts are inserted due to named ops requirements.

Pro of changing default:

Current decomposition is a mix of an actual decomposition to exp, sum, sub, max, and div named ops + partial generalization + some fusion (I'll call the existing one a mixed decomposition here to differentiate between the proposed approach and the existing one). The proposed approach limits decomposition to a single responsibility.

I dont agree with this characterization. If you want to lower to named ops, which are you generalizing after, and representing broadcasts more succinctly is not a fusion IMO. This seems like it is transplanting ideas from tosa/torch etc. into Linalg. There you need to have broadcast as a separate operation. You dont need that in Linalg. I wouldnt characterize it as a fusion (rather tosa/torch are artifically forcing front-ends/lowering/programmers to introduce a broadcast since they dont have mechanisms to represent broadcasts effectively).

Separating these three stages is beneficial because you can control the behavior better. For example, after decomposing softmax into a sequence of named ops one can fuse and tile them together with another named op that was not part of softmax. With the current approach, you'd still need the fusion pass run after the mixed decomposition to reach the same state, so pipeline complexity is the same. Moreover, new possibilities open up for pipelines that don't want to generalize the result of the decomposition.

Mitigating cons n.1: Even though reaching the same result of the mixed decomposition requires additional steps, those are existing upstream transformations. Hence, the downstream changes won't be a significant burden. Mitigating cons n.2: As for broadcasting, as I mentioned earlier, adding implicit casting for named ops is an option. Though I still don't see an actual problem with the "same rank" requirement other than it is "unnecessary", I'm willing to work on it if it proves valuable.

Again, I dont agree that this is implicit casting. This is very much explicit representation of broadcasting behavior. And if you think downstream changes to get back to current state is not a significant burden, please add that to your PR and then lets discuss how to package it from a user perspective. I could very well come and change the behavior back because it "suits my need" and we will never be able to reach a stable state.

I think you missed the point. The proposed decomposition only converts an aggregate operation into a sequence of non-aggregate ones. This has nothing to do with generalization. Downstreams don't need to recognize a fill from its generic form. The solution for you would be to do partial generalization, leaving fills intact.

Again I disagree. Representing broadcasting semantics more succinctly is not about it being an "aggregate" op, but rather there should never have been a need to have a linalg.broadcast operation in the first place IMO (and you are over-indexing on the fill issue. I understand completely how to do partial generalization). There was never an agreement that decomposing a softmax op into named ops as it exists today is the way to go.

stellaraccident · 2024-07-24T22:15:15Z

I have been trying to find a way to help land this PR without asking for too much "as a precursor" work, but given that there hasnt been much change in the approach the real issue IMO are two

It seems like there is a missing control in decomposition in general. An op might have multiple ways of decomposition that should be controllable (either through registering different interfaces or having an options struct that allows you to control the decomposition, dont know about the exact mechanism). I dont think there is a way this PR can land in its current form without introducing such a mechanism from the get go. We can decide the defaults, but the optionality to decomposition needs to be built (I think this should be done anyway, and I am happy to do it, but its not a priority for me right now)

If named ops didnt have (what I consider) an outdated semantics for broadcast handling, then there would be a path where you could just lower to named ops without structurally changing the lowering. That is also another path to make this work. This is also worth doing, but requires a broader agreement on direction.

The change to the decomposition of softmax IMO is an inefficient lowering of softmax and will require "something" to get the state back. This should be part of the PR that is changing the decomposition. It is moving from a more succinct representation that Linalg allows to something that is (artifically) hamstrung with current definitions of the named ops. I dont expect the issue with named ops to be fixed as a precursor (though that would be the right thing to do IMO), but for this PR, I dont see how we can land it without having an option to chose how to decompose softmax (with default being what it is today, and an option to lower to sequence of named ops). On top of that adding a generalization to convert everything to linalg.generics is a non-starter IMO. You will be forcing all downstream users to either use "recognizers" heavily to retrieve back the information that is lost by generalization and not giving downstream users control on when they want to generalize.

Ok, I see. So this position is the opposite of what I'm proposing: changing the default decomposition to target named ops (note that this has nothing to do with generalization).
Here I'm summarizing the arguments for preserving the status quo and against it.
Cons of changing default:

Additional steps are required to reach the same IR state.

The IR is less succinct: explicit broadcasts are inserted due to named ops requirements.

Pro of changing default:

Current decomposition is a mix of an actual decomposition to exp, sum, sub, max, and div named ops + partial generalization + some fusion (I'll call the existing one a mixed decomposition here to differentiate between the proposed approach and the existing one). The proposed approach limits decomposition to a single responsibility.

I dont agree with this characterization. If you want to lower to named ops, which are you generalizing after, and representing broadcasts more succinctly is not a fusion IMO. This seems like it is transplanting ideas from tosa/torch etc. into Linalg. There you need to have broadcast as a separate operation. You dont need that in Linalg. I wouldnt characterize it as a fusion (rather tosa/torch are artifically forcing front-ends/lowering/programmers to introduce a broadcast since they dont have mechanisms to represent broadcasts effectively).

Separating these three stages is beneficial because you can control the behavior better. For example, after decomposing softmax into a sequence of named ops one can fuse and tile them together with another named op that was not part of softmax. With the current approach, you'd still need the fusion pass run after the mixed decomposition to reach the same state, so pipeline complexity is the same. Moreover, new possibilities open up for pipelines that don't want to generalize the result of the decomposition.

Mitigating cons n.1: Even though reaching the same result of the mixed decomposition requires additional steps, those are existing upstream transformations. Hence, the downstream changes won't be a significant burden. Mitigating cons n.2: As for broadcasting, as I mentioned earlier, adding implicit casting for named ops is an option. Though I still don't see an actual problem with the "same rank" requirement other than it is "unnecessary", I'm willing to work on it if it proves valuable.

Again, I dont agree that this is implicit casting. This is very much explicit representation of broadcasting behavior. And if you think downstream changes to get back to current state is not a significant burden, please add that to your PR and then lets discuss how to package it from a user perspective. I could very well come and change the behavior back because it "suits my need" and we will never be able to reach a stable state.

I think you missed the point. The proposed decomposition only converts an aggregate operation into a sequence of non-aggregate ones. This has nothing to do with generalization. Downstreams don't need to recognize a fill from its generic form. The solution for you would be to do partial generalization, leaving fills intact.

Again I disagree. Representing broadcasting semantics more succinctly is not about it being an "aggregate" op, but rather there should never have been a need to have a linalg.broadcast operation in the first place IMO (and you are over-indexing on the fill issue. I understand completely how to do partial generalization). There was never an agreement that decomposing a softmax op into named ops as it exists today is the way to go.

Flyby review of a conversation that appears to be looping.

For number 1 (controllable (de)composition) -- big +1. This is how pytorch does it, and it is a super power (and a big part of what is making things attractive there). Basically, you end up with a way to say what your backend prefers and the framework gets you as close to that as it can. It is a practical way to get out of the issue that really, no one agrees on every detail of this stuff (and never will / there is no best).

For number 2 -- I'm watching this same basic discussion loop on multiple threads and PRs (basically the role of named ops and broadcasting). We're using sloppy language (calling it cast, explicit, implicit, fusion, etc), so I'm not going to add to that. But it is quite clear to me that there are a couple of opposing viewpoints on this being ground out patch by patch (with the usual amount of friction that entails). My tendency is to side with Mahesh's viewpoint on this -- not because of upstream/downstream/whatever -- but because that viewpoint is more compatible with all of the transformations that we should be able to do on this opset (and I've lived too many lives with the really substandard backdrop of trying to use the fixed function op libraries of the frameworks for transformations and optimizations). But if I squint, I can see the value in the "named op everything" viewpoint iff it is part of a holistic design supporting configurable specialization and robust generalization.

I don't want to litigate any of this on a PR like this, but I do think there are a couple of broader discussions here that we'd be better off to have.

kurapov-peter · 2024-07-25T10:14:46Z

I have been trying to find a way to help land this PR without asking for too much "as a precursor" work

I actually don't mind it as long as we agree on the direction. After revisiting our discussion, @MaheshRavishankar, I think we are talking about a similar end state using different languages. I'll try to confirm it below.

An op might have multiple ways of decomposition that should be controllable

Agree. This is a more generic description of what I named partial generalization to reach the same end state of the current decomposition. How about we make this the first step? I can start with an rfc to collect the requirements and we can team up on the design/implementation.

If named ops didnt have (what I consider) an outdated semantics for broadcast handling, then there would be a path where you could just lower to named ops without structurally changing the lowering. That is also another path to make this work. This is also worth doing, but requires a broader agreement on direction.

Right, this is what I referred to as implicit casting. It is less clear to me whether it is a good thing, but again, I am happy to work on it if there's a broad agreement. Here I might be missing something though. I see you both suggesting this would be a solution and you disagree and call it an "explicit representation of broadcasting behavior". This looks contradictory to me. Still, I assume we both think of "named ops can accept tensors of different ranks and decomposition does not produce an actual linalg.broadcast" as a target state, correct?

I can see the value in the "named op everything" viewpoint iff it is part of a holistic design supporting configurable specialization and robust generalization.

@stellaraccident, right, the PR is a naive attempt to go there. I assumed that this was an agreed-upon direction.

rengolin · 2024-07-25T12:15:10Z

2. If named ops didnt have (what I consider) an outdated semantics for broadcast handling, then there would be a path where you could just lower to named ops without structurally changing the lowering. That is also another path to make this work. This is also worth doing, but requires a broader agreement on direction.

This is the key here: define the semantics of the named ops. Landing the PR is secondary.

Can we agree that, IFF we have broadcast/transpose semantics to the named ops, we should decompose softmax to those instead of generics?

How we get there is a matter for another RFC, but I want to make sure our efforts there will lead to this decision agreed on a consensus.

MaheshRavishankar · 2024-07-25T17:16:11Z

An op might have multiple ways of decomposition that should be controllable

Agree. This is a more generic description of what I named partial generalization to reach the same end state of the current decomposition. How about we make this the first step? I can start with an rfc to collect the requirements and we can team up on the design/implementation.

Lets leave the named ops discussion aside (there is a discussion on going here https://discourse.llvm.org/t/rfc-transpose-attribute-for-linalg-matmul-operations/80092/36?u=maheshravishankar) . But for this PR, lets maybe take a break, and an RFC to allow controlling the decomposition would be great. I confess, I have no great ideas to suggest, just some vague ones. So I am at a loss to really suggest how to do that.

rengolin · 2024-08-22T18:09:57Z

FYI, we're working on a simplification of named ops with affine maps to avoid this problem, and I believe this is the solution for the current problem:
https://github.com/plaidml/tpp-mlir/wiki/Linalg-matmul-with-affine-maps#implementation-details

kurapov-peter · 2024-09-12T08:43:14Z

After some thought and discussion, my view on this changed. I tried to write a proposal for how should a mechanism for generic decompositions should look like. The more detail I add the more it resembles the regular rewriter patterns. At this point, it makes no sense to me to introduce yet another very similar mechanism (that is restricted to a specific interface). If we are not changing the default decomposition there's not much value in having additional ones upstream. Those can exist as rewrites downstream.

I'm closing this. Please let me know if there's anything I'm missing or there's still interest in additional decompositions.

kurapov-peter requested review from ftynse, nicolasvasilache, dcaballe and rengolin as code owners July 3, 2024 14:31

llvmbot added mlir:linalg mlir labels Jul 3, 2024

[MLIR][Linalg] Add aggregate ops decomposition pass and softmax decom…

005330b

…position implementation

kurapov-peter force-pushed the softmax-decomposition branch from 35e8600 to 005330b Compare July 3, 2024 14:34

fixup! [MLIR][Linalg] Add aggregate ops decomposition pass and softma…

717f622

…x decomposition implementation

rengolin reviewed Jul 3, 2024

View reviewed changes

mlir/test/Dialect/Linalg/transform-op-decompose.mlir Outdated Show resolved Hide resolved

adam-smnk reviewed Jul 3, 2024

View reviewed changes

mlir/lib/Dialect/Linalg/IR/LinalgOps.cpp Outdated Show resolved Hide resolved

mlir/lib/Dialect/Linalg/IR/LinalgOps.cpp Show resolved Hide resolved

rolfmorel requested changes Jul 3, 2024

View reviewed changes

kurapov-peter added 4 commits July 3, 2024 19:30

skip temporary

87fb531

Fix float min constant

e24372a

Fail gracefully on memrefs

a75b135

Bind all decomposed ops to the same single result of decompose interf…

9926040

…ace op

adam-smnk reviewed Jul 4, 2024

View reviewed changes

mlir/lib/Dialect/Linalg/IR/LinalgOps.cpp Outdated Show resolved Hide resolved

Check for pure tensor semantics instead of input type

d3f7899

adam-smnk reviewed Jul 4, 2024

View reviewed changes

mlir/test/Dialect/Linalg/decompose-named-ops.mlir Show resolved Hide resolved

adam-smnk reviewed Jul 4, 2024

View reviewed changes

mlir/test/Dialect/Linalg/decompose-named-ops.mlir Show resolved Hide resolved

kurapov-peter added 2 commits July 4, 2024 14:21

Add basic check as a smoke test for softmax_memref

593bc91

Add dynamic shapes test case

6201b55

ftynse reviewed Jul 5, 2024

View reviewed changes

Address codestyle review comments

ba40829

banach-space reviewed Jul 8, 2024

View reviewed changes

Add a todo comment for decomposition result

0bab48e

MaheshRavishankar requested changes Jul 8, 2024

View reviewed changes

kurapov-peter added 3 commits July 11, 2024 08:49

Rename decomposition result fields

1b62428

Make helper functions static

fd86946

Give descriptive names to transform-op-decompose test

5ae3dfe

kurapov-peter requested a review from MaheshRavishankar July 23, 2024 07:55

ZhennanQin mentioned this pull request Aug 6, 2024

Copy DecomposeAggregatedOps pass from tpp-mlir intel/graph-compiler#215

Merged

kurapov-peter closed this Sep 12, 2024

		// RUN: mlir-opt %s -split-input-file -linalg-decompose-named-ops \| FileCheck %s
		// RUN: mlir-opt %s -split-input-file -linalg-decompose-named-ops -linalg-generalize-named-ops \| FileCheck %s --check-prefix=GENERALIZECHECK

	// CHECK: %[[D2:.+]] = linalg.generic {indexing_maps = [#[[$MAP2]], #[[$MAP3]]],
	// CHECK: %[[FILL_1:.+]] = linalg.generic {indexing_maps = [#[[$MAP2]], #[[$MAP3]]],


		// -----

		// COM: decomposition assumes tensors as inputs, this is just to make sure nothing breaks

[MLIR][Linalg] Add aggregate ops decomposition pass and softmax decom… #97582

[MLIR][Linalg] Add aggregate ops decomposition pass and softmax decom… #97582

Uh oh!

Conversation

kurapov-peter commented Jul 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Jul 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rengolin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kurapov-peter commented Jul 3, 2024

Uh oh!

Uh oh!

Uh oh!

rolfmorel left a comment

Choose a reason for hiding this comment

Uh oh!

rolfmorel Jul 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rolfmorel Jul 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rolfmorel Jul 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adam-smnk Jul 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

banach-space Jul 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kurapov-peter commented Jul 3, 2024 •

edited

Loading

llvmbot commented Jul 3, 2024 •

edited

Loading

rolfmorel Jul 3, 2024 •

edited

Loading

rolfmorel Jul 3, 2024 •

edited

Loading

rolfmorel Jul 3, 2024 •

edited

Loading

adam-smnk Jul 4, 2024 •

edited

Loading

banach-space Jul 9, 2024 •

edited

Loading