-
Notifications
You must be signed in to change notification settings - Fork 15.2k
[MLIR][Linalg] Add aggregate ops decomposition pass and softmax decom… #97582
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-mlir @llvm/pr-subscribers-mlir-linalg Author: Petr Kurapov (kurapov-peter) Changes…position implementation
Patch is 35.95 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/97582.diff 11 Files Affected:
diff --git a/mlir/include/mlir/Dialect/Linalg/IR/LinalgInterfaces.h b/mlir/include/mlir/Dialect/Linalg/IR/LinalgInterfaces.h
index 08afdf373f014..3858075fae137 100644
--- a/mlir/include/mlir/Dialect/Linalg/IR/LinalgInterfaces.h
+++ b/mlir/include/mlir/Dialect/Linalg/IR/LinalgInterfaces.h
@@ -30,6 +30,16 @@ class IteratorTypeAttr;
class LinalgOp;
class GenericOp;
+/// Container for result values of decomposition.
+/// - `decomposedOps` contains operations created by the decomposition that are
+/// returned to the caller for further transformations.
+/// - `decomposedValues` contains the values corresponding to the result of the
+/// aggregate operation.
+struct DecompositionResult {
+ SmallVector<Operation *> decomposedOps;
+ SmallVector<Value> decomposedValues;
+};
+
namespace detail {
/// Implementation of the method that check if given operands
/// can be dropped, i.e. the remaining operands can compute the loop
diff --git a/mlir/include/mlir/Dialect/Linalg/IR/LinalgInterfaces.td b/mlir/include/mlir/Dialect/Linalg/IR/LinalgInterfaces.td
index fbf3f19cde0e9..9b1ab20552628 100644
--- a/mlir/include/mlir/Dialect/Linalg/IR/LinalgInterfaces.td
+++ b/mlir/include/mlir/Dialect/Linalg/IR/LinalgInterfaces.td
@@ -862,7 +862,7 @@ def AggregatedOpInterface : OpInterface<"AggregatedOpInterface"> {
In other words, the returned vector can be used directly with
`RewriterBase::replaceOp(this, returnedValues)`.
}],
- /*retType=*/"FailureOr<SmallVector<Value>>",
+ /*retType=*/"FailureOr<DecompositionResult>",
/*methodName=*/"decomposeOperation",
/*args=*/(ins
"OpBuilder &":$b),
diff --git a/mlir/include/mlir/Dialect/Linalg/Passes.td b/mlir/include/mlir/Dialect/Linalg/Passes.td
index 0621a9f33ba1e..3031126e582f7 100644
--- a/mlir/include/mlir/Dialect/Linalg/Passes.td
+++ b/mlir/include/mlir/Dialect/Linalg/Passes.td
@@ -94,6 +94,11 @@ def LinalgGeneralizeNamedOpsPass : Pass<"linalg-generalize-named-ops"> {
let dependentDialects = ["linalg::LinalgDialect"];
}
+def LinalgDecomposeAggregateNamedOpsPass : Pass<"linalg-decompose-named-ops"> {
+ let summary = "Decompose complex named ops (e.g., Softmax) into a sequence of linalg named ops";
+ let dependentDialects = ["linalg::LinalgDialect"];
+}
+
def LinalgDetensorizePass : InterfacePass<"linalg-detensorize", "FunctionOpInterface"> {
let summary = "Detensorize linalg ops";
let dependentDialects = [];
diff --git a/mlir/include/mlir/Dialect/Linalg/TransformOps/LinalgTransformOps.td b/mlir/include/mlir/Dialect/Linalg/TransformOps/LinalgTransformOps.td
index 93e2c2db729da..2e8e294aa2e4c 100644
--- a/mlir/include/mlir/Dialect/Linalg/TransformOps/LinalgTransformOps.td
+++ b/mlir/include/mlir/Dialect/Linalg/TransformOps/LinalgTransformOps.td
@@ -1317,25 +1317,21 @@ def ConvertToLoopsOp : Op<Transform_Dialect, "structured.convert_to_loops",
def DecomposeInterfaceOp : Op<Transform_Dialect, "structured.decompose_interface",
[FunctionalStyleTransformOpTrait,
MemoryEffectsOpInterface,
- TransformOpInterface,
- TransformEachOpTrait,
+ DeclareOpInterfaceMethods<TransformOpInterface>,
ReportTrackingListenerFailuresOpTrait]> {
let description = [{
- TODO
+ Decomposes high-level named ops into a sequence of non-aggregate named ops
+ via `AggregatedOpInterface`.
+
+ The operation ignores non-decomposable ops. The return handles point to
+ a sequence of named ops produced by the decomposition.
}];
let arguments = (ins TransformHandleTypeInterface:$target);
- let results = (outs TransformHandleTypeInterface:$transformed);
+ let results = (outs Variadic<TransformHandleTypeInterface>:$transformed);
let assemblyFormat =
"$target attr-dict `:` functional-type(operands, results)";
- let extraClassDeclaration = [{
- ::mlir::DiagnosedSilenceableFailure applyToOne(
- ::mlir::transform::TransformRewriter &rewriter,
- ::mlir::Operation *target,
- ::mlir::transform::ApplyToEachResultList &results,
- ::mlir::transform::TransformState &state);
- }];
}
//===----------------------------------------------------------------------===//
// RewriteInDestinationPassingStyleOp.
diff --git a/mlir/include/mlir/Dialect/Linalg/Transforms/Transforms.h b/mlir/include/mlir/Dialect/Linalg/Transforms/Transforms.h
index 05e97befdec1f..b0eeb274f71bb 100644
--- a/mlir/include/mlir/Dialect/Linalg/Transforms/Transforms.h
+++ b/mlir/include/mlir/Dialect/Linalg/Transforms/Transforms.h
@@ -1546,6 +1546,11 @@ void populateLinalgTilingCanonicalizationPatterns(RewritePatternSet &patterns);
/// linalg.generic ops.
void populateLinalgNamedOpsGeneralizationPatterns(RewritePatternSet &patterns);
+/// Populates `patterns` with patterns to decompose high-level aggregate named
+/// ops (e.g., softmax) into a sequence of simpler linalg named ops, defining
+/// the operation semantics.
+void populateDecomposeAggregateNamedOpsPatterns(RewritePatternSet &patterns);
+
/// Linalg decompose convolutions patterns
/// Populates patterns to decompose high-D convolution ops into low-D ones.
diff --git a/mlir/lib/Dialect/Linalg/IR/LinalgOps.cpp b/mlir/lib/Dialect/Linalg/IR/LinalgOps.cpp
index 57d126603ebd7..383f285969ad7 100644
--- a/mlir/lib/Dialect/Linalg/IR/LinalgOps.cpp
+++ b/mlir/lib/Dialect/Linalg/IR/LinalgOps.cpp
@@ -2564,116 +2564,41 @@ void SoftmaxOp::getEffects(
// Helper functions for softmax decomposition.
// @{
-
-// Helper function to produce the iterator types (reduction or parallel) and
-// affine maps for the iterators used in the decomposition of softmax.
-// This method creates:
-// If allParallel == true:
-// - iterator type: {parallel, ..., parallel}
-// - affine maps:
-// -- identity with inputRank dimensions.
-// -- (d0, ..., dN) -> (d0, ..., d_dim-1, d_dim+1, ..., dN),
-// where N == inputRank.
-//
-// If allParallel == false:
-// - iterator type at dim(i) == parallel for i != \p dim and
-// dim(dim) == reduction.
-// - affine map:
-// -- identity with inputRank dimensions.
-// -- (d0, ..., dN) -> (d0, ..., d_dim-1, d_dim+1, ..., dN),
-// where N == inputRank.
-static std::tuple<SmallVector<utils::IteratorType>, SmallVector<AffineMap>>
-computeIteratorTypesAndIndexingMaps(OpBuilder &builder, int64_t inputRank,
- int64_t dim, bool allParallel = false) {
- SmallVector<utils::IteratorType> iteratorTypes(inputRank,
- utils::IteratorType::parallel);
- if (!allParallel)
- iteratorTypes[dim] = utils::IteratorType::reduction;
- MLIRContext *ctxt = builder.getContext();
- auto identityMap = AffineMap::getMultiDimIdentityMap(inputRank, ctxt);
- SmallVector<AffineExpr, 2> affineExprs;
- for (int i = 0; i < inputRank; i++) {
- if (i != dim)
- affineExprs.push_back(mlir::getAffineDimExpr(i, ctxt));
- }
- auto reductionMap =
- AffineMap::get(inputRank, /*symbols=*/0, affineExprs, ctxt);
- SmallVector<AffineMap> indexingMaps{identityMap, reductionMap};
- return std::make_tuple(iteratorTypes, indexingMaps);
-}
-
-// Helper function to produce a linalg.generic that computes a reduction on
-// dimension \p dim with the operation type \p T.
-template <typename T>
-static Value reduce(OpBuilder &builder, Location loc, Value input, Value output,
- int64_t dim) {
- auto inputType = cast<ShapedType>(input.getType());
- ArrayRef<int64_t> inputShape = inputType.getShape();
- int64_t inputRank = inputShape.size();
- auto [iteratorTypes, indexingMaps] =
- computeIteratorTypesAndIndexingMaps(builder, inputRank, dim);
- assert(indexingMaps.size() == 2 &&
- "We should have two maps: 1 for the input, 1 for the output");
- assert(indexingMaps[0].isIdentity() && "input map should be identity");
-
- auto genericOp = builder.create<linalg::GenericOp>(
- loc, output.getType(), input, output, indexingMaps, iteratorTypes,
- [&](OpBuilder &b, Location loc, ValueRange args) {
- Value result = b.create<T>(loc, args[0], args[1]);
- b.create<linalg::YieldOp>(loc, result);
- });
- return genericOp.getResult(0);
-}
-
-/// Produce a linalg generic that computes the second step of the softmax
-/// decomposition: res = exp(input - max), where \p max is the max of \p input
-/// on dimension \p dim.
-static Value buildSubAndExpOp(OpBuilder &builder, Location loc, Value input,
- Value max, Value output, int64_t dim) {
- auto inputType = cast<ShapedType>(input.getType());
- ArrayRef<int64_t> inputShape = inputType.getShape();
- int64_t inputRank = inputShape.size();
- auto [iteratorTypes, indexingMaps] = computeIteratorTypesAndIndexingMaps(
- builder, inputRank, dim, /*allParallel=*/true);
- assert(indexingMaps.size() == 2 && "We should have one map for each input");
- assert(indexingMaps[0].isIdentity() && "input map should be identity");
- // Add the affine map for the output argument.
- indexingMaps.push_back(indexingMaps[0]);
- auto genericOp = builder.create<linalg::GenericOp>(
- loc, input.getType(), ValueRange{input, max}, output, indexingMaps,
- iteratorTypes, [&](OpBuilder &b, Location loc, ValueRange args) {
- Value diff = b.create<arith::SubFOp>(loc, args[0], args[1]);
- Value result = b.create<math::ExpOp>(loc, diff);
- b.create<linalg::YieldOp>(loc, result);
- });
- return genericOp.getResult(0);
-}
-
-/// Produce a linalg generic that computes the final step of the softmax
-/// decomposition.
-/// \returns linalg.generic ins(\p numerator, \p denominator) outs(\p output) {
-/// yield n / d
-/// }
-static Value buildDivOp(OpBuilder &builder, Location loc, Value numerator,
- Value denominator, Value output, int64_t dim) {
- auto inputType = cast<ShapedType>(numerator.getType());
- ArrayRef<int64_t> inputShape = inputType.getShape();
- int64_t inputRank = inputShape.size();
- auto [iteratorTypes, indexingMaps] = computeIteratorTypesAndIndexingMaps(
- builder, inputRank, dim, /*allParallel=*/true);
- assert(indexingMaps.size() == 2 &&
- "We should have one map for each input (2)");
- assert(indexingMaps[0].isIdentity() && "Numerator map should be identity");
- // Add the affine map for the output tensor.
- indexingMaps.push_back(indexingMaps[0]);
- auto genericOp = builder.create<linalg::GenericOp>(
- loc, numerator.getType(), ValueRange{numerator, denominator}, output,
- indexingMaps, iteratorTypes,
- [&](OpBuilder &b, Location loc, ValueRange args) {
- Value result = b.create<arith::DivFOp>(loc, args[0], args[1]);
- b.create<linalg::YieldOp>(loc, result);
- });
- return genericOp.getResult(0);
+TypedAttr createInitValueForReduceMaxOp(Type type, OpBuilder &b) {
+ if (isa<FloatType>(type))
+ return b.getFloatAttr(
+ type, APFloat::getSmallest(cast<FloatType>(type).getFloatSemantics()));
+ if (isa<IntegerType>(type))
+ return b.getIntegerAttr(
+ type, APInt::getSignedMinValue(type.getIntOrFloatBitWidth()));
+ return {};
+}
+
+TypedAttr createInitValueForReduceSumOp(Type type, OpBuilder &b) {
+ if (isa<FloatType>(type))
+ return b.getFloatAttr(
+ type, APFloat::getZero(cast<FloatType>(type).getFloatSemantics()));
+ if (isa<IntegerType>(type))
+ return b.getIntegerAttr(type, APInt::getZero(type.getIntOrFloatBitWidth()));
+ return {};
+}
+
+Value createLinalgReduceMaxBody(OpBuilder b, Location loc, ValueRange args,
+ Type elementTy) {
+ if (isa<FloatType>(elementTy))
+ return b.create<arith::MaxNumFOp>(loc, args[0], args[1]);
+ if (isa<IntegerType>(elementTy))
+ return b.create<arith::MaxSIOp>(loc, args[0], args[1]);
+ return {};
+}
+
+Value createLinalgReduceSumBody(OpBuilder &b, Location loc, ValueRange args,
+ Type elementTy) {
+ if (isa<FloatType>(elementTy))
+ return b.create<arith::AddFOp>(loc, args[0], args[1]);
+ if (isa<IntegerType>(elementTy))
+ return b.create<arith::AddIOp>(loc, args[0], args[1]);
+ return {};
}
// @} End helper functions for softmax decomposition.
@@ -2695,7 +2620,7 @@ static Value buildDivOp(OpBuilder &builder, Location loc, Value numerator,
/// 4. Divide z and l. This gives the N-dimensional softmax.
/// softmax = z / l
///
-FailureOr<SmallVector<Value>> SoftmaxOp::decomposeOperation(OpBuilder &b) {
+FailureOr<DecompositionResult> SoftmaxOp::decomposeOperation(OpBuilder &b) {
OpBuilder::InsertionGuard guard(b);
b.setInsertionPoint(*this);
Location loc = getLoc();
@@ -2706,32 +2631,60 @@ FailureOr<SmallVector<Value>> SoftmaxOp::decomposeOperation(OpBuilder &b) {
SmallVector<OpFoldResult> dims = tensor::getMixedSizes(b, loc, input);
Value output = getOutput();
dims.erase(dims.begin() + reductionDim);
+
// Step 1: Compute max along dim.
Value outputReduce = b.create<tensor::EmptyOp>(loc, dims, elementType);
- Value neutralForMaxF = arith::getIdentityValue(arith::AtomicRMWKind::maximumf,
- elementType, b, loc,
- /*useOnlyFiniteValue=*/true);
- Value neutralForMaxFInit =
- b.create<linalg::FillOp>(loc, Value{neutralForMaxF}, outputReduce)
- .result();
- Value max =
- reduce<arith::MaxNumFOp>(b, loc, input, neutralForMaxFInit, reductionDim);
+ auto maxFillValAttr = createInitValueForReduceMaxOp(elementType, b);
+ auto maxFillValue = b.create<arith::ConstantOp>(loc, maxFillValAttr);
+ auto neutralMaxInitOp = b.create<linalg::FillOp>(
+ loc, ValueRange{maxFillValue}, ValueRange{outputReduce});
+ Value neutralForMaxFInit = neutralMaxInitOp.result();
+
+ auto reduceMaxOp = b.create<linalg::ReduceOp>(
+ loc, input, neutralForMaxFInit, reductionDim,
+ [&](OpBuilder &nestedBuilder, Location nestedLoc, ValueRange args) {
+ auto result =
+ createLinalgReduceMaxBody(b, nestedLoc, args, elementType);
+ nestedBuilder.create<linalg::YieldOp>(nestedLoc, result);
+ });
// Step 2: Subtract max from input and exponentiate.
- Value numerator = buildSubAndExpOp(b, loc, input, max, output, reductionDim);
+ auto maxBroadcastOp = b.create<linalg::BroadcastOp>(
+ loc, reduceMaxOp.getResult(0), output, reduceMaxOp.getDimensionsAttr());
+
+ auto subOp = b.create<linalg::SubOp>(
+ loc, ValueRange{input, maxBroadcastOp.getResults().front()},
+ ValueRange{output});
+ auto expOp = b.create<linalg::ExpOp>(loc, ValueRange{subOp.getResult(0)},
+ ValueRange{output});
// Step 3: Compute sum along dim.
- Value zero = arith::getIdentityValue(arith::AtomicRMWKind::addf, elementType,
- b, loc, /*useOnlyFiniteValue=*/true);
- Value zeroInit =
- b.create<linalg::FillOp>(loc, Value{zero}, outputReduce).result();
- Value denominator =
- reduce<arith::AddFOp>(b, loc, numerator, zeroInit, reductionDim);
+ auto sumFillValAttr = createInitValueForReduceSumOp(elementType, b);
+ auto sumFillValue = b.create<arith::ConstantOp>(loc, sumFillValAttr);
+ auto neutralSumInitOp = b.create<linalg::FillOp>(
+ loc, ValueRange{sumFillValue}, ValueRange{outputReduce});
+ auto sumFilledTensor = neutralSumInitOp.result();
+ auto reduceSumOp = b.create<linalg::ReduceOp>(
+ loc, expOp.getResults(), sumFilledTensor, reductionDim,
+ [&](OpBuilder &nestedBuilder, Location nestedLoc, ValueRange args) {
+ auto result =
+ createLinalgReduceSumBody(b, nestedLoc, args, elementType);
+ nestedBuilder.create<linalg::YieldOp>(nestedLoc, result);
+ });
// Step 4: Compute softmax.
- Value result =
- buildDivOp(b, loc, numerator, denominator, output, reductionDim);
- return SmallVector<Value>{result};
+ auto sumBcastOutput = b.create<tensor::EmptyOp>(
+ loc, getOutputOperandType().getShape(), elementType);
+ auto sumBroadcastOp = b.create<linalg::BroadcastOp>(
+ loc, reduceSumOp.getResult(0), sumBcastOutput,
+ reduceSumOp.getDimensionsAttr());
+ auto divOp = b.create<linalg::DivOp>(
+ loc, ValueRange{expOp.getResult(0), sumBroadcastOp.getResults().front()},
+ ValueRange{output});
+ return DecompositionResult{{neutralMaxInitOp, reduceMaxOp, maxBroadcastOp,
+ subOp, expOp, neutralSumInitOp, reduceSumOp,
+ sumBroadcastOp, divOp},
+ {divOp.getResults().front()}};
}
//===----------------------------------------------------------------------===//
diff --git a/mlir/lib/Dialect/Linalg/TransformOps/LinalgTransformOps.cpp b/mlir/lib/Dialect/Linalg/TransformOps/LinalgTransformOps.cpp
index bc02788f9c441..e3f0a18a5ec2c 100644
--- a/mlir/lib/Dialect/Linalg/TransformOps/LinalgTransformOps.cpp
+++ b/mlir/lib/Dialect/Linalg/TransformOps/LinalgTransformOps.cpp
@@ -431,27 +431,23 @@ transform::DecomposeOp::applyToOne(transform::TransformRewriter &rewriter,
// Decompose the target operation if it implements the AggregatedOpInterface.
// Push the decomposed operations (the ones that replaces the values produced by
// \p target) in the `results`.
-DiagnosedSilenceableFailure transform::DecomposeInterfaceOp::applyToOne(
- transform::TransformRewriter &rewriter, Operation *target,
- transform::ApplyToEachResultList &results,
- transform::TransformState &state) {
- auto decomposableOp = dyn_cast<AggregatedOpInterface>(target);
- if (!decomposableOp) {
- failed(rewriter.notifyMatchFailure(target,
- "payload is not a decomposable op"));
- return emitDefaultSilenceableFailure(target);
- }
+DiagnosedSilenceableFailure
+transform::DecomposeInterfaceOp::apply(transform::TransformRewriter &rewriter,
+ TransformResults &transformResults,
+ TransformState &state) {
+ for (auto [i, target] : llvm::enumerate(state.getPayloadOps(getTarget()))) {
+ auto decomposableOp = dyn_cast<AggregatedOpInterface>(target);
+ if (!decomposableOp)
+ continue;
- FailureOr<SmallVector<Value>> maybeNewResults =
- decomposableOp.decomposeOperation(rewriter);
- if (failed(maybeNewResults))
- return emitDefaultSilenceableFailure(target);
+ FailureOr<DecompositionResult> maybeNewResults =
+ decomposableOp.decomposeOperation(rewriter);
+ if (failed(maybeNewResults))
+ return emitDefaultSilenceableFailure(target);
- rewriter.replaceOp(decomposableOp, *maybeNewResults);
- for (Value val : *maybeNewResults) {
- Operation *definition = val.getDefiningOp();
- if (definition)
- results.push_back(definition);
+ rewriter.replaceOp(decomposableOp, maybeNewResults->decomposedValues);
+ transformResults.set(cast<OpResult>(getResult(i)),
+ maybeNewResults->decomposedOps);
}
return DiagnosedSilenceableFailure::success();
}
diff --git a/mlir/lib/Dialect/Linalg/Transforms/CMakeLists.txt b/mlir/lib/Dialect/Linalg/Transforms/CMakeLists.txt
index 7e3dc56e0acdc..68582fe6cbad2 100644
--- a/mlir/lib/Dialect/Linalg/Transforms/CMakeLists.txt
+++ b/mlir/lib/Dialect/Linalg/Transforms/CMakeLists.txt
@@ -7,6 +7,7 @@ add_mlir_dialect_library(MLIRLinalgTransforms
ConvertConv2DToImg2Col.cpp
DataLayoutPropagation.cpp
DecomposeLinalgOps.cpp
+ DecomposeAggregateNamedLinalgOps.cpp
Detensorize.cpp
DropUnitDims.cpp
ElementwiseOpFusion.cpp
diff --git a/mlir/lib/Dialect/Linalg/Transforms/DecomposeAggregateNamedLinalgOps.cpp b/mlir/lib/Dialect/Linalg/Transforms/DecomposeAggregateNamedLinalgOps.cpp
new file mode 100644
index 0000000000000..e8a5b96d54d34
--- /dev/null
+++ b/mlir/lib/Dialect/Linalg/Transforms/DecomposeAggregateNamedLinalgOps.cpp
@@ -0,0 +1,62 @@
+//===- DecomposeNamedLinalgOps.cpp - Patterns to break up complex ops -----===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#include "mlir/Dialect/Linalg/Passes.h"
+
+#include "mlir/Dialect/Linalg/IR/Linalg.h"
+#include "mlir/Dialect/Linalg/Transforms/Transforms.h"
+#include "mlir/Transforms/Gre...
[truncated]
|
…position implementation
35e8600
to
005330b
Compare
…x decomposition implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Petr, this is looks good to me (but I'll allow time for others to review, so won't approve).
The main differences I see after generalizing the named ops:
linalg.fill
s become a genteric, but that's a minor thing and I think it's the right thing to have.linalg.subf
andlinalg.exp
are not fused as before, but again, this is the correct lowering. One can fuse element-wise ops later if so inclined.- The
linalg.reduce { maxnum }
's init constant is wrong, will yield wrong max if all numbers are<=0
.
Oh, nice catch! Misused the smallest api :) Will fix in a bit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the interface of the decompose_interface
TransformOp is right.
See my comment on it for why and a suggestion for a fix.
As issues around TransformOps being "vectorized" or not crop up more often, maybe @ftynse can chime in with other suggestions for the op's interface.
|
||
let arguments = (ins TransformHandleTypeInterface:$target); | ||
let results = (outs TransformHandleTypeInterface:$transformed); | ||
let results = (outs Variadic<TransformHandleTypeInterface>:$transformed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is right.
Variadic<..>
allows instances of an op to have different number of return values, but for each instance the number of result values still needs to be statically known, i.e. there is always a fixed number of results in the IR.
What you are doing in the implementation of apply()
in LinalgTransformOps.cpp
is pushing a result for each op in the payload that was associated to target
. Note that the number of ops in this payload is only known at runtime. Hence it better be that the number of ops associated to target
equals the number of results that the instance of transform.structured.decompose_interface
expects, otherwise there will be a runtime error.
This issue also crops up in another situation: you optionally do not push a result if an op associated to target
does not implement the required interface, i.e. you "ignore" it. Because a user is statically encoding the number of expected results, a user would always need to know upfront how many of the ops passed via target
do or do not implement the interface. Otherwise the number of results would be wrong and there would be a runtime error.
Hence this scheme does not work in general. It would mean always needing to know statically how many decomposable ops are going to be associated to an instance of decompose_interface
's target
argument at runtime.
Maybe @ftynse has a suggestion for still allowing the decompose_interface
TransformOp to be "vectorized," but at the moment I don't see a unambiguous interface for it.
I think the interface that does work is to insist that target
payload's size is at most one (there are a number of TransformOps that insist on payload sizes of one/at most one). In that case you just have one return handle (so no Variadic<..>
) which either carries all the decomposed ops or is empty in case 1) target
was empty or 2) the associated op didn't implement the decompose interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion, Rolf! Yes, this is awkward. Here's an example I wrote while testing the thing, it highlights the usage problem:
func.func @softmax(%arg0: tensor<2x16x32xf32>, %dst: tensor<2x16x32xf32>) -> tensor<2x16x32xf32> {
%out = tensor.empty() : tensor<2x16x32xf32>
%1 = linalg.softmax dimension(2) ins(%arg0 : tensor<2x16x32xf32>) outs(%out: tensor<2x16x32xf32>) -> tensor<2x16x32xf32>
%2 = linalg.softmax dimension(1) ins(%1: tensor<2x16x32xf32>) outs(%dst: tensor<2x16x32xf32>) -> tensor<2x16x32xf32>
return %2 : tensor<2x16x32xf32>
}
module attributes {transform.with_named_sequence} {
transform.named_sequence @__transform_main(%arg1: !transform.any_op {transform.readonly}) {
%2 = transform.structured.match ops{["linalg.softmax"]} in %arg1 : (!transform.any_op) -> !transform.any_op
%3, %4 = transform.structured.decompose_interface %2 : (!transform.any_op) -> (!transform.any_op, !transform.any_op)
%5 = transform.structured.generalize %3: (!transform.any_op) -> !transform.any_op
%6 = transform.structured.generalize %4: (!transform.any_op) -> !transform.any_op
transform.yield
}
}
The other option I considered was to assign all the resulting ops to the single value the transform op returns. That had the downside of not knowing which new payload corresponds to which target, so I went this route.
Anyway, the purpose of the change was to introduce the pass. Transform op change is just a byproduct (and it seems the op was created for demonstration in the first place? I found no usage scenarios of it anywhere). I don't have an opinion on how the interface should look like. I'll just list some properties to consider:
- Multiple ops in the resulting decomposition (i.e., can't go through
applyToOne
) - Should skip ops that don't implement the interface
- Would be nice to apply to a sequence (?)
If there are no other strong opinions I'll follow the suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other option I considered was to assign all the resulting ops to the single value the transform op returns. That had the downside of not knowing which new payload corresponds to which target
I think this is also a valid option, namely keep the op vectorized but it's up to the user to deal with the information loss of which decomposed ops belonged to which op in target
's payload. For example, if you know you want to do the same thing to all decomposed ops anyway, e.g. generalize them, this would be more compact than needing to use a transform.foreach
and stuffing the decompose_interface
op in it's region. In case the user ensures there's only one op associated to target
, e.g. by using transform.foreach
, then they regain the unambiguous semantics of my suggestion.
You could even make the "accumulation" behaviour opt-in by requiring the user to use a (unit-)attribute or keyword. Whether this is a pattern that is desirable for the transform dialect is I think best addressed by @ftynse.
Similarly, you could make the ignoring behaviour opt-in or opt-out through a keyword or attribute. Though, IMO, for this PR, it's fine to pick one behaviour and document it. Somebody can make the behaviour be subject to a flag in a separate PR later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In any case, the above is basically just bikeshedding. If you want something that can be merged quick, you could go with my non-vectorized suggestion, which keeps to an established pattern in the Transform Dialect.
The vectorized behaviour could always be added in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with binding all the payload with a single output value (reverting the op change). This was easiest, it doesn't modify the original interface, and has less verbose usage:
func.func @softmax(%arg0: tensor<2x16x32xf32>, %dst: tensor<2x16x32xf32>) -> tensor<2x16x32xf32> {
%out = tensor.empty() : tensor<2x16x32xf32>
%1 = linalg.softmax dimension(2) ins(%arg0 : tensor<2x16x32xf32>) outs(%out: tensor<2x16x32xf32>) -> tensor<2x16x32xf32>
%2 = linalg.softmax dimension(1) ins(%1: tensor<2x16x32xf32>) outs(%dst: tensor<2x16x32xf32>) -> tensor<2x16x32xf32>
return %2 : tensor<2x16x32xf32>
}
module attributes {transform.with_named_sequence} {
transform.named_sequence @__transform_main(%arg1: !transform.any_op {transform.readonly}) {
%2 = transform.structured.match ops{["linalg.softmax"]} in %arg1 : (!transform.any_op) -> !transform.any_op
%3 = transform.structured.decompose_interface %2 : (!transform.any_op) -> (!transform.any_op)
%5 = transform.structured.generalize %3: (!transform.any_op) -> !transform.any_op
transform.yield
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine by me. 👍
|
||
%2 = transform.structured.match ops{["linalg.softmax"]} in %arg1 : (!transform.any_op) -> !transform.any_op | ||
%3 = transform.structured.decompose_interface %2 : (!transform.any_op) -> !transform.any_op | ||
%4 = transform.structured.generalize %3: (!transform.any_op) -> !transform.any_op |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why add this to the existing test?
I don't think generalization is needed for anything and it makes it harder to see what, if anything, has changed after the logic rework.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to show that the generic lowering is still valid. The lowering to named ops is in another test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generic lowering is still valid
It used to be decomposed to generic but now the decomposition is always to named ops and you have to rely on separate pass to have that generic representation. So, I still think it's unrelated to this test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need both. It doesn't matter to me if we create a new one for generic or rely on the new one Petr added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me it's a separate concern. AFAIK, decompose has never promised specific lowering format.
If we worry about generalization pass then that belongs to its own unit tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a nice demonstration of the resulting code which is useful for this review. I agree it's not related to the test. We can remove generalization from both decomposition and transform tests before the merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @rengolin that we need both. MLIR could definitely use more of "integration" tests, but without them replacing unit tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused. Does "both" mean we want to leave all the existing tests in the PR and add one more test for the transform op without generalization? Like in:
module attributes {transform.with_named_sequence} {
transform.named_sequence @__transform_main(%arg1: !transform.any_op {transform.readonly}) {
%2 = transform.structured.match ops{["linalg.softmax"]} in %arg1 : (!transform.any_op) -> !transform.any_op
%3 = transform.structured.decompose_interface %2 : (!transform.any_op) -> (!transform.any_op)
transform.yield
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me you already have both. The old test, which you generalize and compare the delta and the new one you added, which is just the named ops. I'm happy with the tests the way they are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with what's being proposed here (in terms of test coverage), but please make sure that we don't duplicate tests and that test files reflect what's being tested (I consider test file names to be part of documentation).
ATM:
- This file is called "transform-op-decompose.mlir", but it's being updated to test both decomposition and generalisation.
- Another file added in this PR, "decompose-named-ops.mlir", also tests both decomposition and generalisation.
IMO, there's should be a separate test file for:
- decomposition
- generalisation
- e2e/integration (decomposition -> generalisation
This way it will be clear what is being tested and where.
EDIT "Unresolving" to make sure that this is addressed (commented after this was resolved)
/// - `decomposedValues` contains the values corresponding to the result of the | ||
/// aggregate operation. | ||
struct DecompositionResult { | ||
SmallVector<Operation *> decomposedOps; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are all decomposed ops Linalg ops? If so, can we use linalg::LinalgOp
(interface) instead of the generic Operation *
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In softmax case - yes. I was thinking this could eventually become not a linalg-specific interface, so went with Operation*
. Do you have a specific reason it should be LinalgOp
in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to be a bit more restrictive to begin with, then expand as we have more use cases. The main problem is when someone else tries to use it for a wildly different case and complaints "it's not working" when in truth it was never intended to be supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm asking because this seems to be not a trivial change. LinalgOp
is an incomplete type in the context and a forward declaration for DecompositionResult
makes the struct an incomplete type for FailureOr
template instantiation (in AggregatedOpInterface
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I see. I'm ok with this being Operation *
for now, with a TODO/FIXME to make sure people are aware if they try to use it for other needs. I think the probability of anyone trying to use it outside of this scope is very small.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a todo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer it to be Operation *
. Thats the more general form, and changing ABI causes breakages. There is no reason for this to be Linalg ops always.
|
||
%2 = transform.structured.match ops{["linalg.softmax"]} in %arg1 : (!transform.any_op) -> !transform.any_op | ||
%3 = transform.structured.decompose_interface %2 : (!transform.any_op) -> !transform.any_op | ||
%4 = transform.structured.generalize %3: (!transform.any_op) -> !transform.any_op |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @rengolin that we need both. MLIR could definitely use more of "integration" tests, but without them replacing unit tests.
if (definition) | ||
results.push_back(definition); | ||
rewriter.replaceOp(decomposableOp, maybeNewResults->decomposedValues); | ||
allDecomposedOps.append(maybeNewResults->decomposedOps); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a fan of mixing all results together. It basically makes the result handle useless for further composition outside of the single-payload operand case. The most recent attempt at "fixing" this is to introduce a UnitAttr:$flatten
the presence of which explicitly requests the results to be flattened into a single list. Its absence enables a check for operand being associated with at most one payload. This makes the behavior visible to the user and thus less surprising. When needed, they can wrap the logic in `transform.foreach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ftynse, could you please point me to the code/issue? I'm not familiar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://mlir.llvm.org/docs/Dialects/Transform/#transformforeach_match-transformforeachmatchop has flatten_results
. You can git blame
your way back to the PR that added it. The implementation is rather long, in this case it's just a matter of having an attribute and a conditional as described above.
Thanks! Does this PR change what Ops
Also, what's the difference between "transform-op-decompose.mlir" and "decompose-named-ops.mlir"? Is it just "how" the decomposition is "driven"? (TD vs Pass) Couldn't that be one test instead?
Thanks for checking and for the extra context. I am just wondering:
I am a bit confused, there's |
b.create<linalg::YieldOp>(loc, result); | ||
}); | ||
return genericOp.getResult(0); | ||
TypedAttr createInitValueForReduceMaxOp(Type type, OpBuilder &b) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
createInitValueForReduceMaxOp
This might be something obvious, but where does reduction happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function is used to create an init value for the reduction. The reduction op itself is created right after that. I might not understand the question though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function is used to create an init value for the reduction.
What's stopping anyone from using these methods for things other than reductions? It's worth clarifying with a comment. Better still (given how small these are), add them as lambdas inside decomposeOperation
.
Btw, shouldn't these methods be static?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had the same thought actually, and there's similar stuff in tosa, so these could be some generic utilities. The values are specific to reductions though and can be specific to an operator (this is why I didn't pull it out into utils right away). Is there an example where those would be useful besides reductions you have in mind?
// RUN: mlir-opt %s -split-input-file -linalg-decompose-named-ops | FileCheck %s | ||
// RUN: mlir-opt %s -split-input-file -linalg-decompose-named-ops -linalg-generalize-named-ops | FileCheck %s --check-prefix=GENERALIZECHECK |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] With more than one CHECK prefix, you can drop CHECK
(which is just noise IMHO), and leverage the prefixes to better document what's tested:
// RUN: mlir-opt %s -split-input-file -linalg-decompose-named-ops | FileCheck %s -check-prefix=DECOMPOSED
// RUN: mlir-opt %s -split-input-file -linalg-decompose-named-ops -linalg-generalize-named-ops | FileCheck %s --check-prefix=GENERALIZED
I am also making the suggestion to make sure that my understanding is correct.
// CHECK-DAG: %[[D1:.+]] = tensor.empty() : tensor<2x16xf32> | ||
// CHECK-DAG: %[[CST:.+]] = arith.constant -3.40282347E+38 : f32 | ||
// CHECK: %[[D2:.+]] = linalg.fill ins(%[[CST]] : f32) outs(%[[D1]] : tensor<2x16xf32>) -> tensor<2x16xf32> | ||
// CHECK: %[[D2:.+]] = linalg.generic {indexing_maps = [#[[$MAP2]], #[[$MAP3]]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering what D
stands for in these tests 🤔 In any case, this would be a bit more descriptive.
// CHECK: %[[D2:.+]] = linalg.generic {indexing_maps = [#[[$MAP2]], #[[$MAP3]]], | |
// CHECK: %[[FILL_1:.+]] = linalg.generic {indexing_maps = [#[[$MAP2]], #[[$MAP3]]], |
[nit]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I preserved the original naming for the diff to be readable. I also don't find the renaming very helpful or valuable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I preserved the original naming for the diff to be readable.
We should avoid optimising our patches to work around the limitations of GitHub. If GitHub rendering is hard to parse, I will happily just clone a PR and review locally. Lets optimise for maintainability instead.
I also don't find the renaming very helpful or valuable.
I disagree - IMO it's both helpful and valuable. With linalg.fill
, it doesn't matter that much whether it's %[[D2:.+]]
or %[[FILL_1:.+]]
- it's clear what the Op is. But your patch replaces linalg.fill
with linalg.generic
, which makes reading this rather dense test even trickier. To me, more descriptive LIT variables are helpful - they make parsing tests much easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure why this becomes a generic op. Can we preserve the original behavior of this being a fill op?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure why this becomes a generic op. Can we preserve the original behavior of this being a fill op?
The reason is the decomposition now acts at linalg level and lives up to the promise of splitting a complex aggregate op into simpler ops at the same level of abstraction (btw, this is why it does make sense to have LinalgOp
s in the decomposition result). After the decomposition, all the ops are non-generic linalg ops. At this stage, you have all the semantics available along with the benefit of no need to analyze affine maps (say, for fusion). After generalize you have a nice homogeneous set of generics as the last step in the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me, more descriptive LIT variables are helpful - they make parsing tests much easier.
Renamed
|
||
// ----- | ||
|
||
// COM: decomposition assumes tensors as inputs, this is just to make sure nothing breaks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
COM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The decomposition follows the op description (see LinalgOps.td) and specifies its semantics via the implementation. The implementation ends up generating the semantically the same code as the previous decomposition implementation (with minor deviation as you noted). The transform test demonstrates the change in the generalized code.
The first one tests the new pass. The second one uses the transform interpreter and the decompose_interface op (which happen to partially rely on the same code now).
The IR presented is the IR you get by lowering PyTorch to torch-mlir.
I'd say it sets it. The implementation follows the op description, so there's no real 'change'.
Decomposition is performed by the newly introduced
Correct. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ill take a look more, but the changes to softmax op decomposition needs a look. Can that be split out into a separate change if that is indeed needed to keep the pass addition and change to return handle isolated from these changes.
// CHECK-DAG: %[[D1:.+]] = tensor.empty() : tensor<2x16xf32> | ||
// CHECK-DAG: %[[CST:.+]] = arith.constant -3.40282347E+38 : f32 | ||
// CHECK: %[[D2:.+]] = linalg.fill ins(%[[CST]] : f32) outs(%[[D1]] : tensor<2x16xf32>) -> tensor<2x16xf32> | ||
// CHECK: %[[D2:.+]] = linalg.generic {indexing_maps = [#[[$MAP2]], #[[$MAP3]]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure why this becomes a generic op. Can we preserve the original behavior of this being a fill op?
// CHECK: } -> tensor<2x16xf32> | ||
// CHECK: %[[D4:.+]] = linalg.generic {indexing_maps = [#[[$MAP]], #[[$MAP1]], #[[$MAP]]], iterator_types = | ||
// CHECK-SAME: ["parallel", "parallel", "parallel"]} ins(%[[ARG0]], %[[D3]] : tensor<2x16x32xf32>, tensor<2x16xf32>) | ||
// CHECK: %[[BCST:.+]] = linalg.generic {indexing_maps = [#[[$MAP1]], #[[$MAP]]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are the broadcast ops separate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Named ops require the same shapes, no implicit casting.
/// - `decomposedValues` contains the values corresponding to the result of the | ||
/// aggregate operation. | ||
struct DecompositionResult { | ||
SmallVector<Operation *> decomposedOps; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer it to be Operation *
. Thats the more general form, and changing ABI causes breakages. There is no reason for this to be Linalg ops always.
struct DecompositionResult { | ||
/// TODO: can be further constrained to LinalgOp. | ||
SmallVector<Operation *> decomposedOps; | ||
SmallVector<Value> decomposedValues; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these effectively replacements for the original results of the operation. decomposedValues
is a misleading name. Maybe call it replacements
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Value output = getOutput(); | ||
dims.erase(dims.begin() + reductionDim); | ||
|
||
SmallVector<int64_t> reduceShape; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why you are splitting this out to use int64_t
and Value
instead of keeping it as OpFoldResult
. This seems to be equivalent to what was there, so not sure why change it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are just for the sake of pleasing tensor's empty op builder api. It won't accept a vector of OpFoldResult
together with dynamic dims.
} | ||
auto sumBcastOutput = b.create<tensor::EmptyOp>( | ||
loc, getOutputOperandType().getShape(), elementType, dynDims); | ||
auto sumBroadcastOp = b.create<linalg::BroadcastOp>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The attempt to use named ops here is making this more verbose and confusing. I would rather stick to this not using named ops (or at least provide a way to not use named ops).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by that? What is confusing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think lowering to named ops this way is not great for softmax. If anything it is showing issues with named ops sa being currently defined and being developed. For example, linalg allows a more succicnt representation of broadcasting behavior than having to add an explicit broadcast operation. I dont think this is a great way forward. This is requiring everything that relies on this decomposition "to do additional work" to get back to the state that the decomposition was doing. That doesnt seem like a good idea to me. Maybe you can create an option that will allow you to lower softmax to named ops like you want here (and we can decide which to make the default), but changing the decomposition this way is a red flag for me.
Also the decomposition change needs to be a separate PR and not rolled into the PR that is adding a pass for decomposition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, regardless of the merits of lowering softmax
to generics or not, I disagree that a more succinct representation is always preferred. We discuss this at length on the canonicalization threads in the forum.
Second, what this PR is adding is just the semantics that:
- Decomposition is breaking named ops into further named ops.
- Generalization is lowering named ops into generics.
Above you mention you prefer fill
than its generic form, even though the step that the test is doing is generalization, while here you're advocating for a generic form, even though the step is decomposition. This seems totally arbitrary to me.
The linalg dialect and its transforms should be consistent and predictable. If you want to generalize only some ops and not others, this seems to me like a local pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All I am saying is that this is today it is lowered to what I consider a more succinct representation and with this change it does not seem that straightforward to get back to the previous state. So to not break downstream usage, it would be better to keep the default current behavior and introduce some optionality that allows you to lower to named ops. Going from named ops to the previous state does not seem that straight-forward to me (you essentially need to know this is a softmax and handle it accordingly).
W.R.T fill
, ack your concern, but fill
has always felt like a "special" op to me. To preserve the perfectly nested loop nature of linalg ops, what is
C = A * B
is converted to
C = zeros(...);
D = C + A*B;
but any sane backend needs to "fuse" the fill with the matmul operation to generate efficient code.
Anyway, thats is a bit of a minor digression. My bigger concern is the change of the default that goes to named ops which in-turn enforce (what I consider outdated) explicit-broadcasting semantics (when Linalg has a perfectly succicnt and unambiguous way to represent broadcasts). If named ops allowed for representing broadcasting on-par with what Linalg inhernetly allows, then there would be no issue here.
IMO this is key - please add that in the summary.
From what I can tell, both tests verify the decomposition of func.func @softmax(%arg0: tensor<2x16x32xf32>, %dst: tensor<2x16x32xf32>) -> tensor<2x16x32xf32> {
%1 = linalg.softmax dimension(2) ins(%arg0 : tensor<2x16x32xf32>) outs(%dst: tensor<2x16x32xf32>) -> tensor<2x16x32xf32>
return %1 : tensor<2x16x32xf32>
} Couldn't we re-use the input and the
I know where the IR is coming from. FWIW (without links to documentation), that IR is just an implementation detail of torch-mlir. In this PR we are discussing an implementation detail of Linalg. Are you saying that the implementation in Linalg should match torch-mlir? Why? What if the implementation in torch-mlir changes? I'm trying to understand the motivation here and the overall design that we are converging towards.
Key info, please highlight in the summary.
I know what the options are, thanks. To me, your comment implies that |
Done.
Do I understand correctly that you suggest having a single lit with the body of
So the goal was to look into what frameworks actually do in the implementation. If it so happens that all of them lower softmax to the same sequence (and this is what we happen to have) - we can have it set as the default decomposition to avoid re-implementing the thing. The general idea and direction is to have an intermediate decomposition stage that deals with complex ops (such as softmax) to aid other transformations and analyses (this is also an easy route to adding more named ops, upstream and downstream, and not implement all the interfaces like tiling, but convert it to simpler ops instead and enjoy all the existing goodness). Note: I'm leaving the question of accumulation out for now, this should be addressed separately.
Done.
This just describes the transform op decomposition change (in other words it used to produce generics + fills, now it internally produces named ops sequence and then runs generalization), there's no strict requirement to run generalization after the decomposition of course. |
Yes, something along those lines. Basically, IMO, we should identify a canonical way to test transformations with both "passes" and TD so that we maximise "test case" re-use. |
PyTorch & torch-mlir are red herrings. This PR is about softmax. The current lowering of softmax into generics is unsurprisingly the same as both PyTorch and Tensorflow expect it to be. This is related but distinc to lowering softmax to named ops. So, let's close this tangent, as it is irrelevant.
Absolutely not. The test where we lower to named ops was created in this PR. We verify that it does the "right thing". The old test checks the generic lowering (because it was the only thing we had), so @kurapov-peter added the generalization pass to match the old expectation. This provides us with a clear "apples to apples" comparison, and shows that the old output can no longer be attained. Most importantly, you don't get a mix of named + generics. You either get all named or all generic. This seems to be a problem to @MaheshRavishankar and I want to understand it better. My guess is that there are pattern matchers that won't work with the generic version of An option is to make some "special" ops never to generalize, for example |
I would like to understand the next steps here. @MaheshRavishankar, could you please elaborate ^^^? Regarding the Regarding |
There is broadly two things that we need to make progress here
Just to map back to what I said above, we can "recognize" that its a fill, but that seems like an unnecessary burden added to downstream users because it has been generalized too early without any control. I can go into details about why I think "fill" is special but thats a separate issue IMO.
I want clarify this. This is NOT implicit broadcasting. This is very much unambiguous broadcast representation. For example,
There is nothing ambiguous or implicit in this broadcasting. The problem with named ops is that it forces all operands to be of the same rank, which is an unnecessary requirement at Linalg level. The fix is to allow named ops to make use of the broadcast representation that Linalg inherently allows. In the name of "explicit broadcasting" we have an artificial requirement of getting all operands to the same rank that is unnecessary IMO. Also it strictly easier to go from this representation to a representation that requires all operands to be of same rank (its essentially a lowering, you break up the operation into multiple ops). Going from a representation where all ops are "broadcasted" to the same rank to the above representation is IMO a lifting. Actually that brings me to maybe a potential solution. You can take the existing lowering for softmax and then add a pass to explicitly split out the broadcast and then generalize. That will get you to the state you want here? |
Ok, I see. So this position is the opposite of what I'm proposing: changing the default decomposition to target named ops (note that this has nothing to do with generalization). Here I'm summarizing the arguments for preserving the status quo and against it. Cons of changing default:
Pro of changing default:
Mitigating cons n.1: Even though reaching the same result of the mixed decomposition requires additional steps, those are existing upstream transformations. Hence, the downstream changes won't be a significant burden. I suggest we make a decision on the direction here @ftynse, @nicolasvasilache, @dcaballe, @rengolin. |
I think you missed the point. The proposed decomposition only converts an aggregate operation into a sequence of non-aggregate ones. This has nothing to do with generalization. Downstreams don't need to recognize a
Same here. Generalized IR with broadcasts is not the target state. The target is a sequence of named ops. |
I have been trying to find a way to help land this PR without asking for too much "as a precursor" work, but given that there hasnt been much change in the approach the real issue IMO are two
I dont agree with this characterization. If you want to lower to named ops, which are you generalizing after, and representing broadcasts more succinctly is not a fusion IMO. This seems like it is transplanting ideas from tosa/torch etc. into Linalg. There you need to have broadcast as a separate operation. You dont need that in Linalg. I wouldnt characterize it as a fusion (rather tosa/torch are artifically forcing front-ends/lowering/programmers to introduce a broadcast since they dont have mechanisms to represent broadcasts effectively).
Again, I dont agree that this is implicit casting. This is very much explicit representation of broadcasting behavior. And if you think downstream changes to get back to current state is not a significant burden, please add that to your PR and then lets discuss how to package it from a user perspective. I could very well come and change the behavior back because it "suits my need" and we will never be able to reach a stable state.
Again I disagree. Representing broadcasting semantics more succinctly is not about it being an "aggregate" op, but rather there should never have been a need to have a |
Flyby review of a conversation that appears to be looping. For number 1 (controllable (de)composition) -- big +1. This is how pytorch does it, and it is a super power (and a big part of what is making things attractive there). Basically, you end up with a way to say what your backend prefers and the framework gets you as close to that as it can. It is a practical way to get out of the issue that really, no one agrees on every detail of this stuff (and never will / there is no best). For number 2 -- I'm watching this same basic discussion loop on multiple threads and PRs (basically the role of named ops and broadcasting). We're using sloppy language (calling it cast, explicit, implicit, fusion, etc), so I'm not going to add to that. But it is quite clear to me that there are a couple of opposing viewpoints on this being ground out patch by patch (with the usual amount of friction that entails). My tendency is to side with Mahesh's viewpoint on this -- not because of upstream/downstream/whatever -- but because that viewpoint is more compatible with all of the transformations that we should be able to do on this opset (and I've lived too many lives with the really substandard backdrop of trying to use the fixed function op libraries of the frameworks for transformations and optimizations). But if I squint, I can see the value in the "named op everything" viewpoint iff it is part of a holistic design supporting configurable specialization and robust generalization. I don't want to litigate any of this on a PR like this, but I do think there are a couple of broader discussions here that we'd be better off to have. |
I actually don't mind it as long as we agree on the direction. After revisiting our discussion, @MaheshRavishankar, I think we are talking about a similar end state using different languages. I'll try to confirm it below.
Agree. This is a more generic description of what I named partial generalization to reach the same end state of the current decomposition. How about we make this the first step? I can start with an rfc to collect the requirements and we can team up on the design/implementation.
Right, this is what I referred to as implicit casting. It is less clear to me whether it is a good thing, but again, I am happy to work on it if there's a broad agreement. Here I might be missing something though. I see you both suggesting this would be a solution and you disagree and call it an "explicit representation of broadcasting behavior". This looks contradictory to me. Still, I assume we both think of "named ops can accept tensors of different ranks and decomposition does not produce an actual
@stellaraccident, right, the PR is a naive attempt to go there. I assumed that this was an agreed-upon direction. |
This is the key here: define the semantics of the named ops. Landing the PR is secondary. Can we agree that, IFF we have broadcast/transpose semantics to the named ops, we should decompose softmax to those instead of generics? How we get there is a matter for another RFC, but I want to make sure our efforts there will lead to this decision agreed on a consensus. |
Lets leave the named ops discussion aside (there is a discussion on going here https://discourse.llvm.org/t/rfc-transpose-attribute-for-linalg-matmul-operations/80092/36?u=maheshravishankar) . But for this PR, lets maybe take a break, and an RFC to allow controlling the decomposition would be great. I confess, I have no great ideas to suggest, just some vague ones. So I am at a loss to really suggest how to do that. |
FYI, we're working on a simplification of named ops with affine maps to avoid this problem, and I believe this is the solution for the current problem: |
After some thought and discussion, my view on this changed. I tried to write a proposal for how should a mechanism for generic decompositions should look like. The more detail I add the more it resembles the regular rewriter patterns. At this point, it makes no sense to me to introduce yet another very similar mechanism (that is restricted to a specific interface). If we are not changing the default decomposition there's not much value in having additional ones upstream. Those can exist as rewrites downstream. I'm closing this. Please let me know if there's anything I'm missing or there's still interest in additional decompositions. |
…position implementation
AggregatedOpInterface
return aDecompositionResult
, similar to the tiling interface. This is to communicate the decomposition sequence nicely (e.g., useful for transform dialect, see below).DecomposeInterfaceOp
implementation. This removes code duplication between the generalization pass and decomposition implementation - now aggregate ops are decomposed first and then generalized.