Skip to content

Conversation

sdalvi-quic
Copy link

@sdalvi-quic sdalvi-quic commented Sep 10, 2025

The pattern MoveInitOperandsToInput in the LinalgFoldUnitExtendDimsPass tries to move the init operands to input, to ensure that the element wise ops are canonicalized. I need to run split-reduction optimization before LinalgFoldUnitExtendDimsPass. The split reduction optimization is trying to optimize the Linalg op by reading from the destination operand but, the pattern MoveInitOperandsToInput is reverting it and trying to canonicalize element wise op. This is leading to 10x performance regression. Proposing to guard the pattern MoveInitOperandsToInput to have more control over enabling/disabling it.

IR after Split-Reduction and before LinalgFoldUnitExtendDimsPass
image

IR after LinalgFoldUnitExtendDimsPass
image

…entDimsPass, as it causes regression in performance.
Copy link

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@sdalvi-quic
Copy link
Author

sdalvi-quic commented Sep 10, 2025

Hi @MaheshRavishankar, @matthias-springer @hanhanW,
Can you help me understand, why was there a need to add the pattern MoveInitOperandsToInput?
Why was this pattern included in LinalgFoldUnitExtendDimsPass pass and not in a separate transformation pass?

CC: @jverma-quic, @javedabsar1, @aankit-quic, @muthubaskaran

@llvmbot
Copy link
Member

llvmbot commented Sep 10, 2025

@llvm/pr-subscribers-mlir

@llvm/pr-subscribers-mlir-linalg

Author: None (sdalvi-quic)

Changes

The pattern MoveInitOperandsToInput in the LinalgFoldUnitExtendDimsPass tries to move the init operands to input, to ensure that the element wise ops are canonicalized. I need to run split-reduction optimization before LinalgFoldUnitExtendDimsPass. The split reduction optimization is trying to optimize the Linalg op by reading from the destination operand but, the pattern MoveInitOperandsToInput is reverting it and trying to canonicalize element wise op. This is leading to 10x performance regression. Proposing to guard the pattern MoveInitOperandsToInput to have more control over enabling/disabling it.

IR after Split-Reduction and before LinalgFoldUnitExtendDimsPass
<img width="1243" height="290" alt="image" src="https://github.com/user-attachments/assets/4c9a08e3-3d89-4fa8-a484-18eff21eeadb" />

IR after LinalgFoldUnitExtendDimsPass
<img width="1246" height="279" alt="image" src="https://github.com/user-attachments/assets/91b4ddb9-a73c-4f0c-965e-0616066a4b71" />


Full diff: https://github.com/llvm/llvm-project/pull/157941.diff

2 Files Affected:

  • (modified) mlir/include/mlir/Dialect/Linalg/Passes.td (+4-1)
  • (modified) mlir/lib/Dialect/Linalg/Transforms/DropUnitDims.cpp (+3-1)
diff --git a/mlir/include/mlir/Dialect/Linalg/Passes.td b/mlir/include/mlir/Dialect/Linalg/Passes.td
index 44da2965e6892..7cd8da6643400 100644
--- a/mlir/include/mlir/Dialect/Linalg/Passes.td
+++ b/mlir/include/mlir/Dialect/Linalg/Passes.td
@@ -145,7 +145,10 @@ def LinalgFoldUnitExtentDimsPass : Pass<"linalg-fold-unit-extent-dims", ""> {
   let options = [
     Option<"useRankReducingSlices", "use-rank-reducing-slices", "bool",
            /*default=*/"false",
-           "Generate rank-reducing slices instead of reassociative reshapes">
+           "Generate rank-reducing slices instead of reassociative reshapes">,
+    Option<"enableMoveInitOperandsToInput", "enable-move-init-operands-to-input", "bool",
+           /*default=*/"true",
+           "Enable MoveInitOperandsToInputPattern transformation">
   ];
   let dependentDialects = [
     "linalg::LinalgDialect", "affine::AffineDialect", "memref::MemRefDialect"
diff --git a/mlir/lib/Dialect/Linalg/Transforms/DropUnitDims.cpp b/mlir/lib/Dialect/Linalg/Transforms/DropUnitDims.cpp
index 22690daa4f9e1..ba466ed3df5cd 100644
--- a/mlir/lib/Dialect/Linalg/Transforms/DropUnitDims.cpp
+++ b/mlir/lib/Dialect/Linalg/Transforms/DropUnitDims.cpp
@@ -868,7 +868,9 @@ struct LinalgFoldUnitExtentDimsPass
           RankReductionStrategy::ExtractInsertSlice;
     }
     linalg::populateFoldUnitExtentDimsPatterns(patterns, options);
-    populateMoveInitOperandsToInputPattern(patterns);
+    if (enableMoveInitOperandsToInput) {
+      populateMoveInitOperandsToInputPattern(patterns);
+    }
     (void)applyPatternsGreedily(op, std::move(patterns));
   }
 };

@MaheshRavishankar
Copy link
Contributor

I'll take a look in more detail soon, but general advise is don't use passes as is. From my perspective passes in MLIR are mostly for testing, especially this pass. Better to write your own pass using the patterns you want.

Copy link
Contributor

@MaheshRavishankar MaheshRavishankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need time to review

@sdalvi-quic
Copy link
Author

“Ping” - Just a gentle follow-up on the PR. Could you please take a look when you get a chance? Also, I’d appreciate your help in clarifying the query I’ve raised in the comments.

Hi @MaheshRavishankar, @matthias-springer @hanhanW, can you help me understand, why was there a need to add the pattern MoveInitOperandsToInput? Why was this pattern included in LinalgFoldUnitExtendDimsPass pass and not in a separate transformation pass?

CC: @jverma-quic, @javedabsar1, @aankit-quic, @muthubaskaran

@MaheshRavishankar
Copy link
Contributor

What happened here. Now I just see the changes to add a pass option...

@sdalvi-quic
Copy link
Author

Hi @MaheshRavishankar, I might have misunderstood your comment.

This PR adds an option to guard the MoveInitOperandsToInput pattern in the LinalgFoldUnitExtentDimsPass. The reason for this is that the pattern attempts to canonicalize element-wise ops thereby reverting any optimization, which has led to performance regressions in some cases.
By introducing this option, we gain flexibility to enable or disable the pattern depending on the use case. I’ve added more context in the PR description.

What happened here. Now I just see the changes to add a pass option...

Copy link
Contributor

@javedabsar1 javedabsar1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Please give a couple of days for others to agree/disagree before landing.

Option<"enableMoveInitOperandsToInput", "enable-move-init-operands-to-input", "bool",
/*default=*/"true",
"Enable MoveInitOperandsToInputPattern transformation">
];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you have made default 'true', this is NFC for if (enableMoveInitOperandsToInput) {.
With that context, LGTM.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thank you!

@javedabsar1 javedabsar1 changed the title Flag to guard MoveInitOperandsToInput in LinalgFoldUnitExtendDimsPass [mlir][linalg] Flag to guard MoveInitOperandsToInput in LinalgFoldUnitExtendDimsPass Sep 19, 2025
@MaheshRavishankar
Copy link
Contributor

Ok, Actually this is also related to another PR I was reviewing. First of all this input is wrong

%0 = linalg.generic {
    indexing_maps = [<(affine_map<(d0) -> (d0), affine_map<(d0) -> (d0)>],
    iterator_types = ["parallel"]}
    ins(%input : tensor<?xf32>) outs(%output : tensor<?xf32>) {
  ^bb0(%b0 : f32, %b1: f32):
    %1 = arith.maximumf %b0, %b1: f32
    linalg.yield %1 : f32
} -> tensor<?xf32>

You are reading the outs value for a parallel operation. In my book that is undefined behavior. If anything the pattern you are masking off is actually trying to be too smart and making it the correct representation

%empty = tensor.empty(...) : tensor<?xf32>
%0 = linalg.generic {
    indexing_maps = [<(affine_map<(d0) -> (d0), affine_map<(d0) -> (d0)>],
    iterator_types = ["parallel"]}
    ins(%input, %output : tensor<?xf32>, tensor<?xf32>) outs(%empty: tensor<?xf32>) {
  ^bb0(%b0 : f32, %b1: f32, %b2 : ):
    %1 = arith.maximumf %b0, %b1 : f32
    linalg.yield %1 : f32
} -> tensor<?xf32>

So before we go ahead with this change... id strongly suggest looking at your input... Input seems off to me.

@sdalvi-quic
Copy link
Author

Thank you for raising the concern.
For the input IR, we're initializing the outs(output) outside the linalg.generic op before using it, to ensure value correctness is maintained.
I hope this aligns with what you were referring to in your comment.

Ok, Actually this is also related to another PR I was reviewing. First of all this input is wrong

%0 = linalg.generic {
    indexing_maps = [<(affine_map<(d0) -> (d0), affine_map<(d0) -> (d0)>],
    iterator_types = ["parallel"]}
    ins(%input : tensor<?xf32>) outs(%output : tensor<?xf32>) {
  ^bb0(%b0 : f32, %b1: f32):
    %1 = arith.maximumf %b0, %b1: f32
    linalg.yield %1 : f32
} -> tensor<?xf32>

You are reading the outs value for a parallel operation. In my book that is undefined behavior. If anything the pattern you are masking off is actually trying to be too smart and making it the correct representation

%empty = tensor.empty(...) : tensor<?xf32>
%0 = linalg.generic {
    indexing_maps = [<(affine_map<(d0) -> (d0), affine_map<(d0) -> (d0)>],
    iterator_types = ["parallel"]}
    ins(%input, %output : tensor<?xf32>, tensor<?xf32>) outs(%empty: tensor<?xf32>) {
  ^bb0(%b0 : f32, %b1: f32, %b2 : ):
    %1 = arith.maximumf %b0, %b1 : f32
    linalg.yield %1 : f32
} -> tensor<?xf32>

So before we go ahead with this change... id strongly suggest looking at your input... Input seems off to me.

@MaheshRavishankar
Copy link
Contributor

Thank you for raising the concern. For the input IR, we're initializing the outs(output) outside the linalg.generic op before using it, to ensure value correctness is maintained. I hope this aligns with what you were referring to in your comment.

Of course, I understand thats how it works for your full program, but from an operations perspective, there is no guarantee that it will maintain/use the value of outs. It basically then conflates the "meaning" of outs. It will make the outs always read the output, while it might be useful for analysis to say for an operation with all parallel iterator types, only the ins are read. At the very least, its not the best practice.

@sdalvi-quic
Copy link
Author

Thank you for raising the concern. For the input IR, we're initializing the outs(output) outside the linalg.generic op before using it, to ensure value correctness is maintained. I hope this aligns with what you were referring to in your comment.

Of course, I understand thats how it works for your full program, but from an operations perspective, there is no guarantee that it will maintain/use the value of outs. It basically then conflates the "meaning" of outs. It will make the outs always read the output, while it might be useful for analysis to say for an operation with all parallel iterator types, only the ins are read. At the very least, its not the best practice.

I see.

I checked few mlir tests where I noticed that we do read from 'outs'. Example mlir/test/Dialect/Linalg/transform-tile-reduction.mlir test, the generated IR reads from the 'outs' operand inside a linalg.generic with parallel iterators.

I see another test which does the same mlir/test/Dialect/Linalg/transform-op-specialize.mlir.

Please let me know if I am missing something - otherwise can we proceed with the merge as it is NFC?

@sdalvi-quic
Copy link
Author

sdalvi-quic commented Oct 1, 2025

After discussion with the team, we are planning to do the changes to the IR such that instead of reading from the 'out' we try to add the variable as part of both ins and outs and read it from the ins. This won't break the semantics and won't add redundant new tensors (memcopy and mem alloc) in the loop and won't hurt the performance numbers. Adding example for reference.

image

I will close this ticket. Thank you for the valuable input.

@sdalvi-quic sdalvi-quic closed this Oct 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants