-
Notifications
You must be signed in to change notification settings - Fork 551
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fold standalone linalg.fill ops into flow.tensor.splat ops #5614
Conversation
This allows us to use DMA instead of kernels for pure data fills. This is another step towards performance: it further decreases the number of dispatches for MobileNetv2 from 94 to 76, and reduced the latency by 2ms on Galaxy S20 (Mali G77).
Awesome! And this doesn't even have #5410 - when that's done it should help even more! |
iree/compiler/Dialect/Flow/Transforms/DispatchLinalgOnTensors.cpp
Outdated
Show resolved
Hide resolved
// region for it; just use flow.tensor.splat so we can leverage DMA | ||
// functionalities. | ||
Location loc = op->getLoc(); | ||
if (auto fillOp = dyn_cast<linalg::FillOp>(op)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesnt seem to be the right place to do this. Its not actually creating a dispatch region :) .
Maybe move this to ConvertToFlowTensorOps
pass. Make that pass operate in two modes, before dispatch region creation and after (using a flag). The same pass can be run before and after dispatch region creation. This fill
conversion will be added to the "after" path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup. I was in the middle of addressing this comment. (That's why I haven't re-requested review. ;-P)
It's fine for me to move to ConvertToFlowTensorOps.cpp
and actually it results in slightly better code structure I think. But one concern I have is that it's assuming all linalg.fill
ops with linalg users will be fused down the pipeline, while this is more generic as catch all sink. But the assumption is probably true as I cannot immediately come up with a case where it's not. So moved to ConvertToFlowTensorOps.cpp
. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I think Github sent me notification and I thought this was ready for review. But still have some comments below.
/// | ||
/// It assumes linalg.fill ops that has linalg op users can be fused with | ||
/// its users so those cases aren't supported here. | ||
struct LinalgFillToFlowTensorSplat final |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the repeated nag on this one, but adding it by default here (when run before DispatchLinalgOnTensors) will avoid any fusion of fill with other ops. That is better than always doing it only the whole buffer. My comment from previous time was that the same pass should be controlled via flag to run "before" or "after" dispatch region formation. All current patterns are to be run before. The fill can be made to run after.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having the pattern to run after the DispatchLinalgOnTensors pass would mean we need to first identify and exclude the standalone linalg.fill
ops in DispatchLinalgOnTensors (because otherwise the linalg.fill
op would be put into its own dispatch region like shown by the existing test) to allow the ConvertToFlowTensorOpsAfterDispatch to kick in later. So that's coupling two passes and introducing quite a lot boilerplate code for just a canoncialization-like direct 1:1 op rewrite. Frankly I'm not so sure about that's better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No you just need to add a flag to the pass isRunBeforeDispatchRegionFormation
.
If isRunBefore*
is true, use the patterns that exist in that pass.
If isRunBefore*
is false, use the pattern to convert fill
to flow.*
.
Its essentially two passes in one, but it will collect all the conversion to flow.*
ops in one-place.
The the pass-pipeline in Flow is
convertToFlowTensorOps(/*isRunBefore=*/true);
..
createDispatchRegions
..
convertToFlowTensorOps(/*isRunBefore=*/false);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, I understand that part that we can put them into the same pass and have flags controlling that (though I think it's better to have two passes as we don't share anything here and I think we generally don't want .. flags ;-P). That's a minor issue to me. My main point is that we still need to identify and exclude the standalone linalg.fill ops in DispatchLinalgOnTensors to make sure it can fall through to be handled by the second invocation of the ConvertToFlowTensorOps pass; so we are gonna have the check in DispatchLinalgOnTensors anyway..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, I understand that part that we can put them into the same pass and have flags controlling that (though I think it's better to have two passes as we don't share anything here and I think we generally don't want .. flags ;-P). That's a minor issue to me.
Acknowledge that its a minor issue, but dont see a point in having all the boiler plate code and two places in the code where conversion happens. Its a flag that is controlled from the pass pipeline in C++ code, not a flag to add to benchmarking pipeline to conditionally use some feature for performance on a particular benchmark. So all I am saying is flags are of different flavors.
My main point is that we still need to identify and exclude the standalone linalg.fill ops in DispatchLinalgOnTensors to make sure it can fall through to be handled by the second invocation of the ConvertToFlowTensorOps pass; so we are gonna have the check in DispatchLinalgOnTensors anyway..
Good point, that is easy. Just need to add linalg::FillOp
here https://github.com/google/iree/blob/main/iree/compiler/Dialect/Flow/Transforms/DispatchLinalgOnTensors.cpp#L234. It will be treated as an op that is always inlined into the dispatch region. If anything is left out, it will be picked up the second invocation of ConvertToFlow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, that is easy. Just need to add
linalg::FillOp
here https://github.com/google/iree/blob/main/iree/compiler/Dialect/Flow/Transforms/DispatchLinalgOnTensors.cpp#L234. It will be treated as an op that is always inlined into the dispatch region. If anything is left out, it will be picked up the second invocation ofConvertToFlow
I'm not sure that would work. IIUC, it will force fusing the linalg.fill
with its consumers, even for the cases where we shouldn't, like, it will generate the following:
flow.dispatch.workgroups {
%0 = linalg.fill ... : tensor<1x225x225x3xf32>
%1 = subtensor_insert %input into %0[0, 0, 0, 0] [1, 224, 224, 3] [1, 1, 1, 1]
: tensor<1x224x224x3xf32> into tensor<1x225x225x3xf32>
}
(A linalg.fill
and then a subtensor_insert
into a subrange of it is actually the exact motivating pattern for this change.)
The above fusion is problematic because at the moment we use linalg.copy
for subtensor_insert
and the above dispatch region then will contain multiple linalg ops with different problem sizes, which will mess up distribution. Also after fusing them, there won't exist a standalone linalg.fill
so the ConvertToFlowTensorOps after DispatchLinalgOnTensors can pick up.. So to make it work, more special checks in DispatchLinalgOnTensors
pass or even tweaks across the whole pipeline will need to happen. That seems to me even complicated than just 0cc653f.
Actually coming to your original point here
adding it by default here (when run before DispatchLinalgOnTensors) will avoid any fusion of fill with other ops.
I'm not sure I follow exactly. The implementation in 91d46f4 checks that we don't have linalg users. If that's true, there won't be any fusion opportunities left for it (as fusion relies on Linalg ops). Am I missing something here?
|
||
LogicalResult matchAndRewrite(linalg::FillOp fillOp, | ||
PatternRewriter &rewriter) const override { | ||
for (Operation *userOp : fillOp->getUsers()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, follow up from previous comment, We can drop this check if we can run the pass before and after. This check is inherently hard to maintain.
Abbreviated Benchmark Summary@ commit 9b0b3ba68d384b92d8d1a49fa9195f6bfe0684c5 (vs. base 268a30561472401c2fb83ba5e6c5884939b375d0) Regressed Benchmarks 🚩
[Top 3 out of 4 benchmark results showed] Improved Benchmarks 🎉
[Top 3 out of 5 benchmark results showed] For more information: |
I'm trying to land this before being out of office. But looks not gonna make it because now hitting other failures in VMVX/Dylib: https://source.cloud.google.com/results/invocations/bbd2da7c-0003-43ae-85c9-5b7356226012/targets/iree%2Fgcp_ubuntu%2Fbazel%2Flinux%2Fx86-swiftshader%2Fcore%2Fpresubmit/log:
Only happens for VMVX/Dylib:
@benvanik: do you know what might be wrong here? |
This on is on the LLVM path. |
THanks Lei, Ill take this one and land it. |
Awesome, thanks Mahesh! :D I can go and be OOO peacefully now. :) |
A flow.tensor.clone op by definition should have its operand and result of the same shape. However, when mapping to buffers, we could have the case where the operand is a constant subspan. Then the source buffer will be the constant pool buffer, which can be larger than the result buffer.
Actually, figured it out. It's because the buffer size when converting |
All blocking issues have been addressed; this patch is ready to go. :) As shown by #5614 (comment), this is a universally good performance improvement: 6 tracked benchmarks see > 5% latency decrease. @MaheshRavishankar, how about we land this now and then you can feel free to improve it in following up patches. (Enabled auto-merge for it.) |
DenseElementsAttr requires static shape; we will hit assertions when trying to fold dynamic shaped flow.tensor.splat.
This allows us to use DMA instead of kernels for pure data
fills. This is another step towards performance: it
further decreases the number of dispatches for MobileNetv2
from 94 to 76, and reduced the latency by 2ms on Galaxy
S20 (Mali G77).