Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fold standalone linalg.fill ops into flow.tensor.splat ops #5614

Merged
merged 13 commits into from
Jul 22, 2021

Conversation

antiagainst
Copy link
Contributor

This allows us to use DMA instead of kernels for pure data
fills. This is another step towards performance: it
further decreases the number of dispatches for MobileNetv2
from 94 to 76, and reduced the latency by 2ms on Galaxy
S20 (Mali G77).

This allows us to use DMA instead of kernels for pure data
fills. This is another step towards performance: it
further decreases the number of dispatches for MobileNetv2
from 94 to 76, and reduced the latency by 2ms on Galaxy
S20 (Mali G77).
@benvanik
Copy link
Collaborator

Awesome! And this doesn't even have #5410 - when that's done it should help even more!

// region for it; just use flow.tensor.splat so we can leverage DMA
// functionalities.
Location loc = op->getLoc();
if (auto fillOp = dyn_cast<linalg::FillOp>(op)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesnt seem to be the right place to do this. Its not actually creating a dispatch region :) .

Maybe move this to ConvertToFlowTensorOps pass. Make that pass operate in two modes, before dispatch region creation and after (using a flag). The same pass can be run before and after dispatch region creation. This fill conversion will be added to the "after" path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup. I was in the middle of addressing this comment. (That's why I haven't re-requested review. ;-P)

It's fine for me to move to ConvertToFlowTensorOps.cpp and actually it results in slightly better code structure I think. But one concern I have is that it's assuming all linalg.fill ops with linalg users will be fused down the pipeline, while this is more generic as catch all sink. But the assumption is probably true as I cannot immediately come up with a case where it's not. So moved to ConvertToFlowTensorOps.cpp. :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I think Github sent me notification and I thought this was ready for review. But still have some comments below.

@antiagainst antiagainst added the (deprecated) buildkite:benchmark-android Deprecated. Please use benchmarks:android-* label Jun 11, 2021
///
/// It assumes linalg.fill ops that has linalg op users can be fused with
/// its users so those cases aren't supported here.
struct LinalgFillToFlowTensorSplat final
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the repeated nag on this one, but adding it by default here (when run before DispatchLinalgOnTensors) will avoid any fusion of fill with other ops. That is better than always doing it only the whole buffer. My comment from previous time was that the same pass should be controlled via flag to run "before" or "after" dispatch region formation. All current patterns are to be run before. The fill can be made to run after.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having the pattern to run after the DispatchLinalgOnTensors pass would mean we need to first identify and exclude the standalone linalg.fill ops in DispatchLinalgOnTensors (because otherwise the linalg.fill op would be put into its own dispatch region like shown by the existing test) to allow the ConvertToFlowTensorOpsAfterDispatch to kick in later. So that's coupling two passes and introducing quite a lot boilerplate code for just a canoncialization-like direct 1:1 op rewrite. Frankly I'm not so sure about that's better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No you just need to add a flag to the pass isRunBeforeDispatchRegionFormation .

If isRunBefore* is true, use the patterns that exist in that pass.
If isRunBefore* is false, use the pattern to convert fill to flow.*.

Its essentially two passes in one, but it will collect all the conversion to flow.* ops in one-place.

The the pass-pipeline in Flow is

convertToFlowTensorOps(/*isRunBefore=*/true);
..
createDispatchRegions
..
convertToFlowTensorOps(/*isRunBefore=*/false);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I understand that part that we can put them into the same pass and have flags controlling that (though I think it's better to have two passes as we don't share anything here and I think we generally don't want .. flags ;-P). That's a minor issue to me. My main point is that we still need to identify and exclude the standalone linalg.fill ops in DispatchLinalgOnTensors to make sure it can fall through to be handled by the second invocation of the ConvertToFlowTensorOps pass; so we are gonna have the check in DispatchLinalgOnTensors anyway..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I understand that part that we can put them into the same pass and have flags controlling that (though I think it's better to have two passes as we don't share anything here and I think we generally don't want .. flags ;-P). That's a minor issue to me.

Acknowledge that its a minor issue, but dont see a point in having all the boiler plate code and two places in the code where conversion happens. Its a flag that is controlled from the pass pipeline in C++ code, not a flag to add to benchmarking pipeline to conditionally use some feature for performance on a particular benchmark. So all I am saying is flags are of different flavors.

My main point is that we still need to identify and exclude the standalone linalg.fill ops in DispatchLinalgOnTensors to make sure it can fall through to be handled by the second invocation of the ConvertToFlowTensorOps pass; so we are gonna have the check in DispatchLinalgOnTensors anyway..

Good point, that is easy. Just need to add linalg::FillOp here https://github.com/google/iree/blob/main/iree/compiler/Dialect/Flow/Transforms/DispatchLinalgOnTensors.cpp#L234. It will be treated as an op that is always inlined into the dispatch region. If anything is left out, it will be picked up the second invocation of ConvertToFlow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, that is easy. Just need to add linalg::FillOp here https://github.com/google/iree/blob/main/iree/compiler/Dialect/Flow/Transforms/DispatchLinalgOnTensors.cpp#L234. It will be treated as an op that is always inlined into the dispatch region. If anything is left out, it will be picked up the second invocation of ConvertToFlow

I'm not sure that would work. IIUC, it will force fusing the linalg.fill with its consumers, even for the cases where we shouldn't, like, it will generate the following:

flow.dispatch.workgroups {
  %0 = linalg.fill ... : tensor<1x225x225x3xf32>
  %1 = subtensor_insert %input into %0[0, 0, 0, 0] [1, 224, 224, 3] [1, 1, 1, 1]
        : tensor<1x224x224x3xf32> into tensor<1x225x225x3xf32>
}

(A linalg.fill and then a subtensor_insert into a subrange of it is actually the exact motivating pattern for this change.)

The above fusion is problematic because at the moment we use linalg.copy for subtensor_insert and the above dispatch region then will contain multiple linalg ops with different problem sizes, which will mess up distribution. Also after fusing them, there won't exist a standalone linalg.fill so the ConvertToFlowTensorOps after DispatchLinalgOnTensors can pick up.. So to make it work, more special checks in DispatchLinalgOnTensors pass or even tweaks across the whole pipeline will need to happen. That seems to me even complicated than just 0cc653f.

Actually coming to your original point here

adding it by default here (when run before DispatchLinalgOnTensors) will avoid any fusion of fill with other ops.

I'm not sure I follow exactly. The implementation in 91d46f4 checks that we don't have linalg users. If that's true, there won't be any fusion opportunities left for it (as fusion relies on Linalg ops). Am I missing something here?


LogicalResult matchAndRewrite(linalg::FillOp fillOp,
PatternRewriter &rewriter) const override {
for (Operation *userOp : fillOp->getUsers()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, follow up from previous comment, We can drop this check if we can run the pass before and after. This check is inherently hard to maintain.

@iree-github-actions-bot
Copy link
Contributor

iree-github-actions-bot commented Jun 11, 2021

Abbreviated Benchmark Summary

@ commit 9b0b3ba68d384b92d8d1a49fa9195f6bfe0684c5 (vs. base 268a30561472401c2fb83ba5e6c5884939b375d0)

Regressed Benchmarks 🚩

Benchmark Name Average Latency (ms) Median Latency (ms) Latency Standard Deviation (ms)
MobileBertSquad [fp32] (TensorFlow) big-core,full-inference with IREE-Dylib-Sync @ Pixel-4 (CPU-ARMv8.2-A) 894 (vs. 726, 23.14%↑) 893 3
MobileNetV3Small [fp32,imagenet] (TensorFlow) 3-thread,little-core,full-inference with IREE-Dylib @ Pixel-4 (CPU-ARMv8.2-A) 356 (vs. 318, 11.95%↑) 352 29
MobileNetV2 [fp32,imagenet] (TensorFlow) 3-thread,little-core,full-inference with IREE-Dylib @ Pixel-4 (CPU-ARMv8.2-A) 998 (vs. 946, 5.50%↑) 995 50

[Top 3 out of 4 benchmark results showed]

Improved Benchmarks 🎉

Benchmark Name Average Latency (ms) Median Latency (ms) Latency Standard Deviation (ms)
MobileNetV2 [fp32,imagenet] (TensorFlow) kernel-execution with IREE-Vulkan @ SM-G980F (GPU-Mali-G77) 14 (vs. 18, 22.22%↓) 14 0
MobileNetV2 [fp32,imagenet] (TensorFlow) full-inference with IREE-Vulkan @ SM-G980F (GPU-Mali-G77) 77 (vs. 87, 11.49%↓) 83 10
MobileNetV2 [fp32,imagenet] (TensorFlow) kernel-execution with IREE-Vulkan @ Pixel-4 (GPU-Adreno-640) 65 (vs. 70, 7.14%↓) 65 1

[Top 3 out of 5 benchmark results showed]

For more information:

@antiagainst
Copy link
Contributor Author

I'm trying to land this before being out of office. But looks not gonna make it because now hitting other failures in VMVX/Dylib: https://source.cloud.google.com/results/invocations/bbd2da7c-0003-43ae-85c9-5b7356226012/targets/iree%2Fgcp_ubuntu%2Fbazel%2Flinux%2Fx86-swiftshader%2Fcore%2Fpresubmit/log:

check_dylib-llvm-aot_dylib_pad.mlir: iree/task/worker.c:261: _Bool iree_task_worker_pump_once(iree_task_worker_t *, iree_task_submission_t *): Assertion `!!((__builtin_expect(!!((uintptr_t)(status) == IREE_STATUS_OK), 1)))' failed.

check_vmvx_vmvx_convolution.mlir: iree/task/worker.c:261: _Bool iree_task_worker_pump_once(iree_task_worker_t *, iree_task_submission_t *): Assertion `!!((__builtin_expect(!!((uintptr_t)(status) == IREE_STATUS_OK), 1)))' failed.

Only happens for VMVX/Dylib:

//iree/test/e2e/tosa_ops:check_dylib-llvm-aot_dylib_pad.mlir             FAILED in 4.0s
  /home/kbuilder/.cache/bazel/_bazel_kbuilder/c32aa9ac646722210ccee9c722c31e29/execroot/iree_core/bazel-out/k8-opt/testlogs/iree/test/e2e/tosa_ops/check_dylib-llvm-aot_dylib_pad.mlir/test.log
//iree/test/e2e/tosa_ops:check_vmvx_vmvx_pad.mlir                        FAILED in 3.8s
  /home/kbuilder/.cache/bazel/_bazel_kbuilder/c32aa9ac646722210ccee9c722c31e29/execroot/iree_core/bazel-out/k8-opt/testlogs/iree/test/e2e/tosa_ops/check_vmvx_vmvx_pad.mlir/test.log
//iree/test/e2e/xla_ops:check_dylib-llvm-aot_dylib_concatenate.mlir      FAILED in 4.2s
  /home/kbuilder/.cache/bazel/_bazel_kbuilder/c32aa9ac646722210ccee9c722c31e29/execroot/iree_core/bazel-out/k8-opt/testlogs/iree/test/e2e/xla_ops/check_dylib-llvm-aot_dylib_concatenate.mlir/test.log
//iree/test/e2e/xla_ops:check_dylib-llvm-aot_dylib_convolution.mlir      FAILED in 3.9s
  /home/kbuilder/.cache/bazel/_bazel_kbuilder/c32aa9ac646722210ccee9c722c31e29/execroot/iree_core/bazel-out/k8-opt/testlogs/iree/test/e2e/xla_ops/check_dylib-llvm-aot_dylib_convolution.mlir/test.log
//iree/test/e2e/xla_ops:check_dylib-llvm-aot_dylib_pad.mlir              FAILED in 4.2s
  /home/kbuilder/.cache/bazel/_bazel_kbuilder/c32aa9ac646722210ccee9c722c31e29/execroot/iree_core/bazel-out/k8-opt/testlogs/iree/test/e2e/xla_ops/check_dylib-llvm-aot_dylib_pad.mlir/test.log
//iree/test/e2e/xla_ops:check_vmvx_vmvx_concatenate.mlir                 FAILED in 4.2s
  /home/kbuilder/.cache/bazel/_bazel_kbuilder/c32aa9ac646722210ccee9c722c31e29/execroot/iree_core/bazel-out/k8-opt/testlogs/iree/test/e2e/xla_ops/check_vmvx_vmvx_concatenate.mlir/test.log
//iree/test/e2e/xla_ops:check_vmvx_vmvx_convolution.mlir                 FAILED in 3.7s
  /home/kbuilder/.cache/bazel/_bazel_kbuilder/c32aa9ac646722210ccee9c722c31e29/execroot/iree_core/bazel-out/k8-opt/testlogs/iree/test/e2e/xla_ops/check_vmvx_vmvx_convolution.mlir/test.log
//iree/test/e2e/xla_ops:check_vmvx_vmvx_pad.mlir                         FAILED in 3.9s
  /home/kbuilder/.cache/bazel/_bazel_kbuilder/c32aa9ac646722210ccee9c722c31e29/execroot/iree_core/bazel-out/k8-opt/testlogs/iree/test/e2e/xla_ops/check_vmvx_vmvx_pad.mlir/test.log

@benvanik: do you know what might be wrong here?

@MaheshRavishankar
Copy link
Contributor

/iree/test/e2e/tosa_ops:check_dylib-llvm-aot_dylib_pad.mlir FAILED in 4.0s
/home/kbuilder/.cache/bazel/_bazel_kbuilder/c32aa9ac646722210ccee9c722c31e29/execroot/iree_core/bazel-out/k8-opt/testlogs/iree/test/e2e/tosa_ops/check_dylib-llvm-aot_dylib_pad.mlir/test.log

This on is on the LLVM path.

@MaheshRavishankar
Copy link
Contributor

THanks Lei, Ill take this one and land it.

@antiagainst
Copy link
Contributor Author

THanks Lei, Ill take this one and land it.

Awesome, thanks Mahesh! :D I can go and be OOO peacefully now. :)

A flow.tensor.clone op by definition should have its operand and
result of the same shape. However, when mapping to buffers, we
could have the case where the operand is a constant subspan. Then
the source buffer will be the constant pool buffer, which can be
larger than the result buffer.
@antiagainst
Copy link
Contributor Author

I'm trying to land this before being out of office. But looks not gonna make it because now hitting other failures in VMVX/Dylib: https://source.cloud.google.com/results/invocations/bbd2da7c-0003-43ae-85c9-5b7356226012/targets/iree%2Fgcp_ubuntu%2Fbazel%2Flinux%2Fx86-swiftshader%2Fcore%2Fpresubmit/log:

check_dylib-llvm-aot_dylib_pad.mlir: iree/task/worker.c:261: _Bool iree_task_worker_pump_once(iree_task_worker_t *, iree_task_submission_t *): Assertion `!!((__builtin_expect(!!((uintptr_t)(status) == IREE_STATUS_OK), 1)))' failed.

check_vmvx_vmvx_convolution.mlir: iree/task/worker.c:261: _Bool iree_task_worker_pump_once(iree_task_worker_t *, iree_task_submission_t *): Assertion `!!((__builtin_expect(!!((uintptr_t)(status) == IREE_STATUS_OK), 1)))' failed.

Only happens for VMVX/Dylib:

//iree/test/e2e/tosa_ops:check_dylib-llvm-aot_dylib_pad.mlir             FAILED in 4.0s
  /home/kbuilder/.cache/bazel/_bazel_kbuilder/c32aa9ac646722210ccee9c722c31e29/execroot/iree_core/bazel-out/k8-opt/testlogs/iree/test/e2e/tosa_ops/check_dylib-llvm-aot_dylib_pad.mlir/test.log
//iree/test/e2e/tosa_ops:check_vmvx_vmvx_pad.mlir                        FAILED in 3.8s
  /home/kbuilder/.cache/bazel/_bazel_kbuilder/c32aa9ac646722210ccee9c722c31e29/execroot/iree_core/bazel-out/k8-opt/testlogs/iree/test/e2e/tosa_ops/check_vmvx_vmvx_pad.mlir/test.log
//iree/test/e2e/xla_ops:check_dylib-llvm-aot_dylib_concatenate.mlir      FAILED in 4.2s
  /home/kbuilder/.cache/bazel/_bazel_kbuilder/c32aa9ac646722210ccee9c722c31e29/execroot/iree_core/bazel-out/k8-opt/testlogs/iree/test/e2e/xla_ops/check_dylib-llvm-aot_dylib_concatenate.mlir/test.log
//iree/test/e2e/xla_ops:check_dylib-llvm-aot_dylib_convolution.mlir      FAILED in 3.9s
  /home/kbuilder/.cache/bazel/_bazel_kbuilder/c32aa9ac646722210ccee9c722c31e29/execroot/iree_core/bazel-out/k8-opt/testlogs/iree/test/e2e/xla_ops/check_dylib-llvm-aot_dylib_convolution.mlir/test.log
//iree/test/e2e/xla_ops:check_dylib-llvm-aot_dylib_pad.mlir              FAILED in 4.2s
  /home/kbuilder/.cache/bazel/_bazel_kbuilder/c32aa9ac646722210ccee9c722c31e29/execroot/iree_core/bazel-out/k8-opt/testlogs/iree/test/e2e/xla_ops/check_dylib-llvm-aot_dylib_pad.mlir/test.log
//iree/test/e2e/xla_ops:check_vmvx_vmvx_concatenate.mlir                 FAILED in 4.2s
  /home/kbuilder/.cache/bazel/_bazel_kbuilder/c32aa9ac646722210ccee9c722c31e29/execroot/iree_core/bazel-out/k8-opt/testlogs/iree/test/e2e/xla_ops/check_vmvx_vmvx_concatenate.mlir/test.log
//iree/test/e2e/xla_ops:check_vmvx_vmvx_convolution.mlir                 FAILED in 3.7s
  /home/kbuilder/.cache/bazel/_bazel_kbuilder/c32aa9ac646722210ccee9c722c31e29/execroot/iree_core/bazel-out/k8-opt/testlogs/iree/test/e2e/xla_ops/check_vmvx_vmvx_convolution.mlir/test.log
//iree/test/e2e/xla_ops:check_vmvx_vmvx_pad.mlir                         FAILED in 3.9s
  /home/kbuilder/.cache/bazel/_bazel_kbuilder/c32aa9ac646722210ccee9c722c31e29/execroot/iree_core/bazel-out/k8-opt/testlogs/iree/test/e2e/xla_ops/check_vmvx_vmvx_pad.mlir/test.log

Actually, figured it out. It's because the buffer size when converting flow.tensor.clone. A fix here: #6201. :)

@antiagainst antiagainst enabled auto-merge (squash) June 12, 2021 15:32
@antiagainst
Copy link
Contributor Author

THanks Lei, Ill take this one and land it.

Awesome, thanks Mahesh! :D I can go and be OOO peacefully now. :)

All blocking issues have been addressed; this patch is ready to go. :) As shown by #5614 (comment), this is a universally good performance improvement: 6 tracked benchmarks see > 5% latency decrease. @MaheshRavishankar, how about we land this now and then you can feel free to improve it in following up patches. (Enabled auto-merge for it.)

DenseElementsAttr requires static shape; we will hit assertions
when trying to fold dynamic shaped flow.tensor.splat.
@antiagainst antiagainst enabled auto-merge (squash) July 22, 2021 11:42
@antiagainst antiagainst enabled auto-merge (squash) July 22, 2021 11:51
@antiagainst antiagainst merged commit 323108e into iree-org:main Jul 22, 2021
@KoolJBlack KoolJBlack mentioned this pull request Jul 23, 2021
@antiagainst antiagainst deleted the fill2splat2 branch September 5, 2021 20:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
(deprecated) buildkite:benchmark-android Deprecated. Please use benchmarks:android-*
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants