Fold standalone linalg.fill ops into flow.tensor.splat ops #5614

antiagainst · 2021-04-26T16:52:48Z

This allows us to use DMA instead of kernels for pure data
fills. This is another step towards performance: it
further decreases the number of dispatches for MobileNetv2
from 94 to 76, and reduced the latency by 2ms on Galaxy
S20 (Mali G77).

This allows us to use DMA instead of kernels for pure data fills. This is another step towards performance: it further decreases the number of dispatches for MobileNetv2 from 94 to 76, and reduced the latency by 2ms on Galaxy S20 (Mali G77).

benvanik · 2021-04-26T16:54:20Z

Awesome! And this doesn't even have #5410 - when that's done it should help even more!

iree/compiler/Dialect/Flow/Transforms/DispatchLinalgOnTensors.cpp

MaheshRavishankar · 2021-06-11T16:37:40Z

iree/compiler/Dialect/Flow/Transforms/DispatchLinalgOnTensors.cpp

+    // region for it; just use flow.tensor.splat so we can leverage DMA
+    // functionalities.
+    Location loc = op->getLoc();
+    if (auto fillOp = dyn_cast<linalg::FillOp>(op)) {


This doesnt seem to be the right place to do this. Its not actually creating a dispatch region :) .

Maybe move this to ConvertToFlowTensorOps pass. Make that pass operate in two modes, before dispatch region creation and after (using a flag). The same pass can be run before and after dispatch region creation. This fill conversion will be added to the "after" path.

Yup. I was in the middle of addressing this comment. (That's why I haven't re-requested review. ;-P)

It's fine for me to move to ConvertToFlowTensorOps.cpp and actually it results in slightly better code structure I think. But one concern I have is that it's assuming all linalg.fill ops with linalg users will be fused down the pipeline, while this is more generic as catch all sink. But the assumption is probably true as I cannot immediately come up with a case where it's not. So moved to ConvertToFlowTensorOps.cpp. :)

Ah, I think Github sent me notification and I thought this was ready for review. But still have some comments below.

This reverts commit 0cc653f.

MaheshRavishankar · 2021-06-11T19:28:41Z

iree/compiler/Dialect/Flow/Transforms/ConvertToFlowTensorOps.cpp

+///
+/// It assumes linalg.fill ops that has linalg op users can be fused with
+/// its users so those cases aren't supported here.
+struct LinalgFillToFlowTensorSplat final


Sorry for the repeated nag on this one, but adding it by default here (when run before DispatchLinalgOnTensors) will avoid any fusion of fill with other ops. That is better than always doing it only the whole buffer. My comment from previous time was that the same pass should be controlled via flag to run "before" or "after" dispatch region formation. All current patterns are to be run before. The fill can be made to run after.

Having the pattern to run after the DispatchLinalgOnTensors pass would mean we need to first identify and exclude the standalone linalg.fill ops in DispatchLinalgOnTensors (because otherwise the linalg.fill op would be put into its own dispatch region like shown by the existing test) to allow the ConvertToFlowTensorOpsAfterDispatch to kick in later. So that's coupling two passes and introducing quite a lot boilerplate code for just a canoncialization-like direct 1:1 op rewrite. Frankly I'm not so sure about that's better.

No you just need to add a flag to the pass isRunBeforeDispatchRegionFormation .

If isRunBefore* is true, use the patterns that exist in that pass.
If isRunBefore* is false, use the pattern to convert fill to flow.*.

Its essentially two passes in one, but it will collect all the conversion to flow.* ops in one-place.

The the pass-pipeline in Flow is

convertToFlowTensorOps(/*isRunBefore=*/true); .. createDispatchRegions .. convertToFlowTensorOps(/*isRunBefore=*/false);

Yup, I understand that part that we can put them into the same pass and have flags controlling that (though I think it's better to have two passes as we don't share anything here and I think we generally don't want .. flags ;-P). That's a minor issue to me. My main point is that we still need to identify and exclude the standalone linalg.fill ops in DispatchLinalgOnTensors to make sure it can fall through to be handled by the second invocation of the ConvertToFlowTensorOps pass; so we are gonna have the check in DispatchLinalgOnTensors anyway..

Yup, I understand that part that we can put them into the same pass and have flags controlling that (though I think it's better to have two passes as we don't share anything here and I think we generally don't want .. flags ;-P). That's a minor issue to me.

Acknowledge that its a minor issue, but dont see a point in having all the boiler plate code and two places in the code where conversion happens. Its a flag that is controlled from the pass pipeline in C++ code, not a flag to add to benchmarking pipeline to conditionally use some feature for performance on a particular benchmark. So all I am saying is flags are of different flavors.

My main point is that we still need to identify and exclude the standalone linalg.fill ops in DispatchLinalgOnTensors to make sure it can fall through to be handled by the second invocation of the ConvertToFlowTensorOps pass; so we are gonna have the check in DispatchLinalgOnTensors anyway..

Good point, that is easy. Just need to add linalg::FillOp here https://github.com/google/iree/blob/main/iree/compiler/Dialect/Flow/Transforms/DispatchLinalgOnTensors.cpp#L234. It will be treated as an op that is always inlined into the dispatch region. If anything is left out, it will be picked up the second invocation of ConvertToFlow

Good point, that is easy. Just need to add linalg::FillOp here https://github.com/google/iree/blob/main/iree/compiler/Dialect/Flow/Transforms/DispatchLinalgOnTensors.cpp#L234. It will be treated as an op that is always inlined into the dispatch region. If anything is left out, it will be picked up the second invocation of ConvertToFlow

I'm not sure that would work. IIUC, it will force fusing the linalg.fill with its consumers, even for the cases where we shouldn't, like, it will generate the following:

flow.dispatch.workgroups { %0 = linalg.fill ... : tensor<1x225x225x3xf32> %1 = subtensor_insert %input into %0[0, 0, 0, 0] [1, 224, 224, 3] [1, 1, 1, 1] : tensor<1x224x224x3xf32> into tensor<1x225x225x3xf32> }

(A linalg.fill and then a subtensor_insert into a subrange of it is actually the exact motivating pattern for this change.)

The above fusion is problematic because at the moment we use linalg.copy for subtensor_insert and the above dispatch region then will contain multiple linalg ops with different problem sizes, which will mess up distribution. Also after fusing them, there won't exist a standalone linalg.fill so the ConvertToFlowTensorOps after DispatchLinalgOnTensors can pick up.. So to make it work, more special checks in DispatchLinalgOnTensors pass or even tweaks across the whole pipeline will need to happen. That seems to me even complicated than just 0cc653f.

Actually coming to your original point here

adding it by default here (when run before DispatchLinalgOnTensors) will avoid any fusion of fill with other ops.

I'm not sure I follow exactly. The implementation in 91d46f4 checks that we don't have linalg users. If that's true, there won't be any fusion opportunities left for it (as fusion relies on Linalg ops). Am I missing something here?

MaheshRavishankar · 2021-06-11T19:29:53Z

iree/compiler/Dialect/Flow/Transforms/ConvertToFlowTensorOps.cpp

+
+  LogicalResult matchAndRewrite(linalg::FillOp fillOp,
+                                PatternRewriter &rewriter) const override {
+    for (Operation *userOp : fillOp->getUsers()) {


Sorry, follow up from previous comment, We can drop this check if we can run the pass before and after. This check is inherently hard to maintain.

iree-github-actions-bot · 2021-06-11T20:19:10Z

Abbreviated Benchmark Summary

@ commit 9b0b3ba68d384b92d8d1a49fa9195f6bfe0684c5 (vs. base 268a30561472401c2fb83ba5e6c5884939b375d0)

Regressed Benchmarks 🚩

Benchmark Name	Average Latency (ms)	Median Latency (ms)	Latency Standard Deviation (ms)
MobileBertSquad [fp32] (TensorFlow) big-core,full-inference with IREE-Dylib-Sync @ Pixel-4 (CPU-ARMv8.2-A)	894 (vs. 726, 23.14%↑)	893	3
MobileNetV3Small [fp32,imagenet] (TensorFlow) 3-thread,little-core,full-inference with IREE-Dylib @ Pixel-4 (CPU-ARMv8.2-A)	356 (vs. 318, 11.95%↑)	352	29
MobileNetV2 [fp32,imagenet] (TensorFlow) 3-thread,little-core,full-inference with IREE-Dylib @ Pixel-4 (CPU-ARMv8.2-A)	998 (vs. 946, 5.50%↑)	995	50

[Top 3 out of 4 benchmark results showed]

Improved Benchmarks 🎉

Benchmark Name	Average Latency (ms)	Median Latency (ms)	Latency Standard Deviation (ms)
MobileNetV2 [fp32,imagenet] (TensorFlow) kernel-execution with IREE-Vulkan @ SM-G980F (GPU-Mali-G77)	14 (vs. 18, 22.22%↓)	14	0
MobileNetV2 [fp32,imagenet] (TensorFlow) full-inference with IREE-Vulkan @ SM-G980F (GPU-Mali-G77)	77 (vs. 87, 11.49%↓)	83	10
MobileNetV2 [fp32,imagenet] (TensorFlow) kernel-execution with IREE-Vulkan @ Pixel-4 (GPU-Adreno-640)	65 (vs. 70, 7.14%↓)	65	1

[Top 3 out of 5 benchmark results showed]

For more information:

antiagainst · 2021-06-11T20:43:49Z

I'm trying to land this before being out of office. But looks not gonna make it because now hitting other failures in VMVX/Dylib: https://source.cloud.google.com/results/invocations/bbd2da7c-0003-43ae-85c9-5b7356226012/targets/iree%2Fgcp_ubuntu%2Fbazel%2Flinux%2Fx86-swiftshader%2Fcore%2Fpresubmit/log:

check_dylib-llvm-aot_dylib_pad.mlir: iree/task/worker.c:261: _Bool iree_task_worker_pump_once(iree_task_worker_t *, iree_task_submission_t *): Assertion `!!((__builtin_expect(!!((uintptr_t)(status) == IREE_STATUS_OK), 1)))' failed.

check_vmvx_vmvx_convolution.mlir: iree/task/worker.c:261: _Bool iree_task_worker_pump_once(iree_task_worker_t *, iree_task_submission_t *): Assertion `!!((__builtin_expect(!!((uintptr_t)(status) == IREE_STATUS_OK), 1)))' failed.

Only happens for VMVX/Dylib:

//iree/test/e2e/tosa_ops:check_dylib-llvm-aot_dylib_pad.mlir             FAILED in 4.0s
  /home/kbuilder/.cache/bazel/_bazel_kbuilder/c32aa9ac646722210ccee9c722c31e29/execroot/iree_core/bazel-out/k8-opt/testlogs/iree/test/e2e/tosa_ops/check_dylib-llvm-aot_dylib_pad.mlir/test.log
//iree/test/e2e/tosa_ops:check_vmvx_vmvx_pad.mlir                        FAILED in 3.8s
  /home/kbuilder/.cache/bazel/_bazel_kbuilder/c32aa9ac646722210ccee9c722c31e29/execroot/iree_core/bazel-out/k8-opt/testlogs/iree/test/e2e/tosa_ops/check_vmvx_vmvx_pad.mlir/test.log
//iree/test/e2e/xla_ops:check_dylib-llvm-aot_dylib_concatenate.mlir      FAILED in 4.2s
  /home/kbuilder/.cache/bazel/_bazel_kbuilder/c32aa9ac646722210ccee9c722c31e29/execroot/iree_core/bazel-out/k8-opt/testlogs/iree/test/e2e/xla_ops/check_dylib-llvm-aot_dylib_concatenate.mlir/test.log
//iree/test/e2e/xla_ops:check_dylib-llvm-aot_dylib_convolution.mlir      FAILED in 3.9s
  /home/kbuilder/.cache/bazel/_bazel_kbuilder/c32aa9ac646722210ccee9c722c31e29/execroot/iree_core/bazel-out/k8-opt/testlogs/iree/test/e2e/xla_ops/check_dylib-llvm-aot_dylib_convolution.mlir/test.log
//iree/test/e2e/xla_ops:check_dylib-llvm-aot_dylib_pad.mlir              FAILED in 4.2s
  /home/kbuilder/.cache/bazel/_bazel_kbuilder/c32aa9ac646722210ccee9c722c31e29/execroot/iree_core/bazel-out/k8-opt/testlogs/iree/test/e2e/xla_ops/check_dylib-llvm-aot_dylib_pad.mlir/test.log
//iree/test/e2e/xla_ops:check_vmvx_vmvx_concatenate.mlir                 FAILED in 4.2s
  /home/kbuilder/.cache/bazel/_bazel_kbuilder/c32aa9ac646722210ccee9c722c31e29/execroot/iree_core/bazel-out/k8-opt/testlogs/iree/test/e2e/xla_ops/check_vmvx_vmvx_concatenate.mlir/test.log
//iree/test/e2e/xla_ops:check_vmvx_vmvx_convolution.mlir                 FAILED in 3.7s
  /home/kbuilder/.cache/bazel/_bazel_kbuilder/c32aa9ac646722210ccee9c722c31e29/execroot/iree_core/bazel-out/k8-opt/testlogs/iree/test/e2e/xla_ops/check_vmvx_vmvx_convolution.mlir/test.log
//iree/test/e2e/xla_ops:check_vmvx_vmvx_pad.mlir                         FAILED in 3.9s
  /home/kbuilder/.cache/bazel/_bazel_kbuilder/c32aa9ac646722210ccee9c722c31e29/execroot/iree_core/bazel-out/k8-opt/testlogs/iree/test/e2e/xla_ops/check_vmvx_vmvx_pad.mlir/test.log

@benvanik: do you know what might be wrong here?

MaheshRavishankar · 2021-06-11T20:48:59Z

/iree/test/e2e/tosa_ops:check_dylib-llvm-aot_dylib_pad.mlir FAILED in 4.0s
/home/kbuilder/.cache/bazel/_bazel_kbuilder/c32aa9ac646722210ccee9c722c31e29/execroot/iree_core/bazel-out/k8-opt/testlogs/iree/test/e2e/tosa_ops/check_dylib-llvm-aot_dylib_pad.mlir/test.log

This on is on the LLVM path.

MaheshRavishankar · 2021-06-11T21:29:05Z

THanks Lei, Ill take this one and land it.

antiagainst · 2021-06-11T22:07:07Z

THanks Lei, Ill take this one and land it.

Awesome, thanks Mahesh! :D I can go and be OOO peacefully now. :)

A flow.tensor.clone op by definition should have its operand and result of the same shape. However, when mapping to buffers, we could have the case where the operand is a constant subspan. Then the source buffer will be the constant pool buffer, which can be larger than the result buffer.

antiagainst · 2021-06-12T02:13:00Z

I'm trying to land this before being out of office. But looks not gonna make it because now hitting other failures in VMVX/Dylib: https://source.cloud.google.com/results/invocations/bbd2da7c-0003-43ae-85c9-5b7356226012/targets/iree%2Fgcp_ubuntu%2Fbazel%2Flinux%2Fx86-swiftshader%2Fcore%2Fpresubmit/log:

check_dylib-llvm-aot_dylib_pad.mlir: iree/task/worker.c:261: _Bool iree_task_worker_pump_once(iree_task_worker_t *, iree_task_submission_t *): Assertion `!!((__builtin_expect(!!((uintptr_t)(status) == IREE_STATUS_OK), 1)))' failed.

check_vmvx_vmvx_convolution.mlir: iree/task/worker.c:261: _Bool iree_task_worker_pump_once(iree_task_worker_t *, iree_task_submission_t *): Assertion `!!((__builtin_expect(!!((uintptr_t)(status) == IREE_STATUS_OK), 1)))' failed.

Only happens for VMVX/Dylib:

//iree/test/e2e/tosa_ops:check_dylib-llvm-aot_dylib_pad.mlir             FAILED in 4.0s
  /home/kbuilder/.cache/bazel/_bazel_kbuilder/c32aa9ac646722210ccee9c722c31e29/execroot/iree_core/bazel-out/k8-opt/testlogs/iree/test/e2e/tosa_ops/check_dylib-llvm-aot_dylib_pad.mlir/test.log
//iree/test/e2e/tosa_ops:check_vmvx_vmvx_pad.mlir                        FAILED in 3.8s
  /home/kbuilder/.cache/bazel/_bazel_kbuilder/c32aa9ac646722210ccee9c722c31e29/execroot/iree_core/bazel-out/k8-opt/testlogs/iree/test/e2e/tosa_ops/check_vmvx_vmvx_pad.mlir/test.log
//iree/test/e2e/xla_ops:check_dylib-llvm-aot_dylib_concatenate.mlir      FAILED in 4.2s
  /home/kbuilder/.cache/bazel/_bazel_kbuilder/c32aa9ac646722210ccee9c722c31e29/execroot/iree_core/bazel-out/k8-opt/testlogs/iree/test/e2e/xla_ops/check_dylib-llvm-aot_dylib_concatenate.mlir/test.log
//iree/test/e2e/xla_ops:check_dylib-llvm-aot_dylib_convolution.mlir      FAILED in 3.9s
  /home/kbuilder/.cache/bazel/_bazel_kbuilder/c32aa9ac646722210ccee9c722c31e29/execroot/iree_core/bazel-out/k8-opt/testlogs/iree/test/e2e/xla_ops/check_dylib-llvm-aot_dylib_convolution.mlir/test.log
//iree/test/e2e/xla_ops:check_dylib-llvm-aot_dylib_pad.mlir              FAILED in 4.2s
  /home/kbuilder/.cache/bazel/_bazel_kbuilder/c32aa9ac646722210ccee9c722c31e29/execroot/iree_core/bazel-out/k8-opt/testlogs/iree/test/e2e/xla_ops/check_dylib-llvm-aot_dylib_pad.mlir/test.log
//iree/test/e2e/xla_ops:check_vmvx_vmvx_concatenate.mlir                 FAILED in 4.2s
  /home/kbuilder/.cache/bazel/_bazel_kbuilder/c32aa9ac646722210ccee9c722c31e29/execroot/iree_core/bazel-out/k8-opt/testlogs/iree/test/e2e/xla_ops/check_vmvx_vmvx_concatenate.mlir/test.log
//iree/test/e2e/xla_ops:check_vmvx_vmvx_convolution.mlir                 FAILED in 3.7s
  /home/kbuilder/.cache/bazel/_bazel_kbuilder/c32aa9ac646722210ccee9c722c31e29/execroot/iree_core/bazel-out/k8-opt/testlogs/iree/test/e2e/xla_ops/check_vmvx_vmvx_convolution.mlir/test.log
//iree/test/e2e/xla_ops:check_vmvx_vmvx_pad.mlir                         FAILED in 3.9s
  /home/kbuilder/.cache/bazel/_bazel_kbuilder/c32aa9ac646722210ccee9c722c31e29/execroot/iree_core/bazel-out/k8-opt/testlogs/iree/test/e2e/xla_ops/check_vmvx_vmvx_pad.mlir/test.log

Actually, figured it out. It's because the buffer size when converting flow.tensor.clone. A fix here: #6201. :)

antiagainst · 2021-06-12T15:32:38Z

THanks Lei, Ill take this one and land it.

Awesome, thanks Mahesh! :D I can go and be OOO peacefully now. :)

All blocking issues have been addressed; this patch is ready to go. :) As shown by #5614 (comment), this is a universally good performance improvement: 6 tracked benchmarks see > 5% latency decrease. @MaheshRavishankar, how about we land this now and then you can feel free to improve it in following up patches. (Enabled auto-merge for it.)

DenseElementsAttr requires static shape; we will hit assertions when trying to fold dynamic shaped flow.tensor.splat.

iree/compiler/Dialect/Flow/Transforms/Passes.cpp

antiagainst requested review from benvanik and MaheshRavishankar April 26, 2021 16:52

google-cla bot added the cla: yes label Apr 26, 2021

antiagainst mentioned this pull request Apr 26, 2021

Fold pure fill DispatchWorkgroupsOp into TensorSplatOp #5353

Closed

MaheshRavishankar requested changes Apr 26, 2021

View reviewed changes

iree/compiler/Dialect/Flow/Transforms/DispatchLinalgOnTensors.cpp Outdated Show resolved Hide resolved

antiagainst mentioned this pull request Jun 10, 2021

Failure lowering hal.constant.subspan inside flow.ex.stream.fragment #5989

Closed

Merge remote-tracking branch 'origin/main' into fill2splat2

53a8298

MaheshRavishankar requested changes Jun 11, 2021

View reviewed changes

antiagainst added 2 commits June 11, 2021 14:07

Implement the pattern in ConvertToFlowTensorOps.cpp

91d46f4

Revert "Fold standalone linalg.fill ops into flow.tensor.splat ops"

d6567bd

This reverts commit 0cc653f.

antiagainst added the (deprecated) buildkite:benchmark-android Deprecated. Please use benchmarks:android-* label Jun 11, 2021

antiagainst requested a review from MaheshRavishankar June 11, 2021 19:17

s/tab/space/

2b81bb2

MaheshRavishankar requested changes Jun 11, 2021

View reviewed changes

antiagainst added 2 commits June 11, 2021 22:13

Merge remote-tracking branch 'origin/main' into fill2splat2

db6d765

Merge remote-tracking branch 'origin/main' into fill2splat2

47be229

antiagainst enabled auto-merge (squash) June 12, 2021 15:32

antiagainst added 2 commits July 21, 2021 17:20

Merge remote-tracking branch 'origin/main' into fill2splat2

12cfd35

Use two passes for fill2splat

8cd6c60

antiagainst added 2 commits July 21, 2021 21:09

Don't fold flow.tensor.splat with dynamic shape

d5bc092

DenseElementsAttr requires static shape; we will hit assertions when trying to fold dynamic shaped flow.tensor.splat.

Merge remote-tracking branch 'origin/main' into fill2splat2

f2cc3d0

antiagainst requested a review from MaheshRavishankar July 22, 2021 02:00

antiagainst disabled auto-merge July 22, 2021 02:06

MaheshRavishankar approved these changes Jul 22, 2021

View reviewed changes

iree/compiler/Dialect/Flow/Transforms/Passes.cpp Show resolved Hide resolved

Adjust pass order

9b0b3ba

antiagainst enabled auto-merge (squash) July 22, 2021 11:42

antiagainst disabled auto-merge July 22, 2021 11:51

antiagainst enabled auto-merge (squash) July 22, 2021 11:51

antiagainst merged commit 323108e into iree-org:main Jul 22, 2021

KoolJBlack mentioned this pull request Jul 23, 2021

Merge main -> google #6526

Merged

ScottTodd mentioned this pull request Jul 30, 2021

Move tensor->flow passes to Dialect/Flow/Conversion/TensorToFlow. #6586

Merged

antiagainst deleted the fill2splat2 branch September 5, 2021 20:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fold standalone linalg.fill ops into flow.tensor.splat ops #5614

Fold standalone linalg.fill ops into flow.tensor.splat ops #5614

antiagainst commented Apr 26, 2021

benvanik commented Apr 26, 2021

MaheshRavishankar Jun 11, 2021

antiagainst Jun 11, 2021

MaheshRavishankar Jun 11, 2021

MaheshRavishankar Jun 11, 2021

antiagainst Jun 11, 2021

MaheshRavishankar Jun 11, 2021

antiagainst Jun 11, 2021

MaheshRavishankar Jun 11, 2021

antiagainst Jun 12, 2021

MaheshRavishankar Jun 11, 2021

iree-github-actions-bot commented Jun 11, 2021 •

edited

Loading

antiagainst commented Jun 11, 2021

MaheshRavishankar commented Jun 11, 2021

MaheshRavishankar commented Jun 11, 2021

antiagainst commented Jun 11, 2021

antiagainst commented Jun 12, 2021

antiagainst commented Jun 12, 2021

Fold standalone linalg.fill ops into flow.tensor.splat ops #5614

Fold standalone linalg.fill ops into flow.tensor.splat ops #5614

Conversation

antiagainst commented Apr 26, 2021

benvanik commented Apr 26, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iree-github-actions-bot commented Jun 11, 2021 • edited Loading

Abbreviated Benchmark Summary

Regressed Benchmarks 🚩

Improved Benchmarks 🎉

antiagainst commented Jun 11, 2021

MaheshRavishankar commented Jun 11, 2021

MaheshRavishankar commented Jun 11, 2021

antiagainst commented Jun 11, 2021

antiagainst commented Jun 12, 2021

antiagainst commented Jun 12, 2021

iree-github-actions-bot commented Jun 11, 2021 •

edited

Loading