Collapse dims when producer is unpack op #17725

IanWood1 · 2024-06-21T20:56:03Z

github-actions · 2024-06-21T21:51:17Z

Abbreviated Benchmark Summary

@ commit 793e4095b4558c71409842768805334c6cf56f1d (no previous benchmark results to compare)

Data-Tiling Comparison Table

Click to show

Name	No-DT (baseline)	DT-Only	DT-UK
BertLargeTF(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[30-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	766.566 (1.0X)	N/A	220.280 (3.5X)
DeepLabV3_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	6.955 (1.0X)	N/A	8.492 (0.8X)
EfficientNetV2STF(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	35.389 (1.0X)	N/A	34.636 (1.0X)
EfficientNet_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	5.814 (1.0X)	N/A	4.990 (1.2X)
GPT2_117M_TF_1X1XI32(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	9.132 (1.0X)	N/A	8.611 (1.1X)
GPT2_117M_TF_1X4XI32(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	11.016 (1.0X)	N/A	9.055 (1.2X)
MiniLML12H384Uncased(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	12.019 (1.0X)	N/A	13.712 (0.9X)
MobileBertSquad_fp16(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	32.837 (1.0X)	N/A	61.415 (0.5X)
MobileBertSquad_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	33.511 (1.0X)	N/A	61.844 (0.5X)
MobileBertSquad_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	68.698 (1.0X)	N/A	64.790 (1.1X)
MobileNetV1_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	4.582 (1.0X)	N/A	4.544 (1.0X)
MobileNetV2_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	3.711 (1.0X)	N/A	4.853 (0.8X)
MobileNetV2_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	5.864 (1.0X)	N/A	5.383 (1.1X)
MobileNetV3Small_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	2.868 (1.0X)	N/A	2.826 (1.0X)
MobileSSD_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	8.403 (1.0X)	N/A	9.895 (0.8X)
PersonDetect_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	0.777 (1.0X)	N/A	0.611 (1.3X)
PoseNet_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	4.125 (1.0X)	N/A	5.174 (0.8X)
matmul_256x256x2048_i8_i4_i32_tile_config_default(linalg) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu]	7.562 (1.0X)	N/A	7.626 (1.0X)
matmul_256x256x2048_i8_i8_i32_tile_config_default(linalg) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu]	6.655 (1.0X)	N/A	1.807 (3.7X)
BertForMaskedLMTF(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[30-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	221.150 (1.0X)	N/A	107.167 (2.1X)
DeepLabV3_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	32.105 (1.0X)	N/A	29.751 (1.1X)
EfficientNetV2STF(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	275.373 (1.0X)	N/A	229.009 (1.2X)
EfficientNet_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	26.730 (1.0X)	N/A	13.058 (2.0X)
GPT2_117M_TF_1X1XI32(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	70.442 (1.0X)	N/A	39.638 (1.8X)
GPT2_117M_TF_1X4XI32(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	88.008 (1.0X)	N/A	39.483 (2.2X)
MiniLML12H384Uncased(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	81.227 (1.0X)	N/A	56.025 (1.4X)
MobileBertSquad_fp16(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	181.651 (1.0X)	N/A	185.213 (1.0X)
MobileBertSquad_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	182.638 (1.0X)	N/A	190.226 (1.0X)
MobileBertSquad_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	516.176 (1.0X)	N/A	240.885 (2.1X)
MobileNetV1_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	24.091 (1.0X)	N/A	17.741 (1.4X)
MobileNetV2_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	11.742 (1.0X)	N/A	11.472 (1.0X)
MobileNetV2_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	21.515 (1.0X)	N/A	11.765 (1.8X)
MobileNetV3Small_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu]	2.773 (1.0X)	N/A	2.755 (1.0X)
MobileSSD_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	34.040 (1.0X)	N/A	31.585 (1.1X)
PersonDetect_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu]	0.707 (1.0X)	N/A	0.551 (1.3X)
PoseNet_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	17.320 (1.0X)	N/A	18.967 (0.9X)
matmul_1x256x2048_i8_i4_i32_tile_config_default(linalg) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu]	0.054 (1.0X)	N/A	0.054 (1.0X)
matmul_1x256x2048_i8_i8_i32_tile_config_default(linalg) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu]	0.042 (1.0X)	N/A	0.021 (2.0X)

Raw Latencies

Benchmark Name	Average Latency (ms)	Median Latency (ms)	Latency Standard Deviation (ms)
BertLargeTF(stablehlo) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags,dt-uk] local\_task(embedded\_elf)[30-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	220.280	220.026	1.702
BertLargeTF(stablehlo) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][experimental-flags,no-dt] local\_task(embedded\_elf)[30-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	766.566	757.433	48.039
DeepLabV3\_fp32(tflite) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags,dt-uk] local\_task(embedded\_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	8.492	8.495	0.033

[Top 3 out of 92 results showed]

No improved or regressed compilation metrics 🏖️

For more information:

Source Workflow Run

hanhanW · 2024-06-25T20:25:20Z

I haven't looked at the code yet, but it does address the compilation issue for #17530. Thanks for pushing on this!

IanWood1 · 2024-06-25T20:41:07Z

@hanhanW do you mean the large dispatch sizes? I'm not sure currently since the benchmarks are having issues

If you can point me to something reproducible locally I can check it out

hanhanW · 2024-06-25T20:48:45Z

@hanhanW do you mean the large dispatch sizes? I'm not sure currently since the benchmarks are having issues

If you can point me to something reproducible locally I can check it out

There are no issues. This PR is good to me because it fixes my issues. :)

I need to check benchmark result though, but at least we can compile all the models in my prototype.

hanhanW · 2024-06-25T20:51:00Z

I think your PR breaks some models targeting gfx90a: https://github.com/iree-org/iree/actions/runs/9665068317/job/26661780089?pr=17725

The model and input files and be found at https://github.com/nod-ai/SHARK-TestSuite/tree/main/iree_tests/pytorch/models/sdxl-vae-decode-tank

Signed-off-by: Ian Wood <ianwood2024@u.northwestern.edu>

IanWood1 · 2024-07-19T16:40:18Z

@hanhanW I just disabled for dispatches with generic -> generic and opened a issue #17948 for the codegen issue

hanhanW

LGTM, just one question

hanhanW · 2024-07-19T17:05:07Z

compiler/src/iree/compiler/Dialect/Flow/Transforms/test/collapse_linalg_generic_on_tensors.mlir

@@ -507,51 +507,6 @@ util.func public @input_broadcast(%arg0: tensor<4x8xf32>, %arg1: tensor<4xf32>)

 // -----

-// Do nothing if the dispatch is not a single elementwise op (with tensor.empty/linalg.fill producers)


What's going on in the test? Why do we delete it?

It was there to make sure that collapsing only happened when there was one operation in the dispatch region. I removed it because I was planning on being able to collapse this, but had to disable collapsing when there is a producer generic.

Okay, added this test back. Will remove again in a later pr that actually performs collapse here

hanhanW · 2024-07-19T17:20:03Z

compiler/src/iree/compiler/Dialect/Flow/Transforms/CollapseDimensions.cpp

@@ -200,15 +227,6 @@ findRootGenericOp(DispatchRegionOp regionOp) {
    }
  }

-  // Check that the operands of the generic op are defined outside the dispatch.


[no action required]: I don't understand why we had the check. But the changes that you made look reasonable to me.

Signed-off-by: Ian Wood <ianwood2024@u.northwestern.edu>

hanhanW

LGTM, thanks a lot!

#17594 --------- Signed-off-by: Ian Wood <ianwood2024@u.northwestern.edu>

iree-org#17594 --------- Signed-off-by: Ian Wood <ianwood2024@u.northwestern.edu> Signed-off-by: Lubo Litchev <lubol@google.com>

IanWood1 self-assigned this Jun 21, 2024

IanWood1 force-pushed the collapse_dims branch 2 times, most recently from b0942ff to 307e134 Compare June 25, 2024 05:36

IanWood1 force-pushed the collapse_dims branch from 49c01d8 to 0157bc7 Compare July 18, 2024 05:30

Extend CollapseDimensionsPass to support >1 op

7f1038c

Signed-off-by: Ian Wood <ianwood2024@u.northwestern.edu>

IanWood1 force-pushed the collapse_dims branch from 0157bc7 to 7f1038c Compare July 18, 2024 19:25

ScottTodd removed benchmarks:android-cpu Run default Android CPU benchmarks benchmarks:android-gpu Run default Android GPU benchmarks labels Jul 19, 2024

IanWood1 marked this pull request as ready for review July 19, 2024 16:37

IanWood1 requested review from hanhanW and MaheshRavishankar as code owners July 19, 2024 16:37

hanhanW reviewed Jul 19, 2024

View reviewed changes

Revert removal of testcase

0169929

Signed-off-by: Ian Wood <ianwood2024@u.northwestern.edu>

hanhanW approved these changes Jul 19, 2024

View reviewed changes

Merge branch 'iree-org:main' into collapse_dims

1e2432f

IanWood1 merged commit 0d0b989 into iree-org:main Jul 21, 2024
51 checks passed

benvanik pushed a commit that referenced this pull request Jul 22, 2024

Collapse dims when producer is unpack op (#17725)

407be68

#17594 --------- Signed-off-by: Ian Wood <ianwood2024@u.northwestern.edu>

LLITCHEV pushed a commit to LLITCHEV/iree that referenced this pull request Jul 30, 2024

Collapse dims when producer is unpack op (iree-org#17725)

1f8c119

iree-org#17594 --------- Signed-off-by: Ian Wood <ianwood2024@u.northwestern.edu> Signed-off-by: Lubo Litchev <lubol@google.com>

hanhanW mentioned this pull request Jul 30, 2024

The parallel dimensions are not collapsed in unpack + elementwsie fusion #17594

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collapse dims when producer is unpack op #17725

Collapse dims when producer is unpack op #17725

IanWood1 commented Jun 21, 2024 •

edited

Loading

github-actions bot commented Jun 21, 2024 •

edited

Loading

hanhanW commented Jun 25, 2024

IanWood1 commented Jun 25, 2024

hanhanW commented Jun 25, 2024

hanhanW commented Jun 25, 2024

IanWood1 commented Jul 19, 2024

hanhanW left a comment

hanhanW Jul 19, 2024

IanWood1 Jul 19, 2024

IanWood1 Jul 19, 2024

hanhanW Jul 19, 2024

hanhanW left a comment

		@@ -507,51 +507,6 @@ util.func public @input_broadcast(%arg0: tensor<4x8xf32>, %arg1: tensor<4xf32>)

		// -----

		// Do nothing if the dispatch is not a single elementwise op (with tensor.empty/linalg.fill producers)

Collapse dims when producer is unpack op #17725

Collapse dims when producer is unpack op #17725

Conversation

IanWood1 commented Jun 21, 2024 • edited Loading

github-actions bot commented Jun 21, 2024 • edited Loading

Abbreviated Benchmark Summary

Data-Tiling Comparison Table

Raw Latencies

hanhanW commented Jun 25, 2024

IanWood1 commented Jun 25, 2024

hanhanW commented Jun 25, 2024

hanhanW commented Jun 25, 2024

IanWood1 commented Jul 19, 2024

hanhanW left a comment

Choose a reason for hiding this comment

hanhanW Jul 19, 2024

Choose a reason for hiding this comment

IanWood1 Jul 19, 2024

Choose a reason for hiding this comment

IanWood1 Jul 19, 2024

Choose a reason for hiding this comment

hanhanW Jul 19, 2024

Choose a reason for hiding this comment

hanhanW left a comment

Choose a reason for hiding this comment

IanWood1 commented Jun 21, 2024 •

edited

Loading

github-actions bot commented Jun 21, 2024 •

edited

Loading