Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NFC][Flow] Remove use of fusion preprocessing when it isnt a preprocessing #17899

Conversation

MaheshRavishankar
Copy link
Contributor

The Fusion preprocessing pass was used in multiple places, which is
not the intent of the pass. Remove the subsequent usage. The only
reason for this double usage was for the pattern that moved reduction
dimensions to the innermost. Consolidate that pattern with the pattern
in InterchangeTransposeGenericPass (whose name is very convoluted
and does not represent what it actually does).

This commit also includes the following changes:

  • Rename InterchangeTransposeGenericPass to TransposeGenericOpsPass.
  • Reoder the passes in Passes.td to be alphabetical within each of the following
    portions
    • Dispatch region preprocessing passes
    • Dispatch region formation passes
    • General flow passes.

…essing.

The Fusion preprocessing pass was used in multiple places, which is
not the intent of the pass. Remove the subsequent usage.  The only
reason for this double usage was for the pattern that moved reduction
dimensions to the innermost. Consolidate that pattern with the pattern
in `InterchangeTransposeGenericPass` (whose name is very convoluted
and does not represent what it actually does).

Rename `InterchangeTransposeGenericPass` to `TransposeGenericOpsPass`.

Signed-off-by: MaheshRavishankar <mahesh.ravishankar@gmail.com>
@MaheshRavishankar
Copy link
Contributor Author

Could help to review one commit at a time.

@MaheshRavishankar MaheshRavishankar added benchmarks:cuda Run default CUDA benchmarks benchmarks:x86_64 Run default x86_64 benchmarks benchmarks:comp-stats Run default compilation statistics benchmarks benchmarks:android-cpu Run default Android CPU benchmarks benchmarks:android-gpu Run default Android GPU benchmarks benchmarks:vulkan-nvidia Run default Vulkan benchmarks on NVIDIA GPU labels Jul 14, 2024
@MaheshRavishankar MaheshRavishankar removed benchmarks:android-cpu Run default Android CPU benchmarks benchmarks:android-gpu Run default Android GPU benchmarks labels Jul 15, 2024
Also reoder the passes in `Passes.td` to be alphabetical within
each of the following portions
- Dispatch region preprocessing passes
- Dispatch region formation passes
- General flow passes.

Signed-off-by: MaheshRavishankar <mahesh.ravishankar@gmail.com>
Copy link

Abbreviated Benchmark Summary

@ commit 8a92b42503a3e4ecd793ff97ac417ce2d908d201 (vs. base a56975ddba226748b3e59efbc879a250b02147fa)

Data-Tiling Comparison Table

Click to show
Name No-DT (baseline) DT-Only DT-UK
BertLargeTF(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[30-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 766.474 (1.0X) N/A 223.301 (3.4X)
DeepLabV3_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 6.949 (1.0X) N/A 8.516 (0.8X)
EfficientNetV2STF(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 36.155 (1.0X) N/A 34.725 (1.0X)
EfficientNet_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 5.814 (1.0X) N/A 5.022 (1.2X)
GPT2_117M_TF_1X1XI32(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 9.197 (1.0X) N/A 8.525 (1.1X)
GPT2_117M_TF_1X4XI32(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 11.280 (1.0X) N/A 8.987 (1.3X)
MiniLML12H384Uncased(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 11.899 (1.0X) N/A 14.060 (0.8X)
MobileBertSquad_fp16(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 34.062 (1.0X) N/A 62.055 (0.5X)
MobileBertSquad_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 35.196 (1.0X) N/A 62.363 (0.6X)
MobileBertSquad_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 68.886 (1.0X) N/A 65.080 (1.1X)
MobileNetV1_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 4.824 (1.0X) N/A 4.582 (1.1X)
MobileNetV2_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 3.704 (1.0X) N/A 4.915 (0.8X)
MobileNetV2_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 5.868 (1.0X) N/A 5.422 (1.1X)
MobileNetV3Small_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 2.862 (1.0X) N/A 2.808 (1.0X)
MobileSSD_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 8.445 (1.0X) N/A 9.862 (0.9X)
PersonDetect_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 0.784 (1.0X) N/A 0.613 (1.3X)
PoseNet_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 4.165 (1.0X) N/A 5.250 (0.8X)
matmul_256x256x2048_i8_i4_i32_tile_config_default(linalg) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu] 7.554 (1.0X) N/A 7.594 (1.0X)
matmul_256x256x2048_i8_i8_i32_tile_config_default(linalg) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu] 6.673 (1.0X) N/A 1.807 (3.7X)
BertForMaskedLMTF(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[30-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 217.121 (1.0X) N/A 107.436 (2.0X)
DeepLabV3_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 32.258 (1.0X) N/A 30.136 (1.1X)
EfficientNetV2STF(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 275.360 (1.0X) N/A 230.152 (1.2X)
EfficientNet_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 26.868 (1.0X) N/A 13.076 (2.1X)
GPT2_117M_TF_1X1XI32(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 69.827 (1.0X) N/A 39.252 (1.8X)
GPT2_117M_TF_1X4XI32(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 89.052 (1.0X) N/A 41.691 (2.1X)
MiniLML12H384Uncased(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 79.140 (1.0X) N/A 56.417 (1.4X)
MobileBertSquad_fp16(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 180.546 (1.0X) N/A 185.958 (1.0X)
MobileBertSquad_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 182.087 (1.0X) N/A 190.603 (1.0X)
MobileBertSquad_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 516.390 (1.0X) N/A 240.873 (2.1X)
MobileNetV1_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 24.858 (1.0X) N/A 17.906 (1.4X)
MobileNetV2_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 11.744 (1.0X) N/A 11.559 (1.0X)
MobileNetV2_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 21.632 (1.0X) N/A 11.824 (1.8X)
MobileNetV3Small_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu] 2.796 (1.0X) N/A 2.709 (1.0X)
MobileSSD_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 34.490 (1.0X) N/A 31.654 (1.1X)
PersonDetect_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu] 0.711 (1.0X) N/A 0.550 (1.3X)
PoseNet_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 17.639 (1.0X) N/A 19.327 (0.9X)
matmul_1x256x2048_i8_i4_i32_tile_config_default(linalg) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu] 0.054 (1.0X) N/A 0.054 (1.0X)
matmul_1x256x2048_i8_i8_i32_tile_config_default(linalg) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu] 0.043 (1.0X) N/A 0.021 (2.0X)

No improved or regressed benchmarks 🏖️

No improved or regressed compilation metrics 🏖️

For more information:

Source Workflow Run

@MaheshRavishankar MaheshRavishankar merged commit 2912a2a into iree-org:main Jul 15, 2024
62 checks passed
@MaheshRavishankar MaheshRavishankar deleted the nfc_fix_fusion_preprocessing_use branch July 16, 2024 14:27
LLITCHEV pushed a commit to LLITCHEV/iree that referenced this pull request Jul 30, 2024
…essing (iree-org#17899)

The Fusion preprocessing pass was used in multiple places, which is
not the intent of the pass. Remove the subsequent usage.  The only
reason for this double usage was for the pattern that moved reduction
dimensions to the innermost. Consolidate that pattern with the pattern
in `InterchangeTransposeGenericPass` (whose name is very convoluted
and does not represent what it actually does).

This commit also includes the following changes:
- Rename `InterchangeTransposeGenericPass` to `TransposeGenericOpsPass`.
- Reoder the passes in `Passes.td` to be alphabetical within each of the
following
  portions
  - Dispatch region preprocessing passes
  - Dispatch region formation passes
  - General flow passes.

---------

Signed-off-by: MaheshRavishankar <mahesh.ravishankar@gmail.com>
Signed-off-by: Lubo Litchev <lubol@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
benchmarks:comp-stats Run default compilation statistics benchmarks benchmarks:cuda Run default CUDA benchmarks benchmarks:vulkan-nvidia Run default Vulkan benchmarks on NVIDIA GPU benchmarks:x86_64 Run default x86_64 benchmarks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants