Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bump LLVM to llvm/llvm-project@f5145f4dc819 #16073

Merged
merged 7 commits into from
Jan 11, 2024
Merged

Conversation

hanhanW
Copy link
Contributor

@hanhanW hanhanW commented Jan 8, 2024

This includes a revert for llvm/llvm-project@11ac97c
Other fixes:

  • Implement 77b777c which avoids inlining big constants into dispatch.
  • Add fixes for llvm/llvm-project@b3037ae, see 7ed5f29
  • Add fixes for llvm/llvm-project@bae1fde, see c83cefb It updates HALDispatchABI::buildScopeAttr to take LLVM::LLVMFuncOp as inputs, so it can handle if the DistinctAttr is needed or not for DISubprogramAttr.
  • Add fixes for llvm/llvm-project@bb6d5c2
    • Add --split-input-file to lower_to_ukernel_ops.mlir.
    • Move few CHECK-DAG to beginning for fixing lit tests.
    • Replace some CHECK[-NEXT] with CHECK-DAG
    • Add --canonicalize to affinemin_canonicalization.mlir which hoists constants out of region.

@hanhanW hanhanW marked this pull request as draft January 9, 2024 00:04
@hanhanW hanhanW marked this pull request as ready for review January 9, 2024 00:54
Copy link

github-actions bot commented Jan 9, 2024

Abbreviated Benchmark Summary

@ commit fe22035d50c47a5f3a1e3f3a41bfcca1733e1591 (vs. base e2e126ce061454ad71bc2f5c1c08b1efc982f4cd)

Data-Tiling Comparison Table

Click to show
Name No-DT (baseline) DT-Only DT-UK
BertLargeTF(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 1608.489 (1.0X) 448.362 (3.6X) 355.499 (4.5X)
DeepLabV3_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 6.363 (1.0X) 9.902 (0.6X) 9.542 (0.7X)
EfficientNetV2STF(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 51.123 (1.0X) 58.272 (0.9X) 52.492 (1.0X)
EfficientNet_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 6.214 (1.0X) 11.551 (0.5X) 6.108 (1.0X)
Falcon7bInt4GptqPT(linalg) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 36088.773 (1.0X) 17983.579 (2.0X) 7605.654 (4.7X)
GPT2_117M_TF_1X1XI32(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 14.241 (1.0X) 9.565 (1.5X) 10.052 (1.4X)
MiniLML12H384Uncased(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 18.580 (1.0X) 16.629 (1.1X) 14.406 (1.3X)
MobileBertSquad_fp16(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 44.764 (1.0X) 64.975 (0.7X) 59.643 (0.8X)
MobileBertSquad_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 46.646 (1.0X) 66.499 (0.7X) 61.176 (0.8X)
MobileBertSquad_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 111.326 (1.0X) 198.397 (0.6X) 60.082 (1.9X)
MobileNetV1_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 6.712 (1.0X) 5.517 (1.2X) 5.030 (1.3X)
MobileNetV2_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 3.736 (1.0X) 5.623 (0.7X) 5.412 (0.7X)
MobileNetV2_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 5.900 (1.0X) 9.449 (0.6X) 5.606 (1.1X)
MobileNetV3Small_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 3.283 (1.0X) 3.700 (0.9X) 3.403 (1.0X)
MobileSSD_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 8.555 (1.0X) 11.054 (0.8X) 10.491 (0.8X)
PersonDetect_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 0.809 (1.0X) 1.425 (0.6X) 0.753 (1.1X)
PoseNet_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 4.505 (1.0X) 6.413 (0.7X) 5.804 (0.8X)
BertForMaskedLMTF(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 375.376 (1.0X) 225.002 (1.7X) 175.549 (2.1X)
DeepLabV3_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 26.212 (1.0X) 38.011 (0.7X) 34.984 (0.7X)
EfficientNetV2STF(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 262.757 (1.0X) 265.232 (1.0X) 242.260 (1.1X)
EfficientNet_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 29.782 (1.0X) 51.578 (0.6X) 16.368 (1.8X)
Falcon7bGptqPT(linalg) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 35559.239 (1.0X) 17698.086 (2.0X) 7491.853 (4.7X)
GPT2_117M_TF_1X4XI32(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 16.965 (1.0X) 10.531 (1.6X) 10.654 (1.6X)
MiniLML12H384Uncased(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 80.641 (1.0X) 74.807 (1.1X) 59.115 (1.4X)
MobileBertSquad_fp16(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 180.633 (1.0X) 228.942 (0.8X) 189.875 (1.0X)
MobileBertSquad_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 185.870 (1.0X) 231.897 (0.8X) 195.658 (0.9X)
MobileBertSquad_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 485.730 (1.0X) 1018.350 (0.5X) 215.770 (2.3X)
MobileNetV1_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 28.234 (1.0X) 22.806 (1.2X) 19.611 (1.4X)
MobileNetV2_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 11.829 (1.0X) 15.284 (0.8X) 13.467 (0.9X)
MobileNetV2_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 22.122 (1.0X) 41.166 (0.5X) 14.075 (1.6X)
MobileNetV3Small_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu] 3.267 (1.0X) 3.762 (0.9X) 3.317 (1.0X)
MobileSSD_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 33.378 (1.0X) 38.495 (0.9X) 35.018 (1.0X)
PersonDetect_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu] 0.741 (1.0X) 1.347 (0.6X) 0.672 (1.1X)
PoseNet_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 18.444 (1.0X) 25.552 (0.7X) 21.611 (0.9X)
DeepLabV3_fp32(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores] 58.713 (1.0X) 44.247 (1.3X) 43.566 (1.3X)
DeepLabV3_fp32(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] 60.520 (1.0X) 45.843 (1.3X) 45.201 (1.3X)
DeepLabV3_fp32(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[2-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] 36.192 (1.0X) 28.947 (1.3X) 28.527 (1.3X)
GPT2_117M_TF_1X1XI32(stablehlo) [armv8.2-a-generic-linux_android29-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores] 92.821 (1.0X) 22.143 (4.2X) 22.351 (4.2X)
GPT2_117M_TF_1X1XI32(stablehlo) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] 93.275 (1.0X) 23.259 (4.0X) 23.227 (4.0X)
GPT2_117M_TF_1X1XI32(stablehlo) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[2-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] 51.938 (1.0X) 23.113 (2.2X) 23.053 (2.3X)
GPT2_117M_TF_1X4XI32(stablehlo) [armv8.2-a-generic-linux_android29-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores] 137.847 (1.0X) 30.725 (4.5X) 30.720 (4.5X)
GPT2_117M_TF_1X4XI32(stablehlo) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] 122.275 (1.0X) 33.173 (3.7X) 32.404 (3.8X)
GPT2_117M_TF_1X4XI32(stablehlo) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[2-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] 68.819 (1.0X) 29.690 (2.3X) 29.427 (2.3X)
MobileBertSquad_fp32(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores] 739.857 (1.0X) 413.553 (1.8X) 405.445 (1.8X)
MobileBertSquad_fp32(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] 748.615 (1.0X) 424.224 (1.8X) 408.221 (1.8X)
MobileBertSquad_fp32(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[2-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] 435.083 (1.0X) 252.617 (1.7X) 245.096 (1.8X)
MobileBertSquad_int8(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores] 920.585 (1.0X) 362.704 (2.5X) 272.710 (3.4X)
MobileBertSquad_int8(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] 928.329 (1.0X) 357.197 (2.6X) 271.341 (3.4X)
MobileBertSquad_int8(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[2-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] 512.796 (1.0X) 206.459 (2.5X) 159.505 (3.2X)
Vit_int8(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores] 2251.107 (1.0X) 993.359 (2.3X) 823.591 (2.7X)
Vit_int8(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] 2276.463 (1.0X) 1009.674 (2.3X) 837.009 (2.7X)
Vit_int8(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[2-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] 1258.604 (1.0X) 566.392 (2.2X) 476.633 (2.6X)

Regressed Latencies 🚩

Benchmark Name Average Latency (ms) Median Latency (ms) Latency Standard Deviation (ms)
MobileBertSquad\_fp16(tflite) [arm-valhall-vulkan\_android31-vulkan\_spirv][experimental-flags,fuse-padding,max-concurrency,demote-f32-to-f16] vulkan(none)[full-inference,default-flags] with default @ pixel-6-pro[gpu] 111.549 (vs. 93.066, 19.86%↑) 111.419 1.159
MobileNetV1\_fp32(tflite) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][experimental-flags,no-dt] local\_task(embedded\_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 28.234 (vs. 26.515, 6.49%↑) 28.168 0.375
DeepLabV3\_fp32(tflite) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][experimental-flags,dt-only] local\_task(embedded\_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 38.011 (vs. 36.616, 3.81%↑) 37.993 0.226

[Top 3 out of 4 results showed]

Improved Latencies 🎉

Benchmark Name Average Latency (ms) Median Latency (ms) Latency Standard Deviation (ms)
GPT2\_117M\_TF\_1X4XI32(stablehlo) [armv8.2-a-generic-linux\_android29-llvm\_cpu][experimental-flags,no-dt] local\_task(embedded\_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] 122.275 (vs. 136.112, 10.17%↓) 122.316 0.273
matmul\_123x2561x2561\_f32t\_f32t\_f32t\_tile\_config\_default(linalg) [cuda-sm\_80-linux\_gnu-cuda][ukernel,matmul] cuda(none)[full-inference,default-flags] with default @ a2-highgpu-1g[gpu] 0.199 (vs. 0.219, 9.12%↓) 0.199 0.000
DeepLabV3\_fp32(tflite) [armv8.2-a-generic-linux\_android29-llvm\_cpu][experimental-flags,dt-only] local\_task(embedded\_elf)[2-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores] 28.947 (vs. 31.677, 8.62%↓) 29.170 0.912

[Top 3 out of 21 results showed]

Improved Total Dispatch Sizes 🎉

Benchmark Name Total Dispatch Size (bytes)
matmul\_2562x2561x2561\_f32t\_f32t\_f32t\_tile\_config\_default(linalg) [cuda-sm\_80-linux\_gnu-cuda][ukernel,matmul,compile-stats] 84400 (vs. 93272, 9.51%↓)
matmul\_123x2561x2561\_f32t\_f32t\_f32t\_tile\_config\_default(linalg) [cuda-sm\_80-linux\_gnu-cuda][ukernel,matmul,compile-stats] 50580 (vs. 54812, 7.72%↓)

Improved Total Artifact Sizes 🎉

Benchmark Name Total Artifact Size (bytes)
matmul\_2562x2561x2561\_f32t\_f32t\_f32t\_tile\_config\_default(linalg) [cuda-sm\_80-linux\_gnu-cuda][ukernel,matmul,compile-stats] 95187 (vs. 104059, 8.53%↓)
matmul\_123x2561x2561\_f32t\_f32t\_f32t\_tile\_config\_default(linalg) [cuda-sm\_80-linux\_gnu-cuda][ukernel,matmul,compile-stats] 61366 (vs. 65598, 6.45%↓)

For more information:

Source Workflow Run

@hanhanW
Copy link
Contributor Author

hanhanW commented Jan 9, 2024

I took a look at total dispatch sizes regression about GPT2. It looks like they are from llvm/llvm-project@bb6d5c2 and llvm/llvm-project@eb42868 . There are more elementwise add ops at flow level. There are deps, so I can only try with git revert eb42868f2 bb6d5c2.

IR dump without revert: https://gist.githubusercontent.com/hanhanW/42dda3e9994c3b9454d8d4e5b9b0be01/raw/e7f70daac54df584d4a0fff7e98093cdfb90c443/log10

IR dump with revert: https://gist.githubusercontent.com/hanhanW/1e695dfcc4beb894c2423ae3139d2a54/raw/56941d0cb62ee8aac266ccea85bec482523f8e01/log11

If you vimdiff two dumps, you will notice that there are more generic ops w/o revert; there are more constants w/ revert. It looks like some constant folding is not kicked in after integrate.

(It would be good if someone can check whether my statements are correct or not.)

@matthias-springer
Copy link
Contributor

The first difference in the diff is interesting.

Before:

    builtin.module {
      func.func @forward_dispatch_6_generic_2304_f32(%arg0: !flow.dispatch.tensor<readonly:tensor<2304xf32>>, %arg1: !flow.dispatch.tensor<readonly:tensor<2304xf32>>, %arg2: !flow.dispatch.tensor<writeonly:tensor<2304xf32>>) {
        %0 = flow.dispatch.tensor.load %arg0, offsets = [0], sizes = [2304], strides = [1] : !flow.dispatch.tensor<readonly:tensor<2304xf32>> -> tensor<2304xf32>
        %1 = flow.dispatch.tensor.load %arg1, offsets = [0], sizes = [2304], strides = [1] : !flow.dispatch.tensor<readonly:tensor<2304xf32>> -> tensor<2304xf32>
        %2 = tensor.empty() : tensor<2304xf32>
        %3 = linalg.generic {indexing_maps = [#map1, #map1, #map1], iterator_types = ["parallel"]} ins(%0, %1 : tensor<2304xf32>, tensor<2304xf32>) outs(%2 : tensor<2304xf32>) {
        ^bb0(%in: f32, %in_0: f32, %out: f32):
          %4 = arith.addf %in, %in_0 : f32
          linalg.yield %4 : f32
        } -> tensor<2304xf32>
        flow.dispatch.tensor.store %3, %arg2, offsets = [0], sizes = [2304], strides = [1] : tensor<2304xf32> -> !flow.dispatch.tensor<writeonly:tensor<2304xf32>>
        return
      }
    }
  }

After:

    builtin.module {
      func.func @forward_dispatch_6_generic_2304_f32(%arg0: !flow.dispatch.tensor<readonly:tensor<2304xf32>>, %arg1: !flow.dispatch.tensor<writeonly:tensor<2304xf32>>) {
        %cst = arith.constant dense_resource<__elided__> : tensor<2304xf32>
        %0 = flow.dispatch.tensor.load %arg0, offsets = [0], sizes = [2304], strides = [1] : !flow.dispatch.tensor<readonly:tensor<2304xf32>> -> tensor<2304xf32>
        %1 = tensor.empty() : tensor<2304xf32>
        %2 = linalg.generic {indexing_maps = [#map1, #map1, #map1], iterator_types = ["parallel"]} ins(%0, %cst : tensor<2304xf32>, tensor<2304xf32>) outs(%1 : tensor<2304xf32>) {
        ^bb0(%in: f32, %in_0: f32, %out: f32):
          %3 = arith.addf %in, %in_0 : f32
          linalg.yield %3 : f32
        } -> tensor<2304xf32>
        flow.dispatch.tensor.store %2, %arg1, offsets = [0], sizes = [2304], strides = [1] : tensor<2304xf32> -> !flow.dispatch.tensor<writeonly:tensor<2304xf32>>
        return
      }
    }
  }

A flow.dispatch.tensor.load is folded to a constant.

@hanhanW
Copy link
Contributor Author

hanhanW commented Jan 10, 2024

I probably need to dump the IR without eliding large constants. IIUC, they will be pulled into dispatches if they are small constants. We will need to re-dump the IRs and see what the actual values are.

@hanhanW
Copy link
Contributor Author

hanhanW commented Jan 11, 2024

The first difference about dispatch is after CollapseDimensions. The pass pulls the big constants into forward_dispatch_6_generic_2304_f32. I'm studying the logic of the pass now.

@hanhanW
Copy link
Contributor Author

hanhanW commented Jan 11, 2024

The first difference about dispatch is after CollapseDimensions. The pass pulls the big constants into forward_dispatch_6_generic_2304_f32. I'm studying the logic of the pass now.

I have a much smaller repro now:

Download https://gist.githubusercontent.com/hanhanW/0935ccebd438116f386406104bb15368/raw/4e2ce8411a5294ed3acbcd5bf467cac9ec3ad5aa/z.mlir

Run iree-opt --pass-pipeline="builtin.module(func.func(iree-flow-collapse-dimensions))" ~/z.mlir

@hanhanW
Copy link
Contributor Author

hanhanW commented Jan 11, 2024

What's happening is:

%cst = arith.const ...
%1 = flow.dispatch.region -> (tensor<1x1x2304xf32>) {
  %collapse = tensor.collapse_shape %cst [[0, 1, 2]] : tensor<1x1x2304xf32> into tensor<2304xf32>
  ...
}

The applyOpPatternsAndFold replaces the tensor.collapse_shape op with a new cst because its folder replaces it with constants. This is why the big constant is "pulled" into the dispatch.

@hanhanW hanhanW force-pushed the integrate-llvm-20240108 branch 2 times, most recently from 5c0a51e to 75337be Compare January 11, 2024 09:17
@matthias-springer
Copy link
Contributor

What's happening is:

%cst = arith.const ...
%1 = flow.dispatch.region -> (tensor<1x1x2304xf32>) {
  %collapse = tensor.collapse_shape %cst [[0, 1, 2]] : tensor<1x1x2304xf32> into tensor<2304xf32>
  ...
}

The applyOpPatternsAndFold replaces the tensor.collapse_shape op with a new cst because its folder replaces it with constants. This is why the big constant is "pulled" into the dispatch.

I'm wondering why was this not happening before? Or was it happening but the constant was hoisted out of the region by the greedy pattern rewrite driver (and now it is no longer)?

@hanhanW
Copy link
Contributor Author

hanhanW commented Jan 11, 2024

What's happening is:

%cst = arith.const ...
%1 = flow.dispatch.region -> (tensor<1x1x2304xf32>) {
  %collapse = tensor.collapse_shape %cst [[0, 1, 2]] : tensor<1x1x2304xf32> into tensor<2304xf32>
  ...
}

The applyOpPatternsAndFold replaces the tensor.collapse_shape op with a new cst because its folder replaces it with constants. This is why the big constant is "pulled" into the dispatch.

I'm wondering why was this not happening before? Or was it happening but the constant was hoisted out of the region by the greedy pattern rewrite driver (and now it is no longer)?

Yes, you are right. They were hoisted out of the region (maybe by the greedy pattern rewrite driver).

@hanhanW
Copy link
Contributor Author

hanhanW commented Jan 11, 2024

Actually I don't really know what happened w/o the integrate. I need more time to understand it. My take is what you're saying..

I also wonder if the reshape(constant) -> constant should be a folder. It looks dangerous to me now. It could put a constant into a region while the op has IsolatedFromAbove trait (or some traits that disallow constant hoisting, I don't have much context about them), which sounds weird to me?

@matthias-springer
Copy link
Contributor

matthias-springer commented Jan 11, 2024

I also wonder if the reshape(constant) -> constant should be a folder. It looks dangerous to me now. It could put a constant into a region while the op has IsolatedFromAbove trait (or some traits that disallow constant hoisting, I don't have much context about them), which sounds weird to me?

I don't really see how that could happen. If we have a reshape(%cst) we can be sure that %cst is in the same IsolatedFromAbove region as the reshape. Otherwise, the input IR would have been invalid.

Folders never move ops around. The reshape folder just causes an additional constant op to be materialized, right before the reshape op.

Do you have any rewrite patterns in IREE that move ops (in particular constant ops) from one region to another region? Before @llvm/llvm-project:#75897, moving a constant op to a different region brought internal data structures in the greedy pattern rewrite driver into an inconsistent state. This could've caused an op to be hoisted by accident by OperationFolder (which maintains a mapping between Operation* and Region*). (Just speculating...)

@hanhanW
Copy link
Contributor Author

hanhanW commented Jan 11, 2024

I don't really see how that could happen. If we have a reshape(%cst) we can be sure that %cst is in the same IsolatedFromAbove region as the reshape. Otherwise, the input IR would have been invalid.

Very good point!

Do you have any rewrite patterns in IREE that move ops (in particular constant ops) from one region to another region? Before @llvm/llvm-project:#75897, moving a constant op to a different region brought internal data structures in the greedy pattern rewrite driver into an inconsistent state. This could've caused an op to be hoisted by accident by OperationFolder (which maintains a mapping between Operation* and Region*). (Just speculating...)

Well.. I don't know.. Maybe @MaheshRavishankar can answer.

@hanhanW
Copy link
Contributor Author

hanhanW commented Jan 11, 2024

@MaheshRavishankar 77b777c fixes the regression issue. Can you help review?

@hanhanW
Copy link
Contributor Author

hanhanW commented Jan 11, 2024

@matthias-springer are you able to help fix test-with-listener.mlir? The REPLACED and REMOVED strings are not generated anymore. I think this is impacted by upstream commit as well, it uses applyOpPatternsAndFold here.

To repro:

cmake --build . --target check-iree-dialects

# or 

build/llvm-project/bin/iree-dialects-opt --test-listener-canonicalize='listener=1' llvm-external-projects/iree-dialects/test/Transforms/test-with-listener.mlir

@matthias-springer
Copy link
Contributor

matthias-springer commented Jan 11, 2024

That is a problem with the applyOpPatternsAndFold version of the greedy pattern rewrite driver. The regular entry point (applyPatternsAndFoldGreedily) runs multiple iterations of pattern application until a fixpoint is reached. We don't do that at the moment for applyOpPatternsAndFold. This in combination with the fact that we don't put operands/result back on the worklist when an op is modified means that, depending on the order in which patterns are processed, we may see fewer or additional pattern application opportunities.

This should be fixed in the greedy pattern rewrite driver and we should do something that is similar to the applyOpPatternsAndFold entry point.

A simple fix for now could be to invert the order of the operands in the test case, can you give that a try?

func.func @test_canonicalize(%arg0: i32) -> (i32, i32) {
  // CANON: REPLACED arith.addi
  // CANON: REMOVED arith.addi
  %c5 = arith.constant -5 : i32
  %0 = arith.addi %arg0, %c5 : i32
  %1 = arith.addi %0, %c5 : i32
  return %0, %1 : i32, i32
}

I think this was already "broken" before my CSE change it was just not triggered because more notifications were sent around constants.

@hanhanW
Copy link
Contributor Author

hanhanW commented Jan 11, 2024

That is a problem with the applyOpPatternsAndFold version of the greedy pattern rewrite driver. The regular entry point (applyPatternsAndFoldGreedily) runs multiple iterations of pattern application until a fixpoint is reached. We don't do that at the moment for applyOpPatternsAndFold. This in combination with the fact that we don't put operands/result back on the worklist when an op is modified means that, depending on the order in which patterns are processed, we may see fewer or additional pattern application opportunities.

This should be fixed in the greedy pattern rewrite driver and we should do something that is similar to the applyOpPatternsAndFold entry point.

A simple fix for now could be to invert the order of the operands in the test case, can you give that a try?

func.func @test_canonicalize(%arg0: i32) -> (i32, i32) {
  // CANON: REPLACED arith.addi
  // CANON: REMOVED arith.addi
  %c5 = arith.constant -5 : i32
  %0 = arith.addi %arg0, %c5 : i32
  %1 = arith.addi %0, %c5 : i32
  return %0, %1 : i32, i32
}

I think this was already "broken" before my CSE change it was just not triggered because more notifications were sent around constants.

Seems working, thanks!

@hanhanW
Copy link
Contributor Author

hanhanW commented Jan 11, 2024

The regression of MobileBertSquad_fp16 on pixel gpu looks like a noise because total dispatch size is the same.

@hanhanW
Copy link
Contributor Author

hanhanW commented Jan 11, 2024

I thought the review comments would pop out in the PR but it seems not. Mahesh already reviewed it in 77b777c and it looks good to him.

@hanhanW hanhanW merged commit e32a502 into main Jan 11, 2024
65 checks passed
@hanhanW hanhanW deleted the integrate-llvm-20240108 branch January 11, 2024 18:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants