Bump LLVM to llvm/llvm-project@f5145f4dc819 #16073

hanhanW · 2024-01-08T23:25:58Z

This includes a revert for llvm/llvm-project@11ac97c
Other fixes:

Implement 77b777c which avoids inlining big constants into dispatch.
Add fixes for llvm/llvm-project@b3037ae, see 7ed5f29
Add fixes for llvm/llvm-project@bae1fde, see c83cefb It updates HALDispatchABI::buildScopeAttr to take LLVM::LLVMFuncOp as inputs, so it can handle if the DistinctAttr is needed or not for DISubprogramAttr.
Add fixes for llvm/llvm-project@bb6d5c2
- Add --split-input-file to lower_to_ukernel_ops.mlir.
- Move few CHECK-DAG to beginning for fixing lit tests.
- Replace some CHECK[-NEXT] with CHECK-DAG
- Add --canonicalize to affinemin_canonicalization.mlir which hoists constants out of region.

github-actions · 2024-01-09T06:21:35Z

Abbreviated Benchmark Summary

@ commit fe22035d50c47a5f3a1e3f3a41bfcca1733e1591 (vs. base e2e126ce061454ad71bc2f5c1c08b1efc982f4cd)

Data-Tiling Comparison Table

Click to show

Name	No-DT (baseline)	DT-Only	DT-UK
BertLargeTF(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	1608.489 (1.0X)	448.362 (3.6X)	355.499 (4.5X)
DeepLabV3_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	6.363 (1.0X)	9.902 (0.6X)	9.542 (0.7X)
EfficientNetV2STF(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	51.123 (1.0X)	58.272 (0.9X)	52.492 (1.0X)
EfficientNet_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	6.214 (1.0X)	11.551 (0.5X)	6.108 (1.0X)
Falcon7bInt4GptqPT(linalg) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	36088.773 (1.0X)	17983.579 (2.0X)	7605.654 (4.7X)
GPT2_117M_TF_1X1XI32(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	14.241 (1.0X)	9.565 (1.5X)	10.052 (1.4X)
MiniLML12H384Uncased(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	18.580 (1.0X)	16.629 (1.1X)	14.406 (1.3X)
MobileBertSquad_fp16(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	44.764 (1.0X)	64.975 (0.7X)	59.643 (0.8X)
MobileBertSquad_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	46.646 (1.0X)	66.499 (0.7X)	61.176 (0.8X)
MobileBertSquad_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	111.326 (1.0X)	198.397 (0.6X)	60.082 (1.9X)
MobileNetV1_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	6.712 (1.0X)	5.517 (1.2X)	5.030 (1.3X)
MobileNetV2_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	3.736 (1.0X)	5.623 (0.7X)	5.412 (0.7X)
MobileNetV2_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	5.900 (1.0X)	9.449 (0.6X)	5.606 (1.1X)
MobileNetV3Small_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	3.283 (1.0X)	3.700 (0.9X)	3.403 (1.0X)
MobileSSD_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	8.555 (1.0X)	11.054 (0.8X)	10.491 (0.8X)
PersonDetect_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	0.809 (1.0X)	1.425 (0.6X)	0.753 (1.1X)
PoseNet_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	4.505 (1.0X)	6.413 (0.7X)	5.804 (0.8X)
BertForMaskedLMTF(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	375.376 (1.0X)	225.002 (1.7X)	175.549 (2.1X)
DeepLabV3_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	26.212 (1.0X)	38.011 (0.7X)	34.984 (0.7X)
EfficientNetV2STF(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	262.757 (1.0X)	265.232 (1.0X)	242.260 (1.1X)
EfficientNet_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	29.782 (1.0X)	51.578 (0.6X)	16.368 (1.8X)
Falcon7bGptqPT(linalg) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	35559.239 (1.0X)	17698.086 (2.0X)	7491.853 (4.7X)
GPT2_117M_TF_1X4XI32(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	16.965 (1.0X)	10.531 (1.6X)	10.654 (1.6X)
MiniLML12H384Uncased(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	80.641 (1.0X)	74.807 (1.1X)	59.115 (1.4X)
MobileBertSquad_fp16(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	180.633 (1.0X)	228.942 (0.8X)	189.875 (1.0X)
MobileBertSquad_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	185.870 (1.0X)	231.897 (0.8X)	195.658 (0.9X)
MobileBertSquad_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	485.730 (1.0X)	1018.350 (0.5X)	215.770 (2.3X)
MobileNetV1_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	28.234 (1.0X)	22.806 (1.2X)	19.611 (1.4X)
MobileNetV2_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	11.829 (1.0X)	15.284 (0.8X)	13.467 (0.9X)
MobileNetV2_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	22.122 (1.0X)	41.166 (0.5X)	14.075 (1.6X)
MobileNetV3Small_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu]	3.267 (1.0X)	3.762 (0.9X)	3.317 (1.0X)
MobileSSD_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	33.378 (1.0X)	38.495 (0.9X)	35.018 (1.0X)
PersonDetect_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu]	0.741 (1.0X)	1.347 (0.6X)	0.672 (1.1X)
PoseNet_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	18.444 (1.0X)	25.552 (0.7X)	21.611 (0.9X)
DeepLabV3_fp32(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores]	58.713 (1.0X)	44.247 (1.3X)	43.566 (1.3X)
DeepLabV3_fp32(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores]	60.520 (1.0X)	45.843 (1.3X)	45.201 (1.3X)
DeepLabV3_fp32(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[2-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores]	36.192 (1.0X)	28.947 (1.3X)	28.527 (1.3X)
GPT2_117M_TF_1X1XI32(stablehlo) [armv8.2-a-generic-linux_android29-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores]	92.821 (1.0X)	22.143 (4.2X)	22.351 (4.2X)
GPT2_117M_TF_1X1XI32(stablehlo) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores]	93.275 (1.0X)	23.259 (4.0X)	23.227 (4.0X)
GPT2_117M_TF_1X1XI32(stablehlo) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[2-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores]	51.938 (1.0X)	23.113 (2.2X)	23.053 (2.3X)
GPT2_117M_TF_1X4XI32(stablehlo) [armv8.2-a-generic-linux_android29-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores]	137.847 (1.0X)	30.725 (4.5X)	30.720 (4.5X)
GPT2_117M_TF_1X4XI32(stablehlo) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores]	122.275 (1.0X)	33.173 (3.7X)	32.404 (3.8X)
GPT2_117M_TF_1X4XI32(stablehlo) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[2-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores]	68.819 (1.0X)	29.690 (2.3X)	29.427 (2.3X)
MobileBertSquad_fp32(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores]	739.857 (1.0X)	413.553 (1.8X)	405.445 (1.8X)
MobileBertSquad_fp32(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores]	748.615 (1.0X)	424.224 (1.8X)	408.221 (1.8X)
MobileBertSquad_fp32(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[2-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores]	435.083 (1.0X)	252.617 (1.7X)	245.096 (1.8X)
MobileBertSquad_int8(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores]	920.585 (1.0X)	362.704 (2.5X)	272.710 (3.4X)
MobileBertSquad_int8(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores]	928.329 (1.0X)	357.197 (2.6X)	271.341 (3.4X)
MobileBertSquad_int8(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[2-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores]	512.796 (1.0X)	206.459 (2.5X)	159.505 (3.2X)
Vit_int8(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ pixel-6-pro[big-cores]	2251.107 (1.0X)	993.359 (2.3X)	823.591 (2.7X)
Vit_int8(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores]	2276.463 (1.0X)	1009.674 (2.3X)	837.009 (2.7X)
Vit_int8(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu] local_task(embedded_elf)[2-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores]	1258.604 (1.0X)	566.392 (2.2X)	476.633 (2.6X)

Regressed Latencies 🚩

Benchmark Name	Average Latency (ms)	Median Latency (ms)	Latency Standard Deviation (ms)
MobileBertSquad\_fp16(tflite) [arm-valhall-vulkan\_android31-vulkan\_spirv][experimental-flags,fuse-padding,max-concurrency,demote-f32-to-f16] vulkan(none)[full-inference,default-flags] with default @ pixel-6-pro[gpu]	111.549 (vs. 93.066, 19.86%↑)	111.419	1.159
MobileNetV1\_fp32(tflite) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][experimental-flags,no-dt] local\_task(embedded\_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	28.234 (vs. 26.515, 6.49%↑)	28.168	0.375
DeepLabV3\_fp32(tflite) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][experimental-flags,dt-only] local\_task(embedded\_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu]	38.011 (vs. 36.616, 3.81%↑)	37.993	0.226

[Top 3 out of 4 results showed]

Improved Latencies 🎉

Benchmark Name	Average Latency (ms)	Median Latency (ms)	Latency Standard Deviation (ms)
GPT2\_117M\_TF\_1X4XI32(stablehlo) [armv8.2-a-generic-linux\_android29-llvm\_cpu][experimental-flags,no-dt] local\_task(embedded\_elf)[1-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores]	122.275 (vs. 136.112, 10.17%↓)	122.316	0.273
matmul\_123x2561x2561\_f32t\_f32t\_f32t\_tile\_config\_default(linalg) [cuda-sm\_80-linux\_gnu-cuda][ukernel,matmul] cuda(none)[full-inference,default-flags] with default @ a2-highgpu-1g[gpu]	0.199 (vs. 0.219, 9.12%↓)	0.199	0.000
DeepLabV3\_fp32(tflite) [armv8.2-a-generic-linux\_android29-llvm\_cpu][experimental-flags,dt-only] local\_task(embedded\_elf)[2-thread,full-inference,system-scheduling] with default @ pixel-6-pro[big-cores]	28.947 (vs. 31.677, 8.62%↓)	29.170	0.912

[Top 3 out of 21 results showed]

Improved Total Dispatch Sizes 🎉

Benchmark Name	Total Dispatch Size (bytes)
matmul\_2562x2561x2561\_f32t\_f32t\_f32t\_tile\_config\_default(linalg) [cuda-sm\_80-linux\_gnu-cuda][ukernel,matmul,compile-stats]	84400 (vs. 93272, 9.51%↓)
matmul\_123x2561x2561\_f32t\_f32t\_f32t\_tile\_config\_default(linalg) [cuda-sm\_80-linux\_gnu-cuda][ukernel,matmul,compile-stats]	50580 (vs. 54812, 7.72%↓)

Improved Total Artifact Sizes 🎉

Benchmark Name	Total Artifact Size (bytes)
matmul\_2562x2561x2561\_f32t\_f32t\_f32t\_tile\_config\_default(linalg) [cuda-sm\_80-linux\_gnu-cuda][ukernel,matmul,compile-stats]	95187 (vs. 104059, 8.53%↓)
matmul\_123x2561x2561\_f32t\_f32t\_f32t\_tile\_config\_default(linalg) [cuda-sm\_80-linux\_gnu-cuda][ukernel,matmul,compile-stats]	61366 (vs. 65598, 6.45%↓)

For more information:

Source Workflow Run

hanhanW · 2024-01-09T07:33:14Z

I took a look at total dispatch sizes regression about GPT2. It looks like they are from llvm/llvm-project@bb6d5c2 and llvm/llvm-project@eb42868 . There are more elementwise add ops at flow level. There are deps, so I can only try with git revert eb42868f2 bb6d5c2.

IR dump without revert: https://gist.githubusercontent.com/hanhanW/42dda3e9994c3b9454d8d4e5b9b0be01/raw/e7f70daac54df584d4a0fff7e98093cdfb90c443/log10

IR dump with revert: https://gist.githubusercontent.com/hanhanW/1e695dfcc4beb894c2423ae3139d2a54/raw/56941d0cb62ee8aac266ccea85bec482523f8e01/log11

If you vimdiff two dumps, you will notice that there are more generic ops w/o revert; there are more constants w/ revert. It looks like some constant folding is not kicked in after integrate.

(It would be good if someone can check whether my statements are correct or not.)

matthias-springer · 2024-01-10T09:09:13Z

The first difference in the diff is interesting.

Before:

    builtin.module {
      func.func @forward_dispatch_6_generic_2304_f32(%arg0: !flow.dispatch.tensor<readonly:tensor<2304xf32>>, %arg1: !flow.dispatch.tensor<readonly:tensor<2304xf32>>, %arg2: !flow.dispatch.tensor<writeonly:tensor<2304xf32>>) {
        %0 = flow.dispatch.tensor.load %arg0, offsets = [0], sizes = [2304], strides = [1] : !flow.dispatch.tensor<readonly:tensor<2304xf32>> -> tensor<2304xf32>
        %1 = flow.dispatch.tensor.load %arg1, offsets = [0], sizes = [2304], strides = [1] : !flow.dispatch.tensor<readonly:tensor<2304xf32>> -> tensor<2304xf32>
        %2 = tensor.empty() : tensor<2304xf32>
        %3 = linalg.generic {indexing_maps = [#map1, #map1, #map1], iterator_types = ["parallel"]} ins(%0, %1 : tensor<2304xf32>, tensor<2304xf32>) outs(%2 : tensor<2304xf32>) {
        ^bb0(%in: f32, %in_0: f32, %out: f32):
          %4 = arith.addf %in, %in_0 : f32
          linalg.yield %4 : f32
        } -> tensor<2304xf32>
        flow.dispatch.tensor.store %3, %arg2, offsets = [0], sizes = [2304], strides = [1] : tensor<2304xf32> -> !flow.dispatch.tensor<writeonly:tensor<2304xf32>>
        return
      }
    }
  }

After:

    builtin.module {
      func.func @forward_dispatch_6_generic_2304_f32(%arg0: !flow.dispatch.tensor<readonly:tensor<2304xf32>>, %arg1: !flow.dispatch.tensor<writeonly:tensor<2304xf32>>) {
        %cst = arith.constant dense_resource<__elided__> : tensor<2304xf32>
        %0 = flow.dispatch.tensor.load %arg0, offsets = [0], sizes = [2304], strides = [1] : !flow.dispatch.tensor<readonly:tensor<2304xf32>> -> tensor<2304xf32>
        %1 = tensor.empty() : tensor<2304xf32>
        %2 = linalg.generic {indexing_maps = [#map1, #map1, #map1], iterator_types = ["parallel"]} ins(%0, %cst : tensor<2304xf32>, tensor<2304xf32>) outs(%1 : tensor<2304xf32>) {
        ^bb0(%in: f32, %in_0: f32, %out: f32):
          %3 = arith.addf %in, %in_0 : f32
          linalg.yield %3 : f32
        } -> tensor<2304xf32>
        flow.dispatch.tensor.store %2, %arg1, offsets = [0], sizes = [2304], strides = [1] : tensor<2304xf32> -> !flow.dispatch.tensor<writeonly:tensor<2304xf32>>
        return
      }
    }
  }

A flow.dispatch.tensor.load is folded to a constant.

hanhanW · 2024-01-10T10:00:05Z

I probably need to dump the IR without eliding large constants. IIUC, they will be pulled into dispatches if they are small constants. We will need to re-dump the IRs and see what the actual values are.

hanhanW · 2024-01-11T08:17:00Z

The first difference about dispatch is after CollapseDimensions. The pass pulls the big constants into forward_dispatch_6_generic_2304_f32. I'm studying the logic of the pass now.

hanhanW · 2024-01-11T08:28:01Z

The first difference about dispatch is after CollapseDimensions. The pass pulls the big constants into forward_dispatch_6_generic_2304_f32. I'm studying the logic of the pass now.

I have a much smaller repro now:

Download https://gist.githubusercontent.com/hanhanW/0935ccebd438116f386406104bb15368/raw/4e2ce8411a5294ed3acbcd5bf467cac9ec3ad5aa/z.mlir

Run iree-opt --pass-pipeline="builtin.module(func.func(iree-flow-collapse-dimensions))" ~/z.mlir

hanhanW · 2024-01-11T08:49:06Z

What's happening is:

%cst = arith.const ...
%1 = flow.dispatch.region -> (tensor<1x1x2304xf32>) {
  %collapse = tensor.collapse_shape %cst [[0, 1, 2]] : tensor<1x1x2304xf32> into tensor<2304xf32>
  ...
}

The applyOpPatternsAndFold replaces the tensor.collapse_shape op with a new cst because its folder replaces it with constants. This is why the big constant is "pulled" into the dispatch.

matthias-springer · 2024-01-11T09:43:00Z

What's happening is:
%cst = arith.const ...
%1 = flow.dispatch.region -> (tensor<1x1x2304xf32>) {
  %collapse = tensor.collapse_shape %cst [[0, 1, 2]] : tensor<1x1x2304xf32> into tensor<2304xf32>
  ...
}
The applyOpPatternsAndFold replaces the tensor.collapse_shape op with a new cst because its folder replaces it with constants. This is why the big constant is "pulled" into the dispatch.

I'm wondering why was this not happening before? Or was it happening but the constant was hoisted out of the region by the greedy pattern rewrite driver (and now it is no longer)?

hanhanW · 2024-01-11T10:05:24Z

What's happening is:
%cst = arith.const ...
%1 = flow.dispatch.region -> (tensor<1x1x2304xf32>) {
  %collapse = tensor.collapse_shape %cst [[0, 1, 2]] : tensor<1x1x2304xf32> into tensor<2304xf32>
  ...
}
The applyOpPatternsAndFold replaces the tensor.collapse_shape op with a new cst because its folder replaces it with constants. This is why the big constant is "pulled" into the dispatch.
I'm wondering why was this not happening before? Or was it happening but the constant was hoisted out of the region by the greedy pattern rewrite driver (and now it is no longer)?

Yes, you are right. They were hoisted out of the region (maybe by the greedy pattern rewrite driver).

hanhanW · 2024-01-11T10:51:10Z

Actually I don't really know what happened w/o the integrate. I need more time to understand it. My take is what you're saying..

I also wonder if the reshape(constant) -> constant should be a folder. It looks dangerous to me now. It could put a constant into a region while the op has IsolatedFromAbove trait (or some traits that disallow constant hoisting, I don't have much context about them), which sounds weird to me?

matthias-springer · 2024-01-11T11:21:25Z

I also wonder if the reshape(constant) -> constant should be a folder. It looks dangerous to me now. It could put a constant into a region while the op has IsolatedFromAbove trait (or some traits that disallow constant hoisting, I don't have much context about them), which sounds weird to me?

I don't really see how that could happen. If we have a reshape(%cst) we can be sure that %cst is in the same IsolatedFromAbove region as the reshape. Otherwise, the input IR would have been invalid.

Folders never move ops around. The reshape folder just causes an additional constant op to be materialized, right before the reshape op.

Do you have any rewrite patterns in IREE that move ops (in particular constant ops) from one region to another region? Before @llvm/llvm-project:#75897, moving a constant op to a different region brought internal data structures in the greedy pattern rewrite driver into an inconsistent state. This could've caused an op to be hoisted by accident by OperationFolder (which maintains a mapping between Operation* and Region*). (Just speculating...)

hanhanW · 2024-01-11T14:27:00Z

I don't really see how that could happen. If we have a reshape(%cst) we can be sure that %cst is in the same IsolatedFromAbove region as the reshape. Otherwise, the input IR would have been invalid.

Very good point!

Do you have any rewrite patterns in IREE that move ops (in particular constant ops) from one region to another region? Before @llvm/llvm-project:#75897, moving a constant op to a different region brought internal data structures in the greedy pattern rewrite driver into an inconsistent state. This could've caused an op to be hoisted by accident by OperationFolder (which maintains a mapping between Operation* and Region*). (Just speculating...)

Well.. I don't know.. Maybe @MaheshRavishankar can answer.

This includes a revert for llvm/llvm-project@11ac97c

hanhanW · 2024-01-11T14:29:37Z

@MaheshRavishankar 77b777c fixes the regression issue. Can you help review?

hanhanW · 2024-01-11T15:28:46Z

@matthias-springer are you able to help fix test-with-listener.mlir? The REPLACED and REMOVED strings are not generated anymore. I think this is impacted by upstream commit as well, it uses applyOpPatternsAndFold here.

To repro:

cmake --build . --target check-iree-dialects

# or 

build/llvm-project/bin/iree-dialects-opt --test-listener-canonicalize='listener=1' llvm-external-projects/iree-dialects/test/Transforms/test-with-listener.mlir

matthias-springer · 2024-01-11T16:17:38Z

That is a problem with the applyOpPatternsAndFold version of the greedy pattern rewrite driver. The regular entry point (applyPatternsAndFoldGreedily) runs multiple iterations of pattern application until a fixpoint is reached. We don't do that at the moment for applyOpPatternsAndFold. This in combination with the fact that we don't put operands/result back on the worklist when an op is modified means that, depending on the order in which patterns are processed, we may see fewer or additional pattern application opportunities.

This should be fixed in the greedy pattern rewrite driver and we should do something that is similar to the applyOpPatternsAndFold entry point.

A simple fix for now could be to invert the order of the operands in the test case, can you give that a try?

func.func @test_canonicalize(%arg0: i32) -> (i32, i32) {
  // CANON: REPLACED arith.addi
  // CANON: REMOVED arith.addi
  %c5 = arith.constant -5 : i32
  %0 = arith.addi %arg0, %c5 : i32
  %1 = arith.addi %0, %c5 : i32
  return %0, %1 : i32, i32
}

I think this was already "broken" before my CSE change it was just not triggered because more notifications were sent around constants.

hanhanW · 2024-01-11T16:33:37Z

That is a problem with the applyOpPatternsAndFold version of the greedy pattern rewrite driver. The regular entry point (applyPatternsAndFoldGreedily) runs multiple iterations of pattern application until a fixpoint is reached. We don't do that at the moment for applyOpPatternsAndFold. This in combination with the fact that we don't put operands/result back on the worklist when an op is modified means that, depending on the order in which patterns are processed, we may see fewer or additional pattern application opportunities.

This should be fixed in the greedy pattern rewrite driver and we should do something that is similar to the applyOpPatternsAndFold entry point.

A simple fix for now could be to invert the order of the operands in the test case, can you give that a try?
func.func @test_canonicalize(%arg0: i32) -> (i32, i32) {
  // CANON: REPLACED arith.addi
  // CANON: REMOVED arith.addi
  %c5 = arith.constant -5 : i32
  %0 = arith.addi %arg0, %c5 : i32
  %1 = arith.addi %0, %c5 : i32
  return %0, %1 : i32, i32
}
I think this was already "broken" before my CSE change it was just not triggered because more notifications were sent around constants.

Seems working, thanks!

hanhanW · 2024-01-11T17:18:02Z

The regression of MobileBertSquad_fp16 on pixel gpu looks like a noise because total dispatch size is the same.

hanhanW · 2024-01-11T17:43:05Z

I thought the review comments would pop out in the PR but it seems not. Mahesh already reviewed it in 77b777c and it looks good to him.

hanhanW marked this pull request as draft January 9, 2024 00:04

hanhanW marked this pull request as ready for review January 9, 2024 00:54

hanhanW requested review from dcaballe and MaheshRavishankar as code owners January 9, 2024 00:54

hanhanW force-pushed the integrate-llvm-20240108 branch from 7c2831c to aa66a02 Compare January 9, 2024 04:35

hanhanW force-pushed the integrate-llvm-20240108 branch from aa66a02 to 0a9330e Compare January 9, 2024 07:23

MaheshRavishankar mentioned this pull request Jan 9, 2024

[mlir][Transforms] GreedyPatternRewriteDriver: Do not CSE constants during iterations llvm/llvm-project#75897

Merged

hanhanW requested review from stellaraccident, antiagainst, qedawkins and benvanik as code owners January 11, 2024 08:28

hanhanW force-pushed the integrate-llvm-20240108 branch 2 times, most recently from 5c0a51e to 75337be Compare January 11, 2024 09:17

hanhanW force-pushed the integrate-llvm-20240108 branch from 75337be to 159487f Compare January 11, 2024 14:28

hanhanW added 4 commits January 11, 2024 14:28

Bump LLVM to llvm/llvm-project@f5145f4dc819

f749475

This includes a revert for llvm/llvm-project@11ac97c

Fixes for llvm/llvm-project@b3037ae

7ed5f29

Fixes for llvm/llvm-project@bae1fde

c83cefb

Fixes for llvm/llvm-project@bb6d5c2

fcbb6a0

one more fix for llvm/llvm-project@bb6d5c2

c08a443

hanhanW force-pushed the integrate-llvm-20240108 branch from c7471f2 to 82b1659 Compare January 11, 2024 14:29

Fix CollapseDimensions to avoid inlining big constants.

77b777c

hanhanW force-pushed the integrate-llvm-20240108 branch from 82b1659 to 77b777c Compare January 11, 2024 14:48

fixes for test-with-listener.mlir

b2994a4

matthias-springer mentioned this pull request Jan 11, 2024

[mlir][Transforms] GreedyPatternRewriteDriver: log successful folding llvm/llvm-project#77796

Merged

benvanik approved these changes Jan 11, 2024

View reviewed changes

hanhanW merged commit e32a502 into main Jan 11, 2024
65 checks passed

hanhanW deleted the integrate-llvm-20240108 branch January 11, 2024 18:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bump LLVM to llvm/llvm-project@f5145f4dc819 #16073

Bump LLVM to llvm/llvm-project@f5145f4dc819 #16073

hanhanW commented Jan 8, 2024 •

edited

github-actions bot commented Jan 9, 2024 •

edited

hanhanW commented Jan 9, 2024 •

edited

matthias-springer commented Jan 10, 2024

hanhanW commented Jan 10, 2024

hanhanW commented Jan 11, 2024

hanhanW commented Jan 11, 2024

hanhanW commented Jan 11, 2024

matthias-springer commented Jan 11, 2024

hanhanW commented Jan 11, 2024

hanhanW commented Jan 11, 2024 •

edited

matthias-springer commented Jan 11, 2024 •

edited

hanhanW commented Jan 11, 2024

hanhanW commented Jan 11, 2024 •

edited

hanhanW commented Jan 11, 2024 •

edited

matthias-springer commented Jan 11, 2024 •

edited

hanhanW commented Jan 11, 2024

hanhanW commented Jan 11, 2024

hanhanW commented Jan 11, 2024

Bump LLVM to llvm/llvm-project@f5145f4dc819 #16073

Bump LLVM to llvm/llvm-project@f5145f4dc819 #16073

Conversation

hanhanW commented Jan 8, 2024 • edited

github-actions bot commented Jan 9, 2024 • edited

Abbreviated Benchmark Summary

Data-Tiling Comparison Table

Regressed Latencies 🚩

Improved Latencies 🎉

Improved Total Dispatch Sizes 🎉

Improved Total Artifact Sizes 🎉

hanhanW commented Jan 9, 2024 • edited

matthias-springer commented Jan 10, 2024

hanhanW commented Jan 10, 2024

hanhanW commented Jan 11, 2024

hanhanW commented Jan 11, 2024

hanhanW commented Jan 11, 2024

matthias-springer commented Jan 11, 2024

hanhanW commented Jan 11, 2024

hanhanW commented Jan 11, 2024 • edited

matthias-springer commented Jan 11, 2024 • edited

hanhanW commented Jan 11, 2024

hanhanW commented Jan 11, 2024 • edited

hanhanW commented Jan 11, 2024 • edited

matthias-springer commented Jan 11, 2024 • edited

hanhanW commented Jan 11, 2024

hanhanW commented Jan 11, 2024

hanhanW commented Jan 11, 2024

hanhanW commented Jan 8, 2024 •

edited

github-actions bot commented Jan 9, 2024 •

edited

hanhanW commented Jan 9, 2024 •

edited

hanhanW commented Jan 11, 2024 •

edited

matthias-springer commented Jan 11, 2024 •

edited

hanhanW commented Jan 11, 2024 •

edited

hanhanW commented Jan 11, 2024 •

edited

matthias-springer commented Jan 11, 2024 •

edited