Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate LLVM at llvm/llvm-project@f7b2c2e4 #18143

Merged
merged 2 commits into from
Aug 7, 2024
Merged

Conversation

hanhanW
Copy link
Contributor

@hanhanW hanhanW commented Aug 7, 2024

No description provided.

Signed-off-by: hanhanW <hanhan0912@gmail.com>
Copy link

github-actions bot commented Aug 7, 2024

Abbreviated Benchmark Summary

@ commit 473cbe7ec997040d4419af91d44835c757b65e67 (vs. base 4716f685cd74b09d85ec8b3ee515d6dcb72b8e2c)

Data-Tiling Comparison Table

Click to show
Name No-DT (baseline) DT-Only DT-UK
BertLargeTF(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[30-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 738.422 (1.0X) 275.534 (2.7X) 227.718 (3.2X)
DeepLabV3_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 6.997 (1.0X) 9.329 (0.8X) 8.529 (0.8X)
EfficientNetV2STF(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 36.054 (1.0X) 36.194 (1.0X) 34.569 (1.0X)
EfficientNet_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 5.848 (1.0X) 11.040 (0.5X) 5.064 (1.2X)
GPT2_117M_TF_1X1XI32(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 9.227 (1.0X) 8.585 (1.1X) 8.541 (1.1X)
GPT2_117M_TF_1X4XI32(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 11.074 (1.0X) 9.098 (1.2X) 8.991 (1.2X)
MiniLML12H384Uncased(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 12.223 (1.0X) 15.519 (0.8X) 13.860 (0.9X)
MobileBertSquad_fp16(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 33.758 (1.0X) 65.199 (0.5X) 61.267 (0.6X)
MobileBertSquad_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 33.925 (1.0X) 65.579 (0.5X) 61.726 (0.5X)
MobileBertSquad_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 69.237 (1.0X) 133.958 (0.5X) 64.428 (1.1X)
MobileNetV1_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 4.911 (1.0X) 5.351 (0.9X) 4.627 (1.1X)
MobileNetV2_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 3.764 (1.0X) 5.373 (0.7X) 4.957 (0.8X)
MobileNetV2_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 5.910 (1.0X) 9.590 (0.6X) 5.465 (1.1X)
MobileNetV3Small_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 2.867 (1.0X) 3.409 (0.8X) 2.841 (1.0X)
MobileSSD_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 8.531 (1.0X) 11.038 (0.8X) 9.950 (0.9X)
PersonDetect_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 0.786 (1.0X) 1.394 (0.6X) 0.659 (1.2X)
PoseNet_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 4.198 (1.0X) 5.889 (0.7X) 5.308 (0.8X)
matmul_256x256x2048_i8_i4_i32_tile_config_default(linalg) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu] 7.567 (1.0X) 7.583 (1.0X) 7.580 (1.0X)
matmul_256x256x2048_i8_i8_i32_tile_config_default(linalg) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu] 6.689 (1.0X) 13.295 (0.5X) 1.806 (3.7X)
BertForMaskedLMTF(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[30-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 217.299 (1.0X) 135.535 (1.6X) 107.212 (2.0X)
DeepLabV3_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 32.324 (1.0X) 36.465 (0.9X) 29.970 (1.1X)
EfficientNetV2STF(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 279.673 (1.0X) 257.720 (1.1X) 230.050 (1.2X)
EfficientNet_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 26.945 (1.0X) 51.522 (0.5X) 13.136 (2.1X)
GPT2_117M_TF_1X1XI32(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 70.023 (1.0X) 38.756 (1.8X) 37.833 (1.9X)
GPT2_117M_TF_1X4XI32(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 89.303 (1.0X) 41.790 (2.1X) 39.753 (2.2X)
MiniLML12H384Uncased(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 80.984 (1.0X) 76.935 (1.1X) 57.293 (1.4X)
MobileBertSquad_fp16(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 179.308 (1.0X) 248.123 (0.7X) 186.410 (1.0X)
MobileBertSquad_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 181.628 (1.0X) 254.995 (0.7X) 191.739 (0.9X)
MobileBertSquad_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 520.353 (1.0X) 1088.598 (0.5X) 244.113 (2.1X)
MobileNetV1_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 25.381 (1.0X) 22.651 (1.1X) 17.914 (1.4X)
MobileNetV2_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 11.943 (1.0X) 14.818 (0.8X) 11.604 (1.0X)
MobileNetV2_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 21.688 (1.0X) 42.397 (0.5X) 11.928 (1.8X)
MobileNetV3Small_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu] 2.773 (1.0X) 3.304 (0.8X) 2.660 (1.0X)
MobileSSD_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 34.495 (1.0X) 39.706 (0.9X) 31.457 (1.1X)
PersonDetect_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu] 0.719 (1.0X) 1.301 (0.6X) 0.581 (1.2X)
PoseNet_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 18.062 (1.0X) 23.752 (0.8X) 19.486 (0.9X)
matmul_1x256x2048_i8_i4_i32_tile_config_default(linalg) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu] 0.055 (1.0X) 0.055 (1.0X) 0.055 (1.0X)
matmul_1x256x2048_i8_i8_i32_tile_config_default(linalg) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu] 0.043 (1.0X) 0.226 (0.2X) 0.022 (2.0X)

No improved or regressed benchmarks 🏖️

No improved or regressed compilation metrics 🏖️

For more information:

Source Workflow Run

Signed-off-by: hanhanW <hanhan0912@gmail.com>
@hanhanW hanhanW requested a review from benvanik as a code owner August 7, 2024 19:01
Comment on lines +88 to +92
// CHECK: ^bb2
^bb2(%bb2_0: !stream.resource<*>, %bb2_1: !stream.resource<*>):
// CHECK-NOT: stream.async.transfer
%external_transfer = stream.async.transfer %bb2_1 : !stream.resource<*>{%size} -> !stream.resource<external>{%size}
// CHECK: util.return %[[BB2_ARG_FILL0]], %[[BB2_ARG_SELECT]] : !stream.resource<transient>, !stream.resource<external>
// CHECK: util.return %[[FILL0]], %[[SELECT]] : !stream.resource<transient>, !stream.resource<external>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirmed that llvm/llvm-project@441b672 breaks the test. The new output looks okay to me..

Input IR:

util.func private @propagateBlocks(%cond: i1, %size: index) -> (!stream.resource<*>, !stream.resource<external>) {
  %c0 = arith.constant 0 : index
  %c128 = arith.constant 128 : index
  %c123_i32 = arith.constant 123 : i32
  %c456_i32 = arith.constant 456 : i32
  %splat0 = stream.async.splat %c123_i32 : i32 -> !stream.resource<*>{%size}
  %splat1 = stream.async.splat %c456_i32 : i32 -> !stream.resource<*>{%size}
  cf.br ^bb1(%splat0, %splat1 : !stream.resource<*>, !stream.resource<*>)
^bb1(%bb1_0: !stream.resource<*>, %bb1_1: !stream.resource<*>):
  %clone0 = stream.async.clone %bb1_0 : !stream.resource<*>{%size} -> !stream.resource<*>{%size}
  %fill0 = stream.async.fill %c123_i32, %clone0[%c0 to %c128 for %c128] : i32 -> !stream.resource<*>{%size}
  %clone1 = stream.async.clone %bb1_1 : !stream.resource<*>{%size} -> !stream.resource<*>{%size}
  %fill1 = stream.async.fill %c456_i32, %clone1[%c0 to %c128 for %c128] : i32 -> !stream.resource<*>{%size}
  %bb1_1_new = arith.select %cond, %splat1, %fill1 : !stream.resource<*>
  cf.cond_br %cond, ^bb1(%fill0, %bb1_1_new : !stream.resource<*>, !stream.resource<*>),
                 ^bb2(%fill0, %bb1_1_new : !stream.resource<*>, !stream.resource<*>)
^bb2(%bb2_0: !stream.resource<*>, %bb2_1: !stream.resource<*>):
  %external_transfer = stream.async.transfer %bb2_1 : !stream.resource<*>{%size} -> !stream.resource<external>{%size}
  util.return %bb2_0, %external_transfer : !stream.resource<*>, !stream.resource<external>
}

The original output:

  util.func private @propagateBlocks(%arg0: i1, %arg1: index) -> (!stream.resource<transient>, !stream.resource<external>) {
    %c0 = arith.constant 0 : index
    %c128 = arith.constant 128 : index
    %c123_i32 = arith.constant 123 : i32
    %c456_i32 = arith.constant 456 : i32
    %0 = stream.async.splat %c123_i32 : i32 -> !stream.resource<transient>{%arg1}
    %1 = stream.async.splat %c456_i32 : i32 -> !stream.resource<external>{%arg1}
    cf.br ^bb1(%0, %1 : !stream.resource<transient>, !stream.resource<external>)
  ^bb1(%2: !stream.resource<transient>, %3: !stream.resource<external>):  // 2 preds: ^bb0, ^bb1
    %4 = stream.async.clone %2 : !stream.resource<transient>{%arg1} -> !stream.resource<transient>{%arg1}
    %5 = stream.async.fill %c123_i32, %4[%c0 to %c128 for %c128] : i32 -> %4 as !stream.resource<transient>{%arg1}
    %6 = stream.async.clone %3 : !stream.resource<external>{%arg1} -> !stream.resource<external>{%arg1}
    %7 = stream.async.fill %c456_i32, %6[%c0 to %c128 for %c128] : i32 -> %6 as !stream.resource<external>{%arg1}
    %8 = arith.select %arg0, %1, %7 : !stream.resource<external>
    cf.cond_br %arg0, ^bb1(%5, %8 : !stream.resource<transient>, !stream.resource<external>), ^bb2(%5, %8 : !stream.resource<transient>, !stream.resource<external>)
  ^bb2(%9: !stream.resource<transient>, %10: !stream.resource<external>):  // pred: ^bb1
    util.return %9, %10 : !stream.resource<transient>, !stream.resource<external>
  }

New output:

  util.func private @propagateBlocks(%arg0: i1, %arg1: index) -> (!stream.resource<transient>, !stream.resource<external>) {
    %c0 = arith.constant 0 : index
    %c128 = arith.constant 128 : index
    %c123_i32 = arith.constant 123 : i32
    %c456_i32 = arith.constant 456 : i32
    %0 = stream.async.splat %c123_i32 : i32 -> !stream.resource<transient>{%arg1}
    %1 = stream.async.splat %c456_i32 : i32 -> !stream.resource<external>{%arg1}
    cf.br ^bb1(%0, %1 : !stream.resource<transient>, !stream.resource<external>)
  ^bb1(%2: !stream.resource<transient>, %3: !stream.resource<external>):  // 2 preds: ^bb0, ^bb1
    %4 = stream.async.clone %2 : !stream.resource<transient>{%arg1} -> !stream.resource<transient>{%arg1}
    %5 = stream.async.fill %c123_i32, %4[%c0 to %c128 for %c128] : i32 -> %4 as !stream.resource<transient>{%arg1}
    %6 = stream.async.clone %3 : !stream.resource<external>{%arg1} -> !stream.resource<external>{%arg1}
    %7 = stream.async.fill %c456_i32, %6[%c0 to %c128 for %c128] : i32 -> %6 as !stream.resource<external>{%arg1}
    %8 = arith.select %arg0, %1, %7 : !stream.resource<external>
    cf.cond_br %arg0, ^bb1(%5, %8 : !stream.resource<transient>, !stream.resource<external>), ^bb2
  ^bb2:  // pred: ^bb1
    util.return %5, %8 : !stream.resource<transient>, !stream.resource<external>
  }

I think we can drop the bb2 arguments in this case. @benvanik does the fix make sense?

@benvanik
Copy link
Collaborator

benvanik commented Aug 7, 2024

SGTM!

@hanhanW
Copy link
Contributor Author

hanhanW commented Aug 7, 2024

W7900 job failed because of #18140. The integrate PR itself is good, I'm landing it.

@hanhanW hanhanW merged commit 352e05f into main Aug 7, 2024
52 of 53 checks passed
@hanhanW hanhanW deleted the integrates/llvm-20240807 branch August 7, 2024 20:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants