Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop tile sizes specific to the ukernels-disabled case. #17631

Merged
merged 1 commit into from
Jun 18, 2024

Conversation

lialan
Copy link
Contributor

@lialan lialan commented Jun 10, 2024

This patch is to make sure the matmul tiles selected for ARM64 to have same dimensions for arm64 ukernels.

Copy link
Contributor

@bjacob bjacob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - I have checked that this change aligns with the ground truth in https://github.com/iree-org/iree/blob/main/runtime/src/iree/builtins/ukernel/arch/arm_64/mmt4d_arm_64_tiles.inl

@jpienaar @mariecwhite , the existing discrepancy (which this PR fixes) meant that Marie's microkernels were not being used (unless you patched this file like the current PR does). I don't know if that was intentional, but now that the mmt4d ukernel is enabled by default on arm64, this discrepancy started causing this 4-bit path to fall off of a very steep performance cliff: from the ukernel fast path (which Marie wrote) to the very slow generic fallback within the ukernel. For these reasons, I don't really think there can be any other thing to do here but merged this PR, but FYI.

@jpienaar
Copy link
Member

Good catch! Marie knows these much better, but I agree with conclusion

Copy link
Contributor

@hanhanW hanhanW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we able to delete either line 117 to line 159 or line 161 to line 198? I think we don't want the hasUkernel() statement in the if condition anymore.

@bjacob
Copy link
Contributor

bjacob commented Jun 10, 2024

Ooh sorry I had missed that part! So my above comment was incorrect: this code was not causing falling off of the ukernel fast paths, since it was not taken when ukernels are enabled.

But good point that we don't need separate ukernel vs non-ukernel paths here anymore. Since ukernels are default, we don't care much anymore about the non-ukernel fine-tuning (and if we did, the best thing to do would be to work on improving codegen on the same tile shapes as are being selected.

So: resolve this by inlining all instances of hasUkernel in this function as true, i.e. always select tile sizes as if having ukernels enabled. (So when propagating that, that will result in dropping one of the two code paths as suggested by Hanhan).

@bjacob bjacob self-requested a review June 11, 2024 00:58
@lialan lialan force-pushed the lialan/update_tiles branch 2 times, most recently from 0933238 to c4849a3 Compare June 11, 2024 14:00
@bjacob bjacob changed the title Make CPU tile selection same as ukernel configurations Drop Non-ukernel-specific tile sizes. Jun 11, 2024
@bjacob bjacob changed the title Drop Non-ukernel-specific tile sizes. Drop tile sizes specific to the ukernels-disabled case. Jun 11, 2024
Signed-off-by: Alan Li <me@alanli.org>
Copy link
Contributor

@hanhanW hanhanW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@bjacob bjacob merged commit 3461314 into iree-org:main Jun 18, 2024
51 checks passed
@ScottTodd
Copy link
Collaborator

FYI this appears to have regressed performance on at least one continuous benchmark configuration: https://perf.iree.dev/serie?IREE?a57eb6fb2cf6fbf61f9141e63784cc356153766042d0890e6caee565eadcbf21 (~1200ms -> ~2000ms)

The compile flags for that benchmark are here:

iree_bytecode_module(
NAME "iree-module-Vit_int8_tflite___armv8.2-a-generic-linux_android29-llvm_cpu__experimental-flags_dt-only_"
SRC "${ROOT_ARTIFACTS_DIR}/iree_Vit_int8_tflite_.mlir"
MODULE_FILE_NAME "${ROOT_ARTIFACTS_DIR}/iree_module_Vit_int8_tflite___armv8.2-a-generic-linux_android29-llvm_cpu__experimental-flags_dt-only_/module.vmfb"
FLAGS
"--iree-hal-target-backends=llvm-cpu"
"--iree-input-type=tosa"
"--iree-llvmcpu-target-triple=aarch64-none-linux-android29"
"--iree-opt-data-tiling=true"
"--iree-llvmcpu-enable-ukernels=none"
"--iree-llvmcpu-target-cpu-features=+dotprod"
FRIENDLY_NAME "Vit_int8(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu][experimental-flags,dt-only]"
PUBLIC
)

What is the latest status for this flag combination? Is it useful to keep the experimental-flags benchmarks at this point?

     "--iree-opt-data-tiling=true" 
     "--iree-llvmcpu-enable-ukernels=none" 

This similar benchmark https://perf.iree.dev/serie?IREE?3263426782173c417a4205ee460ccf4acb939c53397da8ae06f8ebf3f7228f87 using these flags is stable at 317ms:

iree_bytecode_module(
NAME "iree-module-Vit_int8_tflite___armv8.2-a-generic-linux_android29-llvm_cpu__default-flags_dt-uk_"
SRC "${ROOT_ARTIFACTS_DIR}/iree_Vit_int8_tflite_.mlir"
MODULE_FILE_NAME "${ROOT_ARTIFACTS_DIR}/iree_module_Vit_int8_tflite___armv8.2-a-generic-linux_android29-llvm_cpu__default-flags_dt-uk_/module.vmfb"
FLAGS
"--iree-hal-target-backends=llvm-cpu"
"--iree-input-type=tosa"
"--iree-llvmcpu-target-triple=aarch64-none-linux-android29"
"--iree-opt-data-tiling=true"
"--iree-llvmcpu-enable-ukernels=all"
"--iree-llvmcpu-target-cpu-features=+dotprod"
FRIENDLY_NAME "Vit_int8(tflite) [armv8.2-a-generic-linux_android29-llvm_cpu][default-flags,dt-uk]"
PUBLIC
)

@bjacob
Copy link
Contributor

bjacob commented Jun 24, 2024

Thanks @ScottTodd for flagging this. In this instance, we want to live with this regression because (1) it only affects a non-default path and (2) it only exposes an existing flaw in that non-default path (which was being papered over by the thing that this PR removed).

DT-only is now a non-default configuration on x86-64 and arm-64 CPU, since the mmt4d ukernel was enabled by default there, making DT_UK the new default.

The fact that DT_UK does 317 ms where DT_only used to do 1200 ms with the tweaked tiled size and now does 2000 ms without the tweaked tile size, just means that codegen is doing a very poor job. There is no fundamental reason why codegen would need a different tile size than the ukernel case, so the currently 2000 ms codegen path could be optimized to become as fast as the 317 ms ukernel path, if anyone cares --- so there'll never be a need for separate tile sizes for codegen.

I don't know though that anyone still cares about pure codegen for mmt4d ops. It would be neat to not need ukernels, but having ukernels on by default makes it a non-priority for some of us.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants