-
Notifications
You must be signed in to change notification settings - Fork 556
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Drop tile sizes specific to the ukernels-disabled case. #17631
Conversation
485cabc
to
d205f16
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - I have checked that this change aligns with the ground truth in https://github.com/iree-org/iree/blob/main/runtime/src/iree/builtins/ukernel/arch/arm_64/mmt4d_arm_64_tiles.inl
@jpienaar @mariecwhite , the existing discrepancy (which this PR fixes) meant that Marie's microkernels were not being used (unless you patched this file like the current PR does). I don't know if that was intentional, but now that the mmt4d ukernel is enabled by default on arm64, this discrepancy started causing this 4-bit path to fall off of a very steep performance cliff: from the ukernel fast path (which Marie wrote) to the very slow generic fallback within the ukernel. For these reasons, I don't really think there can be any other thing to do here but merged this PR, but FYI.
Good catch! Marie knows these much better, but I agree with conclusion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we able to delete either line 117 to line 159 or line 161 to line 198? I think we don't want the hasUkernel()
statement in the if condition anymore.
Ooh sorry I had missed that part! So my above comment was incorrect: this code was not causing falling off of the ukernel fast paths, since it was not taken when ukernels are enabled. But good point that we don't need separate ukernel vs non-ukernel paths here anymore. Since ukernels are default, we don't care much anymore about the non-ukernel fine-tuning (and if we did, the best thing to do would be to work on improving codegen on the same tile shapes as are being selected. So: resolve this by inlining all instances of |
compiler/src/iree/compiler/Codegen/Common/CPU/CPUMaterializeEncodingPass.cpp
Show resolved
Hide resolved
5d2e553
to
3dbe2e4
Compare
compiler/src/iree/compiler/Codegen/Common/CPU/CPUMaterializeEncodingPass.cpp
Show resolved
Hide resolved
0933238
to
c4849a3
Compare
compiler/src/iree/compiler/Codegen/Common/CPU/CPUMaterializeEncodingPass.cpp
Show resolved
Hide resolved
Signed-off-by: Alan Li <me@alanli.org>
c4849a3
to
0439758
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
FYI this appears to have regressed performance on at least one continuous benchmark configuration: https://perf.iree.dev/serie?IREE?a57eb6fb2cf6fbf61f9141e63784cc356153766042d0890e6caee565eadcbf21 (~1200ms -> ~2000ms) The compile flags for that benchmark are here: iree/tests/e2e/test_artifacts/generated_e2e_test_iree_artifacts.cmake Lines 1673 to 1686 in 024c48b
What is the latest status for this flag combination? Is it useful to keep the
This similar benchmark https://perf.iree.dev/serie?IREE?3263426782173c417a4205ee460ccf4acb939c53397da8ae06f8ebf3f7228f87 using these flags is stable at 317ms: iree/tests/e2e/test_artifacts/generated_e2e_test_iree_artifacts.cmake Lines 1583 to 1596 in 024c48b
|
Thanks @ScottTodd for flagging this. In this instance, we want to live with this regression because (1) it only affects a non-default path and (2) it only exposes an existing flaw in that non-default path (which was being papered over by the thing that this PR removed). DT-only is now a non-default configuration on x86-64 and arm-64 CPU, since the mmt4d ukernel was enabled by default there, making DT_UK the new default. The fact that DT_UK does 317 ms where DT_only used to do 1200 ms with the tweaked tiled size and now does 2000 ms without the tweaked tile size, just means that codegen is doing a very poor job. There is no fundamental reason why codegen would need a different tile size than the ukernel case, so the currently 2000 ms codegen path could be optimized to become as fast as the 317 ms ukernel path, if anyone cares --- so there'll never be a need for separate tile sizes for codegen. I don't know though that anyone still cares about pure codegen for mmt4d ops. It would be neat to not need ukernels, but having ukernels on by default makes it a non-priority for some of us. |
This patch is to make sure the matmul tiles selected for ARM64 to have same dimensions for arm64 ukernels.