-
Notifications
You must be signed in to change notification settings - Fork 544
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LowerToUKernels should create a codegen fallback when the ukernel would be slow. #15784
Comments
This is in line with a project that @MaheshRavishankar and I had discussed. We want to provide the feature similar to this. We want to be able to generate multiple executables for a single dispatch; pick up the proper kernel with simple checks. We have few progress on it today -- we split the lowering strategy selection logic out from the whole dynamic pipeline. There are ambiguities about how it would look like on codegen and runtime sides. If this becomes a priority, I can study and start a design maybe early 2024. |
This is more about doing it at the lower level - which is required for some scenarios and really what we should be focusing on.
and instead want:
That is to say: trying to do anything that requires different host behavior should raise a red flag and be very carefully evaluated. Unless you can rationalize making an entire copy of the .so we generate (spending the time to compile it, embedding it in vmfbs, and deploying it to all users) then host selection changes are generally out. Doing things within the dispatches is far easier to build/maintain and far more flexible and is the default direction to take. Host-level changes are for big things where all/a large number of all dispatches in the program change - for example, different CPU architectures - and is accomplished via Before you build anything we should all sync up and ensure it's either using the infra we have or in this direction. |
@hanhanW I think what @benvanik suggested is what I was trying to describe. Sorry if that didnt come across clearly (thanks @benvanik for the nice summary!). So this would be pure codegen side only changes, but allows us to generalize to target a wider range of computation within the same dispatch... |
Thanks Ben! You already give me a rough answer about how it should look like on host side and device side; it matches what I prefer! I'm not going to build anything before we discuss in a broader bandwidth and have some agreements. I just wanted to point out that we should escalate it from |
I had wrong understanding about slow fallback. I thought that we had an ukernel where the implementation is scalars. IIUC, what's happening is that we fallback to scalar codegen pipeline when ukernels are not available. The tiling and vectorization do not happen because they are excluded. Then we bufferize it and convert linalg ops to loops in One short fix is that we can "improve" the current fallback. We can apply tiling and vectorization passes unconditionally after LowerToUKernels pass. It does not hurt anything. If mmt4d ops are converted to ukernels, they are nop. If it does not pick up any ukernels, we still can tile and vectorize the ops. if (enableMicrokernels) {
nestedModulePM.addNestedPass<func::FuncOp>(
createDecomposeBatchMmt4DOpsPass());
nestedModulePM.addPass(
createCPULowerToUKernelsPass(clSkipIntermediateRoundings));
}
nestedModulePM.addNestedPass<func::FuncOp>(createLLVMCPUTileAndFusePass(
static_cast<int64_t>(tilingConfig.getVectorCommonParallelLevel())));
nestedModulePM.addNestedPass<func::FuncOp>(createLLVMCPUTilePass(
static_cast<int64_t>(tilingConfig.getVectorReductionLevel())));
nestedModulePM.addNestedPass<func::FuncOp>(
createGenericVectorizationPass());
nestedModulePM.addNestedPass<func::FuncOp>(
createHoistRedundantVectorTransfersPass()); |
Has this been fixed by #15883 ? |
Circling back now that I have the context switched back in. #15883 addresses part of this, but isn't a complete solution by itself. In cases where the lowering-to-ukernel fails, #15883 is the full solution. The remaining problem is what this Issue's original description was about: the case where the lowering-to-ukernel succeeds, and then at runtime the ukernel happens to use slow generic code. There are a few reasons why that would happen. The availability of architecture-specific optimized code depends on a number of factor: not just element types, but also tile shapes, target architecture, and target CPU feature. So far, of these factors, the only one that is fully accounted for in the compiler-side lowering-to-ukernel logic is the element types. Reflecting all of these factors in this compiler logic would be a very fragile duplication of logic, bound to diverge i.e. regress. That's why this Issue description originally called for some additional ukernel entry point returning a ground-truth value for "would this actually be fast". Having swapped all this context back in now, I think that we should really do this now. The change in #15883 will be really helpful with that now. I hadn't originally thought of that aspect, but if I had just followed what I originally described above, the codegen fallback would have been slow. #15883 ensures that the codegen fallback will be fast, when we have it. |
I made a prototype implementing the idea in the original PR description here. It doesn't work as-is. I discussed it with @MaheshRavishankar who kindly brought me up to speed with the reasons why it just won't work. It's all in https://gist.github.com/bjacob/80ab8936c2416cdd41516a59f001f8b6 |
so sounds like a nosideeffects query function called at the top of the function (outside all the loops) and switching the entire loop nest is probably the way to go? worst case, you do: func @entry(...) {
%have_ukernel = iree_codegen.ukernel.is_supported "iree_uk_mmt4d" <whatever metadata you need>
scf.if %have_ukernel {
call @using_ukernel(...)
} else {
call @not_using_ukernel(...)
}
}
func private @using_ukernel(...) ...
func private @not_using_ukernel(...) ... (that is_supported becomes an extern call just like a ukernel.generic does, but without needing to pass buffers/etc - just dims/element types/etc) |
🤯 and more discussion here: https://discord.com/channels/689900678990135345/689906000043573354/1208149147220316210 |
This will be useful to introduce the mmt4d-query ukernel for #15784, as the default calling convention allows for scalar return values. This is also a cleanup in its own right: at this point, ukernels only use ParameterStruct calling convention for historical reasons.
Ukernels do not expose whether they have fast code to handle a given case. They always succeed, potentially falling back on a slow generic tile function that can be 100x slower than optimized code, because on top of being scalar code, it is dynamic-tile-size code since it is a single function that has to adapt to any tile size.
This has been run into multiple times recently e.g. #15504, and is the topic of point 2. in @dcaballe's comment in #15760 (comment).
I have considered a few options in e.g. #15504 (comment) but after an offline discussion with @benvanik last week, this is what we are planning to do:
(plus something to avoid this pattern recursing infinitely).
Pro: we'll never run into slow ukernel fallback code again. The fallback will be codegen.
Con: we'll always compile both code paths down to LLVM IR (though one will be DCE'd in late (post linking) LLVM IR optimization passes). This has two downsides:
The text was updated successfully, but these errors were encountered: