Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LowerToUKernels should create a codegen fallback when the ukernel would be slow. #15784

Open
bjacob opened this issue Dec 4, 2023 · 10 comments
Assignees
Labels
codegen/llvm LLVM code generation compiler backend codegen Shared code generation infrastructure and dialects performance ⚡ Performance/optimization related work across the compiler and runtime

Comments

@bjacob
Copy link
Contributor

bjacob commented Dec 4, 2023

Ukernels do not expose whether they have fast code to handle a given case. They always succeed, potentially falling back on a slow generic tile function that can be 100x slower than optimized code, because on top of being scalar code, it is dynamic-tile-size code since it is a single function that has to adapt to any tile size.

This has been run into multiple times recently e.g. #15504, and is the topic of point 2. in @dcaballe's comment in #15760 (comment).

I have considered a few options in e.g. #15504 (comment) but after an offline discussion with @benvanik last week, this is what we are planning to do:

  1. Let ukernels expose whether they have fast code for a given case.
  2. Let LowerToUKernels lower ops to something of the form (pseudocode)
  scf.if (have_fast_ukernel_code_for_this_case()) {
    ukernel op...
} else {
    non-ukernel op...
}

(plus something to avoid this pattern recursing infinitely).

Pro: we'll never run into slow ukernel fallback code again. The fallback will be codegen.
Con: we'll always compile both code paths down to LLVM IR (though one will be DCE'd in late (post linking) LLVM IR optimization passes). This has two downsides:

  1. More compilation time. However, this is in the part of codegen that is well parallelized, not the current serial bottleneck on object-code generation.
  2. Ukernels won't anymore be a way to avoid compilation issues with the normal codegen path. But since we want the normal codegen path to work, that's not necessarily a bad thing.
@bjacob bjacob added codegen Shared code generation infrastructure and dialects performance ⚡ Performance/optimization related work across the compiler and runtime codegen/llvm LLVM code generation compiler backend labels Dec 4, 2023
@bjacob bjacob self-assigned this Dec 4, 2023
@hanhanW
Copy link
Contributor

hanhanW commented Dec 4, 2023

This is in line with a project that @MaheshRavishankar and I had discussed. We want to provide the feature similar to this. We want to be able to generate multiple executables for a single dispatch; pick up the proper kernel with simple checks. We have few progress on it today -- we split the lowering strategy selection logic out from the whole dynamic pipeline. There are ambiguities about how it would look like on codegen and runtime sides. If this becomes a priority, I can study and start a design maybe early 2024.

@benvanik
Copy link
Collaborator

benvanik commented Dec 4, 2023

This is more about doing it at the lower level - which is required for some scenarios and really what we should be focusing on.
Wherever possible we want to avoid:

host:
  if (some device feature)
     dispatch(variant_a)
        do thing specialized for device feature
  else if (some other device feature)
     dispatch(variant_b)
        do thing specialized for other device feature

and instead want:

host:
  dispatch(variant_ab)
     if (some device feature)
        do thing specialized for device feature
     else if (some other device feature)
        do thing specialized for other device feature

That is to say: trying to do anything that requires different host behavior should raise a red flag and be very carefully evaluated. Unless you can rationalize making an entire copy of the .so we generate (spending the time to compile it, embedding it in vmfbs, and deploying it to all users) then host selection changes are generally out. Doing things within the dispatches is far easier to build/maintain and far more flexible and is the default direction to take.

Host-level changes are for big things where all/a large number of all dispatches in the program change - for example, different CPU architectures - and is accomplished via hal.executable.condition. It should be extremely rare especially on CPU. Device-level changes are done via specialization constants (hal.executable.constant.block) and the various feature queries we have within CPU dispatches (DispatchABI::loadProcessorID, DispatchABI::loadProcessorData, etc).

Before you build anything we should all sync up and ensure it's either using the infra we have or in this direction.

@MaheshRavishankar
Copy link
Contributor

@hanhanW I think what @benvanik suggested is what I was trying to describe. Sorry if that didnt come across clearly (thanks @benvanik for the nice summary!). So this would be pure codegen side only changes, but allows us to generalize to target a wider range of computation within the same dispatch...

@hanhanW
Copy link
Contributor

hanhanW commented Dec 5, 2023

Thanks Ben! You already give me a rough answer about how it should look like on host side and device side; it matches what I prefer! I'm not going to build anything before we discuss in a broader bandwidth and have some agreements. I just wanted to point out that we should escalate it from LowerToUKernels to generating executables level; I can help study it if it becomes a priority and no one have cycles for it. My plate is full now, so I can only offer help to kick it start in 2024.

@hanhanW
Copy link
Contributor

hanhanW commented Dec 11, 2023

I had wrong understanding about slow fallback. I thought that we had an ukernel where the implementation is scalars. IIUC, what's happening is that we fallback to scalar codegen pipeline when ukernels are not available.

https://github.com/openxla/iree/blob/0b0476624d554bda0e7837c516de837a4f457b92/compiler/src/iree/compiler/Codegen/LLVMCPU/Passes.cpp#L529-L543

The tiling and vectorization do not happen because they are excluded. Then we bufferize it and convert linalg ops to loops in addLowerToLLVMPasses.

https://github.com/openxla/iree/blob/0b0476624d554bda0e7837c516de837a4f457b92/compiler/src/iree/compiler/Codegen/LLVMCPU/Passes.cpp#L620-L637

One short fix is that we can "improve" the current fallback. We can apply tiling and vectorization passes unconditionally after LowerToUKernels pass. It does not hurt anything. If mmt4d ops are converted to ukernels, they are nop. If it does not pick up any ukernels, we still can tile and vectorize the ops.

 if (enableMicrokernels) { 
   nestedModulePM.addNestedPass<func::FuncOp>( 
       createDecomposeBatchMmt4DOpsPass()); 
   nestedModulePM.addPass( 
       createCPULowerToUKernelsPass(clSkipIntermediateRoundings)); 
 }
 nestedModulePM.addNestedPass<func::FuncOp>(createLLVMCPUTileAndFusePass( 
     static_cast<int64_t>(tilingConfig.getVectorCommonParallelLevel()))); 
 nestedModulePM.addNestedPass<func::FuncOp>(createLLVMCPUTilePass( 
     static_cast<int64_t>(tilingConfig.getVectorReductionLevel()))); 
 nestedModulePM.addNestedPass<func::FuncOp>( 
     createGenericVectorizationPass()); 
 nestedModulePM.addNestedPass<func::FuncOp>( 
     createHoistRedundantVectorTransfersPass()); 

@bjacob
Copy link
Contributor Author

bjacob commented Feb 14, 2024

Has this been fixed by #15883 ?

@bjacob
Copy link
Contributor Author

bjacob commented Feb 14, 2024

Circling back now that I have the context switched back in.

#15883 addresses part of this, but isn't a complete solution by itself.

In cases where the lowering-to-ukernel fails, #15883 is the full solution.

The remaining problem is what this Issue's original description was about: the case where the lowering-to-ukernel succeeds, and then at runtime the ukernel happens to use slow generic code. There are a few reasons why that would happen. The availability of architecture-specific optimized code depends on a number of factor: not just element types, but also tile shapes, target architecture, and target CPU feature. So far, of these factors, the only one that is fully accounted for in the compiler-side lowering-to-ukernel logic is the element types. Reflecting all of these factors in this compiler logic would be a very fragile duplication of logic, bound to diverge i.e. regress. That's why this Issue description originally called for some additional ukernel entry point returning a ground-truth value for "would this actually be fast". Having swapped all this context back in now, I think that we should really do this now.

The change in #15883 will be really helpful with that now. I hadn't originally thought of that aspect, but if I had just followed what I originally described above, the codegen fallback would have been slow. #15883 ensures that the codegen fallback will be fast, when we have it.

@bjacob
Copy link
Contributor Author

bjacob commented Feb 16, 2024

I made a prototype implementing the idea in the original PR description here. It doesn't work as-is. I discussed it with @MaheshRavishankar who kindly brought me up to speed with the reasons why it just won't work. It's all in https://gist.github.com/bjacob/80ab8936c2416cdd41516a59f001f8b6

@benvanik
Copy link
Collaborator

benvanik commented Feb 16, 2024

so sounds like a nosideeffects query function called at the top of the function (outside all the loops) and switching the entire loop nest is probably the way to go? worst case, you do:

func @entry(...) {
  %have_ukernel = iree_codegen.ukernel.is_supported "iree_uk_mmt4d" <whatever metadata you need>
  scf.if %have_ukernel {
    call @using_ukernel(...)
  } else {
    call @not_using_ukernel(...)
  }
}
func private @using_ukernel(...) ...
func private @not_using_ukernel(...) ...

(that is_supported becomes an extern call just like a ukernel.generic does, but without needing to pass buffers/etc - just dims/element types/etc)

@bjacob
Copy link
Contributor Author

bjacob commented Feb 16, 2024

bjacob added a commit that referenced this issue Mar 1, 2024
This will be useful to introduce the mmt4d-query ukernel for #15784, as
the default calling convention allows for scalar return values. This is
also a cleanup in its own right: at this point, ukernels only use
ParameterStruct calling convention for historical reasons.
bjacob added a commit that referenced this issue Mar 1, 2024
… the mmt4d implementation. (#16631)

This is a key step towards implementing the ukernels-to-codegen fallback
when ukernels would not have fast code (#15784).

Not automatically falling back anymore, revealed that one testcase was
actually testing the wrong tile size all along :-)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
codegen/llvm LLVM code generation compiler backend codegen Shared code generation infrastructure and dialects performance ⚡ Performance/optimization related work across the compiler and runtime
Projects
None yet
Development

No branches or pull requests

4 participants