LowerToUKernels should create a codegen fallback when the ukernel would be slow. #15784

bjacob · 2023-12-04T15:58:19Z

Ukernels do not expose whether they have fast code to handle a given case. They always succeed, potentially falling back on a slow generic tile function that can be 100x slower than optimized code, because on top of being scalar code, it is dynamic-tile-size code since it is a single function that has to adapt to any tile size.

This has been run into multiple times recently e.g. #15504, and is the topic of point 2. in @dcaballe's comment in #15760 (comment).

I have considered a few options in e.g. #15504 (comment) but after an offline discussion with @benvanik last week, this is what we are planning to do:

Let ukernels expose whether they have fast code for a given case.
Let LowerToUKernels lower ops to something of the form (pseudocode)

  scf.if (have_fast_ukernel_code_for_this_case()) {
    ukernel op...
} else {
    non-ukernel op...
}

(plus something to avoid this pattern recursing infinitely).

Pro: we'll never run into slow ukernel fallback code again. The fallback will be codegen.
Con: we'll always compile both code paths down to LLVM IR (though one will be DCE'd in late (post linking) LLVM IR optimization passes). This has two downsides:

More compilation time. However, this is in the part of codegen that is well parallelized, not the current serial bottleneck on object-code generation.
Ukernels won't anymore be a way to avoid compilation issues with the normal codegen path. But since we want the normal codegen path to work, that's not necessarily a bad thing.

The text was updated successfully, but these errors were encountered:

hanhanW · 2023-12-04T23:45:09Z

This is in line with a project that @MaheshRavishankar and I had discussed. We want to provide the feature similar to this. We want to be able to generate multiple executables for a single dispatch; pick up the proper kernel with simple checks. We have few progress on it today -- we split the lowering strategy selection logic out from the whole dynamic pipeline. There are ambiguities about how it would look like on codegen and runtime sides. If this becomes a priority, I can study and start a design maybe early 2024.

benvanik · 2023-12-04T23:56:00Z

This is more about doing it at the lower level - which is required for some scenarios and really what we should be focusing on.
Wherever possible we want to avoid:

host:
  if (some device feature)
     dispatch(variant_a)
        do thing specialized for device feature
  else if (some other device feature)
     dispatch(variant_b)
        do thing specialized for other device feature

and instead want:

host:
  dispatch(variant_ab)
     if (some device feature)
        do thing specialized for device feature
     else if (some other device feature)
        do thing specialized for other device feature

That is to say: trying to do anything that requires different host behavior should raise a red flag and be very carefully evaluated. Unless you can rationalize making an entire copy of the .so we generate (spending the time to compile it, embedding it in vmfbs, and deploying it to all users) then host selection changes are generally out. Doing things within the dispatches is far easier to build/maintain and far more flexible and is the default direction to take.

Host-level changes are for big things where all/a large number of all dispatches in the program change - for example, different CPU architectures - and is accomplished via hal.executable.condition. It should be extremely rare especially on CPU. Device-level changes are done via specialization constants (hal.executable.constant.block) and the various feature queries we have within CPU dispatches (DispatchABI::loadProcessorID, DispatchABI::loadProcessorData, etc).

Before you build anything we should all sync up and ensure it's either using the infra we have or in this direction.

MaheshRavishankar · 2023-12-05T00:16:46Z

@hanhanW I think what @benvanik suggested is what I was trying to describe. Sorry if that didnt come across clearly (thanks @benvanik for the nice summary!). So this would be pure codegen side only changes, but allows us to generalize to target a wider range of computation within the same dispatch...

hanhanW · 2023-12-05T00:20:57Z

Thanks Ben! You already give me a rough answer about how it should look like on host side and device side; it matches what I prefer! I'm not going to build anything before we discuss in a broader bandwidth and have some agreements. I just wanted to point out that we should escalate it from LowerToUKernels to generating executables level; I can help study it if it becomes a priority and no one have cycles for it. My plate is full now, so I can only offer help to kick it start in 2024.

hanhanW · 2023-12-11T18:31:55Z

I had wrong understanding about slow fallback. I thought that we had an ukernel where the implementation is scalars. IIUC, what's happening is that we fallback to scalar codegen pipeline when ukernels are not available.

https://github.com/openxla/iree/blob/0b0476624d554bda0e7837c516de837a4f457b92/compiler/src/iree/compiler/Codegen/LLVMCPU/Passes.cpp#L529-L543

The tiling and vectorization do not happen because they are excluded. Then we bufferize it and convert linalg ops to loops in addLowerToLLVMPasses.

https://github.com/openxla/iree/blob/0b0476624d554bda0e7837c516de837a4f457b92/compiler/src/iree/compiler/Codegen/LLVMCPU/Passes.cpp#L620-L637

One short fix is that we can "improve" the current fallback. We can apply tiling and vectorization passes unconditionally after LowerToUKernels pass. It does not hurt anything. If mmt4d ops are converted to ukernels, they are nop. If it does not pick up any ukernels, we still can tile and vectorize the ops.

 if (enableMicrokernels) { 
   nestedModulePM.addNestedPass<func::FuncOp>( 
       createDecomposeBatchMmt4DOpsPass()); 
   nestedModulePM.addPass( 
       createCPULowerToUKernelsPass(clSkipIntermediateRoundings)); 
 }
 nestedModulePM.addNestedPass<func::FuncOp>(createLLVMCPUTileAndFusePass( 
     static_cast<int64_t>(tilingConfig.getVectorCommonParallelLevel()))); 
 nestedModulePM.addNestedPass<func::FuncOp>(createLLVMCPUTilePass( 
     static_cast<int64_t>(tilingConfig.getVectorReductionLevel()))); 
 nestedModulePM.addNestedPass<func::FuncOp>( 
     createGenericVectorizationPass()); 
 nestedModulePM.addNestedPass<func::FuncOp>( 
     createHoistRedundantVectorTransfersPass());

#15784

…org#15883) iree-org#15784

bjacob · 2024-02-14T02:09:41Z

Has this been fixed by #15883 ?

bjacob · 2024-02-14T16:32:59Z

Circling back now that I have the context switched back in.

#15883 addresses part of this, but isn't a complete solution by itself.

In cases where the lowering-to-ukernel fails, #15883 is the full solution.

The remaining problem is what this Issue's original description was about: the case where the lowering-to-ukernel succeeds, and then at runtime the ukernel happens to use slow generic code. There are a few reasons why that would happen. The availability of architecture-specific optimized code depends on a number of factor: not just element types, but also tile shapes, target architecture, and target CPU feature. So far, of these factors, the only one that is fully accounted for in the compiler-side lowering-to-ukernel logic is the element types. Reflecting all of these factors in this compiler logic would be a very fragile duplication of logic, bound to diverge i.e. regress. That's why this Issue description originally called for some additional ukernel entry point returning a ground-truth value for "would this actually be fast". Having swapped all this context back in now, I think that we should really do this now.

The change in #15883 will be really helpful with that now. I hadn't originally thought of that aspect, but if I had just followed what I originally described above, the codegen fallback would have been slow. #15883 ensures that the codegen fallback will be fast, when we have it.

bjacob · 2024-02-16T20:28:07Z

I made a prototype implementing the idea in the original PR description here. It doesn't work as-is. I discussed it with @MaheshRavishankar who kindly brought me up to speed with the reasons why it just won't work. It's all in https://gist.github.com/bjacob/80ab8936c2416cdd41516a59f001f8b6

benvanik · 2024-02-16T20:34:41Z

so sounds like a nosideeffects query function called at the top of the function (outside all the loops) and switching the entire loop nest is probably the way to go? worst case, you do:

func @entry(...) {
  %have_ukernel = iree_codegen.ukernel.is_supported "iree_uk_mmt4d" <whatever metadata you need>
  scf.if %have_ukernel {
    call @using_ukernel(...)
  } else {
    call @not_using_ukernel(...)
  }
}
func private @using_ukernel(...) ...
func private @not_using_ukernel(...) ...

(that is_supported becomes an extern call just like a ukernel.generic does, but without needing to pass buffers/etc - just dims/element types/etc)

bjacob · 2024-02-16T20:50:12Z

🤯 and more discussion here: https://discord.com/channels/689900678990135345/689906000043573354/1208149147220316210

This will be useful to introduce the mmt4d-query ukernel for #15784, as the default calling convention allows for scalar return values. This is also a cleanup in its own right: at this point, ukernels only use ParameterStruct calling convention for historical reasons.

… the mmt4d implementation. (#16631) This is a key step towards implementing the ukernels-to-codegen fallback when ukernels would not have fast code (#15784). Not automatically falling back anymore, revealed that one testcase was actually testing the wrong tile size all along :-)

bjacob added codegen Shared code generation infrastructure and dialects performance ⚡ Performance/optimization related work across the compiler and runtime codegen/llvm LLVM code generation compiler backend labels Dec 4, 2023

bjacob self-assigned this Dec 4, 2023

bjacob mentioned this issue Dec 4, 2023

[CPU] Missing ukernel for f32, i8 -> f32 mmt4d? #15760

Open

pzread mentioned this issue Dec 8, 2023

[Codegen] Enable DT for matmul with float/integer-mixed inputs #15839

Closed

hanhanW mentioned this issue Dec 11, 2023

[CPU] Enable codegen fallback when ukernels are not available. #15883

Merged

hanhanW added a commit that referenced this issue Dec 13, 2023

[CPU] Enable codegen fallback when ukernels are not available. (#15883)

33a331b

#15784

ramiro050 pushed a commit to ramiro050/iree that referenced this issue Dec 19, 2023

[CPU] Enable codegen fallback when ukernels are not available. (iree-…

e2b15ba

…org#15883) iree-org#15784

bjacob mentioned this issue Feb 14, 2024

Enable the mmt4d ukernel by default in llvm-cpu #16401

Open

bjacob mentioned this issue Feb 22, 2024

Change ukernels calling convention to default #16541

Merged

bjacob mentioned this issue Mar 1, 2024

Make ukernels fallback opt-in and add a mmt4d_info ukernel to query the mmt4d implementation. #16631

Merged

bjacob mentioned this issue Mar 22, 2024

Mmt4d ukernel: Condition the lowering on the existence of an architecture-specific tile function. #16880

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LowerToUKernels should create a codegen fallback when the ukernel would be slow. #15784

LowerToUKernels should create a codegen fallback when the ukernel would be slow. #15784

bjacob commented Dec 4, 2023 •

edited

hanhanW commented Dec 4, 2023

benvanik commented Dec 4, 2023

MaheshRavishankar commented Dec 5, 2023

hanhanW commented Dec 5, 2023

hanhanW commented Dec 11, 2023

bjacob commented Feb 14, 2024

bjacob commented Feb 14, 2024

bjacob commented Feb 16, 2024

benvanik commented Feb 16, 2024 •

edited

bjacob commented Feb 16, 2024

LowerToUKernels should create a codegen fallback when the ukernel would be slow. #15784

LowerToUKernels should create a codegen fallback when the ukernel would be slow. #15784

Comments

bjacob commented Dec 4, 2023 • edited

hanhanW commented Dec 4, 2023

benvanik commented Dec 4, 2023

MaheshRavishankar commented Dec 5, 2023

hanhanW commented Dec 5, 2023

hanhanW commented Dec 11, 2023

bjacob commented Feb 14, 2024

bjacob commented Feb 14, 2024

bjacob commented Feb 16, 2024

benvanik commented Feb 16, 2024 • edited

bjacob commented Feb 16, 2024

bjacob commented Dec 4, 2023 •

edited

benvanik commented Feb 16, 2024 •

edited