-
Notifications
You must be signed in to change notification settings - Fork 581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crash in LLVMGPUTileAndDistribute / Tiling interface #12824
Comments
With the latest integrate, this is hiding the error I used to have which was:
I expect fixing the crash will then trigger the error I am expecting. |
Ill wait for #12822 to land before looking more. |
This is a simpler repro |
Looking at the CPU side, I think there is a folder here that is missing that will help.
This could just be folded into
Then a lot of the backend code-generation works out better. Basically (at least on the CPU side , I havent looked at the GPU side yet, but probably something similar) the Some points of interest. After dispatch region formation this is the IR
That looks good. The issue seems to be with the unpack dispatch. Something very strange happening with bufferization. This is the IR for unpack dispatch before bufferization
and this is the after
so, something went off the rails in bufferization. |
Oh, and |
The scf.for op should not return tensor types. We usually run The dest of transfer_write, extract_slice ops and insert_slice ops are all around iter_arg %9 = affine.apply affine_map<(d0) -> (d0 floordiv 128)>(%arg0)
%10 = affine.apply affine_map<(d0) -> (d0 floordiv 256)>(%arg1)
%11 = flow.dispatch.tensor.load %0, offsets = [%9, %10, 0, 0], sizes = [1, 1, 128, 256], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<8x16x128x256xf32>> -> tensor<1x1x128x256xf32>
%12 = flow.dispatch.tensor.load %1, offsets = [%arg0, %arg1], sizes = [%5, %8], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<999x3999xf32>> -> tensor<?x?xf32>
%13 = scf.for %arg2 = %c0 to %5 step %c128 iter_args(%arg3 = %12) -> (tensor<?x?xf32>) {
%14 = affine.min affine_map<(d0)[s0] -> (-d0 + s0, 128)>(%arg2)[%5]
%15 = scf.for %arg4 = %c0 to %8 step %c256 iter_args(%arg5 = %arg3) -> (tensor<?x?xf32>) {
%16 = affine.min affine_map<(d0)[s0] -> (-d0 + s0, 256)>(%arg4)[%8]
%17 = affine.apply affine_map<(d0) -> (d0 floordiv 128)>(%arg2)
%18 = affine.apply affine_map<(d0) -> (d0 floordiv 256)>(%arg4)
%extracted_slice = tensor.extract_slice %arg5[%arg2, %arg4] [%14, %16] [1, 1] : tensor<?x?xf32> to tensor<?x?xf32>
%19 = vector.transfer_read %11[%17, %18, %c0, %c0], %cst {in_bounds = [true, true]} : tensor<1x1x128x256xf32>, vector<128x256xf32>
%extracted_slice_0 = tensor.extract_slice %extracted_slice[0, 0] [%14, %16] [1, 1] : tensor<?x?xf32> to tensor<?x?xf32>
%20 = vector.transfer_write %19, %extracted_slice_0[%c0, %c0] : vector<128x256xf32>, tensor<?x?xf32>
%inserted_slice = tensor.insert_slice %20 into %extracted_slice[0, 0] [%14, %16] [1, 1] : tensor<?x?xf32> into tensor<?x?xf32>
%inserted_slice_1 = tensor.insert_slice %inserted_slice into %arg5[%arg2, %arg4] [%14, %16] [1, 1] : tensor<?x?xf32> into tensor<?x?xf32>
scf.yield %inserted_slice_1 : tensor<?x?xf32>
}
scf.yield %15 : tensor<?x?xf32>
} |
ping on p0 bug? |
Trying to narrow down the scope of this bug and boil this down to some concrete action items. There are broadly two options for convert a matmul to a shape that is better for the codegen. One is
With both these options the end state that we should be looking for is to have four dispatches.
To achieve this for the pack/unpack path on the small repro that was added here #12824 (comment) we need to add a folding pattern. Task 1: Add a folding pattern for
to
Even without this we will get four dispatches, but you will have a To achieve four dispatches on the pad/slice path we need to make sure that the The next steps then are to look at having "good" code-generation for Task 2: Interaction between For the |
Thanks @MaheshRavishankar, let's split this down further as I seem to see some conflation here.
Sure, but this is not the problem that this issue refers to. The issue is: IREE receives the IR above and crashes in So it seems the bug reduces to:
Now for the specific points looking ahead:
Yes, this is a good followup once the crash is fixed, this will remove one dispatch once we can run e2e and measure.
Yes, having a reasonable codegen strategy for pack and pad seems to be lacking on the GPU side, there are many low-hanging fruit opportunities. One of the avenues we have indeed been looking at is folding trivial packs into pads, which captures one of the use cases we're looking at for GPU.
This deserves another full thread. |
This part is confusing to me. I dont see any uses of IREE::LinalgExtTilingInterfaceBaseTilingPattern on the CPU backend. That is really legacy and should not be used. AFAIK, CPU side doesnt use that and that might be the difference. |
From GPU sync: Nicolas out for 2 weeks, discuss priority offline. |
@MaheshRavishankar @mattwalsh We should discuss priority and resourcing for this issue during GPU sync. |
From meeting today: holding this open for Nicolas when he returns to work, not a blocking item. |
What happened?
I have been iterating on graph-level transformations and added this repro to iree-samples.
This is dependent on #12822
Running
iree-compile --iree-hal-target-backends=cuda transform_dialect/graph/cuda/bugs/1-ir.mlir
crashes with:
Steps to reproduce your issue
What component(s) does this issue relate to?
No response
Version information
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: