-
Notifications
You must be signed in to change notification settings - Fork 557
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fused transposed elementwise ops in dispatch region causing extra shared memory allocation #12523
Comments
This used to work at some point and stopped working in the recent past. We have pinned to an earlier version to workaround. |
This is because we have a transposed %10 = linalg.matmul
ins(%4, %5 : tensor<4096x512xf16>, tensor<512x512xf16>)
outs(%9 : tensor<4096x512xf16>) -> tensor<4096x512xf16>
%11 = linalg.generic {
indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d1)>,
affine_map<(d0, d1) -> (d1, d0)>],
iterator_types = ["parallel", "parallel"]
} ins(%10, %6 : tensor<4096x512xf16>, tensor<512xf16>) outs(%7 : tensor<512x4096xf16>) {
^bb0(%in: f16, %in_0: f16, %out: f16):
%12 = arith.addf %in, %in_0 : f16
linalg.yield %12 : f16
} -> tensor<512x4096xf16> So bufferization allocated a buffer from workgroup memory to hold the intermediate result for The issue is likely at a higher level. I'd be interested to know how we generate such dispatches in flow. Did you perform layout transformations and handle transposes somehow causing this? |
ew, using a temporary buffer is really unfortunate - one shouldn't be needed here for a simple bias add. |
So it was working until about a week ago and we hit the issue. We have pinned torch-mlir to an older version to avoid this issue. Is the issue in higher level in torch-mlir or in flow dialect ? |
So I used the linalg IR I get from There seems to be no special layout transformations or handling of transposes that we're using. |
Can we confirm the IR difference between the two versions? And post both too? |
The op count difference with OLD vs LATEST At
At
Elided Linalg IR which has the issue The main set of difference I see are the 3 "extra" matmuls we have in the newer IR ( I tried experimenting with different There are just a handful of decompositions we use in our pipeline as can be seen here. So, I tried inspecting the decompositions which we're mainly using in our pipeline and the only "relevant" delta I see between |
Thanks for the full input IR. However, I cannot reproduce the issue. With e2151d3 and |
I think this bug is magically fixed. I don't know what changed. We will revert the pin and keep an eye out and report back. Thank you for investigating. |
yeah there is no magic :( I was testing via the wmma pipeline which worked. This failure only happens on the RDNA2/ SIMT pipeline. Re-opening. |
adding @MaheshRavishankar too for any guidance since we think it may be something at the flow level. |
@yzhang93 / @qedawkins could this be because the tunings changed or are now invalid since @antiagainst tried on top of master? |
yup confirmed that the tunings we apply causes this crash at runtime. Maybe we can have runtime checks for them ? I don't know how the verifier let this go ? Maybe we need to enhance the verifier to capture this failure too. |
The verifier might not be considering fused ops when determining shared memory requirements. Also I think it only verifies named matmuls (everything else just passes straight through) but I'd have to double check the verifier to be sure (am away from desk) |
I'm sure previously both tuned and untuned model had the failure. But maybe something has changed and untuned model works fine now. I think it's the VAE model and we don't apply lowering configs on it. I'll check if the Winograd transform caused the problem. |
Not sure tuning is the problem--tuning just adjusts the tile sizes and such after seeing the dispatch region. The problem is having fused transposed elementwise op in the dispatch region from the beginning, which causes bufferization to insert extra allocations in shared memory. It would be beneficial to understand why we are forming such dispatch regions, like doing |
UNAVAILABLE; VK_ERROR_INITIALIZATION_FAILED
error
Or actually maybe I'm not using the proper command-line option to reproduce the issue from the full model. I was using |
here are our typical flags
|
@antiagainst I just tested with the latest nightly IREE python package (https://github.com/openxla/iree/releases/tag/candidate-20230311.455) and the above dispatch still has the compilation error. And I confirmed there's nothing to do with tuning. I tested on my navi3 system with the following commands: The above dispatch is from VAE model and the whole model failed with the same error. We use the preprocessing flags when compiling the whole model. |
@yzhang93 can we please run with |
Here is the output of --mlir-print-ir-after-all |
@antiagainst if this is a AMD driver issue is there a temporary way to override this dispatch 28 creation until it is fixed ? |
Also should the dispatch_28 not be formed for those cards that expose less shared memory ? |
How much is the shared memory usage? |
it seems like the tile sizes chosen (which affects the shared memory usage) is not account for shared memory usage... So this is a backend issue. |
@yzhang93 can we tune dispatch_28 for rdna2 so it doesn't crash ? |
Backend here is the AMD vulkan driver ? |
No, this should be the SPIR-V backend in IREE. |
ah ok. Our generated spv |
Thanks for the repro steps in #12523 (comment). I've seen the issue and figured out what went wrong and put up a fix in #12627. |
With the above we won't fuse such cases. @powderluv or somebody else if you can help to verify this works that'd be nice. |
Thank you. I can confirm it works ok
|
@antiagainst When the right fix lands would we back to original memory usage? We are carrying this locally on SHARK-Runtime and some end users on 8GB cards are running out of VRAM. |
Silly Q, but this issue is marked "Done" in status, but open in the IREE project (https://github.com/orgs/openxla/projects/13?pane=issue&itemId=22177652). Any idea why? (I can click "Close with Comment", but I guess I don't get the difference here.) Thanks! |
I'm not sure this one would address memory usage issues. We were never able to handle such fusion cases before; we never hit such cases previously. So it's not like we are not fusing some previously fused cases. The memory usage issue is likely different. |
Ha, interesting. This is not done; so I moved it back to "In Progress". The fix in #12627 is not the long term way to go. I'll spend some time to do it more proper later. |
Hi - double checking on this P0 issue. More to say? Ok to close or lower priority? Thanks! |
Moving this to a P1 since we have a workaround (for SHARK at least) |
P1 is ok, but this is a release blocker, I think. I foolishly started talking about things in discord, but cross-posting here. Looking at #12627 it seems like we have the option to drop a feature in order to fix the bug. I would advocate for that or hiding it behind a flag so that we don't have to wait for a big rewrite to fix this issue. |
It works on other backend, and its just a can we kicked down the road for a while.... |
If this is indeed a release blocker then I think we need to revert the offending feature. This looks like we are miscompiling and we have head and unstable releases in a known-broken state. |
I'm coming to fix this in the proper way next. The issue is triggered by some new IR patterns from torch-mlir which we didn't see before. So it's not that we have a regression---previous releases won't support it either. I don't want to blocking releasing on my implementation; so I'm fine rolling forward the release. |
Got it, then I think this is not a release blocker. Thanks for clarifying (and for fixing 🙂 ) |
…#13823) This just needs to optimize vector transfer ops after vectorization and before folding memref aliases. At that time we still have memref subviews using the same indices. Fixes iree-org#12523 Closes iree-org#12627
…#13823) This just needs to optimize vector transfer ops after vectorization and before folding memref aliases. At that time we still have memref subviews using the same indices. Fixes iree-org#12523 Closes iree-org#12627
What happened?
On trying to pass the IR through
iree-run-module
, I get the following error :-This takes place for
--iree-vulkan-target-triple=rdna2-unknown-windows
Steps to reproduce your issue
Download module_forward_dispatch_28_vulkan_spirv_fb.mlir.
Step 1.
Step 2.
What component(s) does this issue relate to?
Compiler, Runtime
Version information
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: