-
Notifications
You must be signed in to change notification settings - Fork 571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stablehlo.sort pathologically slow #13865
Comments
yeah, would be great to get some eyes on this - it's been ignored for ages because most models were doing sorts of 5-10 elements, but it's really bad as it scales larger. IIRC this is related to #13745 (I believe that is coming from sort or at least adjacent to it). |
yep the surrounding ir in the model looks like:
|
The |
Assigning to Güray for an assessment of what needs to happen here. |
This occurs in a topk pattern
|
There was work done on making |
Adding @kuhar as an assignee for interfacing with MHLO |
Don't we have that already? https://github.com/openxla/iree/blob/0d81062a8ffabf3561e9d6cc758b8250b2f607d3/compiler/src/iree/compiler/InputConversion/StableHLO/StableHLOToLinalgExt.cpp#L431 |
These models don't seem to be using chlo.topk (or it's getting lowered to stable/mhlo prior to that pass). Ideally we'd still be able to do ok on something like sorting without requiring chlo - there's a lot of ways sorts can be used without being a topk. |
Agree.. but this is more about priorities... I think we should invest in making all these operations better (including split-k). I am happy to help with that... Instead of another special thing of sort, i'd rather just reuse the one-off things that were done for topk. If someone wants to generalize and have a generic way of handling this by pushing on the interfaces above, I'd be glad to help. |
oh yeah, more saying that pattern matching this to the linalgext op may be the quick workaround vs needing to change the original model to use chlo (if it's not already - someone needs to look) - iota + broadcast + sort seems reasonable |
I studied current top-k op and its performance on GPUs. It is not optimal. As far as I can see, topk is executed sequentially or few parallelism by default. Setting a value to the flag --iree-flow-topk-split-reduction manually leverages more parallelism. I can implement a heuristics to set a value. This will be the quickest way to get at least 7x. But even if we do, the generated code isn't cutting-edge. xla gpu uses custom kernel when k < 16. I did some benchmark that shows that it is faster than iree but still not the most optimal solution. We might want to use a custom kernel, or microkernel if we want to fuse producer. |
I think fusing with producers here is not a "high priority". Using a custom PTX blob to dispatch for sort will probably remove it from the blocker list..... If we want to handle "sort" properly, there is a bit of design/exploration needed and better usage of the two interfaces I mentioned here. If we want to go down that route I can help (with design, but cant take on implementation). If we just want to "handle sort", id just write a CUDA kernel, compile to PTX and ship the PTX for now. |
hrrmmm we can't close this issue if the solution is "write some ptx" - may unblock the user here but definitely still a major issue for the platform as vulkan/metal/cpu will have extremely slow sorts and we don't want to be making one-off workarounds in the core platform in lieu of actually solving the problem better. if the heuristic makes things better we should do that, and then maybe if the user needs even better performance they can add custom ptx in the nvgpu plugin. |
Yes, I am not talking in tree. The custom ptx in the nvgpu plugin is what I am saying above. I mentioned it internally, the topk implementation isnt done using these interfaces, and was a one-off (I did push for using interfaces then, but got push back). Having a sort one-off would be sad... So if we want to solve it and be done with it, I'd happily help, but this will require some heavy lifting. If we arent prepared to do that now then we might as well go with the easiest solution to avoid building another "heavy-weight one-off". |
@grypp could you go ahead with the ptx implementation in the nvgpu plugin? |
How bad is IREE's linalg_ext.top_k? (with no special compiler flags). As long as it is within 10x of XLA:GPU I would consider that acceptable for the immediate-term goals. cc @kuhar |
+1, If the focus here is only sort used for topk then it looks like this issue should be renamed or a new issue filed for TopK.
The models here not using chlo.topk is due the IR being looked at post expansion to MHLO, lowering to XLA HLO protos & importing back, so this is looking at the model post a lot of passes have been run which have destroyed structure. |
In the medium-term, we are going to propose and implement a dedicated topk op for stablehlo: openxla/stablehlo#1514 |
I am working on improving performance of |
@kuhar Adding [openxla/stablehlo#1593](openxla/stablehlo#1593) to show the status of adding Topk to StableHLO. Please add any other relevant updates here! |
@kuhar Do you have any update on this issue? |
@allieculp I posted an RFC for adding a topk op on the stablehlo github and presented it during the last openxla community meeting. I don't have an ETA for the new op being available in IREE, but this is not on the critical path given the workarounds from Natasha and Rob. |
What happened?
Compile/run IR below. The result takes like 10 seconds to run on my A100. That is over 1000x off in performance.
Steps to reproduce your issue
See above
What component(s) does this issue relate to?
Compiler
Version information
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: