-
Notifications
You must be signed in to change notification settings - Fork 10.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SROA] the limit on maximum number of alloca slices is too low for NVPTX (and, likely, AMDGPU). #69785
Comments
We could use
Or we can grow a new |
I don't think SROA should have target-dependent limits. If this limit is too low for some use cases, we should find a more precise heuristic. In particular I'm thinking is that we don't want to limit the number of slices by itself, but rather limit breaking up a single memcpy operation into a large number of slices. Just normal promotion of slices to registers should be unproblematic for any number of slices, it's the case where one operation gets turned into thousands that's a problem. |
The issue is that SROA appears to have super-linear scaling. From https://reviews.llvm.org/D159354#4634407
The current value is close to what looks like a reasonable level for non-GPU targets. However, for GPUs we do regularly need much higher thresholds. Setting the default to 2K would un-fix the issue for rust.
If the SROA scalability could be improved for everyone, that would be great. I have no clue how long it would take, though. Until that happens, I need some sort of work-around as it affects a fair number of GPU users. |
As I understand it, the issue isn't in SROA itself, its the knock-on effects it has on MemCpyOpt and DSE among others. When we generate a huge number of slices, these passes then have to deal with processing all of them which leads to this compile time explosion. |
@nikic What do you suggest we should do then? Bumping the default does not seem to be a good way out of this situation either. |
What I said directly after that quote:
But for now, I think we should just revert this change. |
This change causes the 85% performance drop in certain rocBLAS benchmarks for AMDGPU. |
Reverted in ed86e74 for now. |
…t allocas" This reverts commit e13e808. This causes performance regressions on GPU targets, see llvm/llvm-project#69785. Revert the change for now.
e13e808 added a limit on the number of alloca slices processed by SROA and set it to 1024.
The limit appears to be a bit too low for NVPTX, where we heavily rely on SROA to eliminate very expensive local loads/stores. The regression cases I'm aware of need it to be in the ballpark of 2K.
Based on the comments in the code review, bumping the default to that value may be a bit too much for rust. We may need to figure out how to make the threshold default target-specific.
For CUDA I could change the default by making clang driver pass the new threshold via additional cc1 options, but it would not work for JIT users using NVPTX back-end.
The text was updated successfully, but these errors were encountered: