Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SROA] the limit on maximum number of alloca slices is too low for NVPTX (and, likely, AMDGPU). #69785

Open
Artem-B opened this issue Oct 20, 2023 · 8 comments
Assignees

Comments

@Artem-B
Copy link
Member

Artem-B commented Oct 20, 2023

e13e808 added a limit on the number of alloca slices processed by SROA and set it to 1024.

The limit appears to be a bit too low for NVPTX, where we heavily rely on SROA to eliminate very expensive local loads/stores. The regression cases I'm aware of need it to be in the ballpark of 2K.

Based on the comments in the code review, bumping the default to that value may be a bit too much for rust. We may need to figure out how to make the threshold default target-specific.

For CUDA I could change the default by making clang driver pass the new threshold via additional cc1 options, but it would not work for JIT users using NVPTX back-end.

@Artem-B
Copy link
Member Author

Artem-B commented Oct 20, 2023

We could use TargetTransformInfo::getInliningThresholdMultiplier as a proxy

unsigned getInliningThresholdMultiplier() const;

Or we can grow a new getSROAMaxAllocaSlicesMultiplier.

@nikic
Copy link
Contributor

nikic commented Oct 21, 2023

I don't think SROA should have target-dependent limits. If this limit is too low for some use cases, we should find a more precise heuristic. In particular I'm thinking is that we don't want to limit the number of slices by itself, but rather limit breaking up a single memcpy operation into a large number of slices. Just normal promotion of slices to registers should be unproblematic for any number of slices, it's the case where one operation gets turned into thousands that's a problem.

@Artem-B
Copy link
Member Author

Artem-B commented Oct 23, 2023

If this limit is too low for some use cases, we should find a more precise heuristic.

The issue is that SROA appears to have super-linear scaling. From https://reviews.llvm.org/D159354#4634407

here's some statistics for different values (running opt -O3 on unoptimized IR produced by rustc):

2    : 0.05s user
4    : 0.04s user
8    : 0.06s user
16   : 0.06s user
32   : 0.08s user
64   : 0.05s user
128  : 0.12s user
256  : 0.56s user
512  : 0.59s user
1024 : 6.62s user
2048 : 53.27s user

The current value is close to what looks like a reasonable level for non-GPU targets. However, for GPUs we do regularly need much higher thresholds. Setting the default to 2K would un-fix the issue for rust.

we don't want to limit the number of slices by itself, but rather limit breaking up a single memcpy operation into a large number of slices. Just normal promotion of slices to registers should be unproblematic for any number of slices, it's the case where one operation gets turned into thousands that's a problem.

If the SROA scalability could be improved for everyone, that would be great. I have no clue how long it would take, though. Until that happens, I need some sort of work-around as it affects a fair number of GPU users.

@dc03
Copy link
Member

dc03 commented Oct 25, 2023

The issue is that SROA appears to have super-linear scaling.

As I understand it, the issue isn't in SROA itself, its the knock-on effects it has on MemCpyOpt and DSE among others. When we generate a huge number of slices, these passes then have to deal with processing all of them which leads to this compile time explosion.

@Artem-B
Copy link
Member Author

Artem-B commented Nov 7, 2023

I don't think SROA should have target-dependent limits.

@nikic What do you suggest we should do then? Bumping the default does not seem to be a good way out of this situation either.

@nikic
Copy link
Contributor

nikic commented Nov 7, 2023

I don't think SROA should have target-dependent limits.

@nikic What do you suggest we should do then? Bumping the default does not seem to be a good way out of this situation either.

What I said directly after that quote:

In particular I'm thinking is that we don't want to limit the number of slices by itself, but rather limit breaking up a single memcpy operation into a large number of slices.

But for now, I think we should just revert this change.

@alex-t
Copy link
Contributor

alex-t commented Nov 7, 2023

This change causes the 85% performance drop in certain rocBLAS benchmarks for AMDGPU.
My first idea was to add the hook in TargetTransformInfo but the SROA is target-unaware by its nature.

nikic added a commit that referenced this issue Nov 9, 2023
…t allocas"

This reverts commit e13e808.

This causes performance regressions on GPU targets, see
#69785. Revert the
change for now.
@nikic
Copy link
Contributor

nikic commented Nov 9, 2023

Reverted in ed86e74 for now.

qihangkong pushed a commit to rvgpu/llvm that referenced this issue Apr 18, 2024
…t allocas"

This reverts commit e13e808.

This causes performance regressions on GPU targets, see
llvm/llvm-project#69785. Revert the
change for now.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants