[SROA] the limit on maximum number of alloca slices is too low for NVPTX (and, likely, AMDGPU). #69785

Artem-B · 2023-10-20T21:38:54Z

e13e808 added a limit on the number of alloca slices processed by SROA and set it to 1024.

The limit appears to be a bit too low for NVPTX, where we heavily rely on SROA to eliminate very expensive local loads/stores. The regression cases I'm aware of need it to be in the ballpark of 2K.

Based on the comments in the code review, bumping the default to that value may be a bit too much for rust. We may need to figure out how to make the threshold default target-specific.

For CUDA I could change the default by making clang driver pass the new threshold via additional cc1 options, but it would not work for JIT users using NVPTX back-end.

The text was updated successfully, but these errors were encountered:

Artem-B · 2023-10-20T21:56:07Z

We could use TargetTransformInfo::getInliningThresholdMultiplier as a proxy

llvm-project/llvm/include/llvm/Analysis/TargetTransformInfo.h

Line 349 in a432358

unsigned getInliningThresholdMultiplier() const;

Or we can grow a new getSROAMaxAllocaSlicesMultiplier.

nikic · 2023-10-21T08:17:08Z

I don't think SROA should have target-dependent limits. If this limit is too low for some use cases, we should find a more precise heuristic. In particular I'm thinking is that we don't want to limit the number of slices by itself, but rather limit breaking up a single memcpy operation into a large number of slices. Just normal promotion of slices to registers should be unproblematic for any number of slices, it's the case where one operation gets turned into thousands that's a problem.

Artem-B · 2023-10-23T22:12:32Z

If this limit is too low for some use cases, we should find a more precise heuristic.

The issue is that SROA appears to have super-linear scaling. From https://reviews.llvm.org/D159354#4634407

here's some statistics for different values (running opt -O3 on unoptimized IR produced by rustc):

2    : 0.05s user
4    : 0.04s user
8    : 0.06s user
16   : 0.06s user
32   : 0.08s user
64   : 0.05s user
128  : 0.12s user
256  : 0.56s user
512  : 0.59s user
1024 : 6.62s user
2048 : 53.27s user

The current value is close to what looks like a reasonable level for non-GPU targets. However, for GPUs we do regularly need much higher thresholds. Setting the default to 2K would un-fix the issue for rust.

we don't want to limit the number of slices by itself, but rather limit breaking up a single memcpy operation into a large number of slices. Just normal promotion of slices to registers should be unproblematic for any number of slices, it's the case where one operation gets turned into thousands that's a problem.

If the SROA scalability could be improved for everyone, that would be great. I have no clue how long it would take, though. Until that happens, I need some sort of work-around as it affects a fair number of GPU users.

dc03 · 2023-10-25T05:58:38Z

The issue is that SROA appears to have super-linear scaling.

As I understand it, the issue isn't in SROA itself, its the knock-on effects it has on MemCpyOpt and DSE among others. When we generate a huge number of slices, these passes then have to deal with processing all of them which leads to this compile time explosion.

Artem-B · 2023-11-07T00:16:55Z

I don't think SROA should have target-dependent limits.

@nikic What do you suggest we should do then? Bumping the default does not seem to be a good way out of this situation either.

nikic · 2023-11-07T16:14:07Z

I don't think SROA should have target-dependent limits.

@nikic What do you suggest we should do then? Bumping the default does not seem to be a good way out of this situation either.

What I said directly after that quote:

In particular I'm thinking is that we don't want to limit the number of slices by itself, but rather limit breaking up a single memcpy operation into a large number of slices.

But for now, I think we should just revert this change.

alex-t · 2023-11-07T16:42:54Z

This change causes the 85% performance drop in certain rocBLAS benchmarks for AMDGPU.
My first idea was to add the hook in TargetTransformInfo but the SROA is target-unaware by its nature.

…t allocas" This reverts commit e13e808. This causes performance regressions on GPU targets, see #69785. Revert the change for now.

nikic · 2023-11-09T15:39:30Z

Reverted in ed86e74 for now.

…t allocas" This reverts commit e13e808. This causes performance regressions on GPU targets, see llvm/llvm-project#69785. Revert the change for now.

github-actions bot added the new issue label Oct 20, 2023

Artem-B assigned dc03 Oct 20, 2023

EugeneZelenko added llvm:optimizations and removed new issue labels Oct 20, 2023

Artem-B added llvm:optimizations and removed llvm:optimizations labels Oct 20, 2023

nathanchance mentioned this issue Nov 9, 2023

llc very slow to compile IR sample #71742

Open

nikic added a commit that referenced this issue Nov 9, 2023

Revert "[SROA] Limit the number of allowed slices when trying to spli…

ed86e74

…t allocas" This reverts commit e13e808. This causes performance regressions on GPU targets, see #69785. Revert the change for now.

EugeneZelenko added llvm:transforms and removed llvm:optimizations labels Nov 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SROA] the limit on maximum number of alloca slices is too low for NVPTX (and, likely, AMDGPU). #69785

[SROA] the limit on maximum number of alloca slices is too low for NVPTX (and, likely, AMDGPU). #69785

Artem-B commented Oct 20, 2023

Artem-B commented Oct 20, 2023

nikic commented Oct 21, 2023

Artem-B commented Oct 23, 2023

dc03 commented Oct 25, 2023

Artem-B commented Nov 7, 2023

nikic commented Nov 7, 2023

alex-t commented Nov 7, 2023

nikic commented Nov 9, 2023

[SROA] the limit on maximum number of alloca slices is too low for NVPTX (and, likely, AMDGPU). #69785

[SROA] the limit on maximum number of alloca slices is too low for NVPTX (and, likely, AMDGPU). #69785

Comments

Artem-B commented Oct 20, 2023

Artem-B commented Oct 20, 2023

nikic commented Oct 21, 2023

Artem-B commented Oct 23, 2023

dc03 commented Oct 25, 2023

Artem-B commented Nov 7, 2023

nikic commented Nov 7, 2023

alex-t commented Nov 7, 2023

nikic commented Nov 9, 2023