Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[gpu] Use clustered gpu.subgroup_reduce for nested layout distribution #18515

Merged

Conversation

andfau-amd
Copy link
Contributor

@andfau-amd andfau-amd commented Sep 12, 2024

There is now support in MLIR for expressing a subgroup reduction operation that operates on several "clusters" in parallel, so it is no longer necessary to build a series of shuffles.

It has been verified that, at least if the upstream patterns are used, the resulting sequence of shuffles is the same as the old code.

This commit also adds a new pass, ExpandGPUOps, which uses the upstream patterns to expand these ops, and adds it to the LLVMGPU pass list.

Resolves #18142.

Copy link
Contributor

@Groverkss Groverkss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! There seems to be a failure in some tests because of this patch. Looks like you forgot to add a lowering for this op to gpu.shuffle ops?

failed to legalize operation 'gpu.subgroup_reduce' that was explicitly marked illegal

@andfau-amd andfau-amd force-pushed the 18142-clustered-subgroup-reduce-integration branch from efa3bcd to 21c8d62 Compare September 13, 2024 13:59
@andfau-amd
Copy link
Contributor Author

andfau-amd commented Sep 13, 2024

There seems to be a failure in some tests because of this patch. Looks like you forgot to add a lowering for this op to gpu.shuffle ops?

When I saw that failure, my gut feeling was that the ROCm pipelines just don't have the subgroup->shuffle lowering patterns, because previously they didn't need them, as they could always use the lowering to a dedicated op. If so, the problem is that the dedicated op lowering is now partial (can't be used when there's clusters). I might be wrong about this and will need to look closer at it soon. If you have any hints related to this I'd be grateful. I'm assuming I need to add the upstream shuffle lowering patterns somewhere in some pipeline.

@kuhar
Copy link
Member

kuhar commented Sep 13, 2024

my gut feeling was that the ROCm pipelines just don't have the subgroup->shuffle lowering patterns, because previously they didn't need them

Right, I think this is the case. We should use patterns that expand these into primitive shuffles etc. and add to the rocm lowering pipeline.

@andfau-amd andfau-amd force-pushed the 18142-clustered-subgroup-reduce-integration branch from 36973a5 to 2093968 Compare September 18, 2024 19:00
@andfau-amd andfau-amd force-pushed the 18142-clustered-subgroup-reduce-integration branch 2 times, most recently from e9c8ccd to 6f86e2c Compare September 18, 2024 19:05
Copy link
Contributor

@qedawkins qedawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to land once integrate pulls in the needed commits.

.gitmodules Outdated Show resolved Hide resolved
Copy link
Member

@kuhar kuhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have a basic LIT test to check that the subgroup size gets correctly passed on to the subgroup expansion patterns?

@andfau-amd andfau-amd force-pushed the 18142-clustered-subgroup-reduce-integration branch from 6f86e2c to 73c6de6 Compare September 18, 2024 19:47
@andfau-amd
Copy link
Contributor Author

Should we have a basic LIT test to check that the subgroup size gets correctly passed on to the subgroup expansion patterns?

What should it look like? I think some kind of integration test might work, but if it's just testing the pass in isolation I'm not sure it'll be useful; it's the presence and accuracy of particular attributes that I think would cause problems, not whether it manages to read them.

@kuhar
Copy link
Member

kuhar commented Sep 18, 2024

Should we have a basic LIT test to check that the subgroup size gets correctly passed on to the subgroup expansion patterns?

What should it look like? I think some kind of integration test might work, but if it's just testing the pass in isolation I'm not sure it'll be useful; it's the presence and accuracy of particular attributes that I think would cause problems, not whether it manages to read them.

I'd dump the IR just before the pass, strip away everything unrelated and run it in isolation, once with subgroup size 32 and once with 64, expecting a different number of shuffles

@andfau-amd
Copy link
Contributor Author

The number of shuffles would be the same in both cases, because only subgroup reductions with a specified cluster size are being lowered. The subgroup size is basically just for the sake of error-checking in this case.

@kuhar
Copy link
Member

kuhar commented Sep 18, 2024

Ah, OK then

Copy link
Member

@kuhar kuhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, let's wait for the mlir commit to propagate to iree

@andfau-amd andfau-amd force-pushed the 18142-clustered-subgroup-reduce-integration branch from 73c6de6 to 25c1a41 Compare September 18, 2024 20:16
There is now support in MLIR for expressing a subgroup reduction
operation that operates on several "clusters" in parallel, so it is no
longer necessary to build a series of shuffles.

It has been verified that, at least if the upstream patterns are used,
the resulting sequence of shuffles is the same as the old code.

This commit also adds a new pass, ExpandGPUOps, which uses the upstream
patterns to expand these ops, and adds it to the LLVMGPU pass list.

Resolves iree-org#18142.

Signed-off-by: Andrea Faulds <andrea.faulds@amd.com>
@andfau-amd andfau-amd force-pushed the 18142-clustered-subgroup-reduce-integration branch from 25c1a41 to 4b984f1 Compare September 23, 2024 16:23
@andfau-amd
Copy link
Contributor Author

New LLVM integrate has landed, I've rebased this on main now and removed the LLVM cherry-picks commit, so let's see if CI is happy.

Copy link
Member

@kuhar kuhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kuhar kuhar merged commit c0909a4 into iree-org:main Sep 23, 2024
35 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[GPU] Clustered Subgroup Reduction
4 participants