add subgroups, and make them portable if possible #4306

dneto0 · 2023-09-26T21:28:55Z

Add subgroup (a.k.a. simd_group, wave, wavefront) operations. Favour portability.

There have been a number of earlier issues and PRs. None seemed quite right to restart the conversation, so I'm opening this new one.

Previous work:

Introduce Subgroup Operations Extension #954 Closed. Oguz’s original PR.
- Includes discussion of nonportability: theoretical and theoretical and demo and Myles’s broader replication of the demo (including an M1 bug, and an Intel hang)
Introduce Subgroup Operations Extension #1459 Closed. Oguz’s second PR.
Add capability to query gpu "warp" size #4290 Request query for subgroup size (“warp”)
Investigation: Querying Subgroup Support #78 Investigation: Querying subgroup support
Considerations for subgroups #3950 Raph's “Considerations for subgroups”
Request for compute: anyInvocation() and allInvocation() #2137 Request for compute: anyInvocation(), allInvocation()

Implementations
gfx-rs/wgpu#4428 Naga. request.

Interactions with uniformity:

Annotation for the uniformity analysis #2323 annotation for uniformity analysis
Uniformity annotations for global variables #1791 Uniformity annotations for global variables

Benefits: Subgroup operations offer compelling performance benefits.

Drawbacks: ( There are theoretical reasons to doubt their portability. Earlier discussion included tiny demonstrations of nonportability.

Subgroups were postponed out of "v1" until we could devote more energy to investigating them in more detail. Now is the time.

@alan-baker has been leading an effort at Google to:

Implement an experimental subgroups extension supporting the "ballot" and "broadcast" operations only.
Write prototype conformance tests to check the portability behaviours we were concerned about. You only need "ballot" to do this because it tells you the effective active mask.
- This work is in this draft PR: WIP: Experimental reconvergence tests cts#2916

Let's use this issue to show the data, and then discuss how to shape the feature.

magcius · 2023-10-11T02:55:57Z

While I don't want to push too much R&D work into WebGPU, I do think that a good number of people who would want subgroups might be happier with operations that work across the full workgroup.

Related: microsoft/hlsl-specs#105

JMS55 · 2023-10-11T05:19:53Z

I would personally love workgroup level reduction primitives, but yeah it would need more development than subgroups and I would be happy just to get subgroups.

Also I want to note, on some gpu architectures, subgroup size depends on the final compiled shader and is up to the driver. I'm unsure if polyfilling workgroup level operations on top of subgroup ops would be possible on these platforms without cooperation with the driver. I lack enough experience to say either way, I just want to bring it up as a point.

alan-baker · 2023-10-11T14:42:21Z

Workgroup operations are out of scope for this issue, but feel free to file a new issue with that request so we can track it. I expect the answer will be not until it is exposed by underlying APIs.

raphlinus · 2023-10-11T19:09:41Z

Regarding the comment by @JMS55: yes, having unpredictable subgroup size is one of the major portability challenges, and that is the primary point of #3950. That issue contains a concrete proposal which I hope will be considered carefully. The core of it is that minimum and maximum subgroup size are constants that can be used at compile time, for example to size arrays, and the actual subgroup size is available at runtime. That may sound pretty basic, but underlying APIs (with the exception of Vulkan 1.3) are hostile to reliably providing that information.

I also recommend prioritizing subgroup operations in uniform control flow. All of the use cases I care deeply about are effectively in uniform control flow. My intuition is that specifying the semantics of those use cases might be easier than the fully general case, as well getting shader compilers to emit code consistent with that spec.

Lichtso · 2023-10-20T22:06:36Z

gfx-rs/wgpu#4190 proposes a set of built-in functions and built-in values and implements them for DirectX, Metal, Vulkan and OpenGL. Feedback and ideas are welcome.

alan-baker · 2023-11-06T17:00:52Z

Chrome implemented minimal experimental extensions to experiment how much portability exists with subgroup operations.

Extension

WGSL

The following additions were made to WGSL:

subgroupBallot - an unpredicated version of ballot
subgroup_size - built-in value for subgroup size access
subgroup_invocation_id - built-in value for invocation id within a subgroup

API

The following additions were made to the API:

Two new features:
- chromium-experimental-subgroups: core subgroup functionality
- chromium-experimental-subgroup-uniform-control-flow: mirrors SPV_KHR_subgroup_uniform_control_flow, this is not implementable for most platforms
New limits: minSubgroupSize and maxSubgroupSize
A pipeline creation flag to require full subgroups

Requirements

Vulkan: key requirement for the experiments is subgroup size control
Metal: simd scoped permute operations supported
D3D: SM 6.0

Experiments

Divergence/Reconvergence

gpuweb/cts#2916 implements reconvergence tests for subgroup operations. The tests are based on Vulkan CTS's experimental reconvergence tests (available here).

There are 4 styles of reconvergence tested:

wgsl_v1 - this is intended to match the WGSL V1 spec
workgroup - a slight extension to wgsl_v1 that requires loop iterations to reconverge
subgroup - extends workgroup to subgroup scope (a la SPV_KHR_subgroup_uniform_control_flow)
maximal - set of rules that are expected to mostly match developer intuition for the HLL. No spec implements these rules

The tests are a combination of predefined and pseudo-randomly generated cases that are swept the various reconvergence styles. The program is simulated and those results are compared against the actual GPU results. Ideally, all implementations should pass at least wgsl_v1, but also hopefully workgroup and subgroup too. Maximal is more for investigation.

There is an additional set of tests (uniform_maximal) that check the behaviour when all branches are uniform (i.e. no divergence occurs in the workgroup). The expectation is that all implementations should pass these tests.

The tests can be run using dawn.node or chrome canary.

Results

We collected results from a variety of platforms and devices:

Predefined tests

GPU	Driver	Platform	Impl	Num Tests	wgsl_v1	workgroup	subgroup	maximal
Apple M1 Pro	13.5.1 (22G90)	Metal	Chrome	15	12	12	12	4
Apple M1 Pro	14.1 (23B74)	Metal	Dawn	16	13	13	13	5
Intel HD 630		Metal	Dawn	15	15	15	15	12
Intel HD TGL GT1	Mesa 22.3.6	Vulkan	Dawn	16	16	16	16	16
Intel HD TGL GT2	Mesa 22.3.6	Vulkan	Dawn	16	16	16	16	16
AMD Radeon Pro 560		Metal	Dawn	15	14	14	14	12
AMD Radeon Pro WX 3200	Windows 10 (19045.3324)	D3D12		15	14	14	14	14
Pixel 6 Pro (Mali G78)	UPB5.230623.005	Vulkan	Chrome	15	15	15	15	12
Pixel 3 (Adreno 630)	SP1A.210812.016.C2	Vulkan	Chrome	15	10	10	10	7
Nvidia Quadro P1000¹	Linux 525.125.6.384	Vulkan	Dawn	15	15	9	12	7

The Nvidia device seemed to give non-deterministic results across multiple runs.

Random tests

GPU	Driver	Platform	Impl	Num Tests	wgsl_v1	workgroup	subgroup	maximal
Apple M1 Pro	13.5.1 (22G90)	Metal	Chrome	100	87	87	87	28
Apple M1 Pro	14.1 (23B74)	Metal	Dawn	100	87	87	87	28
Intel HD 630		Metal	Dawn	100	100	100	100	79
Intel HD TGL GT1¹	Mesa 22.3.6	Vulkan	Dawn	100	98	98	98	96
Intel HD TGL GT2¹	Mesa 22.3.6	Vulkan	Dawn	100	100			99
AMD Radeon Pro 560²		Metal	Dawn	100
AMD Radeon Pro WX 3200	Windows 10 (19045.3324)	D3D12		100	36	36	34	62
Pixel 6 Pro (Mali G78)	UPB5.230623.005	Vulkan	Chrome	100	100	100	100	81
Pixel 3 (Adreno 630)³	SP1A.210812.016.C2	Vulkan	Chrome	100
Nvidia Quadro P1000	Linux 525.125.6.384	Vulkan	Dawn	100	100	100	100	86

All failures were timeouts
Compiler bug prevented testing. Many cases would memout around 80GB.
Driver crashes prevented testing.

Uniform tests

The PR also contains a set of pseudo-randomly generated tests that always select uniform branches (uniform_maximal set). These were added later so we haven't collected as much information from them. All devices should pass these tests.

GPU	Driver	Platform	Impl	Num Tests	maximal
Apple M1 Pro¹	13.5.1 (22G90)	Metal	Chrome	500	498
Apple M1 Pro	14.1 (23B74)	Metal	Dawn	500	500
Intel HD 630		Metal	Dawn	500	500
Intel HD TGL GT1¹	Mesa 22.3.6	Vulkan	Dawn	500	500
Intel HD TGL GT2¹	Mesa 22.3.6	Vulkan	Dawn	500	500
Pixel 6 Pro (Mali G78)	UPB5.230623.005	Vulkan	Chrome	500	500
Nvidia Quadro P1000	Linux 525.125.6.384	Vulkan	Dawn	500	500

Functional incorrectness in the two failures. Bugs were opened with Apple.

Subgroup Size

More testing is required to test that subgroup sizes are reliable, but early indications are that the requirements placed on Vulkan are sufficient. Metal also appears to be ok in this regard. D3D12 requires more testing.

The tests check that a ballot bit count matches the value from the subgroup size built-in value, but currently the PR does not check the newly added limits. I have an experimental patch (that requires IDL and Dawn changes) that verifies the Vulkan behaviour.

Further testing

I haven't been able to test the requires full subgroups pipeline flag yet. It has obvious implementations for Metal and Vulkan, but not D3D12. Most implementations seem to do the right thing here anyways though.

Discussion

Behaviour is not portable. The failures for even wgsl_v1 reconvergence are problematic. This means we cannot even produce portable behaviour by requiring the built-in functions only be used in uniform control flow. We could have explanations that behaviour is portable if you do not diverge the workgroup/subgroup. This is an important use case in terms of pure acceleration, but leaves large gaps in terms of overall portability of the feature.

alan-baker · 2023-11-07T21:58:37Z

Made a proposal in #4368 based on the previous work.

kdashg · 2023-12-06T22:39:50Z

WGSL 2023-12-05 Minutes

AB: The M1 part we want is setting a basic direction.
JB: Let’s talk about that int he second half of the meeting.

dneto0 added the wgsl WebGPU Shading Language Issues label Sep 26, 2023

dneto0 added this to the Milestone 2 milestone Sep 26, 2023

dneto0 mentioned this issue Sep 27, 2023

Support for explicit matmul instructions / tensor core instructions #4137

Open

This was referenced Sep 30, 2023

Subgroup Operations gfx-rs/naga#2523

Closed

Subgroup Operations gfx-rs/wgpu#4190

Closed

alan-baker mentioned this issue Jan 25, 2024

[SPIR-V] Expose maximal reconvergence microsoft/hlsl-specs#164

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add subgroups, and make them portable if possible #4306

add subgroups, and make them portable if possible #4306

dneto0 commented Sep 26, 2023 •

edited

magcius commented Oct 11, 2023 •

edited

JMS55 commented Oct 11, 2023

alan-baker commented Oct 11, 2023

raphlinus commented Oct 11, 2023

Lichtso commented Oct 20, 2023 •

edited

alan-baker commented Nov 6, 2023 •

edited by Kangz

alan-baker commented Nov 7, 2023

kdashg commented Dec 6, 2023

add subgroups, and make them portable if possible #4306

add subgroups, and make them portable if possible #4306

Comments

dneto0 commented Sep 26, 2023 • edited

magcius commented Oct 11, 2023 • edited

JMS55 commented Oct 11, 2023

alan-baker commented Oct 11, 2023

raphlinus commented Oct 11, 2023

Lichtso commented Oct 20, 2023 • edited

alan-baker commented Nov 6, 2023 • edited by Kangz

Extension

WGSL

API

Requirements

Experiments

Divergence/Reconvergence

Results

Predefined tests

Random tests

Uniform tests

Subgroup Size

Further testing

Discussion

alan-baker commented Nov 7, 2023

kdashg commented Dec 6, 2023

dneto0 commented Sep 26, 2023 •

edited

magcius commented Oct 11, 2023 •

edited

Lichtso commented Oct 20, 2023 •

edited

alan-baker commented Nov 6, 2023 •

edited by Kangz