Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add subgroups proposal #4368

Merged
merged 11 commits into from
Dec 6, 2023
248 changes: 248 additions & 0 deletions proposals/subgroups.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,248 @@
# Subgroups

Status: **Draft**

Last modified: 2023-11-07

Issue: #4306

# Requirements

**Vulkan**:
* SPIR-V 1.3, and
* Vulkan 1.1, and
* subgroupSupportedStages includes compute and fragment bit, and
* subgroupSupportedOperations includes the following bits: basic, vote, ballot, shuffle, shuffle relative, arithmetic, quad, and
* Vulkan 1.3 or VK_EXT_subgroup_size_control
teoxoy marked this conversation as resolved.
Show resolved Hide resolved

According to [this query](https://vulkan.gpuinfo.org/displaycoreproperty.php?core=1.1&name=subgroupSupportedOperations&platform=all),
~84% of devices are captured.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like these percentages are fractions of unique reports.

It would be more meaningful to weight by shipped (and active) units. But this data is hard to get. (As a committee we should decide on a set of representative popular devices. But that goes way beyond this issue.)

I think generally the guidance works out ok though.

Dropping quad would only grab another ~1%.

**Metal**:
* Quad-scoped permute, and
* Simd-scoped permute, and
* Simd-scoped reduction, and
* Metal 2.1 for macOS or Metal 2.3 for iOS

According to the Metal
[feature table](https://developer.apple.com/metal/Metal-Feature-Set-Tables.pdf), this
includes the following families: Metal3, Apple7+, Mac2.


**D3D12**:
* SM6.0 support, and
* `D3D12_FEATURE_DATA_D3D12_OPTIONS1` permits `WaveOps`

# WGSL

## Enable Extension

Add two new enable extensions.
| Enable | Description |
| --- | --- |
| **subgroups** | Adds built-in values and functions for subgroups |
| **subgroups-f16** | Allows f16 to be used in subgroups operations |

Note: Metal can always provide subgroups-f16, Vulkan requires
VK_KHR_shader_subgroup_extended_types
([~61%](https://vulkan.gpuinfo.org/listdevicescoverage.php?extension=VK_KHR_shader_subgroup_extended_types&platform=all)
of devices), and D3D12 requires SM6.2.
Comment on lines +45 to +50
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need an additional subgroups-f16 feature? The existing f16 feature already requires SM6.2 and almost all devices that support the proposed subgroups functionality + the f16 feature also support shaderSubgroupExtendedTypes (see reports analysis).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some numbers for Vulkan (pulled from the analysis):

  • 78% devices support the proposed subgroup feature
  • 46% devices support f16
  • 42% devices support both f16 and subgroup

100% being core WebGPU functionality.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's interesting data. I'd prefer not to cleave the feature. Can you clarify what the f16 feature is? Does it represent the set of features required to enable f16 in WGSL? The other features seem to mirror Vulkan features.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added this as a todo for now (and linked to your report).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code for the f16 features is here.

This is the whole commit: teoxoy/gpuinfo-vulkan-query@8681e00 (includes the src and the analysis).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok so it is all the features needed to implement WGSL f16. Thanks!


**TODO**: Can we drop **subgroups-f16**?
According to this [analysis](https://github.com/teoxoy/gpuinfo-vulkan-query/blob/8681e0074ece1b251177865203d18b018e05d67a/subgroups.txt#L1071-L1466)
Only 4% of devices that support both f16 and subgroups could not support
subgroup extended types.

**TODO**: Should this feature be broken down further?
According to [gpuinfo.org](https://vulkan.gpuinfo.org/displaycoreproperty.php?core=1.1&name=subgroupSupportedOperations&platform=all),
this feature set captures ~84% of devices.
Splitting the feature does not grab a significant portion of devices.
Comment on lines +58 to +60
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to this report analysis, splitting off the arithmetic ops looks the most promising, but considering the devices we get back are only Gen7 (Ivy Bridge, Haswell) intel it might not be worth it.

Splitting out simd-scoped reduction adds the Apple6 family which includes
iPhone11, iPhone SE, and iPad (9th gen).

**TODO**: Should there be additional enables for extra functionality?
Some possibilities:
* MSL and HLSL (SM6.7) support `any` and `all` operations at quad scope
* SPIR-V and HLSL (SM6.5) could support more exclusive/inclusive, clustered, and partitioned operations
* Inclusive sum and product could be done with multi-prefix SM6.5 operations in HLSL or through emulation

## Built-in Values

| Built-in | Type | Direction | Description |
| --- | --- | --- | --- |
| `subgroup_size` | u32 | Input | The size of the current subgroup |
| `subgroup_invocation_id` | u32 | Input | The index of the invocation in the current subgroup |

Note: HLSL does not expose a subgroup_id or num_subgroups equivalent.

**TODO**: Can subgroup_id and/or num_subgroups be emulated efficiently and portably?

## Built-in Functions

All built-in function can only be used in `compute` or `fragment` shader stages.
Using f16 as a parameter in any of these functions requires `subgroups-f16` to be enabled.

| Function | Preconditions | Description |
| --- | --- | --- |
| `fn subgroupElect() -> bool` | | Returns true if this invocation has the lowest subgroup_invocation_id among active invocations in the subgroup |
| `fn subgroupAll(e : bool) -> bool` | | Returns true if `e` is true for all active invocations in the subgroup |
| `fn subgroupAny(e : bool) -> bool` | | Returns true if `e` is true for any active invocation in the subgroup |
| `fn subgroupBroadcast(e : T, id : I) -> T` | `T` must be u32, i32, f32, f16 or a vector of those types<br>`I` must be i32 or u32 | Broadcasts `e` from subgroup_invocation_id `id` to all active invocations. `id` must be dynamically uniform<sup>1</sup> |
| `fn subgroupBroadcastFirst(e : T) -> T` | `T` must be u32, i32, f32, f16 or a vector of those types | Broadcasts `e` from the active invocation with the lowest subgroup_invocation_id in the subgroup to all other active invocations |
| `fn subgroupBallot(pred : bool) -> vec4<u32>` | | Returns a set of bitfields where the bit corresponding to subgroup_invocation_id is 1 if `pred` is true for that active invocation and 0 otherwise. |
| `fn subgroupShuffle(v : T, id : I) -> T` | `T` must be u32, i32, f32, f16 or a vector of those types<br>`I` must be u32 or i32 | Returns `v` from the active invocation whose subgroup_invocation_id matches `id` |
| `fn subgroupShuffleXor(v : T, mask : u32) -> T` | `T` must be u32, i32, f32, f16 or a vector of those types | Returns `v` from the active invocation whose subgroup_invocation_id matches `subgroup_invocation_id ^ mask`.<br>`mask` must be dynamically uniform. |
| `fn subgroupShuffleUp(v : T, delta : u32) -> T` | `T` must be u32, i32, f32, f16 or a vector of those types | Returns `v` from the active invocation whose subgroup_invocation_id matches `subgroup_invocation_id - delta` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when subgroup_invocation_id +- delta goes OOB?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An indeterminate value is returned. I was trying to avoid all the fiddly cases in the proposal that will have to be addressed in the final spec.

MSL says lower invocations are unmodified, but SPIR-V says undefined value.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually MSL has a variant of shuffles which fill-in a defined value in the OOB cases. But leaving it as UB is fine as well IMO.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm misunderstanding this, but given that at least one invocation would end up with an indeterminate value, wouldn't the developer always have to end up doing some comparison against subgroup_invocation_id to ensure the indeterminate value doesn't cause trouble?

| `fn subgroupShuffleDown(v : T, delta : u32) -> T` | `T` must be u32, i32, f32, f16 or a vector of those types | Returns `v` from the active invocation whose subgroup_invocation_id matches `subgroup_invocation_id + delta` |
| `fn subgroupSum(e : T) -> T` | `T` must be u32, i32, f32, or a vector of those types | Reduction<br>Adds `e` among all active invocations and returns that result |
| `fn subgroupExclusiveSum(e : T) -> T)` | `T` must be u32, i32, f32, f16 or a vector of those types | Exclusive scan<br>Returns the sum of `e` for all active invocations with subgroup_invocation_id less than this invocation |
| `fn subgroupProduct(e : T) -> T` | `T` must be u32, i32, f32, or a vector of those types | Reduction<br>Multiplies `e` among all active invocations and returns that result |
| `fn subgroupExclusiveProduct(e : T) -> T)` | `T` must be u32, i32, f32, f16 or a vector of those types | Exclusive scan<br>Returns the product of `e` for all active invocations with subgroup_invocation_id less than this invocation |
| `fn subgroupAnd(e : T) -> T` | `T` must be u32, i32, or a vector of those types | Reduction<br>Performs a bitwise and of `e` among all active invocations and returns that result |
| `fn subgroupOr(e : T) -> T` | `T` must be u32, i32, or a vector of those types | Reduction<br>Performs a bitwise or of `e` among all active invocations and returns that result |
| `fn subgroupXor(e : T) -> T` | `T` must be u32, i32, or a vector of those types | Reduction<br>Performs a bitwise xor of `e` among all active invocations and returns that result |
| `fn subgroupMin(e : T) -> T` | `T` must be u32, i32, f32, f16 or a vector of those types | Reduction<br>Performs a min of `e` among all active invocations and returns that result |
| `fn subgroupMax(e : T) -> T` | `T` must be u32, i32, f32, f16 or a vector of those types | Reduction<br>Performs a max of `e` among all active invocations and returns that result |
| `fn quadBroadcast(e : T, id : I)` | `T` must be u32, i32, f32, f16 or a vector of those types<br>`I` must be u32 or i32 | Broadcasts `e` from the quad invocation with id equal to `id`<br>`e` must be a constant-expression<sup>2</sup> |
| `fn quadSwapX(e : T)` | `T` must be u32, i32, f32, f16 or a vector of those types | Swaps `e` between invocations in the quad in the X direction |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a definition of how invocations are grouped into quads (how it maps to subgroup_invocation_id)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vulkan explicitly requires that quads are consequence subgroup invocations. See https://registry.khronos.org/vulkan/specs/1.3-extensions/html/vkspec.html#shaders-scope-quad. MSL provides a quad id. D3D12 says they are evenly divided in the subgroup (so I assume just like vulkan).

| `fn quadSwapY(e : T)` | `T` must be u32, i32, f32, f16 or a vector of those types | Swaps `e` between invocations in the quad in the Y direction |
| `fn quadSwapDiagonal(e : T)` | `T` must be u32, i32, f32, f16 or a vector of those types | Swaps `e` between invocations in the quad diagnoally |
1. This is the first instance of dynamic uniformity. See the portability and uniformity section for more details.
2. Unlike `subgroupBroadcast`, SPIR-V does not have a shuffle operation to fall back on, so this requirement must be surfaced.

**TODO**: Are quad operations worth it?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we are adding subgroup operations to fragment shaders yet right? So are there helper invocations?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah it seems we are, but i thought these were even newer functionality than compute subgroups, are we looking to add them anyway? It seems it could lose some reach.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to gpuinfo.org very few devices don't support fragment shaders.

Because general portability seems to be off the table based on empirical testing, it seemed beneficial to allow them in fragment shaders.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requiring subgroupSupportedStages to include the fragment stage seems to drop support for the Adreno 600 series (see reports analysis).

This is probably fine given that D3D12 and Metal support subgroup ops in the fragment stage unconditionally.

Quad operations present even less portability than subgroup operations due to
factors like helper invocations and multiple draws being packed into a
subgroup.
SM6.7 adds an attribute to require helpers be active.

**TODO**: Can we spec the builtins to improve portability without hurting performance?
E.g. shuffle up or down when delta is clearly out of range.
Need to consider the affect or active vs inactive invocations.

## Portability and Uniformity

Unfortunately,
[testing](https://github.com/gpuweb/gpuweb/issues/4306#issuecomment-1795498468)
indicates that behavior is not widely portable across devices.
Even requiring that the subgroup operations only be used in uniform control
flow (at workgroup scope) is insufficient to produce portable behavior.
For example, compilers make aggressive opimizations that do not preserve the
correct active invocations.
This leaves us in an awkward situation where portability cannot be guaranteed,
but these operations provide significant performance improvements in many
circumstances.

Suggest allowing these operations anywhere and provide guidance on how to
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does feel somewhat counter to all the efforts we've put in for portability. This is certainly an implementor's decision, but I would feel more comfortable making the extension unavailable for known drivers that perform the aggressive opimizations that do not preserve the correct active invocations. At least then there's incentive for these things to be fixed, instead of forever worked-around.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a todo to see how much more portable we can make individual functions.

achieve portable behavior.
From testing, it seems all implementations are able to produce portable results
when the workgroup never diverges.
While this may seem obvious, it still provides significant benefit in many
cases (for example, by reducing overall memory bandwidth).

Not requiring any particular uniformity also makes providing these operations
in fragment shaders more palatable.
Normally, there would be extra portability hazards in fragment shaders (e.g.
due to helper invocations).

## Diagnostics

Add new diagnostic controls:

| Filterable Triggering Rule | Default Severity | Triggering Location | Description |
| --- | --- | --- | --- |
| **subgroup_uniformity** | Error | Call site of a subgroup builtin function | A call to a subgroup builtin that the uniformity analysis cannot prove occurs in uniform control flow (or with uniform parameter values in some cases) |
| **subgroup_branching** | Error | Call site of a subgroup builtin function | A call to a subgroup builtin that uniformity analysis cannot prove is preceeded only by uniform branches |

**TODO**: Are these defaults appropriate?
They attempt to default to the most portable behavior, but that means it would
be an error to have a subgroup operation preceeded by divergent control flow.

# API

## GPU Feature

New GPU features:
| Feature | Description |
| --- | --- |
| **subgroups** | Allows the WGSL feature and adds new limits |
| **subgroups-f16** | Allows WGSL feature. Requires **subgroups** and **shader-f16** |

**TODO**: Can we expose a feature to require a specific subgroup size?
No facility exists in Metal so it would have to be a separate feature.
In SM6.6 HLSL adds a required wave size attribute for shaders.
In Vulkan, pipelines can specify a required size between min and max using
subgroup size control.
This is a requested feature (see #3950).

## Limits

Two new limits:
| Limit | Description | Vulkan | Metal | D3D12
| --- | --- | --- | --- | --- |
| subgroupMinSize | Minimum subgroup size | minSubgroupSize from VkPhysicalDeviceSubgroupSizeProperties[EXT] | 4 | WaveLaneCountMin from D3D12_FEATURE_DATA_D3D12_OPTIONS1 |
| subgroupMaxSize | Maximum subgroup size | maxSubgroupSize from VkPhysicalDeviceSubgroupSizeProperties[EXT] | 64 | 128 |

Major requirement is that no shader will be launched where the `subgroup_size`
built-in value is less than `subgroupMinSize` or greater than
`subgroupMaxSize`.

**TODO**: I have been unable to find a more accurate query for Metal subgroup
sizes before pipeline compilation.

**TODO**: More testing is required to verify the reliability of D3D12 WaveLaneCountMin.

**TODO**: We could consider adding a limit for which stages support subgroup
operations for future expansion, but it is not necessary now.

# Pipelines

Note: Vulkan backends should either pass
VkShaderRequiredSubgroupSizeCreateInfoEXT or the ALLOW_VARYING flag to pipeline
creation to ensure the subgroup size built-in value works correctly.

**TODO**: Can we add a pipeline parameter to require full subgroups in compute shaders?
Validate that workgroup size x dimension is a multiple of max subgroup size.
For Vulkan, this would set the FULL_SUBGROUPS pipeline creation bit.
For Metal, this would use threadExecutionWidth.
D3D12 would have to be proven empricially.

# Appendix A: WGSL Built-in Value Mappings

| Built-in | SPIR-V | MSL | HLSL |
| --- | --- | --- | --- |
| `subgroup_size` | SubgroupSize | threads_per_simdgroup | WaveGetLaneCount |
| `subgroup_invocation_id` | SubgroupLocalInvocationId | thread_index_in_simdgroup | WaveGetLaneIndex |

# Appendix B: WGSL Built-in Function Mappings

| Built-in | SPIR-V<sup>1</sup> | MSL | HLSL |
| --- | --- | --- | --- |
| `subgroupElect` | OpGroupNonUniformElect | simd_is_first | WaveIsFirstLane |
| `subgroupAll` | OpGroupNonUniformAll | simd_all | WaveActiveAllTrue |
| `subgroupAny` | OpGroupNonUniformAny | simd_any | WaveActiveAnyTrue |
| `subgroupBroadcast` | OpGroupNonUniformBroadcast<sup>2</sup> | simd_broadcast | WaveReadLaneAt |
| `subgroupBroadcastFirst` | OpGroupNonUniformBroadcastFirst | simd_broadcast_first | WaveReadLaneFirst |
| `subgroupBallot` | OpGroupNonUniformBallot | simd_ballot | WaveActiveBallot |
| `subgroupShuffle` | OpGroupNonUniformShuffle | simd_shuffle | WaveReadLaneAt with non-uniform index |
| `subgroupShuffleXor` | OpGroupNonUniformShuffleXor | simd_shuffle_xor | WaveReadLaneAt with index equal `subgroup_invocation_id ^ mask` |
| `subgroupShuffleUp` | OpGroupNonUniformShuffleUp | simd_shuffle_up | WaveReadLaneAt with index equal `subgroup_invocation_id - delta` |
| `subgroupShuffleDown` | OpGroupNonUniformShuffleDown | simd_shuffle_down | WaveReadLaneAt with index equal `subgroup_invocation_id + delta` |
| `subgroupSum` | OpGroupNonUniform[IF]Add with Reduce operation | simd_sum | WaveActiveSum |
| `subgroupExclusiveSum` | OpGroupNonUniform[IF]Add with ExclusiveScan operation | simd_prefix_exclusive_sum | WavePrefixSum |
| `subgroupProduct` | OpGroupNonUniform[IF]Mul with Reduce operation | simd_product | WaveActiveProduct |
| `subgroupExclusiveProduct` | OpGroupNonUniform[IF]Add with ExclusiveScan operation | simd_prefix_exclusive_product | WavePrefixProduct |
| `subgroupAnd` | OpGroupNonUniformBitwiseAnd with Reduce operation | simd_and | WaveActiveBitAnd |
| `subgroupOr` | OpGroupNonUniformBitwiseOr with Reduce operation | simd_or | WaveActiveBitOr |
| `subgroupXor` | OpGroupNonUniformBitwiseXor with Reduce operation | simd_xor | WaveActiveBitXor |
| `subgroupMin` | OpGroupNonUniform[SUF]Min with Reduce operation | simd_min | WaveActiveMin |
| `subgroupMax` | OpGroupNonUniform[SUF]Max with Reduce operation | simd_max | WaveActiveMax |
| `quadBroadcast` | OpGroupNonUniformQuadBroadcast | quad_broadcast | QuadReadLaneAt |
| `quadSwapX` | OpGroupNonUniformQuadSwap with Direction=0 | quad_shuffle with `quad_lane_id=thread_index_in_quad_group ^ 0x1` | QuadReadAcrossX |
| `quadSwapY` | OpGroupNonUniformQuadSwap with Direction=1 | quad_shuffle with `quad_lane_id=thread_index_in_quad_group ^ 0x10` | QuadReadAcrossY |
| `quadSwapDiagonal` | OpGroupNonUniformQuadSwap with Direction=2 | quad_shuffle with `quad_lane_id=thread_index_in_quad_group ^ 0x11` | QuadReadAcrossDiagonal |


1. All group non-uniform instructions use the `Subgroup` scope.
2. To avoid constant-expression requirement, use SPIR-V 1.5 or OpGroupNonUniformShuffle.