New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add subgroups proposal #4368
Add subgroups proposal #4368
Conversation
Previews, as seen when this build job started (161220e): |
Is there a rationale for including |
proposals/subgroups.md
Outdated
| `fn subgroupShuffleUp(v : T, delta : u32) -> T` | `T` must be u32, i32, f32, f16 or a vector of those types | Returns `v` from the active invocation whose subgroup_invocation_id matches `subgroup_invocation_id - delta` | | ||
| `fn subgroupShuffleDown(v : T, delta : u32) -> T` | `T` must be u32, i32, f32, f16 or a vector of those types | Returns `v` from the active invocation whose subgroup_invocation_id matches `subgroup_invocation_id + delta` | | ||
| `fn subgroupSum(e : T) -> T` | `T` must be u32, i32, f32, or a vector of those types | Reduction<br>Adds `e` among all active invocations and returns that result | | ||
| `fn subgroupInclusiveSum(e : T) -> T)` | `T` must be u32, i32, f32, f16 or a vector of those types | Inclusive scan<br>Returns the sum of `e` for all active invocations with subgroup_invocation_id less than or equal to this invocation | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
subgroupInclusiveSum naming bikeshedding
It's not obvious from the name this behaves differently from subgroupSum
in that it returns a unique value per invocation rather than a subgroup-uniform value. Not having prior knowledge I'd think a standard sum operation was already "inclusive".
I think subgroupPrefixInclusiveSum
or subgroupInclusivePrefixSum
would be much more obvious at the cost of verbosity. Alternatively, subgroupPrefixSum
still hints that something more complex than an ordinary sum is going on to an unfamiliar reader, but for experienced users leaves room for confusion as to the in/exclusivity. It doesn't help that DX12 WavePrefixSum()
is exclusive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is meant to be exclusive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bikeshedding thoughts.
Comparing prior art:
- "Prefix sum" is what I've known the general algorithm as, and likely the first entry for most folks. https://en.wikipedia.org/wiki/Prefix_sum
- HLSL and MSL use "prefix" in the names. GLSL does not.
- HLSL and MSL use "sum" and "product". GLSL uses "Add" and "Mul". The GLSL way generalizes to logical operations like And, Or, Max, Min (or those are simply ambiguoous! Hmmm.)
Once I see many operations in context like in this Vulkan subgroups tutorial, lacking "prefix" doesn't seem so bad because it's implied by Inclusive
or Exclusive
.
Do we have any concrete examples of what a developer would have to do to write performant code that worked on all platforms? I get in the abstract what they'd have to do but it would be nice to see some more concrete example JS/TS code of what a developer would have to do to run across devices. |
I would also strongly recommend having the exclusive prefix scans or both, not the inclusive versions only. The reason is that it is easy to build the inclusive version from the exclusive one, but not the other way around. E.g. take multiplication, one can just multiply the current invocations value and the exclusive product to get the inclusive product. However, undoing the inclusive product into an exclusive one would require a division which is expensive and has ugly edge cases such as division by zero and the smallest negative integer overflow. Additionally to Also, I did not see any barriers in the proposal, neither for the control flow nor the memory. |
The inclusive scan was a typo. I realize HLSL is missing this intrinsic, but I flipped the name when I was writing. |
The problem with barriers is that HLSL doesn't have a wave barrier, MSL says the barrier is uniform, but SPIR-V says the barrier is non-uniform. I certainly wouldn't want to have HLSL use a workgroup barrier for this. |
We can consider those, but I think it requires more testing to ensure D3D12 behavior is always consistent. I don't have access to enough Windows devices to test this fully. My worry would be that HLSL has a similar problem to Vulkan 1.1 where the subgroup size builtin could just return a static number. There is a possibility that D3D12 backends will have to emulate subgroup_size. |
I do think we'd want to formalize that guidance. Internally, we have such code, but I can't share it. Coincidental with this feature getting implemented it would make sense to also have a sample or two to demonstrate this. I could see an additional diagnostic control being added that shader authors could use to warn/error if they went outside those bounds. At a high level those bounds are make sure all branches are uniform. |
proposals/subgroups.md
Outdated
| `fn subgroupBroadcastFirst(e : T) -> T` | `T` must be u32, i32, f32, f16 or a vector of those types | Broadcasts `e` from the active invocation with the lowest subgroup_invocation_id in the subgroup to all other active invocations | | ||
| `fn subgroupBallot(pred : bool) -> vec4<u32>` | | Returns a set of bitfields where the bit corresponding to subgroup_invocation_id is 1 if `pred` is true for that active invocation and 0 otherwise. | | ||
| `fn subgroupShuffle(v : T, id : I) -> T` | `T` must be u32, i32, f32, f16 or a vector of those types<br>`I` must be u32 or i32 | Returns `v` from the active invocation whose subgroup_invocation_id matches `id` | | ||
| `fn subgroupShuffleXor(v : T, mask : u32) -> T` | `T` must be u32, i32, f32, f16 or a vector of those types | Returns `v` from the active invocation whose subgroup_invocation_id matches `subgroup_invocation_id ^ mask` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the mask need to be dynamically uniform? What's the use of this builtin?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-checking things, MSL does require it to be dynamically uniform. SPIR-V does not. I'll update this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the mask should be dynamically uniform. This built-in can be used for permutations (where there is a bijection between sources and destinations) which can potentially be implemented more efficiently than the general gather operation which might need to distribute some sources to multiple destinations.
| `fn subgroupBallot(pred : bool) -> vec4<u32>` | | Returns a set of bitfields where the bit corresponding to subgroup_invocation_id is 1 if `pred` is true for that active invocation and 0 otherwise. | | ||
| `fn subgroupShuffle(v : T, id : I) -> T` | `T` must be u32, i32, f32, f16 or a vector of those types<br>`I` must be u32 or i32 | Returns `v` from the active invocation whose subgroup_invocation_id matches `id` | | ||
| `fn subgroupShuffleXor(v : T, mask : u32) -> T` | `T` must be u32, i32, f32, f16 or a vector of those types | Returns `v` from the active invocation whose subgroup_invocation_id matches `subgroup_invocation_id ^ mask` | | ||
| `fn subgroupShuffleUp(v : T, delta : u32) -> T` | `T` must be u32, i32, f32, f16 or a vector of those types | Returns `v` from the active invocation whose subgroup_invocation_id matches `subgroup_invocation_id - delta` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens when subgroup_invocation_id +- delta goes OOB?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An indeterminate value is returned. I was trying to avoid all the fiddly cases in the proposal that will have to be addressed in the final spec.
MSL says lower invocations are unmodified, but SPIR-V says undefined value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually MSL has a variant of shuffles which fill-in a defined value in the OOB cases. But leaving it as UB is fine as well IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm misunderstanding this, but given that at least one invocation would end up with an indeterminate value, wouldn't the developer always have to end up doing some comparison against subgroup_invocation_id
to ensure the indeterminate value doesn't cause trouble?
| `fn subgroupMin(e : T) -> T` | `T` must be u32, i32, f32, f16 or a vector of those types | Reduction<br>Performs a min of `e` among all active invocations and returns that result | | ||
| `fn subgroupMax(e : T) -> T` | `T` must be u32, i32, f32, f16 or a vector of those types | Reduction<br>Performs a max of `e` among all active invocations and returns that result | | ||
| `fn quadBroadcast(e : T, id : I)` | `T` must be u32, i32, f32, f16 or a vector of those types<br>`I` must be u32 or i32 | Broadcasts `e` from the quad invocation with id equal to `id`<br>`e` must be a constant-expression<sup>2</sup> | | ||
| `fn quadSwapX(e : T)` | `T` must be u32, i32, f32, f16 or a vector of those types | Swaps `e` between invocations in the quad in the X direction | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a definition of how invocations are grouped into quads (how it maps to subgroup_invocation_id)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vulkan explicitly requires that quads are consequence subgroup invocations. See https://registry.khronos.org/vulkan/specs/1.3-extensions/html/vkspec.html#shaders-scope-quad. MSL provides a quad id. D3D12 says they are evenly divided in the subgroup (so I assume just like vulkan).
1. This is the first instance of dynamic uniformity. See the portability and uniformity section for more details. | ||
2. Unlike `subgroupBroadcast`, SPIR-V does not have a shuffle operation to fall back on, so this requirement must be surfaced. | ||
|
||
**TODO**: Are quad operations worth it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we are adding subgroup operations to fragment shaders yet right? So are there helper invocations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah it seems we are, but i thought these were even newer functionality than compute subgroups, are we looking to add them anyway? It seems it could lose some reach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to gpuinfo.org very few devices don't support fragment shaders.
Because general portability seems to be off the table based on empirical testing, it seemed beneficial to allow them in fragment shaders.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requiring subgroupSupportedStages
to include the fragment stage seems to drop support for the Adreno 600 series (see reports analysis).
This is probably fine given that D3D12 and Metal support subgroup ops in the fragment stage unconditionally.
proposals/subgroups.md
Outdated
Normally, there would be extra portability hazards in fragment shaders (e.g. | ||
due to helper invocations). | ||
|
||
**TODO**: Is it worth adding a diagnostic control to require subgroup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be really nice. Even if we are building on shaky foundations semantics-wise, I think it is useful to help developers that want to build on stronger format foundations. It might be worth defining subgroup uniformity as well so we can require that bradcast's I is subgroup uniform, even if for now the only way to have that is by having I be workgroup uniform.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I replied to Greg's comment we could also consider a diagnostic that warns if all the branches are uniform (as opposed to just the subgroup operation being in uniform control flow).
I think adding either or both diagnostics defaulting to off is reasonable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a section on diagnostics that would default to errors as a starting point.
* Vulkan 1.3 or VK_EXT_subgroup_size_control | ||
|
||
According to [this query](https://vulkan.gpuinfo.org/displaycoreproperty.php?core=1.1&name=subgroupSupportedOperations&platform=all), | ||
~84% of devices are captured. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like these percentages are fractions of unique reports.
It would be more meaningful to weight by shipped (and active) units. But this data is hard to get. (As a committee we should decide on a set of representative popular devices. But that goes way beyond this issue.)
I think generally the guidance works out ok though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving this to land as a proposal in flight, for discussion.
Several open issues are marked with TODOs, and more can be raised during discussion.
WGSL 2023-11-28 Minutes
|
but these operations provide significant performance improvements in many | ||
circumstances. | ||
|
||
Suggest allowing these operations anywhere and provide guidance on how to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does feel somewhat counter to all the efforts we've put in for portability. This is certainly an implementor's decision, but I would feel more comfortable making the extension unavailable for known drivers that perform the aggressive opimizations that do not preserve the correct active invocations
. At least then there's incentive for these things to be fixed, instead of forever worked-around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a todo to see how much more portable we can make individual functions.
| **subgroups-f16** | Allows f16 to be used in subgroups operations | | ||
|
||
Note: Metal can always provide subgroups-f16, Vulkan requires | ||
VK_KHR_shader_subgroup_extended_types | ||
([~61%](https://vulkan.gpuinfo.org/listdevicescoverage.php?extension=VK_KHR_shader_subgroup_extended_types&platform=all) | ||
of devices), and D3D12 requires SM6.2. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need an additional subgroups-f16
feature? The existing f16
feature already requires SM6.2 and almost all devices that support the proposed subgroups
functionality + the f16
feature also support shaderSubgroupExtendedTypes
(see reports analysis).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some numbers for Vulkan (pulled from the analysis):
- 78% devices support the proposed
subgroup
feature - 46% devices support
f16
- 42% devices support both
f16
andsubgroup
100% being core WebGPU functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's interesting data. I'd prefer not to cleave the feature. Can you clarify what the f16 feature is? Does it represent the set of features required to enable f16 in WGSL? The other features seem to mirror Vulkan features.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added this as a todo for now (and linked to your report).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code for the f16
features is here.
This is the whole commit: teoxoy/gpuinfo-vulkan-query@8681e00 (includes the src and the analysis).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok so it is all the features needed to implement WGSL f16. Thanks!
According to [gpuinfo.org](https://vulkan.gpuinfo.org/displaycoreproperty.php?core=1.1&name=subgroupSupportedOperations&platform=all), | ||
this feature set captures ~84% of devices. | ||
Splitting the feature does not grab a significant portion of devices. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to this report analysis, splitting off the arithmetic ops looks the most promising, but considering the devices we get back are only Gen7 (Ivy Bridge, Haswell) intel it might not be worth it.
1. This is the first instance of dynamic uniformity. See the portability and uniformity section for more details. | ||
2. Unlike `subgroupBroadcast`, SPIR-V does not have a shuffle operation to fall back on, so this requirement must be surfaced. | ||
|
||
**TODO**: Are quad operations worth it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requiring subgroupSupportedStages
to include the fragment stage seems to drop support for the Adreno 600 series (see reports analysis).
This is probably fine given that D3D12 and Metal support subgroup ops in the fragment stage unconditionally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving as the document looks promising and makes reasonable choices to expose a wider set of operations with more stage compatibility without sacrificing device segments with meaningful support for subgroups. Thank you very much, and looking forward to further dialogue to realize subgroups spec in WGSL and WebGPU based on having the proposal in flight.
I've added todo to see if we can provide portable versions of these builtin values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I double-checked the builtins across shading languages and the proposal includes the ones that are available on our 3 main targets.
SPIR-V and MSL have quite a few more builtins that HLSL SM6 doesn't.
SPIR-V and HLSL SM6 have 3 additional builtins (subgroupAllEqual
, subgroupBallotBitCount
, subgroupExclusiveBallotBitCount
) but MSL lacks those.
Great stuff!
I actually used subgroupBallotBitcount and it's handy to have, and trivially pollyfillable: fn ballotBitCount(ballot: vec4u) -> u32 { return dot(vec4u(1),countOneBits(ballot)); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. This is a great foundation for experimentation and further discussion.
Good to land now.
* Two new features * `subgroups` for the base functionality * `subgroups-f16` to use f16 in the new operations
Co-authored-by: exrook <exrook.j@gmail.com>
Co-authored-by: Teodor Tanasoaia <28601907+teoxoy@users.noreply.github.com>
Co-authored-by: Teodor Tanasoaia <28601907+teoxoy@users.noreply.github.com>
Co-authored-by: Teodor Tanasoaia <28601907+teoxoy@users.noreply.github.com>
Co-authored-by: Teodor Tanasoaia <28601907+teoxoy@users.noreply.github.com>
* Added a diagnostics section * Two new diagnostics to make the feature as portable as possible * Default to errors * Added some extra requirements * Added some todos: * Potentially dropping subgroups-f16 * Potentially emulating other builtin values * Increasing portability of individual builtin calls
Co-authored-by: Teodor Tanasoaia <28601907+teoxoy@users.noreply.github.com>
a6b9063
to
161220e
Compare
WGSL 2023-12-05 Minutes
|
Sorry for the late feedback, and I'm not sure if this is the right place for it (if not, let me know where is best). Overall I'm thrilled to see this go forward. I do have remaining concerns about subgroup size. First, my testing has shown that WaveLaneCountMin is not a reliable minimum value, in particular on some Intel devices it can be reported as 16 but a workgroup can be launched with 8. I know there's a TODO to validate this, and I'm happy to share details of my testing environment. Second, I'm unhappy that subgroup size bounds are available only as limits through the API and not accessible from WGSL, for reasons I've explained in #3950. There is a nontrivial burden on programmers to reap the API values, translate them into array sizes, and instantiate those as override constants (or textual substitution). More concern, it precludes a good compiler from doing analysis of the program, inferring tighter bounds on the subgroup sizes, and instantiating an individual program with those tighter bounds. I'm curious what the reasoning is for not including this. |
Some Intel gpus in Vulkan report a SubgroupSize of 32 both on the host and gpu, but as @raphlinus mentioned have a range of 8-32, and will generally use 16. It's my understanding that vulkan drivers simply report the same SubgroupSize they have on the host, regardless of the workgroup size, which often may not accurately reflect the number of active threads within a particular subgroup. subgroup_size can be computed correctly via ballot or reduction operations, if it is not a fixed value (ie min == max). In cases where subgroup_invocation_id is not available, it can be computed based on the relation Required Subgroup SizeWith subgroup_size accurately computed, I would propose not enforcing subgroup_size uniformity or setting it to a specific value, since vendors like Intel and Apple have chosen to make this dynamic (potentially based on workgroup size or register usage). It's reasonable to assume that there would be a performance cost, potentially catastrophic in some instances, which it doesn't make sense to pay unless it is needed. That is, something that the user opts in to, requesting this behavior. This avoids having to use required_subgroup_size, or requring full subgroups, etc. Apple doesn't support required_subgroup_size, and may require speculatively recompiling or some other magic to make this work. While it's understandable to make this simpler to ease dev burden, it may not actually be necessary, and not doing this would be easier and more portable. Requiring local_size (or just the x dim) be a multiple of the max subgroup size, while it seems like common practice, might be unnecessarily restrictive, as this isn't otherwise a requirement. When learning or debugging, it can be helpful to reduce workgroup size to 1 or a small value. Or larger subgroups could simply be neglected when prototyping, to avoid additional complexity, or maybe the algorithm doesn't require more threads. It isn't a necessary requirement in order to use subgroups correctly. @raphlinus makes a good point in including subgroupMinSize and subgroupMaxSize within a shader. This allows specializing in cases where subgroup_size is fixed, while also handling cases where it varies, for instance by discarding extra subgroups, altering the algorithm, inserting barriers, etc. Potentially required_subgroup_size, uniform / varying subgroup size, or full subgroups could be eventually be exposed in some fashion, if necessary. That is, if the user really needs a fixed subgroup_size, they could opt in to various means, whether when creating the pipeline, or as a property of the pipeline (which aligns better with Apple). Making this optional allows implementations to do so as it becomes possible, if possible, rather than blocking subgroups entirely. |
Hi, can we add a note that Currently only
The Vulkan spec, in
I don't find explicit requirement in MSL/HLSL spec, but I think they should work in the same way for compute kernel. |
subgroups
for the base functionalitysubgroups-f16
to use f16 in the new operationsRendered proposal