Add subgroups proposal #4368

alan-baker · 2023-11-07T19:42:47Z

Two new features
- subgroups for the base functionality
- subgroups-f16 to use f16 in the new operations

github-actions · 2023-11-07T19:46:36Z

Previews, as seen when this build job started (161220e):
WebGPU _{webgpu.idl | Explainer | Correspondence Reference}
WGSL _{grammar.js | wgsl.lalr.txt}

exrook · 2023-11-07T23:59:19Z

Is there a rationale for including subgroupInclusiveSum (rather than/without) a subgroupExclusiveSum version?

proposals/subgroups.md

exrook · 2023-11-08T00:08:30Z

proposals/subgroups.md

+| `fn subgroupShuffleUp(v : T, delta : u32) -> T` | `T` must be u32, i32, f32, f16 or a vector of those types | Returns `v` from the active invocation whose subgroup_invocation_id matches `subgroup_invocation_id - delta` |
+| `fn subgroupShuffleDown(v : T, delta : u32) -> T` | `T` must be u32, i32, f32, f16 or a vector of those types | Returns `v` from the active invocation whose subgroup_invocation_id matches `subgroup_invocation_id + delta` |
+| `fn subgroupSum(e : T) -> T` | `T` must be u32, i32, f32, or a vector of those types | Reduction<br>Adds `e` among all active invocations and returns that result |
+| `fn subgroupInclusiveSum(e : T) -> T)` | `T` must be u32, i32, f32, f16 or a vector of those types | Inclusive scan<br>Returns the sum of `e` for all active invocations with subgroup_invocation_id less than or equal to this invocation |


subgroupInclusiveSum naming bikeshedding

It's not obvious from the name this behaves differently from subgroupSum in that it returns a unique value per invocation rather than a subgroup-uniform value. Not having prior knowledge I'd think a standard sum operation was already "inclusive".

I think subgroupPrefixInclusiveSum or subgroupInclusivePrefixSum would be much more obvious at the cost of verbosity. Alternatively, subgroupPrefixSum still hints that something more complex than an ordinary sum is going on to an unfamiliar reader, but for experienced users leaves room for confusion as to the in/exclusivity. It doesn't help that DX12 WavePrefixSum() is exclusive

It is meant to be exclusive.

Bikeshedding thoughts.

Comparing prior art:

"Prefix sum" is what I've known the general algorithm as, and likely the first entry for most folks. https://en.wikipedia.org/wiki/Prefix_sum

HLSL and MSL use "prefix" in the names. GLSL does not.

HLSL and MSL use "sum" and "product". GLSL uses "Add" and "Mul". The GLSL way generalizes to logical operations like And, Or, Max, Min (or those are simply ambiguoous! Hmmm.)

Once I see many operations in context like in this Vulkan subgroups tutorial, lacking "prefix" doesn't seem so bad because it's implied by Inclusive or Exclusive.

greggman · 2023-11-08T08:25:18Z

Do we have any concrete examples of what a developer would have to do to write performant code that worked on all platforms? I get in the abstract what they'd have to do but it would be nice to see some more concrete example JS/TS code of what a developer would have to do to run across devices.

Lichtso · 2023-11-08T11:52:37Z

I would also strongly recommend having the exclusive prefix scans or both, not the inclusive versions only. The reason is that it is easy to build the inclusive version from the exclusive one, but not the other way around. E.g. take multiplication, one can just multiply the current invocations value and the exclusive product to get the inclusive product. However, undoing the inclusive product into an exclusive one would require a division which is expensive and has ugly edge cases such as division by zero and the smallest negative integer overflow.

Additionally to subgroup_size and subgroup_invocation_id our prototype also has num_subgroups and subgroup_id for the compute stage only. These are missing in HLSL but can be emulated.

Also, I did not see any barriers in the proposal, neither for the control flow nor the memory.

alan-baker · 2023-11-08T13:23:35Z

I would also strongly recommend having the exclusive prefix scans or both, not the inclusive versions only. The reason is that it is easy to build the inclusive version from the exclusive one, but not the other way around. E.g. take multiplication, one can just multiply the current invocations value and the exclusive product to get the inclusive product. However, undoing the inclusive product into an exclusive one would require a division which is expensive and has ugly edge cases such as division by zero and the smallest negative integer overflow.

The inclusive scan was a typo. I realize HLSL is missing this intrinsic, but I flipped the name when I was writing.

alan-baker · 2023-11-08T13:40:25Z

Also, I did not see any barriers in the proposal, neither for the control flow nor the memory.

The problem with barriers is that HLSL doesn't have a wave barrier, MSL says the barrier is uniform, but SPIR-V says the barrier is non-uniform. I certainly wouldn't want to have HLSL use a workgroup barrier for this.

alan-baker · 2023-11-08T13:49:48Z

Additionally to subgroup_size and subgroup_invocation_id gfx-rs/wgpu#4190 also has num_subgroups and subgroup_id for the compute stage only. These are missing in HLSL but can be emulated.

We can consider those, but I think it requires more testing to ensure D3D12 behavior is always consistent. I don't have access to enough Windows devices to test this fully. My worry would be that HLSL has a similar problem to Vulkan 1.1 where the subgroup size builtin could just return a static number. There is a possibility that D3D12 backends will have to emulate subgroup_size.

alan-baker · 2023-11-08T13:53:54Z

Do we have any concrete examples of what a developer would have to do to write performant code that worked on all platforms? I get in the abstract what they'd have to do but it would be nice to see some more concrete example JS/TS code of what a developer would have to do to run across devices.

I do think we'd want to formalize that guidance. Internally, we have such code, but I can't share it. Coincidental with this feature getting implemented it would make sense to also have a sample or two to demonstrate this. I could see an additional diagnostic control being added that shader authors could use to warn/error if they went outside those bounds. At a high level those bounds are make sure all branches are uniform.

Kangz · 2023-11-08T15:21:17Z

proposals/subgroups.md

+| `fn subgroupBroadcastFirst(e : T) -> T` | `T` must be u32, i32, f32, f16 or a vector of those types | Broadcasts `e` from the active invocation with the lowest subgroup_invocation_id in the subgroup  to all other active invocations |
+| `fn subgroupBallot(pred : bool) -> vec4<u32>` | | Returns a set of bitfields where the bit corresponding to subgroup_invocation_id is 1 if `pred` is true for that active invocation and 0 otherwise. |
+| `fn subgroupShuffle(v : T, id : I) -> T` | `T` must be u32, i32, f32, f16 or a vector of those types<br>`I` must be u32 or i32 | Returns `v` from the active invocation whose subgroup_invocation_id matches `id` |
+| `fn subgroupShuffleXor(v : T, mask : u32) -> T` | `T` must be u32, i32, f32, f16 or a vector of those types | Returns `v` from the active invocation whose subgroup_invocation_id matches `subgroup_invocation_id ^ mask` |


Does the mask need to be dynamically uniform? What's the use of this builtin?

Re-checking things, MSL does require it to be dynamically uniform. SPIR-V does not. I'll update this.

Yes, the mask should be dynamically uniform. This built-in can be used for permutations (where there is a bijection between sources and destinations) which can potentially be implemented more efficiently than the general gather operation which might need to distribute some sources to multiple destinations.

Kangz · 2023-11-08T15:21:36Z

proposals/subgroups.md

+| `fn subgroupBallot(pred : bool) -> vec4<u32>` | | Returns a set of bitfields where the bit corresponding to subgroup_invocation_id is 1 if `pred` is true for that active invocation and 0 otherwise. |
+| `fn subgroupShuffle(v : T, id : I) -> T` | `T` must be u32, i32, f32, f16 or a vector of those types<br>`I` must be u32 or i32 | Returns `v` from the active invocation whose subgroup_invocation_id matches `id` |
+| `fn subgroupShuffleXor(v : T, mask : u32) -> T` | `T` must be u32, i32, f32, f16 or a vector of those types | Returns `v` from the active invocation whose subgroup_invocation_id matches `subgroup_invocation_id ^ mask` |
+| `fn subgroupShuffleUp(v : T, delta : u32) -> T` | `T` must be u32, i32, f32, f16 or a vector of those types | Returns `v` from the active invocation whose subgroup_invocation_id matches `subgroup_invocation_id - delta` |


What happens when subgroup_invocation_id +- delta goes OOB?

An indeterminate value is returned. I was trying to avoid all the fiddly cases in the proposal that will have to be addressed in the final spec.

MSL says lower invocations are unmodified, but SPIR-V says undefined value.

Actually MSL has a variant of shuffles which fill-in a defined value in the OOB cases. But leaving it as UB is fine as well IMO.

Maybe I'm misunderstanding this, but given that at least one invocation would end up with an indeterminate value, wouldn't the developer always have to end up doing some comparison against subgroup_invocation_id to ensure the indeterminate value doesn't cause trouble?

Kangz · 2023-11-08T15:22:46Z

proposals/subgroups.md

+| `fn subgroupMin(e : T) -> T` | `T` must be u32, i32, f32, f16 or a vector of those types | Reduction<br>Performs a min of `e` among all active invocations and returns that result |
+| `fn subgroupMax(e : T) -> T` | `T` must be u32, i32, f32, f16 or a vector of those types | Reduction<br>Performs a max of `e` among all active invocations and returns that result |
+| `fn quadBroadcast(e : T, id : I)` | `T` must be u32, i32, f32, f16 or a vector of those types<br>`I` must be u32 or i32 | Broadcasts `e` from the quad invocation with id equal to `id`<br>`e` must be a constant-expression<sup>2</sup> |
+| `fn quadSwapX(e : T)` | `T` must be u32, i32, f32, f16 or a vector of those types | Swaps `e` between invocations in the quad in the X direction |


Is there a definition of how invocations are grouped into quads (how it maps to subgroup_invocation_id)?

Vulkan explicitly requires that quads are consequence subgroup invocations. See https://registry.khronos.org/vulkan/specs/1.3-extensions/html/vkspec.html#shaders-scope-quad. MSL provides a quad id. D3D12 says they are evenly divided in the subgroup (so I assume just like vulkan).

Kangz · 2023-11-08T15:23:44Z

proposals/subgroups.md

+1. This is the first instance of dynamic uniformity. See the portability and uniformity section for more details.
+2. Unlike `subgroupBroadcast`, SPIR-V does not have a shuffle operation to fall back on, so this requirement must be surfaced.
+
+**TODO**: Are quad operations worth it?


I don't think we are adding subgroup operations to fragment shaders yet right? So are there helper invocations?

Ah it seems we are, but i thought these were even newer functionality than compute subgroups, are we looking to add them anyway? It seems it could lose some reach.

According to gpuinfo.org very few devices don't support fragment shaders.

Because general portability seems to be off the table based on empirical testing, it seemed beneficial to allow them in fragment shaders.

Requiring subgroupSupportedStages to include the fragment stage seems to drop support for the Adreno 600 series (see reports analysis).

This is probably fine given that D3D12 and Metal support subgroup ops in the fragment stage unconditionally.

Kangz · 2023-11-08T15:28:03Z

proposals/subgroups.md

+Normally, there would be extra portability hazards in fragment shaders (e.g.
+due to helper invocations).
+
+**TODO**: Is it worth adding a diagnostic control to require subgroup


It would be really nice. Even if we are building on shaky foundations semantics-wise, I think it is useful to help developers that want to build on stronger format foundations. It might be worth defining subgroup uniformity as well so we can require that bradcast's I is subgroup uniform, even if for now the only way to have that is by having I be workgroup uniform.

As I replied to Greg's comment we could also consider a diagnostic that warns if all the branches are uniform (as opposed to just the subgroup operation being in uniform control flow).

I think adding either or both diagnostics defaulting to off is reasonable

I've added a section on diagnostics that would default to errors as a starting point.

dneto0 · 2023-11-24T20:00:33Z

proposals/subgroups.md

+* Vulkan 1.3 or VK_EXT_subgroup_size_control
+
+According to [this query](https://vulkan.gpuinfo.org/displaycoreproperty.php?core=1.1&name=subgroupSupportedOperations&platform=all),
+~84% of devices are captured.


It looks like these percentages are fractions of unique reports.

It would be more meaningful to weight by shipped (and active) units. But this data is hard to get. (As a committee we should decide on a set of representative popular devices. But that goes way beyond this issue.)

I think generally the guidance works out ok though.

dneto0

Approving this to land as a proposal in flight, for discussion.
Several open issues are marked with TODOs, and more can be raised during discussion.

kdashg · 2023-11-28T20:04:12Z

WGSL 2023-11-28 Minutes

AB: We wrote an experimental extension with very limited functionality to determine portability. (Only need size, id, and ballot). The short of it: There’s not much portability across platforms and devices.
AB: There’s a draft PR in CTS to run the experiments.
AB: We have internal use cases that get a lot perf out of this, and really want this. We want to do an origin trial in order to prove out the performance benefits.
AB: We’re looking for the group to agree on this direction, then hammer out the details. That will let us experiment.
JB: People get attachment to experimental code.
DN: The request is to land the proposal doc in the repo, not to change the main spec.
JB: Expect to need this in the future. Not prepared to offer opinions on naming, or tech details. Think the threshold for landing proposal docs is low.
KG: Difference between landing as proposal vs. having this in your own repo.
AB: Signals agreement on direction. Fine points will need discussion.
JB: Inclined toward this, is taking a very prevalent arch feature and fronting it to applications, because it has massive performance implications. Don’t know of competing designs we’d consider as alternatives. WebGPU will have to do something here eventually.
KG: Positive on direction. Details TBD. Do worry about accidentally ensconcing details of the implementation because customers write code and end up relying on it. Do want to have a stronger way to deprecate, e.g. by a prefix on each builtin name, or a long extension name.
AB: More prefixing seems fine. We’ll discuss internally.
JB: Positive signal. Let’s discuss next week.
KG: If you want an official signal, we have a standards-position process: https://github.com/mozilla/standards-positions/
AB: Call to action. Please look at it. Tried to lay out where portability falls down. Sketched some potential guardrails.
JB: Want to discuss some details in the office hours.

ben-clayton · 2023-11-30T11:29:29Z

proposals/subgroups.md

+but these operations provide significant performance improvements in many
+circumstances.
+
+Suggest allowing these operations anywhere and provide guidance on how to


This does feel somewhat counter to all the efforts we've put in for portability. This is certainly an implementor's decision, but I would feel more comfortable making the extension unavailable for known drivers that perform the aggressive opimizations that do not preserve the correct active invocations. At least then there's incentive for these things to be fixed, instead of forever worked-around.

Added a todo to see how much more portable we can make individual functions.

proposals/subgroups.md

teoxoy · 2023-12-05T19:26:06Z

proposals/subgroups.md

+| **subgroups-f16** | Allows f16 to be used in subgroups operations |
+
+Note: Metal can always provide subgroups-f16, Vulkan requires
+VK_KHR_shader_subgroup_extended_types
+([~61%](https://vulkan.gpuinfo.org/listdevicescoverage.php?extension=VK_KHR_shader_subgroup_extended_types&platform=all)
+of devices), and D3D12 requires SM6.2.


Do we need an additional subgroups-f16 feature? The existing f16 feature already requires SM6.2 and almost all devices that support the proposed subgroups functionality + the f16 feature also support shaderSubgroupExtendedTypes (see reports analysis).

Some numbers for Vulkan (pulled from the analysis):

78% devices support the proposed subgroup feature

46% devices support f16

42% devices support both f16 and subgroup

100% being core WebGPU functionality.

That's interesting data. I'd prefer not to cleave the feature. Can you clarify what the f16 feature is? Does it represent the set of features required to enable f16 in WGSL? The other features seem to mirror Vulkan features.

I've added this as a todo for now (and linked to your report).

The code for the f16 features is here.

This is the whole commit: teoxoy/gpuinfo-vulkan-query@8681e00 (includes the src and the analysis).

Ok so it is all the features needed to implement WGSL f16. Thanks!

proposals/subgroups.md

teoxoy · 2023-12-05T19:36:46Z

proposals/subgroups.md

+According to [gpuinfo.org](https://vulkan.gpuinfo.org/displaycoreproperty.php?core=1.1&name=subgroupSupportedOperations&platform=all),
+this feature set captures ~84% of devices.
+Splitting the feature does not grab a significant portion of devices.


According to this report analysis, splitting off the arithmetic ops looks the most promising, but considering the devices we get back are only Gen7 (Ivy Bridge, Haswell) intel it might not be worth it.

proposals/subgroups.md

teoxoy · 2023-12-05T21:20:05Z

proposals/subgroups.md

+1. This is the first instance of dynamic uniformity. See the portability and uniformity section for more details.
+2. Unlike `subgroupBroadcast`, SPIR-V does not have a shuffle operation to fall back on, so this requirement must be surfaced.
+
+**TODO**: Are quad operations worth it?


Requiring subgroupSupportedStages to include the fragment stage seems to drop support for the Adreno 600 series (see reports analysis).

This is probably fine given that D3D12 and Metal support subgroup ops in the fragment stage unconditionally.

proposals/subgroups.md

mehmetoguzderin

Approving as the document looks promising and makes reasonable choices to expose a wider set of operations with more stage compatibility without sacrificing device segments with meaningful support for subgroups. Thank you very much, and looking forward to further dialogue to realize subgroups spec in WGSL and WebGPU based on having the proposal in flight.

alan-baker · 2023-12-06T14:43:56Z

Additionally to subgroup_size and subgroup_invocation_id gfx-rs/wgpu#4190 also has num_subgroups and subgroup_id for the compute stage only. These are missing in HLSL but can be emulated.

We can consider those, but I think it requires more testing to ensure D3D12 behavior is always consistent. I don't have access to enough Windows devices to test this fully. My worry would be that HLSL has a similar problem to Vulkan 1.1 where the subgroup size builtin could just return a static number. There is a possibility that D3D12 backends will have to emulate subgroup_size.

I've added todo to see if we can provide portable versions of these builtin values.

proposals/subgroups.md

teoxoy

I double-checked the builtins across shading languages and the proposal includes the ones that are available on our 3 main targets.

SPIR-V and MSL have quite a few more builtins that HLSL SM6 doesn't.
SPIR-V and HLSL SM6 have 3 additional builtins (subgroupAllEqual, subgroupBallotBitCount, subgroupExclusiveBallotBitCount) but MSL lacks those.

Great stuff!

dneto0 · 2023-12-06T18:39:29Z

I actually used subgroupBallotBitcount and it's handy to have, and trivially pollyfillable:

fn ballotBitCount(ballot: vec4u) -> u32 { return dot(vec4u(1),countOneBits(ballot)); }

dneto0

LGTM. This is a great foundation for experimentation and further discussion.
Good to land now.

* Two new features * `subgroups` for the base functionality * `subgroups-f16` to use f16 in the new operations

Co-authored-by: exrook <exrook.j@gmail.com>

Co-authored-by: Teodor Tanasoaia <28601907+teoxoy@users.noreply.github.com>

* Added a diagnostics section * Two new diagnostics to make the feature as portable as possible * Default to errors * Added some extra requirements * Added some todos: * Potentially dropping subgroups-f16 * Potentially emulating other builtin values * Increasing portability of individual builtin calls

Co-authored-by: Teodor Tanasoaia <28601907+teoxoy@users.noreply.github.com>

kdashg · 2023-12-06T22:43:48Z

WGSL 2023-12-05 Minutes

JB: Teo went through the proposal. Mozilla’s feeling is there’s no plausible alternative to this functionality. This lets users take advantage of near-universal support of this in hardware. We think we should land this as a proposal doc.
AB: I’ve landed some of the smaller of Teo’s changes, and will look up the device loss reports from the gpuinfo.org queries.
AB: Very interested in the cascade of devices that get lost as you add features. Had a question about the f16 one; doesn’t correspond to a Vulkan feature-bit. Will need to followup. Some question about the report showing 4% loss if you couple subgroups-f16 with subgroups. Want to follow up.
AB: Overall would be happy to have fewer splits of the feature if the data bears it out.
AB: Internally we discussed how to handle portability. E.g. possibly avoiding indeterminate value when you get data from an inactive invocation (e.g. for shuffles). Could look at polyfilling as the default, and then opt out if you want the speed but not the portability.
JB: Definitely like the portable unless you ask for something else.
MOD: Will some of these non-portable behaviors be security hazards? Or is it unlikely.
AB: Think it is not a security hazard. All the invocations will be from the same end-user application (graphics context at the API level).
AB: Vulkan is weird in that it can pack multiple draws into a single subgroup, but don’t know how many devices actually do that.
(JB: What will hardware people think of next?? :P )
(KG: Mesh shaders, yeah? :P )
JB: I was surprised that Mesh shaders didn’t come up from [Google] partner requests. Is there no asking here?
AB: It’s large to spec. There’s a concern of whether the mesh features in underlying APIs are actually good across HW classes in the long term.
JB: Yeah, I guess it’s a lot to spec. Lots of iteration vs earlier geom/tess approaches
AB: for subgroups, I’ll add the diagnostic change. Would love to merge the proposal, and iterate from there, even without a firm ETA for merging into main spec.
KG: Sounds like consensus to merge proposal after some (but not perfectionist!) revisioning.

raphlinus · 2023-12-15T18:20:57Z

Sorry for the late feedback, and I'm not sure if this is the right place for it (if not, let me know where is best). Overall I'm thrilled to see this go forward. I do have remaining concerns about subgroup size.

First, my testing has shown that WaveLaneCountMin is not a reliable minimum value, in particular on some Intel devices it can be reported as 16 but a workgroup can be launched with 8. I know there's a TODO to validate this, and I'm happy to share details of my testing environment.

Second, I'm unhappy that subgroup size bounds are available only as limits through the API and not accessible from WGSL, for reasons I've explained in #3950. There is a nontrivial burden on programmers to reap the API values, translate them into array sizes, and instantiate those as override constants (or textual substitution). More concern, it precludes a good compiler from doing analysis of the program, inferring tighter bounds on the subgroup sizes, and instantiating an individual program with those tighter bounds. I'm curious what the reasoning is for not including this.

charles-r-earp · 2024-02-25T00:36:36Z

Some Intel gpus in Vulkan report a SubgroupSize of 32 both on the host and gpu, but as @raphlinus mentioned have a range of 8-32, and will generally use 16. It's my understanding that vulkan drivers simply report the same SubgroupSize they have on the host, regardless of the workgroup size, which often may not accurately reflect the number of active threads within a particular subgroup.

subgroup_size can be computed correctly via ballot or reduction operations, if it is not a fixed value (ie min == max). In cases where subgroup_invocation_id is not available, it can be computed based on the relation local_invocation_index = subgroup_id * subgroup_size + subgroup_invocation_id, which should be trivial in typical cases of powers of 2.

Required Subgroup Size

With subgroup_size accurately computed, I would propose not enforcing subgroup_size uniformity or setting it to a specific value, since vendors like Intel and Apple have chosen to make this dynamic (potentially based on workgroup size or register usage).

It's reasonable to assume that there would be a performance cost, potentially catastrophic in some instances, which it doesn't make sense to pay unless it is needed. That is, something that the user opts in to, requesting this behavior.

This avoids having to use required_subgroup_size, or requring full subgroups, etc. Apple doesn't support required_subgroup_size, and may require speculatively recompiling or some other magic to make this work. While it's understandable to make this simpler to ease dev burden, it may not actually be necessary, and not doing this would be easier and more portable.

Requiring local_size (or just the x dim) be a multiple of the max subgroup size, while it seems like common practice, might be unnecessarily restrictive, as this isn't otherwise a requirement. When learning or debugging, it can be helpful to reduce workgroup size to 1 or a small value. Or larger subgroups could simply be neglected when prototyping, to avoid additional complexity, or maybe the algorithm doesn't require more threads. It isn't a necessary requirement in order to use subgroups correctly.

@raphlinus makes a good point in including subgroupMinSize and subgroupMaxSize within a shader. This allows specializing in cases where subgroup_size is fixed, while also handling cases where it varies, for instance by discarding extra subgroups, altering the algorithm, inserting barriers, etc.

Potentially required_subgroup_size, uniform / varying subgroup size, or full subgroups could be eventually be exposed in some fashion, if necessary. That is, if the user really needs a fixed subgroup_size, they could opt in to various means, whether when creating the pipeline, or as a property of the pipeline (which aligns better with Apple). Making this optional allows implementations to do so as it becomes possible, if possible, rather than blocking subgroups entirely.

jzm-intel · 2024-04-02T08:11:25Z

Hi, can we add a note that subgroup_size built-in value, when used in compute shader, should be considered as uniform built-in value?

Currently only workgroup_id and num_workgroups built-in values are considered uniform according to spec. Treating subgroup_size as uniform may help writing flexible kernel that fit to different actual subgroup size, and allow using functions like workgroupBarrier like

    if (subgroupSize == 16) {
        ...
        workgroupBarrier();
        ...
    }

The Vulkan spec, in SubgroupSize of 15.9. Built-In Variables section, required that

The value may vary across a single draw call, and for fragment shaders may vary across a single primitive. In compute dispatches, SubgroupSize must be uniform with command scope.

I don't find explicit requirement in MSL/HLSL spec, but I think they should work in the same way for compute kernel.

alan-baker added enhancement proposal labels Nov 7, 2023

alan-baker requested review from Kangz, kainino0x, jimblandy, toji, kdashg, mehmetoguzderin, dneto0 and mwyrzykowski November 7, 2023 19:42

alan-baker mentioned this pull request Nov 7, 2023

add subgroups, and make them portable if possible #4306

Open

mwyrzykowski approved these changes Nov 7, 2023

View reviewed changes

exrook reviewed Nov 8, 2023

View reviewed changes

proposals/subgroups.md Outdated Show resolved Hide resolved

exrook reviewed Nov 8, 2023

View reviewed changes

exrook mentioned this pull request Nov 8, 2023

Subgroup Operations gfx-rs/wgpu#4190

Closed

3 tasks

Kangz reviewed Nov 8, 2023

View reviewed changes

dneto0 reviewed Nov 24, 2023

View reviewed changes

dneto0 approved these changes Nov 24, 2023

View reviewed changes

kdashg added this to the Milestone 1 milestone Nov 28, 2023

ben-clayton reviewed Nov 30, 2023

View reviewed changes

teoxoy reviewed Dec 5, 2023

View reviewed changes

mehmetoguzderin approved these changes Dec 5, 2023

View reviewed changes

teoxoy reviewed Dec 6, 2023

View reviewed changes

proposals/subgroups.md Outdated Show resolved Hide resolved

teoxoy approved these changes Dec 6, 2023

View reviewed changes

dneto0 approved these changes Dec 6, 2023

View reviewed changes

alan-baker and others added 11 commits December 6, 2023 13:43

Add subgroups proposal

edb05dc

* Two new features * `subgroups` for the base functionality * `subgroups-f16` to use f16 in the new operations

Update proposals/subgroups.md

72ea9d2

Co-authored-by: exrook <exrook.j@gmail.com>

fix scan operations to be exclusive

15a1ef5

require mask in subgroupShuffleXor to be dynamically uniform

a89ad23

fix previous commit

590bb0d

Update proposals/subgroups.md

70075ed

Co-authored-by: Teodor Tanasoaia <28601907+teoxoy@users.noreply.github.com>

Update proposals/subgroups.md

77d43bb

Co-authored-by: Teodor Tanasoaia <28601907+teoxoy@users.noreply.github.com>

Update proposals/subgroups.md

c665322

Co-authored-by: Teodor Tanasoaia <28601907+teoxoy@users.noreply.github.com>

Update proposals/subgroups.md

5e45780

Co-authored-by: Teodor Tanasoaia <28601907+teoxoy@users.noreply.github.com>

Update proposals/subgroups.md

161220e

Co-authored-by: Teodor Tanasoaia <28601907+teoxoy@users.noreply.github.com>

dneto0 force-pushed the subgroups-proposal branch from a6b9063 to 161220e Compare December 6, 2023 18:43

dneto0 enabled auto-merge (squash) December 6, 2023 18:43

dneto0 disabled auto-merge December 6, 2023 18:43

dneto0 enabled auto-merge (squash) December 6, 2023 18:43

dneto0 merged commit e1ded4a into gpuweb:main Dec 6, 2023
4 checks passed

teoxoy mentioned this pull request Apr 18, 2024

Differences between our subgroup implementation and the WebGPU proposal gfx-rs/wgpu#5555

Open

Add subgroups proposal #4368

Add subgroups proposal #4368

Conversation

alan-baker commented Nov 7, 2023 • edited

github-actions bot commented Nov 7, 2023 • edited

exrook commented Nov 7, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

greggman commented Nov 8, 2023 • edited

Lichtso commented Nov 8, 2023

alan-baker commented Nov 8, 2023 • edited

alan-baker commented Nov 8, 2023

alan-baker commented Nov 8, 2023

alan-baker commented Nov 8, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dneto0 left a comment

Choose a reason for hiding this comment

kdashg commented Nov 28, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mehmetoguzderin left a comment

Choose a reason for hiding this comment

alan-baker commented Dec 6, 2023

teoxoy left a comment

Choose a reason for hiding this comment

dneto0 commented Dec 6, 2023

dneto0 left a comment

Choose a reason for hiding this comment

kdashg commented Dec 6, 2023

raphlinus commented Dec 15, 2023

charles-r-earp commented Feb 25, 2024

Required Subgroup Size

jzm-intel commented Apr 2, 2024

alan-baker commented Nov 7, 2023 •

edited

github-actions bot commented Nov 7, 2023 •

edited

greggman commented Nov 8, 2023 •

edited

alan-baker commented Nov 8, 2023 •

edited