Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add subgroups proposal #4368

Merged
merged 11 commits into from Dec 6, 2023
Merged

Add subgroups proposal #4368

merged 11 commits into from Dec 6, 2023

Conversation

alan-baker
Copy link
Contributor

@alan-baker alan-baker commented Nov 7, 2023

  • Two new features
    • subgroups for the base functionality
    • subgroups-f16 to use f16 in the new operations

Rendered proposal

Copy link
Contributor

github-actions bot commented Nov 7, 2023

Previews, as seen when this build job started (161220e):
WebGPU webgpu.idl | Explainer | Correspondence Reference
WGSL grammar.js | wgsl.lalr.txt

@exrook
Copy link
Contributor

exrook commented Nov 7, 2023

Is there a rationale for including subgroupInclusiveSum (rather than/without) a subgroupExclusiveSum version?

proposals/subgroups.md Outdated Show resolved Hide resolved
| `fn subgroupShuffleUp(v : T, delta : u32) -> T` | `T` must be u32, i32, f32, f16 or a vector of those types | Returns `v` from the active invocation whose subgroup_invocation_id matches `subgroup_invocation_id - delta` |
| `fn subgroupShuffleDown(v : T, delta : u32) -> T` | `T` must be u32, i32, f32, f16 or a vector of those types | Returns `v` from the active invocation whose subgroup_invocation_id matches `subgroup_invocation_id + delta` |
| `fn subgroupSum(e : T) -> T` | `T` must be u32, i32, f32, or a vector of those types | Reduction<br>Adds `e` among all active invocations and returns that result |
| `fn subgroupInclusiveSum(e : T) -> T)` | `T` must be u32, i32, f32, f16 or a vector of those types | Inclusive scan<br>Returns the sum of `e` for all active invocations with subgroup_invocation_id less than or equal to this invocation |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

subgroupInclusiveSum naming bikeshedding

It's not obvious from the name this behaves differently from subgroupSum in that it returns a unique value per invocation rather than a subgroup-uniform value. Not having prior knowledge I'd think a standard sum operation was already "inclusive".

I think subgroupPrefixInclusiveSum or subgroupInclusivePrefixSum would be much more obvious at the cost of verbosity. Alternatively, subgroupPrefixSum still hints that something more complex than an ordinary sum is going on to an unfamiliar reader, but for experienced users leaves room for confusion as to the in/exclusivity. It doesn't help that DX12 WavePrefixSum() is exclusive

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is meant to be exclusive.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bikeshedding thoughts.

Comparing prior art:

  • "Prefix sum" is what I've known the general algorithm as, and likely the first entry for most folks. https://en.wikipedia.org/wiki/Prefix_sum
  • HLSL and MSL use "prefix" in the names. GLSL does not.
  • HLSL and MSL use "sum" and "product". GLSL uses "Add" and "Mul". The GLSL way generalizes to logical operations like And, Or, Max, Min (or those are simply ambiguoous! Hmmm.)

Once I see many operations in context like in this Vulkan subgroups tutorial, lacking "prefix" doesn't seem so bad because it's implied by Inclusive or Exclusive.

@exrook exrook mentioned this pull request Nov 8, 2023
3 tasks
@greggman
Copy link
Contributor

greggman commented Nov 8, 2023

Do we have any concrete examples of what a developer would have to do to write performant code that worked on all platforms? I get in the abstract what they'd have to do but it would be nice to see some more concrete example JS/TS code of what a developer would have to do to run across devices.

@Lichtso
Copy link

Lichtso commented Nov 8, 2023

I would also strongly recommend having the exclusive prefix scans or both, not the inclusive versions only. The reason is that it is easy to build the inclusive version from the exclusive one, but not the other way around. E.g. take multiplication, one can just multiply the current invocations value and the exclusive product to get the inclusive product. However, undoing the inclusive product into an exclusive one would require a division which is expensive and has ugly edge cases such as division by zero and the smallest negative integer overflow.

Additionally to subgroup_size and subgroup_invocation_id our prototype also has num_subgroups and subgroup_id for the compute stage only. These are missing in HLSL but can be emulated.

Also, I did not see any barriers in the proposal, neither for the control flow nor the memory.

@alan-baker
Copy link
Contributor Author

alan-baker commented Nov 8, 2023

I would also strongly recommend having the exclusive prefix scans or both, not the inclusive versions only. The reason is that it is easy to build the inclusive version from the exclusive one, but not the other way around. E.g. take multiplication, one can just multiply the current invocations value and the exclusive product to get the inclusive product. However, undoing the inclusive product into an exclusive one would require a division which is expensive and has ugly edge cases such as division by zero and the smallest negative integer overflow.

The inclusive scan was a typo. I realize HLSL is missing this intrinsic, but I flipped the name when I was writing.

@alan-baker
Copy link
Contributor Author

Also, I did not see any barriers in the proposal, neither for the control flow nor the memory.

The problem with barriers is that HLSL doesn't have a wave barrier, MSL says the barrier is uniform, but SPIR-V says the barrier is non-uniform. I certainly wouldn't want to have HLSL use a workgroup barrier for this.

@alan-baker
Copy link
Contributor Author

Additionally to subgroup_size and subgroup_invocation_id gfx-rs/wgpu#4190 also has num_subgroups and subgroup_id for the compute stage only. These are missing in HLSL but can be emulated.

We can consider those, but I think it requires more testing to ensure D3D12 behavior is always consistent. I don't have access to enough Windows devices to test this fully. My worry would be that HLSL has a similar problem to Vulkan 1.1 where the subgroup size builtin could just return a static number. There is a possibility that D3D12 backends will have to emulate subgroup_size.

@alan-baker
Copy link
Contributor Author

Do we have any concrete examples of what a developer would have to do to write performant code that worked on all platforms? I get in the abstract what they'd have to do but it would be nice to see some more concrete example JS/TS code of what a developer would have to do to run across devices.

I do think we'd want to formalize that guidance. Internally, we have such code, but I can't share it. Coincidental with this feature getting implemented it would make sense to also have a sample or two to demonstrate this. I could see an additional diagnostic control being added that shader authors could use to warn/error if they went outside those bounds. At a high level those bounds are make sure all branches are uniform.

| `fn subgroupBroadcastFirst(e : T) -> T` | `T` must be u32, i32, f32, f16 or a vector of those types | Broadcasts `e` from the active invocation with the lowest subgroup_invocation_id in the subgroup to all other active invocations |
| `fn subgroupBallot(pred : bool) -> vec4<u32>` | | Returns a set of bitfields where the bit corresponding to subgroup_invocation_id is 1 if `pred` is true for that active invocation and 0 otherwise. |
| `fn subgroupShuffle(v : T, id : I) -> T` | `T` must be u32, i32, f32, f16 or a vector of those types<br>`I` must be u32 or i32 | Returns `v` from the active invocation whose subgroup_invocation_id matches `id` |
| `fn subgroupShuffleXor(v : T, mask : u32) -> T` | `T` must be u32, i32, f32, f16 or a vector of those types | Returns `v` from the active invocation whose subgroup_invocation_id matches `subgroup_invocation_id ^ mask` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the mask need to be dynamically uniform? What's the use of this builtin?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-checking things, MSL does require it to be dynamically uniform. SPIR-V does not. I'll update this.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the mask should be dynamically uniform. This built-in can be used for permutations (where there is a bijection between sources and destinations) which can potentially be implemented more efficiently than the general gather operation which might need to distribute some sources to multiple destinations.

| `fn subgroupBallot(pred : bool) -> vec4<u32>` | | Returns a set of bitfields where the bit corresponding to subgroup_invocation_id is 1 if `pred` is true for that active invocation and 0 otherwise. |
| `fn subgroupShuffle(v : T, id : I) -> T` | `T` must be u32, i32, f32, f16 or a vector of those types<br>`I` must be u32 or i32 | Returns `v` from the active invocation whose subgroup_invocation_id matches `id` |
| `fn subgroupShuffleXor(v : T, mask : u32) -> T` | `T` must be u32, i32, f32, f16 or a vector of those types | Returns `v` from the active invocation whose subgroup_invocation_id matches `subgroup_invocation_id ^ mask` |
| `fn subgroupShuffleUp(v : T, delta : u32) -> T` | `T` must be u32, i32, f32, f16 or a vector of those types | Returns `v` from the active invocation whose subgroup_invocation_id matches `subgroup_invocation_id - delta` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when subgroup_invocation_id +- delta goes OOB?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An indeterminate value is returned. I was trying to avoid all the fiddly cases in the proposal that will have to be addressed in the final spec.

MSL says lower invocations are unmodified, but SPIR-V says undefined value.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually MSL has a variant of shuffles which fill-in a defined value in the OOB cases. But leaving it as UB is fine as well IMO.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm misunderstanding this, but given that at least one invocation would end up with an indeterminate value, wouldn't the developer always have to end up doing some comparison against subgroup_invocation_id to ensure the indeterminate value doesn't cause trouble?

| `fn subgroupMin(e : T) -> T` | `T` must be u32, i32, f32, f16 or a vector of those types | Reduction<br>Performs a min of `e` among all active invocations and returns that result |
| `fn subgroupMax(e : T) -> T` | `T` must be u32, i32, f32, f16 or a vector of those types | Reduction<br>Performs a max of `e` among all active invocations and returns that result |
| `fn quadBroadcast(e : T, id : I)` | `T` must be u32, i32, f32, f16 or a vector of those types<br>`I` must be u32 or i32 | Broadcasts `e` from the quad invocation with id equal to `id`<br>`e` must be a constant-expression<sup>2</sup> |
| `fn quadSwapX(e : T)` | `T` must be u32, i32, f32, f16 or a vector of those types | Swaps `e` between invocations in the quad in the X direction |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a definition of how invocations are grouped into quads (how it maps to subgroup_invocation_id)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vulkan explicitly requires that quads are consequence subgroup invocations. See https://registry.khronos.org/vulkan/specs/1.3-extensions/html/vkspec.html#shaders-scope-quad. MSL provides a quad id. D3D12 says they are evenly divided in the subgroup (so I assume just like vulkan).

1. This is the first instance of dynamic uniformity. See the portability and uniformity section for more details.
2. Unlike `subgroupBroadcast`, SPIR-V does not have a shuffle operation to fall back on, so this requirement must be surfaced.

**TODO**: Are quad operations worth it?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we are adding subgroup operations to fragment shaders yet right? So are there helper invocations?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah it seems we are, but i thought these were even newer functionality than compute subgroups, are we looking to add them anyway? It seems it could lose some reach.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to gpuinfo.org very few devices don't support fragment shaders.

Because general portability seems to be off the table based on empirical testing, it seemed beneficial to allow them in fragment shaders.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requiring subgroupSupportedStages to include the fragment stage seems to drop support for the Adreno 600 series (see reports analysis).

This is probably fine given that D3D12 and Metal support subgroup ops in the fragment stage unconditionally.

Normally, there would be extra portability hazards in fragment shaders (e.g.
due to helper invocations).

**TODO**: Is it worth adding a diagnostic control to require subgroup
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be really nice. Even if we are building on shaky foundations semantics-wise, I think it is useful to help developers that want to build on stronger format foundations. It might be worth defining subgroup uniformity as well so we can require that bradcast's I is subgroup uniform, even if for now the only way to have that is by having I be workgroup uniform.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I replied to Greg's comment we could also consider a diagnostic that warns if all the branches are uniform (as opposed to just the subgroup operation being in uniform control flow).

I think adding either or both diagnostics defaulting to off is reasonable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a section on diagnostics that would default to errors as a starting point.

* Vulkan 1.3 or VK_EXT_subgroup_size_control

According to [this query](https://vulkan.gpuinfo.org/displaycoreproperty.php?core=1.1&name=subgroupSupportedOperations&platform=all),
~84% of devices are captured.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like these percentages are fractions of unique reports.

It would be more meaningful to weight by shipped (and active) units. But this data is hard to get. (As a committee we should decide on a set of representative popular devices. But that goes way beyond this issue.)

I think generally the guidance works out ok though.

Copy link
Contributor

@dneto0 dneto0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving this to land as a proposal in flight, for discussion.
Several open issues are marked with TODOs, and more can be raised during discussion.

@kdashg kdashg added this to the Milestone 1 milestone Nov 28, 2023
@kdashg
Copy link
Contributor

kdashg commented Nov 28, 2023

WGSL 2023-11-28 Minutes
  • AB: We wrote an experimental extension with very limited functionality to determine portability. (Only need size, id, and ballot). The short of it: There’s not much portability across platforms and devices.
  • AB: There’s a draft PR in CTS to run the experiments.
  • AB: We have internal use cases that get a lot perf out of this, and really want this. We want to do an origin trial in order to prove out the performance benefits.
  • AB: We’re looking for the group to agree on this direction, then hammer out the details. That will let us experiment.
  • JB: People get attachment to experimental code.
  • DN: The request is to land the proposal doc in the repo, not to change the main spec.
  • JB: Expect to need this in the future. Not prepared to offer opinions on naming, or tech details. Think the threshold for landing proposal docs is low.
  • KG: Difference between landing as proposal vs. having this in your own repo.
  • AB: Signals agreement on direction. Fine points will need discussion.
  • JB: Inclined toward this, is taking a very prevalent arch feature and fronting it to applications, because it has massive performance implications. Don’t know of competing designs we’d consider as alternatives. WebGPU will have to do something here eventually.
  • KG: Positive on direction. Details TBD. Do worry about accidentally ensconcing details of the implementation because customers write code and end up relying on it. Do want to have a stronger way to deprecate, e.g. by a prefix on each builtin name, or a long extension name.
  • AB: More prefixing seems fine. We’ll discuss internally.
  • JB: Positive signal. Let’s discuss next week.
  • KG: If you want an official signal, we have a standards-position process: https://github.com/mozilla/standards-positions/
  • AB: Call to action. Please look at it. Tried to lay out where portability falls down. Sketched some potential guardrails.
  • JB: Want to discuss some details in the office hours.

but these operations provide significant performance improvements in many
circumstances.

Suggest allowing these operations anywhere and provide guidance on how to
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does feel somewhat counter to all the efforts we've put in for portability. This is certainly an implementor's decision, but I would feel more comfortable making the extension unavailable for known drivers that perform the aggressive opimizations that do not preserve the correct active invocations. At least then there's incentive for these things to be fixed, instead of forever worked-around.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a todo to see how much more portable we can make individual functions.

proposals/subgroups.md Outdated Show resolved Hide resolved
proposals/subgroups.md Outdated Show resolved Hide resolved
Comment on lines +43 to +50
| **subgroups-f16** | Allows f16 to be used in subgroups operations |

Note: Metal can always provide subgroups-f16, Vulkan requires
VK_KHR_shader_subgroup_extended_types
([~61%](https://vulkan.gpuinfo.org/listdevicescoverage.php?extension=VK_KHR_shader_subgroup_extended_types&platform=all)
of devices), and D3D12 requires SM6.2.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need an additional subgroups-f16 feature? The existing f16 feature already requires SM6.2 and almost all devices that support the proposed subgroups functionality + the f16 feature also support shaderSubgroupExtendedTypes (see reports analysis).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some numbers for Vulkan (pulled from the analysis):

  • 78% devices support the proposed subgroup feature
  • 46% devices support f16
  • 42% devices support both f16 and subgroup

100% being core WebGPU functionality.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's interesting data. I'd prefer not to cleave the feature. Can you clarify what the f16 feature is? Does it represent the set of features required to enable f16 in WGSL? The other features seem to mirror Vulkan features.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added this as a todo for now (and linked to your report).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code for the f16 features is here.

This is the whole commit: teoxoy/gpuinfo-vulkan-query@8681e00 (includes the src and the analysis).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok so it is all the features needed to implement WGSL f16. Thanks!

proposals/subgroups.md Outdated Show resolved Hide resolved
Comment on lines +51 to +60
According to [gpuinfo.org](https://vulkan.gpuinfo.org/displaycoreproperty.php?core=1.1&name=subgroupSupportedOperations&platform=all),
this feature set captures ~84% of devices.
Splitting the feature does not grab a significant portion of devices.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to this report analysis, splitting off the arithmetic ops looks the most promising, but considering the devices we get back are only Gen7 (Ivy Bridge, Haswell) intel it might not be worth it.

proposals/subgroups.md Outdated Show resolved Hide resolved
1. This is the first instance of dynamic uniformity. See the portability and uniformity section for more details.
2. Unlike `subgroupBroadcast`, SPIR-V does not have a shuffle operation to fall back on, so this requirement must be surfaced.

**TODO**: Are quad operations worth it?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requiring subgroupSupportedStages to include the fragment stage seems to drop support for the Adreno 600 series (see reports analysis).

This is probably fine given that D3D12 and Metal support subgroup ops in the fragment stage unconditionally.

proposals/subgroups.md Outdated Show resolved Hide resolved
proposals/subgroups.md Show resolved Hide resolved
Copy link
Member

@mehmetoguzderin mehmetoguzderin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving as the document looks promising and makes reasonable choices to expose a wider set of operations with more stage compatibility without sacrificing device segments with meaningful support for subgroups. Thank you very much, and looking forward to further dialogue to realize subgroups spec in WGSL and WebGPU based on having the proposal in flight.

@alan-baker
Copy link
Contributor Author

Additionally to subgroup_size and subgroup_invocation_id gfx-rs/wgpu#4190 also has num_subgroups and subgroup_id for the compute stage only. These are missing in HLSL but can be emulated.

We can consider those, but I think it requires more testing to ensure D3D12 behavior is always consistent. I don't have access to enough Windows devices to test this fully. My worry would be that HLSL has a similar problem to Vulkan 1.1 where the subgroup size builtin could just return a static number. There is a possibility that D3D12 backends will have to emulate subgroup_size.

I've added todo to see if we can provide portable versions of these builtin values.

proposals/subgroups.md Outdated Show resolved Hide resolved
Copy link
Member

@teoxoy teoxoy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double-checked the builtins across shading languages and the proposal includes the ones that are available on our 3 main targets.

SPIR-V and MSL have quite a few more builtins that HLSL SM6 doesn't.
SPIR-V and HLSL SM6 have 3 additional builtins (subgroupAllEqual, subgroupBallotBitCount, subgroupExclusiveBallotBitCount) but MSL lacks those.

Great stuff!

@dneto0
Copy link
Contributor

dneto0 commented Dec 6, 2023

I actually used subgroupBallotBitcount and it's handy to have, and trivially pollyfillable:

fn ballotBitCount(ballot: vec4u) -> u32 { return dot(vec4u(1),countOneBits(ballot)); }

Copy link
Contributor

@dneto0 dneto0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. This is a great foundation for experimentation and further discussion.
Good to land now.

alan-baker and others added 11 commits December 6, 2023 13:43
* Two new features
  * `subgroups` for the base functionality
  * `subgroups-f16` to use f16 in the new operations
Co-authored-by: exrook <exrook.j@gmail.com>
Co-authored-by: Teodor Tanasoaia <28601907+teoxoy@users.noreply.github.com>
Co-authored-by: Teodor Tanasoaia <28601907+teoxoy@users.noreply.github.com>
Co-authored-by: Teodor Tanasoaia <28601907+teoxoy@users.noreply.github.com>
Co-authored-by: Teodor Tanasoaia <28601907+teoxoy@users.noreply.github.com>
* Added a diagnostics section
  * Two new diagnostics to make the feature as portable as possible
  * Default to errors
* Added some extra requirements
* Added some todos:
  * Potentially dropping subgroups-f16
  * Potentially emulating other builtin values
  * Increasing portability of individual builtin calls
Co-authored-by: Teodor Tanasoaia <28601907+teoxoy@users.noreply.github.com>
@dneto0 dneto0 enabled auto-merge (squash) December 6, 2023 18:43
@dneto0 dneto0 disabled auto-merge December 6, 2023 18:43
@dneto0 dneto0 enabled auto-merge (squash) December 6, 2023 18:43
@dneto0 dneto0 merged commit e1ded4a into gpuweb:main Dec 6, 2023
4 checks passed
@kdashg
Copy link
Contributor

kdashg commented Dec 6, 2023

WGSL 2023-12-05 Minutes
  • JB: Teo went through the proposal. Mozilla’s feeling is there’s no plausible alternative to this functionality. This lets users take advantage of near-universal support of this in hardware. We think we should land this as a proposal doc.
  • AB: I’ve landed some of the smaller of Teo’s changes, and will look up the device loss reports from the gpuinfo.org queries.
  • AB: Very interested in the cascade of devices that get lost as you add features. Had a question about the f16 one; doesn’t correspond to a Vulkan feature-bit. Will need to followup. Some question about the report showing 4% loss if you couple subgroups-f16 with subgroups. Want to follow up.
  • AB: Overall would be happy to have fewer splits of the feature if the data bears it out.
  • AB: Internally we discussed how to handle portability. E.g. possibly avoiding indeterminate value when you get data from an inactive invocation (e.g. for shuffles). Could look at polyfilling as the default, and then opt out if you want the speed but not the portability.
  • JB: Definitely like the portable unless you ask for something else.
  • MOD: Will some of these non-portable behaviors be security hazards? Or is it unlikely.
  • AB: Think it is not a security hazard. All the invocations will be from the same end-user application (graphics context at the API level).
  • AB: Vulkan is weird in that it can pack multiple draws into a single subgroup, but don’t know how many devices actually do that.
  • (JB: What will hardware people think of next?? :P )
  • (KG: Mesh shaders, yeah? :P )
  • JB: I was surprised that Mesh shaders didn’t come up from [Google] partner requests. Is there no asking here?
  • AB: It’s large to spec. There’s a concern of whether the mesh features in underlying APIs are actually good across HW classes in the long term.
  • JB: Yeah, I guess it’s a lot to spec. Lots of iteration vs earlier geom/tess approaches
  • AB: for subgroups, I’ll add the diagnostic change. Would love to merge the proposal, and iterate from there, even without a firm ETA for merging into main spec.
  • KG: Sounds like consensus to merge proposal after some (but not perfectionist!) revisioning.

@raphlinus
Copy link

Sorry for the late feedback, and I'm not sure if this is the right place for it (if not, let me know where is best). Overall I'm thrilled to see this go forward. I do have remaining concerns about subgroup size.

First, my testing has shown that WaveLaneCountMin is not a reliable minimum value, in particular on some Intel devices it can be reported as 16 but a workgroup can be launched with 8. I know there's a TODO to validate this, and I'm happy to share details of my testing environment.

Second, I'm unhappy that subgroup size bounds are available only as limits through the API and not accessible from WGSL, for reasons I've explained in #3950. There is a nontrivial burden on programmers to reap the API values, translate them into array sizes, and instantiate those as override constants (or textual substitution). More concern, it precludes a good compiler from doing analysis of the program, inferring tighter bounds on the subgroup sizes, and instantiating an individual program with those tighter bounds. I'm curious what the reasoning is for not including this.

@charles-r-earp
Copy link

Some Intel gpus in Vulkan report a SubgroupSize of 32 both on the host and gpu, but as @raphlinus mentioned have a range of 8-32, and will generally use 16. It's my understanding that vulkan drivers simply report the same SubgroupSize they have on the host, regardless of the workgroup size, which often may not accurately reflect the number of active threads within a particular subgroup.

subgroup_size can be computed correctly via ballot or reduction operations, if it is not a fixed value (ie min == max). In cases where subgroup_invocation_id is not available, it can be computed based on the relation local_invocation_index = subgroup_id * subgroup_size + subgroup_invocation_id, which should be trivial in typical cases of powers of 2.

Required Subgroup Size

With subgroup_size accurately computed, I would propose not enforcing subgroup_size uniformity or setting it to a specific value, since vendors like Intel and Apple have chosen to make this dynamic (potentially based on workgroup size or register usage).

It's reasonable to assume that there would be a performance cost, potentially catastrophic in some instances, which it doesn't make sense to pay unless it is needed. That is, something that the user opts in to, requesting this behavior.

This avoids having to use required_subgroup_size, or requring full subgroups, etc. Apple doesn't support required_subgroup_size, and may require speculatively recompiling or some other magic to make this work. While it's understandable to make this simpler to ease dev burden, it may not actually be necessary, and not doing this would be easier and more portable.

Requiring local_size (or just the x dim) be a multiple of the max subgroup size, while it seems like common practice, might be unnecessarily restrictive, as this isn't otherwise a requirement. When learning or debugging, it can be helpful to reduce workgroup size to 1 or a small value. Or larger subgroups could simply be neglected when prototyping, to avoid additional complexity, or maybe the algorithm doesn't require more threads. It isn't a necessary requirement in order to use subgroups correctly.

@raphlinus makes a good point in including subgroupMinSize and subgroupMaxSize within a shader. This allows specializing in cases where subgroup_size is fixed, while also handling cases where it varies, for instance by discarding extra subgroups, altering the algorithm, inserting barriers, etc.

Potentially required_subgroup_size, uniform / varying subgroup size, or full subgroups could be eventually be exposed in some fashion, if necessary. That is, if the user really needs a fixed subgroup_size, they could opt in to various means, whether when creating the pipeline, or as a property of the pipeline (which aligns better with Apple). Making this optional allows implementations to do so as it becomes possible, if possible, rather than blocking subgroups entirely.

@jzm-intel
Copy link
Contributor

Hi, can we add a note that subgroup_size built-in value, when used in compute shader, should be considered as uniform built-in value?

Currently only workgroup_id and num_workgroups built-in values are considered uniform according to spec. Treating subgroup_size as uniform may help writing flexible kernel that fit to different actual subgroup size, and allow using functions like workgroupBarrier like

    if (subgroupSize == 16) {
        ...
        workgroupBarrier();
        ...
    }

The Vulkan spec, in SubgroupSize of 15.9. Built-In Variables section, required that

The value may vary across a single draw call, and for fragment shaders may vary across a single primitive. In compute dispatches, SubgroupSize must be uniform with command scope.

I don't find explicit requirement in MSL/HLSL spec, but I think they should work in the same way for compute kernel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet