Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add subgroup barrier to subgroups proposal #4437

Open
raphlinus opened this issue Jan 3, 2024 · 5 comments
Open

Add subgroup barrier to subgroups proposal #4437

raphlinus opened this issue Jan 3, 2024 · 5 comments
Labels
wgsl WebGPU Shading Language Issues
Milestone

Comments

@raphlinus
Copy link

The current subgroup proposal is lacking a subgroup barrier, though it was present in the naga prototype implementation. In this issue I will try to motivate why we need it, suggest semantics including some test cases, and finally some notes about how it could be implemented.

I believe there is a strong analogy to acquire/release semantics in the C++ memory model. In that analogy, subgroup barriers map to acquire/release, lockstep behavior is equivalent to TSO, and promoting the subgroup barrier to a workgroup barrier is equivalent to sequential consistency. It is not necessary to fully appreciate this analogy, but I think it's helpful in understanding where arguments questioning the need for the finer grained distinctions may come from.

A motivating use case

I didn't have a golden example of a use case for subgroup barriers until recently, but now I do. I'm not alone, there is a (currently still unanswered) question on Apple developer forums asking this in the context of Metal, with a reference to Apple-written code in which the question was unclear.

The algorithm is warp-local multi-split, which is at the core of state of the art least-significant-digit radix sorting algorithms such as Onesweep. This has a readily available implementation in CUDA. I did a prototype port to vanilla WebGPU, using workgroup shared arrays instead of subgroups, but found that the excessive number of workgroup barriers needed tanked performance (I don't have hard numbers yet but it is likely that our investigation into sorting will produce some). On infrastructure with subgroups (after which the algorithm is named), the WLMS algorithm performs key ranking entirely using subgroups, with a single reasonably cheap prefix sum to stitch together the results after n chunks, where n ("elements per thread") is a tunable parameter that gains work efficiency possibly at the expense of exploitable parallelism.

A key operation in WMLS updates an array in workgroup shared memory so that the locations addressed are disjoint at subgroup granularity, but different threads within a subgroup may read and write the same value. In CUDA, the Vulkan memory model, and my interpretation of Metal's adaptation of the C++ memory model, such a thing is a data race unless there is an intervening barrier. And indeed, in the CUB source code, there is a barrier that macro-expands to __syncwarp().

A reduced test case for this pattern is as follows:

const WG = 256;
var<workgroup> a: array<u32, WG>;

@compute @workgroup_size(WG)
fn main(
    @builtin(local_invocation_index) index: u32,
    @builtin(subgroup_size) sg_size: u32,
) {
    a[index ^ (sg_size - 1)] = 1;
    subgroupBarrier();
    let x = a[index];
    if x != 1 {
        fail();
    }
}

A few quick observations. If the mask were (WG - 1) instead of (sg_size - 1), then this would be expected to fail, though correctness could be restored by upgrading the barrier to workgroup scope.

Reconvergence and uniformity

As stated in this informative thread questioning the need for __syncwarp in CUDA, the Volta execution model is allowed to move threads forward in any order, even in the absence of conditional control flow. While this is not the execution model we expect in graphics implementations, I don't think we should preclude that; it's possible we may want to run WGSL programs in the future on infrastructure that shares this property with CUDA.

The spec for simdgroup_barrier() in MSL requires subgroup uniformity, though analogous barriers in other APIs do not. For WLMS, the barrier would be run in workgroup uniform control flow, so I personally would be happy with that. The rules for reconvergence are messy (as amply documented in the subgroup proposal), and I expect will probably continue to be refined.

Proposed semantics

I propose that subgroupBarrier() is required to be in uniform control flow, just as the other barriers. This could be weakened later, though I think would require uniformity analysis to run at subgroup granularity as well.

Its semantics are the same as workgroupBarrier() except that it says "invocations in the same subgroup" rather than "invocations in the same workgroup" (referencing 14.5.1 Barriers in the current spec). The memory semantics are the same as workgroup barriers - AcquireRelease and WorkgroupMemory memory semantics.

Implementation guide

In general, implementations should do the least expensive thing required to pass the above litmus test. In some cases, that may be nothing. Even so, the goal is to capture programmer intent in case the distinction is important. I doubt anyone would be comfortable adding the above litmus test sans barrier to the CTS.

Metal

Translation to MSL is straightforwardly simdgroup_barrier(mem_flags::mem_threadgroup). As mentioned above, subgroup uniformity is required.

Vulkan

Translation to Vulkan is controlBarrier(gl_ScopeSubgroup, gl_ScopeWorkgroup, gl_StorageSemanticsShared, gl_SemanticsAcquireRelease) in GLSL syntax, or OpControlBarrier(3, 2, 0x108) in SPIR-V. This follows straightforwardly from the memory semantics as written in the spec, with the addition that memory scope (not a concept defined by WGSL) is set to workgroup, matching the scope implied by the memory semantics.

HLSL

Translation to HLSL is // TODO: add WaveBarrier() if a future shader model incorporates it. Just as on x86 we can rely on total store ordering and can emit vanilla load and store operations, here we rely on the folk knowledge that GPUs execute in lockstep at the wave level, and thus no synchronization is necessary. If an implementation fails the above litmus test, is it a bug in the implementation, or an insufficiently strong spec? Is an electron a wave or a particle? In either case, the solution is to emit GroupMemoryBarrierWithGroupSync() when we are not confident the implementation admits the fast implementation, possibly using an allowlist or denylist as with other bug workarounds.

@raphlinus raphlinus added the wgsl WebGPU Shading Language Issues label Jan 3, 2024
@alan-baker
Copy link
Contributor

Vulkan specifically goes out of it's way to mention that mapping being local id and subgroup id is not reliable. I have not seen it fail for one dimensional workgroup, but there is a caveat that subgroups aren't required to packed the way your litmus test assumes. It should use subgroup_id and subgroup_invocation_id to build a linear layout, but then you run into the problem of needing a reliable emulation of subgroup_id in HLSL and guarantees that only full subgroups are launched (seems doable in Vulkan and Metal, but not D3D). SM6.6 does have the WaveSize attribute, but none of these features fit together in a nice package (mostly Metal and D3D).

Upgrading to a workgroup barrier if the uniformity analysis is done at subgroup level seems destined to be unreliable as well. I'm not sure this is something we could hand wave about. I could see subgroupBarrier being an optional additional feature that is simply not present on D3D devices, but I wouldn't feel comfortable spec'ing it to "just work".

@kdashg
Copy link
Contributor

kdashg commented Jan 9, 2024

WGSL 2024-01-09 Minutes
  • KG: Maybe an M3+ thing. Add later.
  • JB: Raph’s example, says workgroup barrier make it not worth it.
  • AB: Would have to be its own feature flag distinct from the main subgroups feature. Raph’s example also uses assumptions about mapping subgroup IDs with local Id subgroups, that are not enforced by Vulkan.

@munrocket
Copy link
Contributor

According to 3dgs feedback subgroups not useful without barriers.
What if we will leave HLSL broken for 2D/3D workgroups? At least will work for VR backends 🤔

@alienself
Copy link

I am also very interested in having a fast radix sorting implementation in WGSL.
Unfortunately this is quite hard to achieve at the moment without subgroup barriers.

I currently implemented radix sorting using workgroup barriers but performance is not good enough for production as too many dispatches are currently necessary which doesn't scale with my use case.

Ralph's thread here is very informative on the topic. @raphlinus were you able to find a way to create an optimized radix sort version using the latest subgroup implementation (just released in chrome 125) or is it just not viable without barriers ?

Any progress towards adding subgroup barriers to the proposal?

@raphlinus
Copy link
Author

I have not yet, but it's on my radar; it looks like the Chrome 125 feature set is barely enough, as the main thing you need is ballot (aside from the barrier). I do have a prototype which was run by translating the WGSL into Metal, then hand-editing the Metal to change the workgroup implementation of WLMS to proper subgroups.

Another challenge of course is knowing the subgroup size. For experimentation, it's possible to know that Apple Silicon has a subgroup size of 32, but for portable deployment it's a much harder problem. The GPUSorting repo has some interesting work which promises to be portable across subgroup sizes, but I haven't studied it carefully yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wgsl WebGPU Shading Language Issues
Projects
None yet
Development

No branches or pull requests

5 participants