-
Notifications
You must be signed in to change notification settings - Fork 305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add subgroup barrier to subgroups proposal #4437
Comments
Vulkan specifically goes out of it's way to mention that mapping being local id and subgroup id is not reliable. I have not seen it fail for one dimensional workgroup, but there is a caveat that subgroups aren't required to packed the way your litmus test assumes. It should use subgroup_id and subgroup_invocation_id to build a linear layout, but then you run into the problem of needing a reliable emulation of subgroup_id in HLSL and guarantees that only full subgroups are launched (seems doable in Vulkan and Metal, but not D3D). SM6.6 does have the WaveSize attribute, but none of these features fit together in a nice package (mostly Metal and D3D). Upgrading to a workgroup barrier if the uniformity analysis is done at subgroup level seems destined to be unreliable as well. I'm not sure this is something we could hand wave about. I could see subgroupBarrier being an optional additional feature that is simply not present on D3D devices, but I wouldn't feel comfortable spec'ing it to "just work". |
WGSL 2024-01-09 Minutes
|
According to 3dgs feedback subgroups not useful without barriers. |
I am also very interested in having a fast radix sorting implementation in WGSL. I currently implemented radix sorting using workgroup barriers but performance is not good enough for production as too many dispatches are currently necessary which doesn't scale with my use case. Ralph's thread here is very informative on the topic. @raphlinus were you able to find a way to create an optimized radix sort version using the latest subgroup implementation (just released in chrome 125) or is it just not viable without barriers ? Any progress towards adding subgroup barriers to the proposal? |
I have not yet, but it's on my radar; it looks like the Chrome 125 feature set is barely enough, as the main thing you need is ballot (aside from the barrier). I do have a prototype which was run by translating the WGSL into Metal, then hand-editing the Metal to change the workgroup implementation of WLMS to proper subgroups. Another challenge of course is knowing the subgroup size. For experimentation, it's possible to know that Apple Silicon has a subgroup size of 32, but for portable deployment it's a much harder problem. The GPUSorting repo has some interesting work which promises to be portable across subgroup sizes, but I haven't studied it carefully yet. |
The current subgroup proposal is lacking a subgroup barrier, though it was present in the naga prototype implementation. In this issue I will try to motivate why we need it, suggest semantics including some test cases, and finally some notes about how it could be implemented.
I believe there is a strong analogy to acquire/release semantics in the C++ memory model. In that analogy, subgroup barriers map to acquire/release, lockstep behavior is equivalent to TSO, and promoting the subgroup barrier to a workgroup barrier is equivalent to sequential consistency. It is not necessary to fully appreciate this analogy, but I think it's helpful in understanding where arguments questioning the need for the finer grained distinctions may come from.
A motivating use case
I didn't have a golden example of a use case for subgroup barriers until recently, but now I do. I'm not alone, there is a (currently still unanswered) question on Apple developer forums asking this in the context of Metal, with a reference to Apple-written code in which the question was unclear.
The algorithm is warp-local multi-split, which is at the core of state of the art least-significant-digit radix sorting algorithms such as Onesweep. This has a readily available implementation in CUDA. I did a prototype port to vanilla WebGPU, using workgroup shared arrays instead of subgroups, but found that the excessive number of workgroup barriers needed tanked performance (I don't have hard numbers yet but it is likely that our investigation into sorting will produce some). On infrastructure with subgroups (after which the algorithm is named), the WLMS algorithm performs key ranking entirely using subgroups, with a single reasonably cheap prefix sum to stitch together the results after n chunks, where n ("elements per thread") is a tunable parameter that gains work efficiency possibly at the expense of exploitable parallelism.
A key operation in WMLS updates an array in workgroup shared memory so that the locations addressed are disjoint at subgroup granularity, but different threads within a subgroup may read and write the same value. In CUDA, the Vulkan memory model, and my interpretation of Metal's adaptation of the C++ memory model, such a thing is a data race unless there is an intervening barrier. And indeed, in the CUB source code, there is a barrier that macro-expands to
__syncwarp()
.A reduced test case for this pattern is as follows:
A few quick observations. If the mask were
(WG - 1)
instead of(sg_size - 1)
, then this would be expected to fail, though correctness could be restored by upgrading the barrier to workgroup scope.Reconvergence and uniformity
As stated in this informative thread questioning the need for
__syncwarp
in CUDA, the Volta execution model is allowed to move threads forward in any order, even in the absence of conditional control flow. While this is not the execution model we expect in graphics implementations, I don't think we should preclude that; it's possible we may want to run WGSL programs in the future on infrastructure that shares this property with CUDA.The spec for
simdgroup_barrier()
in MSL requires subgroup uniformity, though analogous barriers in other APIs do not. For WLMS, the barrier would be run in workgroup uniform control flow, so I personally would be happy with that. The rules for reconvergence are messy (as amply documented in the subgroup proposal), and I expect will probably continue to be refined.Proposed semantics
I propose that
subgroupBarrier()
is required to be in uniform control flow, just as the other barriers. This could be weakened later, though I think would require uniformity analysis to run at subgroup granularity as well.Its semantics are the same as
workgroupBarrier()
except that it says "invocations in the same subgroup" rather than "invocations in the same workgroup" (referencing 14.5.1 Barriers in the current spec). The memory semantics are the same as workgroup barriers -AcquireRelease
andWorkgroupMemory
memory semantics.Implementation guide
In general, implementations should do the least expensive thing required to pass the above litmus test. In some cases, that may be nothing. Even so, the goal is to capture programmer intent in case the distinction is important. I doubt anyone would be comfortable adding the above litmus test sans barrier to the CTS.
Metal
Translation to MSL is straightforwardly
simdgroup_barrier(mem_flags::mem_threadgroup)
. As mentioned above, subgroup uniformity is required.Vulkan
Translation to Vulkan is
controlBarrier(gl_ScopeSubgroup, gl_ScopeWorkgroup, gl_StorageSemanticsShared, gl_SemanticsAcquireRelease)
in GLSL syntax, or OpControlBarrier(3, 2, 0x108) in SPIR-V. This follows straightforwardly from the memory semantics as written in the spec, with the addition that memory scope (not a concept defined by WGSL) is set to workgroup, matching the scope implied by the memory semantics.HLSL
Translation to HLSL is
// TODO: add WaveBarrier() if a future shader model incorporates it
. Just as on x86 we can rely on total store ordering and can emit vanilla load and store operations, here we rely on the folk knowledge that GPUs execute in lockstep at the wave level, and thus no synchronization is necessary. If an implementation fails the above litmus test, is it a bug in the implementation, or an insufficiently strong spec? Is an electron a wave or a particle? In either case, the solution is to emitGroupMemoryBarrierWithGroupSync()
when we are not confident the implementation admits the fast implementation, possibly using an allowlist or denylist as with other bug workarounds.The text was updated successfully, but these errors were encountered: