Skip to content

perf: remove workgroup-level early-out atomics from rasterize pass#8596

Merged
mvaligursky merged 1 commit intomainfrom
mv-rasterize-remove-sync
Apr 14, 2026
Merged

perf: remove workgroup-level early-out atomics from rasterize pass#8596
mvaligursky merged 1 commit intomainfrom
mv-rasterize-remove-sync

Conversation

@mvaligursky
Copy link
Copy Markdown
Contributor

Remove the per-batch workgroup-level early-out (atomicStore/atomicAdd/atomicLoad + barriers) from the tile rasterizer. WGSL lacks a fused barrier+vote intrinsic like CUDA's __syncthreads_count, so emulating it with atomics costs 3 synchronization points per batch of 64 splats.

Changes:

  • Remove doneCount atomic and doneCountShared workgroup variables
  • Remove the per-batch atomic early-out dance (atomicStore → barrier → atomicAdd → barrier → atomicLoad → workgroupUniformLoad)
  • Keep per-thread threadDone flag that skips evaluation via branchless ALU once all 4 pixels saturate
  • Simplify batch loop to: load → barrier → eval → barrier
  • Remove unused WORKGROUP_SIZE constant

Performance:

  • Neutral on Apple M4 (2.95ms → 2.99ms) where barriers are nearly free
  • Expected improvement on NVIDIA/discrete GPUs where each barrier stall is significant
  • Simpler code, less shared memory usage (no atomic variables)

WGSL lacks a fused barrier+vote intrinsic like CUDA's __syncthreads_count,
so emulating workgroup-level early-out with atomics+barriers costs 3
synchronization points per batch. Remove the atomic dance and rely on
per-thread branchless early-out instead.
@mvaligursky mvaligursky self-assigned this Apr 14, 2026
@mvaligursky mvaligursky merged commit 96a4b98 into main Apr 14, 2026
8 checks passed
@mvaligursky mvaligursky deleted the mv-rasterize-remove-sync branch April 14, 2026 08:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant