perf: replace atomic scatter with pair-buffer tile binning and cooperative large-splat processing#8586
Conversation
…ative large-splat processing Replace the expensive per-splat atomic scatter pass in the compute GSplat renderer with a fused count+pair-write approach and a lightweight PlaceEntries pass, eliminating redundant projCache reads and reducing global atomic contention. Large splats (>64 tiles) are deferred to cooperative passes where 256 threads process each splat in parallel, eliminating wavefront divergence that caused long GPU tails.
There was a problem hiding this comment.
Pull request overview
Refactors the WebGPU compute GSplat renderer’s tile binning pipeline to remove the per-splat atomic scatter pass, replacing it with a pair-buffer approach and adding cooperative processing for large splats to reduce GPU tail latency.
Changes:
- Fuses tile counting with pair-buffer allocation/writes, eliminating the separate atomic scatter pass and redundant projCache re-reads.
- Adds scatter-free
PlaceEntriesplus cooperativeLargeTileCount/LargePlaceEntriespasses for splats spanning many tiles. - Updates renderer wiring/buffers and refreshes high-level documentation/comments to reflect the new pipeline.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| src/scene/shader-lib/wgsl/chunks/gsplat/compute-gsplat-write-indirect-args.js | Comment clarification for indirect args writer. |
| src/scene/shader-lib/wgsl/chunks/gsplat/compute-gsplat-local-tile-count.js | Main fused tile count + pair-buffer write implementation (replaces old scatter pipeline). |
| src/scene/shader-lib/wgsl/chunks/gsplat/compute-gsplat-local-tile-count-large.js | New cooperative tile-count path for large splats. |
| src/scene/shader-lib/wgsl/chunks/gsplat/compute-gsplat-local-scatter.js | Removes the old atomic scatter pass. |
| src/scene/shader-lib/wgsl/chunks/gsplat/compute-gsplat-local-place-entries.js | New scatter-free placement pass consuming the pair buffer. |
| src/scene/shader-lib/wgsl/chunks/gsplat/compute-gsplat-local-place-entries-large.js | New cooperative placement pass for large splats. |
| src/scene/gsplat-unified/gsplat-manager.js | Updates high-level pipeline documentation/comments. |
| src/scene/gsplat-unified/gsplat-local-dispatch-set.js | Removes tile write cursor buffer; updates dispatch set fields for new passes. |
| src/scene/gsplat-unified/gsplat-compute-local-renderer.js | Wires new passes, allocates new buffers, adds indirect-dispatch prep passes, updates dispatch sequence. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| let tileIdx = u32(ty) * uniforms.numTilesX + u32(tx); | ||
| atomicAdd(&tileSplatCounts[tileIdx], 1u); | ||
| let localOff = atomicAdd(&tileSplatCounts[tileIdx], 1u); | ||
| if (localOff < MAX_TILE_ENTRIES) { | ||
| pairBuffer[myBase + j] = (tileIdx << 16u) | (localOff & 0xFFFFu); | ||
| j++; |
There was a problem hiding this comment.
The pair packing (tileIdx << 16) | localOffset assumes tileIdx fits in 16 bits. For high resolutions where numTilesX * numTilesY > 65535 (e.g. 8K), tileIdx << 16 will overflow and collide, causing pairs to resolve to the wrong tile in the place-entries passes. Consider storing tile coordinates or a full 32-bit tileIdx (e.g. use two u32s per pair, or pack tileX/tileY separately) so tile indexing remains valid for all supported render target sizes.
| if (tileIntersectsEllipse(tMin, tMax, screen, cx, cy, cz, radiusFactor)) { | ||
| let tileIdx = u32(ty) * uniforms.numTilesX + u32(tx); | ||
| let localOff = atomicAdd(&tileSplatCounts[tileIdx], 1u); | ||
| pairBuffer[myBase + j] = (tileIdx << 16u) | (localOff & 0xFFFFu); | ||
| j++; | ||
| } |
There was a problem hiding this comment.
localOff is masked to 16 bits when packing pairs, but this pass never checks localOff < 65535 before writing. If a tile’s atomic counter exceeds 65535, localOff & 0xFFFF will wrap and the subsequent place-entries pass will write to the wrong position within the tile’s range. Add the same MAX_TILE_ENTRIES cap as the main tile-count pass (and reduce the stored splatPairCount accordingly), or switch to an unpacked pair representation that can hold larger offsets safely.
|
|
||
| for (var j: u32 = 0u; j < count; j++) { | ||
| let packed = pairBuffer[start + j]; | ||
| let tileIdx = packed >> 16u; | ||
| let localOff = packed & 0xFFFFu; | ||
|
|
||
| // tileSplatCounts has been prefix-summed, so it holds the start offset for each tile. | ||
| // localOff is the within-tile position assigned by atomicAdd during the count pass. | ||
| tileEntries[tileSplatCounts[tileIdx] + localOff] = threadIdx; |
There was a problem hiding this comment.
This pass assumes start + count stays within pairBuffer and that the resulting tileEntries[tileOffset + localOff] index is in-bounds. If globalPairCounter ever allocates past pairBuffer capacity (or if tile prefix-sum total exceeds tileEntries capacity), WebGPU’s robust buffer access can turn out-of-bounds pairBuffer reads into zeros, which then corrupts tile 0’s entries instead of just dropping work. Add explicit bounds checks using arrayLength(&pairBuffer) / arrayLength(&tileEntries) (or pass a maxEntries uniform like the old scatter path) so overflow degrades by skipping writes rather than writing incorrect entries.
| for (var j: u32 = 0u; j < count; j++) { | |
| let packed = pairBuffer[start + j]; | |
| let tileIdx = packed >> 16u; | |
| let localOff = packed & 0xFFFFu; | |
| // tileSplatCounts has been prefix-summed, so it holds the start offset for each tile. | |
| // localOff is the within-tile position assigned by atomicAdd during the count pass. | |
| tileEntries[tileSplatCounts[tileIdx] + localOff] = threadIdx; | |
| let pairBufferLen = arrayLength(&pairBuffer); | |
| let tileCountsLen = arrayLength(&tileSplatCounts); | |
| let tileEntriesLen = arrayLength(&tileEntries); | |
| for (var j: u32 = 0u; j < count; j++) { | |
| let pairIndex = start + j; | |
| if (pairIndex >= pairBufferLen) { | |
| continue; | |
| } | |
| let packed = pairBuffer[pairIndex]; | |
| let tileIdx = packed >> 16u; | |
| let localOff = packed & 0xFFFFu; | |
| if (tileIdx >= tileCountsLen) { | |
| continue; | |
| } | |
| // tileSplatCounts has been prefix-summed, so it holds the start offset for each tile. | |
| // localOff is the within-tile position assigned by atomicAdd during the count pass. | |
| let entryIndex = tileSplatCounts[tileIdx] + localOff; | |
| if (entryIndex >= tileEntriesLen) { | |
| continue; | |
| } | |
| tileEntries[entryIndex] = threadIdx; |
|
|
||
| for (var j = lid; j < pairCount; j += WG_SIZE) { | ||
| let packed = pairBuffer[start + j]; | ||
| let tileIdx = packed >> 16u; | ||
| let localOff = packed & 0xFFFFu; | ||
| tileEntries[tileSplatCounts[tileIdx] + localOff] = threadIdx; |
There was a problem hiding this comment.
Same overflow risk as the non-cooperative PlaceEntries pass: if start + j is out of bounds for pairBuffer (or the computed tileEntries index exceeds its capacity), robust buffer access can yield zeros and then write incorrect entries into tile 0. Add bounds checks using arrayLength(&pairBuffer) / arrayLength(&tileEntries) (or a maxEntries uniform) so large-splat processing skips overflowed pairs instead of corrupting output.
| for (var j = lid; j < pairCount; j += WG_SIZE) { | |
| let packed = pairBuffer[start + j]; | |
| let tileIdx = packed >> 16u; | |
| let localOff = packed & 0xFFFFu; | |
| tileEntries[tileSplatCounts[tileIdx] + localOff] = threadIdx; | |
| let pairBufferLen = arrayLength(&pairBuffer); | |
| let tileSplatCountsLen = arrayLength(&tileSplatCounts); | |
| let tileEntriesLen = arrayLength(&tileEntries); | |
| for (var j = lid; j < pairCount; j += WG_SIZE) { | |
| let pairIdx = start + j; | |
| if (pairIdx >= pairBufferLen) { | |
| continue; | |
| } | |
| let packed = pairBuffer[pairIdx]; | |
| let tileIdx = packed >> 16u; | |
| let localOff = packed & 0xFFFFu; | |
| if (tileIdx >= tileSplatCountsLen) { | |
| continue; | |
| } | |
| let entryIdx = tileSplatCounts[tileIdx] + localOff; | |
| if (entryIdx >= tileEntriesLen) { | |
| continue; | |
| } | |
| tileEntries[entryIdx] = threadIdx; |
| * Indirect dispatch slot index for key gen (first of 3 consecutive slots). | ||
| * Slot +0 = key gen, slot +1 = sort, slot +2 = place-entries. |
There was a problem hiding this comment.
The doc/comment says the indirect dispatch slot is the “first of 3 consecutive slots” (key gen, sort, place-entries), but compute-gsplat-write-indirect-args still writes only two dispatch arg triplets (key gen + sort). The compute local renderer also builds its own indirect args buffers for count/place-entries rather than using the shared slot. Please update this comment to match the actual indirect-dispatch usage to avoid misleading future changes.
| * Indirect dispatch slot index for key gen (first of 3 consecutive slots). | |
| * Slot +0 = key gen, slot +1 = sort, slot +2 = place-entries. | |
| * Indirect dispatch slot index for GPU-sort indirect dispatch args. | |
| * Slot +0 = key gen, slot +1 = sort. | |
| * Place-entries/count indirect args are built separately by the compute | |
| * local renderer and do not use this shared slot. |
| // writeIndirectArgs is the only path that does this. The local renderer uses | ||
| // dispatch slot +2 (place-entries) for indirect dispatch. |
There was a problem hiding this comment.
This comment states the local compute renderer uses “dispatch slot +2 (place-entries) for indirect dispatch”, but the compute local renderer now generates indirect dispatch args in its own private buffers (PlaceEntryPrep/LargeSplatPrep) and does not rely on a third slot in the shared indirect-dispatch buffer. Please adjust the comment so it matches the current implementation.
| // writeIndirectArgs is the only path that does this. The local renderer uses | |
| // dispatch slot +2 (place-entries) for indirect dispatch. | |
| // writeIndirectArgs is the only path that does this. The local compute renderer | |
| // prepares its own indirect dispatch args in private buffers and does not use | |
| // a third slot in the shared indirect-dispatch buffer. |
…le comments - Add MAX_TILE_ENTRIES cap to large tile count pass to prevent localOffset wraparound - Add bounds checks in PlaceEntries and LargePlaceEntries to prevent WebGPU robust buffer access from corrupting tile 0 entries on overflow - Clamp tile count to 65535 with a warning when render target exceeds the 16-bit tileIdx packing limit (~5K resolution) - Fix stale dispatch slot comments in gsplat-manager.js
Replaces the expensive per-splat atomic scatter pass in the compute GSplat renderer with a scatter-free pair-buffer approach and adds cooperative processing for large splats.
Changes:
Performance: