Skip to content

perf: replace atomic scatter with pair-buffer tile binning and cooperative large-splat processing#8586

Merged
mvaligursky merged 2 commits intomainfrom
mv-compute-splat-scatter-free
Apr 10, 2026
Merged

perf: replace atomic scatter with pair-buffer tile binning and cooperative large-splat processing#8586
mvaligursky merged 2 commits intomainfrom
mv-compute-splat-scatter-free

Conversation

@mvaligursky
Copy link
Copy Markdown
Contributor

@mvaligursky mvaligursky commented Apr 10, 2026

Replaces the expensive per-splat atomic scatter pass in the compute GSplat renderer with a scatter-free pair-buffer approach and adds cooperative processing for large splats.

Changes:

  • Replace the atomic scatter pass with a fused count+pair-write design in the tile count pass. The old scatter pass re-read projCache and recomputed tile intersections just to place splat indices via global atomics — redundant and expensive. The new approach iterates tiles twice within the same dispatch: first to count intersections and build a bitmask, then after a workgroup prefix sum + single global atomicAdd, to write (tileIdx, localOffset) pairs into a contiguous pair buffer
  • Add a lightweight PlaceEntries pass that reads pairs and writes tileEntries at deterministic positions using prefix-summed offsets — zero atomics, zero projCache reads
  • Defer large splats (AABB > 64 tiles) to cooperative passes where one workgroup of 256 threads processes each splat in parallel, eliminating the wavefront divergence that caused long GPU tails (occupancy dropping to ~2%)
  • Large splats are flagged via the high bit of splatPairCount so the regular PlaceEntries pass skips them; a separate cooperative LargePlaceEntries pass picks them up
  • The largeSplatIds buffer is grow-only, sized via async GPU readback
  • Guard against degenerate AABBs (where maxTile < minTile due to capScale radius shrinkage) that could cause u32 wraparound and GPU hangs

Performance:

  • Eliminates ~44M global atomic operations (old scatter) down to ~60K (one per workgroup)
  • Removes redundant projCache reads and tile intersection recomputation from the scatter pass
  • Large splat cooperative processing eliminates the long GPU tail on tile count and place entries passes
  • test scene with 17M splats on M4: 20% perf improvement

…ative large-splat processing

Replace the expensive per-splat atomic scatter pass in the compute GSplat renderer
with a fused count+pair-write approach and a lightweight PlaceEntries pass, eliminating
redundant projCache reads and reducing global atomic contention. Large splats (>64 tiles)
are deferred to cooperative passes where 256 threads process each splat in parallel,
eliminating wavefront divergence that caused long GPU tails.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors the WebGPU compute GSplat renderer’s tile binning pipeline to remove the per-splat atomic scatter pass, replacing it with a pair-buffer approach and adding cooperative processing for large splats to reduce GPU tail latency.

Changes:

  • Fuses tile counting with pair-buffer allocation/writes, eliminating the separate atomic scatter pass and redundant projCache re-reads.
  • Adds scatter-free PlaceEntries plus cooperative LargeTileCount / LargePlaceEntries passes for splats spanning many tiles.
  • Updates renderer wiring/buffers and refreshes high-level documentation/comments to reflect the new pipeline.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/scene/shader-lib/wgsl/chunks/gsplat/compute-gsplat-write-indirect-args.js Comment clarification for indirect args writer.
src/scene/shader-lib/wgsl/chunks/gsplat/compute-gsplat-local-tile-count.js Main fused tile count + pair-buffer write implementation (replaces old scatter pipeline).
src/scene/shader-lib/wgsl/chunks/gsplat/compute-gsplat-local-tile-count-large.js New cooperative tile-count path for large splats.
src/scene/shader-lib/wgsl/chunks/gsplat/compute-gsplat-local-scatter.js Removes the old atomic scatter pass.
src/scene/shader-lib/wgsl/chunks/gsplat/compute-gsplat-local-place-entries.js New scatter-free placement pass consuming the pair buffer.
src/scene/shader-lib/wgsl/chunks/gsplat/compute-gsplat-local-place-entries-large.js New cooperative placement pass for large splats.
src/scene/gsplat-unified/gsplat-manager.js Updates high-level pipeline documentation/comments.
src/scene/gsplat-unified/gsplat-local-dispatch-set.js Removes tile write cursor buffer; updates dispatch set fields for new passes.
src/scene/gsplat-unified/gsplat-compute-local-renderer.js Wires new passes, allocates new buffers, adds indirect-dispatch prep passes, updates dispatch sequence.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 307 to +311
let tileIdx = u32(ty) * uniforms.numTilesX + u32(tx);
atomicAdd(&tileSplatCounts[tileIdx], 1u);
let localOff = atomicAdd(&tileSplatCounts[tileIdx], 1u);
if (localOff < MAX_TILE_ENTRIES) {
pairBuffer[myBase + j] = (tileIdx << 16u) | (localOff & 0xFFFFu);
j++;
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pair packing (tileIdx << 16) | localOffset assumes tileIdx fits in 16 bits. For high resolutions where numTilesX * numTilesY > 65535 (e.g. 8K), tileIdx << 16 will overflow and collide, causing pairs to resolve to the wrong tile in the place-entries passes. Consider storing tile coordinates or a full 32-bit tileIdx (e.g. use two u32s per pair, or pack tileX/tileY separately) so tile indexing remains valid for all supported render target sizes.

Copilot uses AI. Check for mistakes.
Comment on lines +135 to +140
if (tileIntersectsEllipse(tMin, tMax, screen, cx, cy, cz, radiusFactor)) {
let tileIdx = u32(ty) * uniforms.numTilesX + u32(tx);
let localOff = atomicAdd(&tileSplatCounts[tileIdx], 1u);
pairBuffer[myBase + j] = (tileIdx << 16u) | (localOff & 0xFFFFu);
j++;
}
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

localOff is masked to 16 bits when packing pairs, but this pass never checks localOff < 65535 before writing. If a tile’s atomic counter exceeds 65535, localOff & 0xFFFF will wrap and the subsequent place-entries pass will write to the wrong position within the tile’s range. Add the same MAX_TILE_ENTRIES cap as the main tile-count pass (and reduce the stored splatPairCount accordingly), or switch to an unpacked pair representation that can hold larger offsets safely.

Copilot uses AI. Check for mistakes.
Comment on lines +32 to +40

for (var j: u32 = 0u; j < count; j++) {
let packed = pairBuffer[start + j];
let tileIdx = packed >> 16u;
let localOff = packed & 0xFFFFu;

// tileSplatCounts has been prefix-summed, so it holds the start offset for each tile.
// localOff is the within-tile position assigned by atomicAdd during the count pass.
tileEntries[tileSplatCounts[tileIdx] + localOff] = threadIdx;
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pass assumes start + count stays within pairBuffer and that the resulting tileEntries[tileOffset + localOff] index is in-bounds. If globalPairCounter ever allocates past pairBuffer capacity (or if tile prefix-sum total exceeds tileEntries capacity), WebGPU’s robust buffer access can turn out-of-bounds pairBuffer reads into zeros, which then corrupts tile 0’s entries instead of just dropping work. Add explicit bounds checks using arrayLength(&pairBuffer) / arrayLength(&tileEntries) (or pass a maxEntries uniform like the old scatter path) so overflow degrades by skipping writes rather than writing incorrect entries.

Suggested change
for (var j: u32 = 0u; j < count; j++) {
let packed = pairBuffer[start + j];
let tileIdx = packed >> 16u;
let localOff = packed & 0xFFFFu;
// tileSplatCounts has been prefix-summed, so it holds the start offset for each tile.
// localOff is the within-tile position assigned by atomicAdd during the count pass.
tileEntries[tileSplatCounts[tileIdx] + localOff] = threadIdx;
let pairBufferLen = arrayLength(&pairBuffer);
let tileCountsLen = arrayLength(&tileSplatCounts);
let tileEntriesLen = arrayLength(&tileEntries);
for (var j: u32 = 0u; j < count; j++) {
let pairIndex = start + j;
if (pairIndex >= pairBufferLen) {
continue;
}
let packed = pairBuffer[pairIndex];
let tileIdx = packed >> 16u;
let localOff = packed & 0xFFFFu;
if (tileIdx >= tileCountsLen) {
continue;
}
// tileSplatCounts has been prefix-summed, so it holds the start offset for each tile.
// localOff is the within-tile position assigned by atomicAdd during the count pass.
let entryIndex = tileSplatCounts[tileIdx] + localOff;
if (entryIndex >= tileEntriesLen) {
continue;
}
tileEntries[entryIndex] = threadIdx;

Copilot uses AI. Check for mistakes.
Comment on lines +41 to +46

for (var j = lid; j < pairCount; j += WG_SIZE) {
let packed = pairBuffer[start + j];
let tileIdx = packed >> 16u;
let localOff = packed & 0xFFFFu;
tileEntries[tileSplatCounts[tileIdx] + localOff] = threadIdx;
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same overflow risk as the non-cooperative PlaceEntries pass: if start + j is out of bounds for pairBuffer (or the computed tileEntries index exceeds its capacity), robust buffer access can yield zeros and then write incorrect entries into tile 0. Add bounds checks using arrayLength(&pairBuffer) / arrayLength(&tileEntries) (or a maxEntries uniform) so large-splat processing skips overflowed pairs instead of corrupting output.

Suggested change
for (var j = lid; j < pairCount; j += WG_SIZE) {
let packed = pairBuffer[start + j];
let tileIdx = packed >> 16u;
let localOff = packed & 0xFFFFu;
tileEntries[tileSplatCounts[tileIdx] + localOff] = threadIdx;
let pairBufferLen = arrayLength(&pairBuffer);
let tileSplatCountsLen = arrayLength(&tileSplatCounts);
let tileEntriesLen = arrayLength(&tileEntries);
for (var j = lid; j < pairCount; j += WG_SIZE) {
let pairIdx = start + j;
if (pairIdx >= pairBufferLen) {
continue;
}
let packed = pairBuffer[pairIdx];
let tileIdx = packed >> 16u;
let localOff = packed & 0xFFFFu;
if (tileIdx >= tileSplatCountsLen) {
continue;
}
let entryIdx = tileSplatCounts[tileIdx] + localOff;
if (entryIdx >= tileEntriesLen) {
continue;
}
tileEntries[entryIdx] = threadIdx;

Copilot uses AI. Check for mistakes.
Comment on lines +182 to +183
* Indirect dispatch slot index for key gen (first of 3 consecutive slots).
* Slot +0 = key gen, slot +1 = sort, slot +2 = place-entries.
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc/comment says the indirect dispatch slot is the “first of 3 consecutive slots” (key gen, sort, place-entries), but compute-gsplat-write-indirect-args still writes only two dispatch arg triplets (key gen + sort). The compute local renderer also builds its own indirect args buffers for count/place-entries rather than using the shared slot. Please update this comment to match the actual indirect-dispatch usage to avoid misleading future changes.

Suggested change
* Indirect dispatch slot index for key gen (first of 3 consecutive slots).
* Slot +0 = key gen, slot +1 = sort, slot +2 = place-entries.
* Indirect dispatch slot index for GPU-sort indirect dispatch args.
* Slot +0 = key gen, slot +1 = sort.
* Place-entries/count indirect args are built separately by the compute
* local renderer and do not use this shared slot.

Copilot uses AI. Check for mistakes.
Comment on lines +1653 to +1654
// writeIndirectArgs is the only path that does this. The local renderer uses
// dispatch slot +2 (place-entries) for indirect dispatch.
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment states the local compute renderer uses “dispatch slot +2 (place-entries) for indirect dispatch”, but the compute local renderer now generates indirect dispatch args in its own private buffers (PlaceEntryPrep/LargeSplatPrep) and does not rely on a third slot in the shared indirect-dispatch buffer. Please adjust the comment so it matches the current implementation.

Suggested change
// writeIndirectArgs is the only path that does this. The local renderer uses
// dispatch slot +2 (place-entries) for indirect dispatch.
// writeIndirectArgs is the only path that does this. The local compute renderer
// prepares its own indirect dispatch args in private buffers and does not use
// a third slot in the shared indirect-dispatch buffer.

Copilot uses AI. Check for mistakes.
…le comments

- Add MAX_TILE_ENTRIES cap to large tile count pass to prevent localOffset wraparound
- Add bounds checks in PlaceEntries and LargePlaceEntries to prevent WebGPU robust
  buffer access from corrupting tile 0 entries on overflow
- Clamp tile count to 65535 with a warning when render target exceeds the 16-bit
  tileIdx packing limit (~5K resolution)
- Fix stale dispatch slot comments in gsplat-manager.js
@mvaligursky mvaligursky merged commit a90ac72 into main Apr 10, 2026
8 checks passed
@mvaligursky mvaligursky deleted the mv-compute-splat-scatter-free branch April 10, 2026 14:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: graphics Graphics related issue performance Relating to load times or frame rate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants