perf: replace atomic scatter with pair-buffer tile binning and cooperative large-splat processing by mvaligursky · Pull Request #8586 · playcanvas/engine

mvaligursky · 2026-04-10T14:18:02Z

Replaces the expensive per-splat atomic scatter pass in the compute GSplat renderer with a scatter-free pair-buffer approach and adds cooperative processing for large splats.

Changes:

Replace the atomic scatter pass with a fused count+pair-write design in the tile count pass. The old scatter pass re-read projCache and recomputed tile intersections just to place splat indices via global atomics — redundant and expensive. The new approach iterates tiles twice within the same dispatch: first to count intersections and build a bitmask, then after a workgroup prefix sum + single global atomicAdd, to write (tileIdx, localOffset) pairs into a contiguous pair buffer
Add a lightweight PlaceEntries pass that reads pairs and writes tileEntries at deterministic positions using prefix-summed offsets — zero atomics, zero projCache reads
Defer large splats (AABB > 64 tiles) to cooperative passes where one workgroup of 256 threads processes each splat in parallel, eliminating the wavefront divergence that caused long GPU tails (occupancy dropping to ~2%)
Large splats are flagged via the high bit of splatPairCount so the regular PlaceEntries pass skips them; a separate cooperative LargePlaceEntries pass picks them up
The largeSplatIds buffer is grow-only, sized via async GPU readback
Guard against degenerate AABBs (where maxTile < minTile due to capScale radius shrinkage) that could cause u32 wraparound and GPU hangs

Performance:

Eliminates ~44M global atomic operations (old scatter) down to ~60K (one per workgroup)
Removes redundant projCache reads and tile intersection recomputation from the scatter pass
Large splat cooperative processing eliminates the long GPU tail on tile count and place entries passes
test scene with 17M splats on M4: 20% perf improvement

…ative large-splat processing Replace the expensive per-splat atomic scatter pass in the compute GSplat renderer with a fused count+pair-write approach and a lightweight PlaceEntries pass, eliminating redundant projCache reads and reducing global atomic contention. Large splats (>64 tiles) are deferred to cooperative passes where 256 threads process each splat in parallel, eliminating wavefront divergence that caused long GPU tails.

Copilot

Pull request overview

Refactors the WebGPU compute GSplat renderer’s tile binning pipeline to remove the per-splat atomic scatter pass, replacing it with a pair-buffer approach and adding cooperative processing for large splats to reduce GPU tail latency.

Changes:

Fuses tile counting with pair-buffer allocation/writes, eliminating the separate atomic scatter pass and redundant projCache re-reads.
Adds scatter-free PlaceEntries plus cooperative LargeTileCount / LargePlaceEntries passes for splats spanning many tiles.
Updates renderer wiring/buffers and refreshes high-level documentation/comments to reflect the new pipeline.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
src/scene/shader-lib/wgsl/chunks/gsplat/compute-gsplat-write-indirect-args.js	Comment clarification for indirect args writer.
src/scene/shader-lib/wgsl/chunks/gsplat/compute-gsplat-local-tile-count.js	Main fused tile count + pair-buffer write implementation (replaces old scatter pipeline).
src/scene/shader-lib/wgsl/chunks/gsplat/compute-gsplat-local-tile-count-large.js	New cooperative tile-count path for large splats.
src/scene/shader-lib/wgsl/chunks/gsplat/compute-gsplat-local-scatter.js	Removes the old atomic scatter pass.
src/scene/shader-lib/wgsl/chunks/gsplat/compute-gsplat-local-place-entries.js	New scatter-free placement pass consuming the pair buffer.
src/scene/shader-lib/wgsl/chunks/gsplat/compute-gsplat-local-place-entries-large.js	New cooperative placement pass for large splats.
src/scene/gsplat-unified/gsplat-manager.js	Updates high-level pipeline documentation/comments.
src/scene/gsplat-unified/gsplat-local-dispatch-set.js	Removes tile write cursor buffer; updates dispatch set fields for new passes.
src/scene/gsplat-unified/gsplat-compute-local-renderer.js	Wires new passes, allocates new buffers, adds indirect-dispatch prep passes, updates dispatch sequence.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-10T14:34:22Z

src/scene/shader-lib/wgsl/chunks/gsplat/compute-gsplat-local-tile-count.js

                let tileIdx = u32(ty) * uniforms.numTilesX + u32(tx);
-                atomicAdd(&tileSplatCounts[tileIdx], 1u);
+                let localOff = atomicAdd(&tileSplatCounts[tileIdx], 1u);
+                if (localOff < MAX_TILE_ENTRIES) {
+                    pairBuffer[myBase + j] = (tileIdx << 16u) | (localOff & 0xFFFFu);
+                    j++;


The pair packing (tileIdx << 16) | localOffset assumes tileIdx fits in 16 bits. For high resolutions where numTilesX * numTilesY > 65535 (e.g. 8K), tileIdx << 16 will overflow and collide, causing pairs to resolve to the wrong tile in the place-entries passes. Consider storing tile coordinates or a full 32-bit tileIdx (e.g. use two u32s per pair, or pack tileX/tileY separately) so tile indexing remains valid for all supported render target sizes.

Copilot · 2026-04-10T14:34:23Z

src/scene/shader-lib/wgsl/chunks/gsplat/compute-gsplat-local-tile-count-large.js

+        if (tileIntersectsEllipse(tMin, tMax, screen, cx, cy, cz, radiusFactor)) {
+            let tileIdx = u32(ty) * uniforms.numTilesX + u32(tx);
+            let localOff = atomicAdd(&tileSplatCounts[tileIdx], 1u);
+            pairBuffer[myBase + j] = (tileIdx << 16u) | (localOff & 0xFFFFu);
+            j++;
+        }


localOff is masked to 16 bits when packing pairs, but this pass never checks localOff < 65535 before writing. If a tile’s atomic counter exceeds 65535, localOff & 0xFFFF will wrap and the subsequent place-entries pass will write to the wrong position within the tile’s range. Add the same MAX_TILE_ENTRIES cap as the main tile-count pass (and reduce the stored splatPairCount accordingly), or switch to an unpacked pair representation that can hold larger offsets safely.

Copilot · 2026-04-10T14:34:23Z

src/scene/shader-lib/wgsl/chunks/gsplat/compute-gsplat-local-place-entries.js

+
+    for (var j: u32 = 0u; j < count; j++) {
+        let packed = pairBuffer[start + j];
+        let tileIdx = packed >> 16u;
+        let localOff = packed & 0xFFFFu;
+
+        // tileSplatCounts has been prefix-summed, so it holds the start offset for each tile.
+        // localOff is the within-tile position assigned by atomicAdd during the count pass.
+        tileEntries[tileSplatCounts[tileIdx] + localOff] = threadIdx;


This pass assumes start + count stays within pairBuffer and that the resulting tileEntries[tileOffset + localOff] index is in-bounds. If globalPairCounter ever allocates past pairBuffer capacity (or if tile prefix-sum total exceeds tileEntries capacity), WebGPU’s robust buffer access can turn out-of-bounds pairBuffer reads into zeros, which then corrupts tile 0’s entries instead of just dropping work. Add explicit bounds checks using arrayLength(&pairBuffer) / arrayLength(&tileEntries) (or pass a maxEntries uniform like the old scatter path) so overflow degrades by skipping writes rather than writing incorrect entries.

Suggested change

for (var j: u32 = 0u; j < count; j++) {

let packed = pairBuffer[start + j];

let tileIdx = packed >> 16u;

let localOff = packed & 0xFFFFu;

// tileSplatCounts has been prefix-summed, so it holds the start offset for each tile.

// localOff is the within-tile position assigned by atomicAdd during the count pass.

tileEntries[tileSplatCounts[tileIdx] + localOff] = threadIdx;

let pairBufferLen = arrayLength(&pairBuffer);

let tileCountsLen = arrayLength(&tileSplatCounts);

let tileEntriesLen = arrayLength(&tileEntries);

for (var j: u32 = 0u; j < count; j++) {

let pairIndex = start + j;

if (pairIndex >= pairBufferLen) {

continue;

}

let packed = pairBuffer[pairIndex];

let tileIdx = packed >> 16u;

let localOff = packed & 0xFFFFu;

if (tileIdx >= tileCountsLen) {

continue;

}

// tileSplatCounts has been prefix-summed, so it holds the start offset for each tile.

// localOff is the within-tile position assigned by atomicAdd during the count pass.

let entryIndex = tileSplatCounts[tileIdx] + localOff;

if (entryIndex >= tileEntriesLen) {

continue;

}

tileEntries[entryIndex] = threadIdx;

Copilot · 2026-04-10T14:34:23Z

src/scene/shader-lib/wgsl/chunks/gsplat/compute-gsplat-local-place-entries-large.js

+
+    for (var j = lid; j < pairCount; j += WG_SIZE) {
+        let packed = pairBuffer[start + j];
+        let tileIdx = packed >> 16u;
+        let localOff = packed & 0xFFFFu;
+        tileEntries[tileSplatCounts[tileIdx] + localOff] = threadIdx;


Same overflow risk as the non-cooperative PlaceEntries pass: if start + j is out of bounds for pairBuffer (or the computed tileEntries index exceeds its capacity), robust buffer access can yield zeros and then write incorrect entries into tile 0. Add bounds checks using arrayLength(&pairBuffer) / arrayLength(&tileEntries) (or a maxEntries uniform) so large-splat processing skips overflowed pairs instead of corrupting output.

Suggested change

for (var j = lid; j < pairCount; j += WG_SIZE) {

let packed = pairBuffer[start + j];

let tileIdx = packed >> 16u;

let localOff = packed & 0xFFFFu;

tileEntries[tileSplatCounts[tileIdx] + localOff] = threadIdx;

let pairBufferLen = arrayLength(&pairBuffer);

let tileSplatCountsLen = arrayLength(&tileSplatCounts);

let tileEntriesLen = arrayLength(&tileEntries);

for (var j = lid; j < pairCount; j += WG_SIZE) {

let pairIdx = start + j;

if (pairIdx >= pairBufferLen) {

continue;

}

let packed = pairBuffer[pairIdx];

let tileIdx = packed >> 16u;

let localOff = packed & 0xFFFFu;

if (tileIdx >= tileSplatCountsLen) {

continue;

}

let entryIdx = tileSplatCounts[tileIdx] + localOff;

if (entryIdx >= tileEntriesLen) {

continue;

}

tileEntries[entryIdx] = threadIdx;

Copilot · 2026-04-10T14:34:24Z

src/scene/gsplat-unified/gsplat-manager.js

+     * Indirect dispatch slot index for key gen (first of 3 consecutive slots).
+     * Slot +0 = key gen, slot +1 = sort, slot +2 = place-entries.


The doc/comment says the indirect dispatch slot is the “first of 3 consecutive slots” (key gen, sort, place-entries), but compute-gsplat-write-indirect-args still writes only two dispatch arg triplets (key gen + sort). The compute local renderer also builds its own indirect args buffers for count/place-entries rather than using the shared slot. Please update this comment to match the actual indirect-dispatch usage to avoid misleading future changes.

Suggested change

* Indirect dispatch slot index for key gen (first of 3 consecutive slots).

* Slot +0 = key gen, slot +1 = sort, slot +2 = place-entries.

* Indirect dispatch slot index for GPU-sort indirect dispatch args.

* Slot +0 = key gen, slot +1 = sort.

* Place-entries/count indirect args are built separately by the compute

* local renderer and do not use this shared slot.

Copilot · 2026-04-10T14:34:24Z

src/scene/gsplat-unified/gsplat-manager.js

+        // writeIndirectArgs is the only path that does this. The local renderer uses
+        // dispatch slot +2 (place-entries) for indirect dispatch.


This comment states the local compute renderer uses “dispatch slot +2 (place-entries) for indirect dispatch”, but the compute local renderer now generates indirect dispatch args in its own private buffers (PlaceEntryPrep/LargeSplatPrep) and does not rely on a third slot in the shared indirect-dispatch buffer. Please adjust the comment so it matches the current implementation.

Suggested change

// writeIndirectArgs is the only path that does this. The local renderer uses

// dispatch slot +2 (place-entries) for indirect dispatch.

// writeIndirectArgs is the only path that does this. The local compute renderer

// prepares its own indirect dispatch args in private buffers and does not use

// a third slot in the shared indirect-dispatch buffer.

…le comments - Add MAX_TILE_ENTRIES cap to large tile count pass to prevent localOffset wraparound - Add bounds checks in PlaceEntries and LargePlaceEntries to prevent WebGPU robust buffer access from corrupting tile 0 entries on overflow - Clamp tile count to 65535 with a warning when render target exceeds the 16-bit tileIdx packing limit (~5K resolution) - Fix stale dispatch slot comments in gsplat-manager.js

mvaligursky self-assigned this Apr 10, 2026

mvaligursky added performance Relating to load times or frame rate area: graphics Graphics related issue labels Apr 10, 2026

vercel bot deployed to Preview – engine-api-docs April 10, 2026 14:18 View deployment

vercel bot deployed to Preview – engine April 10, 2026 14:19 View deployment

mvaligursky requested a review from Copilot April 10, 2026 14:21

Copilot started reviewing on behalf of mvaligursky April 10, 2026 14:21 View session

Copilot AI reviewed Apr 10, 2026

View reviewed changes

vercel bot deployed to Preview – engine-api-docs April 10, 2026 14:48 View deployment

vercel bot deployed to Preview – engine April 10, 2026 14:49 View deployment

mvaligursky merged commit a90ac72 into main Apr 10, 2026
8 checks passed

mvaligursky deleted the mv-compute-splat-scatter-free branch April 10, 2026 14:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: replace atomic scatter with pair-buffer tile binning and cooperative large-splat processing#8586

perf: replace atomic scatter with pair-buffer tile binning and cooperative large-splat processing#8586
mvaligursky merged 2 commits intomainfrom
mv-compute-splat-scatter-free

mvaligursky commented Apr 10, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 10, 2026

Uh oh!

Copilot AI Apr 10, 2026

Uh oh!

Copilot AI Apr 10, 2026

Uh oh!

Copilot AI Apr 10, 2026

Uh oh!

Copilot AI Apr 10, 2026

Uh oh!

Copilot AI Apr 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-    for (var j: u32 = 0u; j < count; j++) {
-        let packed = pairBuffer[start + j];
-        let tileIdx = packed >> 16u;
-        let localOff = packed & 0xFFFFu;
-        // tileSplatCounts has been prefix-summed, so it holds the start offset for each tile.
-        // localOff is the within-tile position assigned by atomicAdd during the count pass.
-        tileEntries[tileSplatCounts[tileIdx] + localOff] = threadIdx;
+    let pairBufferLen = arrayLength(&pairBuffer);
+    let tileCountsLen = arrayLength(&tileSplatCounts);
+    let tileEntriesLen = arrayLength(&tileEntries);
+    for (var j: u32 = 0u; j < count; j++) {
+        let pairIndex = start + j;
+        if (pairIndex >= pairBufferLen) {
+            continue;
+        }
+        let packed = pairBuffer[pairIndex];
+        let tileIdx = packed >> 16u;
+        let localOff = packed & 0xFFFFu;
+        if (tileIdx >= tileCountsLen) {
+            continue;
+        }
+        // tileSplatCounts has been prefix-summed, so it holds the start offset for each tile.
+        // localOff is the within-tile position assigned by atomicAdd during the count pass.
+        let entryIndex = tileSplatCounts[tileIdx] + localOff;
+        if (entryIndex >= tileEntriesLen) {
+            continue;
+        }
+        tileEntries[entryIndex] = threadIdx;

		* Indirect dispatch slot index for key gen (first of 3 consecutive slots).
		* Slot +0 = key gen, slot +1 = sort, slot +2 = place-entries.

-     * Indirect dispatch slot index for key gen (first of 3 consecutive slots).
-     * Slot +0 = key gen, slot +1 = sort, slot +2 = place-entries.
+     * Indirect dispatch slot index for GPU-sort indirect dispatch args.
+     * Slot +0 = key gen, slot +1 = sort.
+     * Place-entries/count indirect args are built separately by the compute
+     * local renderer and do not use this shared slot.

		// writeIndirectArgs is the only path that does this. The local renderer uses
		// dispatch slot +2 (place-entries) for indirect dispatch.

-        // writeIndirectArgs is the only path that does this. The local renderer uses
-        // dispatch slot +2 (place-entries) for indirect dispatch.
+        // writeIndirectArgs is the only path that does this. The local compute renderer
+        // prepares its own indirect dispatch args in private buffers and does not use
+        // a third slot in the shared indirect-dispatch buffer.

Conversation

mvaligursky commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mvaligursky commented Apr 10, 2026 •

edited

Loading