Handle duplicate pixels in sparse pixel gaussian rendering#488
Conversation
Deduplicate pixel coordinates at the binding layer before passing them to computeSparseInfo and the rasterization kernels, then scatter results back via index_select. This avoids CUDA kernel changes while correctly handling duplicate (batchIdx, row, col) entries that cause incorrect tile bitmasks. - Add deduplicatePixels() that encodes pixels as int64 keys, sorts to find unique groups, and builds inverse indices for scatter-back - Handle single-list JaggedTensors where jidx() is empty - Pass deduplicated pixels through all sparse render paths (images, depths, num_contributing, contributing_ids) and scatter results back - Reduce numContributingGaussians to unique-pixel space before passing to the contributing IDs kernel - Add C++ unit tests for deduplicatePixels (20 tests) - Add Python end-to-end tests for all sparse render APIs with duplicate pixels (6 tests) Fixes openvdb#106 Signed-off-by: Mark Harris <mharris@nvidia.com> Made-with: Cursor
Signed-off-by: Mark Harris <mharris@nvidia.com> Made-with: Cursor
blackencino
left a comment
There was a problem hiding this comment.
LGTM, Ship it!
Looking specifically at the actual de-duplication, every part of what exists in the C++ there could have been done as a torch python operation. Is there a specific reason this needed to be compiled?
No reason other than that the current implementation is all C++. The original proposal was a much more complicated approach that used custom kernels. Performance may be lower in Python, naturally, but I don't think this operation will be a bottleneck. |
swahtz
left a comment
There was a problem hiding this comment.
Looks great I just had a few potential optimization suggestions since these will get called on every rendering iteration.
I was also wondering for the unit tests if it would be worth having a test where all pixels in each batch are duplicates or perhaps that's redundant with the tests were just some pixels are duplicates.
Use torch::bincount instead of zeros+scatter_add_ for counting unique pixels per batch, and reuse uniqueBatchIdx directly as the new jidx instead of round-tripping through jidx_from_joffsets. Co-authored-by: swahtz <2375296+swahtz@users.noreply.github.com> Signed-off-by: Mark Harris <mharris@nvidia.com> Made-with: Cursor
cumsum on a bool tensor already returns int64, so the explicit cast allocated an unnecessary temporary. Also update the docstring to describe the actual sort-based dedup rather than torch::unique. Signed-off-by: Mark Harris <mharris@nvidia.com> Made-with: Cursor
Test the maximum-compression case where every batch collapses to a single unique pixel, verifying offsets and inverse indices are correct. Co-authored-by: swahtz <2375296+swahtz@users.noreply.github.com> Signed-off-by: Mark Harris <mharris@nvidia.com> Made-with: Cursor
|
Great suggestions @swahtz . Implemented your two optimizations and found one other and a stale comment. Also added your suggested maximum compression test case. |
…icate-pixels-in-sparse-pixel-
…plicate-pixels-in-sparse-pixel-
…plicate-pixels-in-sparse-pixel- Signed-off-by: Mark Harris <mharris@nvidia.com> Made-with: Cursor # Conflicts: # tests/unit/test_gaussian_splat_3d.py
fwilliams
left a comment
There was a problem hiding this comment.
A few small notes on the algorithm. I think this could be tightened up to use fewer allocations. This could also be implemented with fixed up front memory using a couple of cuda kernels if you wanted to do it that way. I'll defer to you as to which you think is better.
|
I decided against the custom kernel approach because by definition these are low-cost operations in terms of memory and computation. For sparse rendering, N (number of pixels) should be small -- say 5-50K. I estimate memory savings of a custom kernel to be at most ~1MB at that scale. Even if we render at the scale of 500K pixels (25% of a 1080p image) we're talking only 10MB savings. On the scale of 3DGS memory usage, this is peanuts. The main reason for custom kernels here would be launch overhead and possibly synchronization reduction. But I don't think that should be addressed until it is determined to be a bottleneck. |
- Skip batch-index zeros tensor for single-list JaggedTensors by branching the key computation (avoids N*8 byte allocation) - Use in-place cumsum_(0).sub_(1) for group ID assignment, computing firstInSorted before the mutation (avoids one N*8 byte temporary) Co-authored-by: Francis Williams <francisw@nvidia.com> Signed-off-by: Mark Harris <mharris@nvidia.com> Made-with: Cursor
|
@fwilliams implemented the small optimizations you suggested. Torch cumsum_ doesn't work on bool tensors in CUDA, so had to cast it, so the memory savings is reduced a bit but there's still about 8N bytes savings (1 fewer allocation) for that one (rather than 16N). |
…plicate-pixels-in-sparse-pixel-
Summary
deduplicatePixels) before passing tocomputeSparseInfoand rasterization kernels, then scatter results back viaindex_select. This avoids any CUDA kernel modifications while correctly handling duplicate(batchIdx, row, col)entries that would otherwise cause incorrect tile bitmasks.jidx, correctjlidxpassthrough forldim()consistency, and reducingnumContributingGaussiansto unique-pixel space for the contributing IDs kernel.deduplicatePixels(20 tests covering empty, single, all-unique, duplicates, multi-batch, round-trip reconstruction, and per-batch offset validation) and Python end-to-end tests for all sparse render APIs with duplicate pixels (6 tests covering depths, images, backward gradients, num_contributing, contributing_ids, and multi-camera).Fixes #106
Test plan
DeduplicatePixelsTest: 20/20 passGaussianComputeSparseInfoTest: 14/14 passTestGaussianRenderSparseDuplicatePixels: 6/6 pass