Skip to content

Improve mGPU Gaussian tile intersection#664

Merged
matthewdcong merged 5 commits into
openvdb:mainfrom
matthewdcong:decoupled_mgpu_sort
Jun 3, 2026
Merged

Improve mGPU Gaussian tile intersection#664
matthewdcong merged 5 commits into
openvdb:mainfrom
matthewdcong:decoupled_mgpu_sort

Conversation

@matthewdcong

@matthewdcong matthewdcong commented May 30, 2026

Copy link
Copy Markdown
Contributor

Previous iterations of mGPU Gaussian tile intersection include:

  1. Distribute every step of the single mGPU tile sort. Compute the tile intersections for all Gaussians across all tiles into a single tensor, followed by a parallel mGPU radix sort. Requires significantly more communication and synchronization during the parallel radix sort, as well as a temp output key and value array for cross-device merging. (previously in main)
  2. Observe that the radix sort is independent for each camera and compute the radix sort for each camera entirely on a single device. This works well when batch size = num GPUs, but performs poorly when batch size = 1 and num GPUs > 1 because only one GPU is used and all data must be gathered to this GPU. (currently in main)

This PR introduces a more performant strategy. First, we assign each GPU a subset of the tiles/the tile range that will be rendered by that GPU. Then, on each GPU, we compute the intersections of all Gaussians with only that subset of the tiles/the tile range. Since the tile keys are monotonically increasing, this means that the subsequent sorting process is decoupled, i.e. we can sort the per-GPU Gaussian tile intersection lists independently and the resulting flattened array is guaranteed to be sorted. This significantly reduces the amount of communication and data transfer required during the sorting process. Moreover, a switch from radix sort to merge sort enables us to remove the temp output buffers, further reducing stalls due to prefetching as well as decreasing peak memory utilization.

This is a performance improvement across the board, but becomes more significant as the number of GPUs increases. On 8x A100s, this improves end-to-end reconstruction performance about 15% with a batch size of 1 (on a relatively small problem).

Signed-off-by: Matthew Cong <mcong@nvidia.com>
Signed-off-by: Matthew Cong <mcong@nvidia.com>
@matthewdcong matthewdcong requested a review from a team as a code owner May 30, 2026 15:20
Signed-off-by: Matthew Cong <mcong@nvidia.com>
Signed-off-by: Matthew Cong <mcong@nvidia.com>
@swahtz swahtz added optimization Performance or memory optimization Gaussian Splatting Issues related to Gaussian splattng in the core library labels Jun 3, 2026
@swahtz swahtz added this to the v0.5 milestone Jun 3, 2026
Comment thread src/fvdb/detail/ops/gsplat/IntersectGaussianTiles.cu Outdated

@swahtz swahtz left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One super minor comment not worth blocking for but other than that, it looks great to me! Thanks

@matthewdcong matthewdcong enabled auto-merge (squash) June 3, 2026 15:21
Co-authored-by: Jonathan Swartz <jonathan@jswartz.info>
Signed-off-by: Matthew Cong <mcong@nvidia.com>
@matthewdcong matthewdcong force-pushed the decoupled_mgpu_sort branch from d166135 to 3458183 Compare June 3, 2026 15:25
@matthewdcong matthewdcong merged commit 8a26163 into openvdb:main Jun 3, 2026
39 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Gaussian Splatting Issues related to Gaussian splattng in the core library optimization Performance or memory optimization

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants