Improve/optimize mGPU scaling via batched prefetching and sorting changes by matthewdcong · Pull Request #499 · openvdb/fvdb-core

matthewdcong · 2026-03-04T22:05:37Z

Coalesce consecutive cudaMemPrefetchAsync calls into a single cudaMemPrefetchBatchAsync call in order to amortize OS overhead
Decouple the mGPU radix sort into trivially parallel per-camera radix sorts. Effectively, the data corresponding to each batch can be sorted independently on each GPU instead of one large mGPU radix sort.
Switch from DeviceRadixSort to DeviceMergeSort for mGPU. In mGPU, the performance advantage of radix sort is outweighed by the additional cost of allocating separate input and output buffers.

Signed-off-by: Matthew Cong <mcong@nvidia.com>

harrism

Looks good. I'd like you to document the two new utility functions before merging.

Signed-off-by: Matthew Cong <mcong@nvidia.com>

matthewdcong requested a review from a team as a code owner March 4, 2026 22:05

matthewdcong requested review from harrism and sifakis March 4, 2026 22:05

matthewdcong added 16 commits March 4, 2026 14:06

Small const correctness improvement

f0f6a1d

Signed-off-by: Matthew Cong <mcong@nvidia.com>

Optimize prefetching for multibatch training

c56a421

Signed-off-by: Matthew Cong <mcong@nvidia.com>

Try independent radix sort

89c1f28

Signed-off-by: Matthew Cong <mcong@nvidia.com>

Trial fix?

4a7d818

Signed-off-by: Matthew Cong <mcong@nvidia.com>

Batched prefetching

72c456f

Signed-off-by: Matthew Cong <mcong@nvidia.com>

Sleep to avoid blocking

cf6f982

Signed-off-by: Matthew Cong <mcong@nvidia.com>

Partial batch fix

47b8929

Signed-off-by: Matthew Cong <mcong@nvidia.com>

Better partial batch fix

e4f16e3

Signed-off-by: Matthew Cong <mcong@nvidia.com>

Fix stream synchronization improve load balancing

a630897

Signed-off-by: Matthew Cong <mcong@nvidia.com>

Fix stream sync for prefetching

88b48e0

Signed-off-by: Matthew Cong <mcong@nvidia.com>

Switch to merge sort

9e0a13a

Signed-off-by: Matthew Cong <mcong@nvidia.com>

Fix sync in prefetch

6bab3fd

Signed-off-by: Matthew Cong <mcong@nvidia.com>

Prefetch tuning

2d30ad8

Signed-off-by: Matthew Cong <mcong@nvidia.com>

Remove unneeded

c3a2866

Signed-off-by: Matthew Cong <mcong@nvidia.com>

Clean up

7facb06

Signed-off-by: Matthew Cong <mcong@nvidia.com>

Upstream changes fix

ae1ed0e

Signed-off-by: Matthew Cong <mcong@nvidia.com>

matthewdcong force-pushed the mgpu_multibatch branch from ac47265 to ae1ed0e Compare March 4, 2026 22:06

matthewdcong added 2 commits March 4, 2026 14:09

Format

5770cfd

Signed-off-by: Matthew Cong <mcong@nvidia.com>

Add CUDA 12 fallbacks

d450f88

Signed-off-by: Matthew Cong <mcong@nvidia.com>

matthewdcong force-pushed the mgpu_multibatch branch from 983f532 to d450f88 Compare March 4, 2026 23:45

Add sleep

2a7c5af

Signed-off-by: Matthew Cong <mcong@nvidia.com>

harrism added enhancement New feature or request optimization Performance or memory optimization Gaussian Splatting Issues related to Gaussian splattng in the core library labels Mar 4, 2026

harrism added this to fvdb-realitycapture Mar 4, 2026

harrism approved these changes Mar 5, 2026

View reviewed changes

Comment thread src/fvdb/detail/ops/gsplat/FusedSSIM.cu

Comment thread src/fvdb/detail/ops/gsplat/GaussianRasterizeBackward.cu Outdated

Comment thread src/fvdb/detail/ops/gsplat/GaussianUtils.h

Add docs

7e99190

Signed-off-by: Matthew Cong <mcong@nvidia.com>

matthewdcong added 2 commits March 4, 2026 21:58

Merge remote-tracking branch 'upstream/main' into mgpu_multibatch

d692676

Add comment

d53783d

Signed-off-by: Matthew Cong <mcong@nvidia.com>

matthewdcong force-pushed the mgpu_multibatch branch from d09a74c to d53783d Compare March 5, 2026 06:01

matthewdcong enabled auto-merge (squash) March 5, 2026 06:12

matthewdcong merged commit 366a74a into openvdb:main Mar 5, 2026
35 checks passed

github-project-automation Bot moved this to Done in fvdb-realitycapture Mar 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve/optimize mGPU scaling via batched prefetching and sorting changes#499

Improve/optimize mGPU scaling via batched prefetching and sorting changes#499
matthewdcong merged 22 commits into
openvdb:mainfrom
matthewdcong:mgpu_multibatch

matthewdcong commented Mar 4, 2026 •

edited

Loading

Uh oh!

harrism left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

matthewdcong commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

harrism left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

matthewdcong commented Mar 4, 2026 •

edited

Loading