SampleGridTrilinear optimization: stencil + sample#474
Merged
Conversation
- Introduced `resolveTrilinearStencil` and `resolveTrilinearStencilWithGrad` functions to streamline the resolution of corner indices and weights for trilinear interpolation, enhancing cache efficiency. - Updated `sampleTrilinearCallback` and `sampleTrilinearWithGradCallback` to utilize the new stencil functions, improving clarity and performance. - Modified `splatIntoGridTrilinearCallback` to adopt the new stencil approach, ensuring consistent handling of weights and indices across all channels. - Enhanced vectorized callbacks for both sampling and gradient operations to support efficient processing of multiple channels. These changes improve the overall performance and maintainability of the trilinear interpolation operations in the CUDA backend. Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
…approach - Introduced `TrilinearStencil.h` to encapsulate the logic for resolving trilinear corner indices and weights, enhancing code clarity and performance. - Updated `SampleGridTrilinear.cu`, `SampleGridTrilinearWithGrad.cu`, `SampleGridTrilinearWithGradBackward.cu`, and `SplatIntoGridTrilinear.cu` to leverage the new stencil functions, ensuring consistent handling of weights and indices across all operations. - Removed redundant implementations of trilinear resolution logic from multiple files, streamlining the codebase and reducing maintenance overhead. Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR optimizes the trilinear sampling/splat CUDA kernels by switching from per-(point, channel) work to a per-point stencil resolution that caches the 8 corner indices/weights (and gradient weights where needed), then iterates channels using the cached stencil.
Changes:
- Added
TrilinearStencil.hto resolve trilinear corner indices/weights (and dWeight/d{u,v,w}) in one pass with cache-friendly traversal. - Updated sample/splat forward + grad + backward kernels to dispatch one thread per point (
numChannels=1) and loop channels inside the callback, reusing the resolved stencil. - Removed the old trilinear iterator helper headers and updated includes accordingly.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
src/fvdb/detail/utils/TrilinearStencil.h |
New shared stencil resolver utilities used by all trilinear kernels. |
src/fvdb/detail/utils/TrilinearInterpolationIterator.h |
Deleted (superseded by stencil resolver). |
src/fvdb/detail/utils/TrilinearInterpolationWithGradIterator.h |
Deleted (superseded by stencil resolver with grad). |
src/fvdb/detail/ops/SampleGridTrilinear.cu |
Forward sampling kernel updated to stencil+channel-loop (plus vec4 path). |
src/fvdb/detail/ops/SampleGridTrilinearWithGrad.cu |
Forward sampling-with-grad updated to stencil+channel-loop (plus vec4 path). |
src/fvdb/detail/ops/SampleGridTrilinearWithGradBackward.cu |
Backward kernel updated to stencil+channel-loop (plus vec4 path). |
src/fvdb/detail/ops/SplatIntoGridTrilinear.cu |
Splat kernel updated to stencil+channel-loop (plus vec4 path). |
src/fvdb/detail/ops/VoxelNeighborhood.cu |
Removed unused include of deleted iterator header. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
blackencino
approved these changes
Mar 3, 2026
blackencino
left a comment
Contributor
There was a problem hiding this comment.
Ship it!
I'd love to see this ported to the dispatch framework, it would make a lot of code go away, I think.
blackencino
added a commit
to blackencino/fvdb-core
that referenced
this pull request
Mar 3, 2026
Resolve merge conflicts from PR openvdb#474 (SampleGridTrilinear stencil optimization) by combining main's stencil-based implementation with the feature branch's self-containerization pattern (validation wrapper + non-template public API + FVDB_DISPATCH_KERNEL device dispatch). Resolved files: - SampleGridTrilinear.cu - SampleGridTrilinearWithGrad.cu - SampleGridTrilinearWithGradBackward.cu - SplatIntoGridTrilinear.cu - VoxelNeighborhood.cu Signed-off-by: Christopher Horvath <chorvath@nvidia.com> Made-with: Cursor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changed
The trilinear sampling CUDA kernels (
sample_trilinearforward, forward-with-grad, backward, and splat) were optimized by taking inspiration from from NanoVDB's TrilinearSampler.Files changed
src/fvdb/detail/utils/TrilinearStencil.h-- shared header withresolveTrilinearStencilandresolveTrilinearStencilWithGradsrc/fvdb/detail/ops/SampleGridTrilinear.cu-- forward kernel (no gradients)src/fvdb/detail/ops/SampleGridTrilinearWithGrad.cu-- forward kernel (with gradients)src/fvdb/detail/ops/SampleGridTrilinearWithGradBackward.cu-- backward kernelsrc/fvdb/detail/ops/SplatIntoGridTrilinear.cu-- splat kernelOptimizations
1. Stencil/sample decomposition (all 4 kernels)
Previously, each callback dispatched one thread per (point, channel) pair. Every thread independently performed 8 NanoVDB tree traversals to resolve the trilinear interpolation corners -- meaning the tree was traversed C x 8 times per point (where C = number of feature channels).
The new approach dispatches one thread per point. Each thread resolves the 8-corner stencil once (8 tree traversals total), caching the resolved voxel indices, interpolation weights, and an active bitmask. It then iterates over all channels using the cached data. This reduces tree traversals from C x 8 to 8 per point.
The coordinate traversal order follows NanoVDB's
TrilinearSamplerpattern (incrementing one component at a time) to maximizeReadAccessornode-cache hits across the 8 lookups.2. float4 vectorization (forward kernels, GPU path)
For
floatscalar type with channels divisible by 4 and 16-byte-aligned data, the channel loop uses explicitfloat4loads and stores for 128-bit coalesced memory access. This was an optimization carried over from the previous version.3. Branchless inner loop (forward kernels)
Inactive corners receive
MathType(0)weights during stencil resolution. This eliminatesif (activeMask & (1 << corner))branches from the inner channel loop, reducing warp divergence on GPU. The backward/splat kernels retain the branch to avoid wasted atomic writes to inactive corners.Benchmark results
RTX 6000 Ada, batch of 24 grids, ~2M voxels/grid, 10K sample points/grid, 32 feature channels:
The forward pass improvement comes from eliminating redundant tree traversals and reducing warp divergence. The backward pass is dominated by atomic write contention when scattering gradients into shared voxels, so the stencil optimization has minimal impact there.
What was tried and reverted
probeValueinstead ofisActive+getValue: Replacing the two separate NanoVDB accessor calls with a singleprobeValuecall caused a 19% forward regression (0.110ms -> 0.131ms). NanoVDB'sprobeValueunconditionally computes the value via prefix sums before reporting active status, and the extra reference parameter through the template dispatch chain hurts GPU code generation. The existing pattern --isActivefirst (cheap bit test), thengetValueonly for active corners (guaranteed cache hit) -- is already optimal.