`SampleGridTrilinear` optimization: stencil + sample by swahtz · Pull Request #474 · openvdb/fvdb-core

swahtz · 2026-02-19T04:42:35Z

What changed

The trilinear sampling CUDA kernels (sample_trilinear forward, forward-with-grad, backward, and splat) were optimized by taking inspiration from from NanoVDB's TrilinearSampler.

Files changed

NEW src/fvdb/detail/utils/TrilinearStencil.h -- shared header with resolveTrilinearStencil and resolveTrilinearStencilWithGrad
Modified src/fvdb/detail/ops/SampleGridTrilinear.cu -- forward kernel (no gradients)
Modified src/fvdb/detail/ops/SampleGridTrilinearWithGrad.cu -- forward kernel (with gradients)
Modified src/fvdb/detail/ops/SampleGridTrilinearWithGradBackward.cu -- backward kernel
Modified src/fvdb/detail/ops/SplatIntoGridTrilinear.cu -- splat kernel

Optimizations

1. Stencil/sample decomposition (all 4 kernels)

Previously, each callback dispatched one thread per (point, channel) pair. Every thread independently performed 8 NanoVDB tree traversals to resolve the trilinear interpolation corners -- meaning the tree was traversed C x 8 times per point (where C = number of feature channels).

The new approach dispatches one thread per point. Each thread resolves the 8-corner stencil once (8 tree traversals total), caching the resolved voxel indices, interpolation weights, and an active bitmask. It then iterates over all channels using the cached data. This reduces tree traversals from C x 8 to 8 per point.

The coordinate traversal order follows NanoVDB's TrilinearSampler pattern (incrementing one component at a time) to maximize ReadAccessor node-cache hits across the 8 lookups.

2. float4 vectorization (forward kernels, GPU path)

For float scalar type with channels divisible by 4 and 16-byte-aligned data, the channel loop uses explicit float4 loads and stores for 128-bit coalesced memory access. This was an optimization carried over from the previous version.

3. Branchless inner loop (forward kernels)

Inactive corners receive MathType(0) weights during stencil resolution. This eliminates if (activeMask & (1 << corner)) branches from the inner channel loop, reducing warp divergence on GPU. The backward/splat kernels retain the branch to avoid wasted atomic writes to inactive corners.

Benchmark results

RTX 6000 Ada, batch of 24 grids, ~2M voxels/grid, 10K sample points/grid, 32 feature channels:

	Baseline	Optimized	Speedup
Forward	0.214 ms	0.110 ms	1.95x
Backward	7.138 ms	7.059 ms	~1.01x

The forward pass improvement comes from eliminating redundant tree traversals and reducing warp divergence. The backward pass is dominated by atomic write contention when scattering gradients into shared voxels, so the stencil optimization has minimal impact there.

What was tried and reverted

probeValue instead of isActive + getValue: Replacing the two separate NanoVDB accessor calls with a single probeValue call caused a 19% forward regression (0.110ms -> 0.131ms). NanoVDB's probeValue unconditionally computes the value via prefix sums before reporting active status, and the extra reference parameter through the template dispatch chain hurts GPU code generation. The existing pattern -- isActive first (cheap bit test), then getValue only for active corners (guaranteed cache hit) -- is already optimal.

- Introduced `resolveTrilinearStencil` and `resolveTrilinearStencilWithGrad` functions to streamline the resolution of corner indices and weights for trilinear interpolation, enhancing cache efficiency. - Updated `sampleTrilinearCallback` and `sampleTrilinearWithGradCallback` to utilize the new stencil functions, improving clarity and performance. - Modified `splatIntoGridTrilinearCallback` to adopt the new stencil approach, ensuring consistent handling of weights and indices across all channels. - Enhanced vectorized callbacks for both sampling and gradient operations to support efficient processing of multiple channels. These changes improve the overall performance and maintainability of the trilinear interpolation operations in the CUDA backend. Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

…approach - Introduced `TrilinearStencil.h` to encapsulate the logic for resolving trilinear corner indices and weights, enhancing code clarity and performance. - Updated `SampleGridTrilinear.cu`, `SampleGridTrilinearWithGrad.cu`, `SampleGridTrilinearWithGradBackward.cu`, and `SplatIntoGridTrilinear.cu` to leverage the new stencil functions, ensuring consistent handling of weights and indices across all operations. - Removed redundant implementations of trilinear resolution logic from multiple files, streamlining the codebase and reducing maintenance overhead. Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

Copilot

Pull request overview

This PR optimizes the trilinear sampling/splat CUDA kernels by switching from per-(point, channel) work to a per-point stencil resolution that caches the 8 corner indices/weights (and gradient weights where needed), then iterates channels using the cached stencil.

Changes:

Added TrilinearStencil.h to resolve trilinear corner indices/weights (and dWeight/d{u,v,w}) in one pass with cache-friendly traversal.
Updated sample/splat forward + grad + backward kernels to dispatch one thread per point (numChannels=1) and loop channels inside the callback, reusing the resolved stencil.
Removed the old trilinear iterator helper headers and updated includes accordingly.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`src/fvdb/detail/utils/TrilinearStencil.h`	New shared stencil resolver utilities used by all trilinear kernels.
`src/fvdb/detail/utils/TrilinearInterpolationIterator.h`	Deleted (superseded by stencil resolver).
`src/fvdb/detail/utils/TrilinearInterpolationWithGradIterator.h`	Deleted (superseded by stencil resolver with grad).
`src/fvdb/detail/ops/SampleGridTrilinear.cu`	Forward sampling kernel updated to stencil+channel-loop (plus vec4 path).
`src/fvdb/detail/ops/SampleGridTrilinearWithGrad.cu`	Forward sampling-with-grad updated to stencil+channel-loop (plus vec4 path).
`src/fvdb/detail/ops/SampleGridTrilinearWithGradBackward.cu`	Backward kernel updated to stencil+channel-loop (plus vec4 path).
`src/fvdb/detail/ops/SplatIntoGridTrilinear.cu`	Splat kernel updated to stencil+channel-loop (plus vec4 path).
`src/fvdb/detail/ops/VoxelNeighborhood.cu`	Removed unused include of deleted iterator header.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

blackencino

Ship it!

I'd love to see this ported to the dispatch framework, it would make a lot of code go away, I think.

Resolve merge conflicts from PR openvdb#474 (SampleGridTrilinear stencil optimization) by combining main's stencil-based implementation with the feature branch's self-containerization pattern (validation wrapper + non-template public API + FVDB_DISPATCH_KERNEL device dispatch). Resolved files: - SampleGridTrilinear.cu - SampleGridTrilinearWithGrad.cu - SampleGridTrilinearWithGradBackward.cu - SplatIntoGridTrilinear.cu - VoxelNeighborhood.cu Signed-off-by: Christopher Horvath <chorvath@nvidia.com> Made-with: Cursor

swahtz added 2 commits February 19, 2026 16:45

swahtz requested a review from a team as a code owner February 19, 2026 04:42

swahtz added optimization Performance or memory optimization core library Core fVDB library. i.e. anything in the _Cpp module (C++) or fvdb python module ReCap/Segmentation labels Feb 19, 2026

swahtz requested review from harrism and matthewdcong February 19, 2026 04:42

swahtz added 2 commits February 19, 2026 17:45

format fix

1ebd746

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

formatting

bccb59e

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

swahtz requested review from Copilot February 25, 2026 23:27

Copilot started reviewing on behalf of swahtz February 25, 2026 23:28 View session

Copilot AI reviewed Feb 25, 2026

View reviewed changes

Comment thread src/fvdb/detail/utils/TrilinearStencil.h

Comment thread src/fvdb/detail/utils/TrilinearStencil.h

Copilot AI reviewed Feb 25, 2026

blackencino approved these changes Mar 3, 2026

View reviewed changes

swahtz merged commit 6d76302 into openvdb:main Mar 3, 2026
39 checks passed

swahtz deleted the js/trilinear_opt branch March 3, 2026 20:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`SampleGridTrilinear` optimization: stencil + sample#474

`SampleGridTrilinear` optimization: stencil + sample#474
swahtz merged 4 commits into
openvdb:mainfrom
swahtz:js/trilinear_opt

swahtz commented Feb 19, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

blackencino left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

swahtz commented Feb 19, 2026

What changed

Files changed

Optimizations

Benchmark results

What was tried and reverted

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

blackencino left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants