Skip to content

SampleGridTrilinear optimization: stencil + sample#474

Merged
swahtz merged 4 commits into
openvdb:mainfrom
swahtz:js/trilinear_opt
Mar 3, 2026
Merged

SampleGridTrilinear optimization: stencil + sample#474
swahtz merged 4 commits into
openvdb:mainfrom
swahtz:js/trilinear_opt

Conversation

@swahtz

@swahtz swahtz commented Feb 19, 2026

Copy link
Copy Markdown
Contributor

What changed

The trilinear sampling CUDA kernels (sample_trilinear forward, forward-with-grad, backward, and splat) were optimized by taking inspiration from from NanoVDB's TrilinearSampler.

Files changed

  • NEW src/fvdb/detail/utils/TrilinearStencil.h -- shared header with resolveTrilinearStencil and resolveTrilinearStencilWithGrad
  • Modified src/fvdb/detail/ops/SampleGridTrilinear.cu -- forward kernel (no gradients)
  • Modified src/fvdb/detail/ops/SampleGridTrilinearWithGrad.cu -- forward kernel (with gradients)
  • Modified src/fvdb/detail/ops/SampleGridTrilinearWithGradBackward.cu -- backward kernel
  • Modified src/fvdb/detail/ops/SplatIntoGridTrilinear.cu -- splat kernel

Optimizations

1. Stencil/sample decomposition (all 4 kernels)

Previously, each callback dispatched one thread per (point, channel) pair. Every thread independently performed 8 NanoVDB tree traversals to resolve the trilinear interpolation corners -- meaning the tree was traversed C x 8 times per point (where C = number of feature channels).

The new approach dispatches one thread per point. Each thread resolves the 8-corner stencil once (8 tree traversals total), caching the resolved voxel indices, interpolation weights, and an active bitmask. It then iterates over all channels using the cached data. This reduces tree traversals from C x 8 to 8 per point.

The coordinate traversal order follows NanoVDB's TrilinearSampler pattern (incrementing one component at a time) to maximize ReadAccessor node-cache hits across the 8 lookups.

2. float4 vectorization (forward kernels, GPU path)

For float scalar type with channels divisible by 4 and 16-byte-aligned data, the channel loop uses explicit float4 loads and stores for 128-bit coalesced memory access. This was an optimization carried over from the previous version.

3. Branchless inner loop (forward kernels)

Inactive corners receive MathType(0) weights during stencil resolution. This eliminates if (activeMask & (1 << corner)) branches from the inner channel loop, reducing warp divergence on GPU. The backward/splat kernels retain the branch to avoid wasted atomic writes to inactive corners.

Benchmark results

RTX 6000 Ada, batch of 24 grids, ~2M voxels/grid, 10K sample points/grid, 32 feature channels:

Baseline Optimized Speedup
Forward 0.214 ms 0.110 ms 1.95x
Backward 7.138 ms 7.059 ms ~1.01x

The forward pass improvement comes from eliminating redundant tree traversals and reducing warp divergence. The backward pass is dominated by atomic write contention when scattering gradients into shared voxels, so the stencil optimization has minimal impact there.

What was tried and reverted

  • probeValue instead of isActive + getValue: Replacing the two separate NanoVDB accessor calls with a single probeValue call caused a 19% forward regression (0.110ms -> 0.131ms). NanoVDB's probeValue unconditionally computes the value via prefix sums before reporting active status, and the extra reference parameter through the template dispatch chain hurts GPU code generation. The existing pattern -- isActive first (cheap bit test), then getValue only for active corners (guaranteed cache hit) -- is already optimal.

- Introduced `resolveTrilinearStencil` and `resolveTrilinearStencilWithGrad` functions to streamline the resolution of corner indices and weights for trilinear interpolation, enhancing cache efficiency.
- Updated `sampleTrilinearCallback` and `sampleTrilinearWithGradCallback` to utilize the new stencil functions, improving clarity and performance.
- Modified `splatIntoGridTrilinearCallback` to adopt the new stencil approach, ensuring consistent handling of weights and indices across all channels.
- Enhanced vectorized callbacks for both sampling and gradient operations to support efficient processing of multiple channels.

These changes improve the overall performance and maintainability of the trilinear interpolation operations in the CUDA backend.

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
…approach

- Introduced `TrilinearStencil.h` to encapsulate the logic for resolving trilinear corner indices and weights, enhancing code clarity and performance.
- Updated `SampleGridTrilinear.cu`, `SampleGridTrilinearWithGrad.cu`, `SampleGridTrilinearWithGradBackward.cu`, and `SplatIntoGridTrilinear.cu` to leverage the new stencil functions, ensuring consistent handling of weights and indices across all operations.
- Removed redundant implementations of trilinear resolution logic from multiple files, streamlining the codebase and reducing maintenance overhead.

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
@swahtz swahtz requested a review from a team as a code owner February 19, 2026 04:42
@swahtz swahtz added optimization Performance or memory optimization core library Core fVDB library. i.e. anything in the _Cpp module (C++) or fvdb python module ReCap/Segmentation labels Feb 19, 2026
Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes the trilinear sampling/splat CUDA kernels by switching from per-(point, channel) work to a per-point stencil resolution that caches the 8 corner indices/weights (and gradient weights where needed), then iterates channels using the cached stencil.

Changes:

  • Added TrilinearStencil.h to resolve trilinear corner indices/weights (and dWeight/d{u,v,w}) in one pass with cache-friendly traversal.
  • Updated sample/splat forward + grad + backward kernels to dispatch one thread per point (numChannels=1) and loop channels inside the callback, reusing the resolved stencil.
  • Removed the old trilinear iterator helper headers and updated includes accordingly.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/fvdb/detail/utils/TrilinearStencil.h New shared stencil resolver utilities used by all trilinear kernels.
src/fvdb/detail/utils/TrilinearInterpolationIterator.h Deleted (superseded by stencil resolver).
src/fvdb/detail/utils/TrilinearInterpolationWithGradIterator.h Deleted (superseded by stencil resolver with grad).
src/fvdb/detail/ops/SampleGridTrilinear.cu Forward sampling kernel updated to stencil+channel-loop (plus vec4 path).
src/fvdb/detail/ops/SampleGridTrilinearWithGrad.cu Forward sampling-with-grad updated to stencil+channel-loop (plus vec4 path).
src/fvdb/detail/ops/SampleGridTrilinearWithGradBackward.cu Backward kernel updated to stencil+channel-loop (plus vec4 path).
src/fvdb/detail/ops/SplatIntoGridTrilinear.cu Splat kernel updated to stencil+channel-loop (plus vec4 path).
src/fvdb/detail/ops/VoxelNeighborhood.cu Removed unused include of deleted iterator header.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/fvdb/detail/utils/TrilinearStencil.h
Comment thread src/fvdb/detail/utils/TrilinearStencil.h

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@blackencino blackencino left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ship it!

I'd love to see this ported to the dispatch framework, it would make a lot of code go away, I think.

@swahtz swahtz merged commit 6d76302 into openvdb:main Mar 3, 2026
39 checks passed
@swahtz swahtz deleted the js/trilinear_opt branch March 3, 2026 20:25
blackencino added a commit to blackencino/fvdb-core that referenced this pull request Mar 3, 2026
Resolve merge conflicts from PR openvdb#474 (SampleGridTrilinear stencil
optimization) by combining main's stencil-based implementation with
the feature branch's self-containerization pattern (validation wrapper
+ non-template public API + FVDB_DISPATCH_KERNEL device dispatch).

Resolved files:
- SampleGridTrilinear.cu
- SampleGridTrilinearWithGrad.cu
- SampleGridTrilinearWithGradBackward.cu
- SplatIntoGridTrilinear.cu
- VoxelNeighborhood.cu

Signed-off-by: Christopher Horvath <chorvath@nvidia.com>
Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core library Core fVDB library. i.e. anything in the _Cpp module (C++) or fvdb python module optimization Performance or memory optimization ReCap/Segmentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants