SampleGridTrilinear: Vectorized float4 loads by swahtz · Pull Request #430 · openvdb/fvdb-core

swahtz · 2026-01-26T04:24:05Z

This PR optimizes SampleGridTrilinear (and other related trilinear sampling methods) to perform vectorized float4 loads/stores for cases where we're operating on float feature channels that are a multiple of 4 (for a number of channels >=4) and when on the GPU.

This greatly improves memory bandwidth utilization; on my Ada RTX 6000 I see a 320% speedup for the forward pass and more modest 15% speedup for the backward pass. This is on a test of 8.5 million voxels with a feature channel size of 192, sampling 1.6 million points (totals across a batch size of 16). This channel configuration is a common number we use in segmentation and I think this speedup would be beneficial in other use cases as well.

I've also added test combos for a number of channel configurations and fixed test issues that came up as part of that change.

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

Copilot

Pull request overview

This PR optimizes trilinear sampling operations by implementing vectorized float4 load/store operations for float32 data types with channel counts that are multiples of 4 (and >= 4) on GPU devices. The optimization applies to forward pass, backward pass with gradients, and splat operations, reportedly achieving a 320% speedup for the forward pass and 15% speedup for the backward pass.

Changes:

Implements vectorized float4 callbacks for SampleGridTrilinear, SampleGridTrilinearWithGrad, SampleGridTrilinearWithGradBackward, and SplatIntoGridTrilinear operations
Adds comprehensive test coverage for various channel configurations (1, 3, 4, 5, 16) to ensure correctness across edge cases
Fixes test issues by removing problematic .squeeze() calls that could fail for single-channel cases

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
tests/unit/test_sample.py	Adds test combinations for different channel counts (1, 3, 4, 5, 16) and fixes test issues by properly handling tensor shapes without .squeeze()
src/fvdb/detail/ops/SampleGridTrilinear.cu	Implements vectorized float4 forward pass for trilinear sampling, removes unused iostream include
src/fvdb/detail/ops/SampleGridTrilinearWithGrad.cu	Implements vectorized float4 forward pass with gradient computation
src/fvdb/detail/ops/SampleGridTrilinearWithGradBackward.cu	Implements vectorized float4 backward pass for gradients
src/fvdb/detail/ops/SplatIntoGridTrilinear.cu	Implements vectorized float4 reads for splat operations with scalar atomic writes

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

matthewdcong

Overall, I think this is a step in the right direction but I think the alignment assumptions could be more explicit and/or strictly enforced.

TensorAccessor could be non-contiguous in this case which means the reinterpret_cast would lead to a float4 that is ill-formed. One option here would be to change the kernel from using a TensorAccessor argument to use a raw pointer which is then wrapped with __builtin_assume_aligned. This would allow the compiler to automatically vectorize the load/stores and other instructions which might save on the manual unrolling as well. However, this is a bit of a departure from a readability standpoint, so open to discussion. Alternatively, we could call contiguous() on the vectorized inputs/outputs beforehand, pass the resulting accessor, and explicitly use __buildin_assume_aligned prior to the reinterpret_cast for readability and compiler hinting.
TensorAccessor could have a nonzero storage offset (from the page aligned pointer returned by CUDA malloc) that isn't 16-byte aligned which would result in a pointer that is not 16-byte aligned even if the number of channels is a multiple of 4. In this case, I think we need to assert on the base pointer rather than solely the number of channels themselves.

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

swahtz · 2026-01-26T21:49:41Z

Overall, I think this is a step in the right direction but I think the alignment assumptions could be more explicit and/or strictly enforced.

TensorAccessor could be non-contiguous in this case which means the reinterpret_cast would lead to a float4 that is ill-formed. One option here would be to change the kernel from using a TensorAccessor argument to use a raw pointer which is then wrapped with __builtin_assume_aligned. This would allow the compiler to automatically vectorize the load/stores and other instructions which might save on the manual unrolling as well. However, this is a bit of a departure from a readability standpoint, so open to discussion. Alternatively, we could call contiguous() on the vectorized inputs/outputs beforehand, pass the resulting accessor, and explicitly use __buildin_assume_aligned prior to the reinterpret_cast for readability and compiler hinting.

TensorAccessor could have a nonzero storage offset (from the page aligned pointer returned by CUDA malloc) that isn't 16-byte aligned which would result in a pointer that is not 16-byte aligned even if the number of channels is a multiple of 4. In this case, I think we need to assert on the base pointer rather than solely the number of channels themselves.

Thanks @matthewdcong for having a look and catching that. I've implemented the second option to point 1 (that seemed the most consistent) and the 16-byte alignment check on the base pointer of point 2 in the SampleGridTrilinear.cu file as a set of proposed changes before I go and implement this everywhere. Let me know what you think and can just roll out the same approach everywhere.

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

matthewdcong

Looks good, thanks for the changes!

swahtz added 2 commits January 26, 2026 16:39

SampleGridTrilinear: Vectorized float4 loads for float32 type

d474e3f

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

Add tests

9c08914

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

swahtz requested a review from a team as a code owner January 26, 2026 04:24

swahtz requested review from blackencino and phapalova January 26, 2026 04:24

swahtz added core library Core fVDB library. i.e. anything in the _Cpp module (C++) or fvdb python module ReCap/Segmentation optimization Performance or memory optimization labels Jan 26, 2026

swahtz requested a review from Copilot January 26, 2026 04:31

Copilot started reviewing on behalf of swahtz January 26, 2026 04:31 View session

Copilot AI reviewed Jan 26, 2026

View reviewed changes

Comment thread src/fvdb/detail/ops/SampleGridTrilinearWithGradBackward.cu Outdated

Comment thread src/fvdb/detail/ops/SampleGridTrilinearWithGradBackward.cu

finish sampleTrilinearWithGradBackwardCallbackVec4 implementation

9565cfb

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

matthewdcong reviewed Jan 26, 2026

View reviewed changes

addressing review notes proposed changes

10b7250

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

apply the same pattern from SampleGridTrilinear to other kernels

91aa483

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

matthewdcong reviewed Jan 26, 2026

View reviewed changes

Comment thread src/fvdb/detail/ops/SampleGridTrilinear.cu Outdated

matthewdcong reviewed Jan 26, 2026

View reviewed changes

Comment thread src/fvdb/detail/ops/SampleGridTrilinear.cu Outdated

swahtz added 2 commits January 27, 2026 11:55

static_cast

36afea3

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

collapsing some of the assignments into unrolled loops where we can

26ac11c

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

matthewdcong approved these changes Jan 27, 2026

View reviewed changes

swahtz enabled auto-merge (squash) January 27, 2026 00:27

swahtz merged commit e37c0d1 into openvdb:main Jan 27, 2026
32 checks passed

swahtz deleted the js/trilinear_sample_float4 branch January 27, 2026 00:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SampleGridTrilinear: Vectorized float4 loads#430

SampleGridTrilinear: Vectorized float4 loads#430
swahtz merged 7 commits into
openvdb:mainfrom
swahtz:js/trilinear_sample_float4

swahtz commented Jan 26, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

matthewdcong left a comment •

edited

Loading

Uh oh!

swahtz commented Jan 26, 2026

Uh oh!

Uh oh!

Uh oh!

matthewdcong left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

swahtz commented Jan 26, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

matthewdcong left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

swahtz commented Jan 26, 2026

Uh oh!

Uh oh!

Uh oh!

matthewdcong left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

matthewdcong left a comment •

edited

Loading