Skip to content

SampleGridTrilinear: Vectorized float4 loads#430

Merged
swahtz merged 7 commits into
openvdb:mainfrom
swahtz:js/trilinear_sample_float4
Jan 27, 2026
Merged

SampleGridTrilinear: Vectorized float4 loads#430
swahtz merged 7 commits into
openvdb:mainfrom
swahtz:js/trilinear_sample_float4

Conversation

@swahtz

@swahtz swahtz commented Jan 26, 2026

Copy link
Copy Markdown
Contributor

This PR optimizes SampleGridTrilinear (and other related trilinear sampling methods) to perform vectorized float4 loads/stores for cases where we're operating on float feature channels that are a multiple of 4 (for a number of channels >=4) and when on the GPU.

This greatly improves memory bandwidth utilization; on my Ada RTX 6000 I see a 320% speedup for the forward pass and more modest 15% speedup for the backward pass. This is on a test of 8.5 million voxels with a feature channel size of 192, sampling 1.6 million points (totals across a batch size of 16). This channel configuration is a common number we use in segmentation and I think this speedup would be beneficial in other use cases as well.

I've also added test combos for a number of channel configurations and fixed test issues that came up as part of that change.

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
@swahtz swahtz requested a review from a team as a code owner January 26, 2026 04:24
@swahtz swahtz added core library Core fVDB library. i.e. anything in the _Cpp module (C++) or fvdb python module ReCap/Segmentation optimization Performance or memory optimization labels Jan 26, 2026
@swahtz swahtz requested a review from Copilot January 26, 2026 04:31

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes trilinear sampling operations by implementing vectorized float4 load/store operations for float32 data types with channel counts that are multiples of 4 (and >= 4) on GPU devices. The optimization applies to forward pass, backward pass with gradients, and splat operations, reportedly achieving a 320% speedup for the forward pass and 15% speedup for the backward pass.

Changes:

  • Implements vectorized float4 callbacks for SampleGridTrilinear, SampleGridTrilinearWithGrad, SampleGridTrilinearWithGradBackward, and SplatIntoGridTrilinear operations
  • Adds comprehensive test coverage for various channel configurations (1, 3, 4, 5, 16) to ensure correctness across edge cases
  • Fixes test issues by removing problematic .squeeze() calls that could fail for single-channel cases

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/unit/test_sample.py Adds test combinations for different channel counts (1, 3, 4, 5, 16) and fixes test issues by properly handling tensor shapes without .squeeze()
src/fvdb/detail/ops/SampleGridTrilinear.cu Implements vectorized float4 forward pass for trilinear sampling, removes unused iostream include
src/fvdb/detail/ops/SampleGridTrilinearWithGrad.cu Implements vectorized float4 forward pass with gradient computation
src/fvdb/detail/ops/SampleGridTrilinearWithGradBackward.cu Implements vectorized float4 backward pass for gradients
src/fvdb/detail/ops/SplatIntoGridTrilinear.cu Implements vectorized float4 reads for splat operations with scalar atomic writes

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/fvdb/detail/ops/SampleGridTrilinearWithGradBackward.cu Outdated
Comment thread src/fvdb/detail/ops/SampleGridTrilinearWithGradBackward.cu
Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

@matthewdcong matthewdcong left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I think this is a step in the right direction but I think the alignment assumptions could be more explicit and/or strictly enforced.

  1. TensorAccessor could be non-contiguous in this case which means the reinterpret_cast would lead to a float4 that is ill-formed. One option here would be to change the kernel from using a TensorAccessor argument to use a raw pointer which is then wrapped with __builtin_assume_aligned. This would allow the compiler to automatically vectorize the load/stores and other instructions which might save on the manual unrolling as well. However, this is a bit of a departure from a readability standpoint, so open to discussion. Alternatively, we could call contiguous() on the vectorized inputs/outputs beforehand, pass the resulting accessor, and explicitly use __buildin_assume_aligned prior to the reinterpret_cast for readability and compiler hinting.
  2. TensorAccessor could have a nonzero storage offset (from the page aligned pointer returned by CUDA malloc) that isn't 16-byte aligned which would result in a pointer that is not 16-byte aligned even if the number of channels is a multiple of 4. In this case, I think we need to assert on the base pointer rather than solely the number of channels themselves.

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
@swahtz

swahtz commented Jan 26, 2026

Copy link
Copy Markdown
Contributor Author

Overall, I think this is a step in the right direction but I think the alignment assumptions could be more explicit and/or strictly enforced.

  1. TensorAccessor could be non-contiguous in this case which means the reinterpret_cast would lead to a float4 that is ill-formed. One option here would be to change the kernel from using a TensorAccessor argument to use a raw pointer which is then wrapped with __builtin_assume_aligned. This would allow the compiler to automatically vectorize the load/stores and other instructions which might save on the manual unrolling as well. However, this is a bit of a departure from a readability standpoint, so open to discussion. Alternatively, we could call contiguous() on the vectorized inputs/outputs beforehand, pass the resulting accessor, and explicitly use __buildin_assume_aligned prior to the reinterpret_cast for readability and compiler hinting.
  2. TensorAccessor could have a nonzero storage offset (from the page aligned pointer returned by CUDA malloc) that isn't 16-byte aligned which would result in a pointer that is not 16-byte aligned even if the number of channels is a multiple of 4. In this case, I think we need to assert on the base pointer rather than solely the number of channels themselves.

Thanks @matthewdcong for having a look and catching that. I've implemented the second option to point 1 (that seemed the most consistent) and the 16-byte alignment check on the base pointer of point 2 in the SampleGridTrilinear.cu file as a set of proposed changes before I go and implement this everywhere. Let me know what you think and can just roll out the same approach everywhere.

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
Comment thread src/fvdb/detail/ops/SampleGridTrilinear.cu Outdated
Comment thread src/fvdb/detail/ops/SampleGridTrilinear.cu Outdated
Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

@matthewdcong matthewdcong left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for the changes!

@swahtz swahtz enabled auto-merge (squash) January 27, 2026 00:27
@swahtz swahtz merged commit e37c0d1 into openvdb:main Jan 27, 2026
32 checks passed
@swahtz swahtz deleted the js/trilinear_sample_float4 branch January 27, 2026 00:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core library Core fVDB library. i.e. anything in the _Cpp module (C++) or fvdb python module optimization Performance or memory optimization ReCap/Segmentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants