SampleGridTrilinear: Vectorized float4 loads#430
Conversation
Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
There was a problem hiding this comment.
Pull request overview
This PR optimizes trilinear sampling operations by implementing vectorized float4 load/store operations for float32 data types with channel counts that are multiples of 4 (and >= 4) on GPU devices. The optimization applies to forward pass, backward pass with gradients, and splat operations, reportedly achieving a 320% speedup for the forward pass and 15% speedup for the backward pass.
Changes:
- Implements vectorized float4 callbacks for
SampleGridTrilinear,SampleGridTrilinearWithGrad,SampleGridTrilinearWithGradBackward, andSplatIntoGridTrilinearoperations - Adds comprehensive test coverage for various channel configurations (1, 3, 4, 5, 16) to ensure correctness across edge cases
- Fixes test issues by removing problematic
.squeeze()calls that could fail for single-channel cases
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/test_sample.py | Adds test combinations for different channel counts (1, 3, 4, 5, 16) and fixes test issues by properly handling tensor shapes without .squeeze() |
| src/fvdb/detail/ops/SampleGridTrilinear.cu | Implements vectorized float4 forward pass for trilinear sampling, removes unused iostream include |
| src/fvdb/detail/ops/SampleGridTrilinearWithGrad.cu | Implements vectorized float4 forward pass with gradient computation |
| src/fvdb/detail/ops/SampleGridTrilinearWithGradBackward.cu | Implements vectorized float4 backward pass for gradients |
| src/fvdb/detail/ops/SplatIntoGridTrilinear.cu | Implements vectorized float4 reads for splat operations with scalar atomic writes |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
There was a problem hiding this comment.
Overall, I think this is a step in the right direction but I think the alignment assumptions could be more explicit and/or strictly enforced.
TensorAccessorcould be non-contiguous in this case which means thereinterpret_castwould lead to afloat4that is ill-formed. One option here would be to change the kernel from using aTensorAccessorargument to use a raw pointer which is then wrapped with__builtin_assume_aligned. This would allow the compiler to automatically vectorize the load/stores and other instructions which might save on the manual unrolling as well. However, this is a bit of a departure from a readability standpoint, so open to discussion. Alternatively, we could callcontiguous()on the vectorized inputs/outputs beforehand, pass the resulting accessor, and explicitly use__buildin_assume_alignedprior to thereinterpret_castfor readability and compiler hinting.TensorAccessorcould have a nonzero storage offset (from the page aligned pointer returned by CUDA malloc) that isn't 16-byte aligned which would result in a pointer that is not 16-byte aligned even if the number of channels is a multiple of 4. In this case, I think we need to assert on the base pointer rather than solely the number of channels themselves.
Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
Thanks @matthewdcong for having a look and catching that. I've implemented the second option to point 1 (that seemed the most consistent) and the 16-byte alignment check on the base pointer of point 2 in the |
Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
matthewdcong
left a comment
There was a problem hiding this comment.
Looks good, thanks for the changes!
This PR optimizes
SampleGridTrilinear(and other related trilinear sampling methods) to perform vectorized float4 loads/stores for cases where we're operating on float feature channels that are a multiple of 4 (for a number of channels >=4) and when on the GPU.This greatly improves memory bandwidth utilization; on my Ada RTX 6000 I see a 320% speedup for the forward pass and more modest 15% speedup for the backward pass. This is on a test of 8.5 million voxels with a feature channel size of 192, sampling 1.6 million points (totals across a batch size of 16). This channel configuration is a common number we use in segmentation and I think this speedup would be beneficial in other use cases as well.
I've also added test combos for a number of channel configurations and fixed test issues that came up as part of that change.