Optimizations for volume_render and move its autograd layer to Python#640
Conversation
Rewrite the volume-render forward pass around per-ray register
accumulation plus a dedicated inference fast path, and move the
backward plumbing out of C++ autograd into a Python
torch.autograd.Function wrapper to match our other functional ops.
- volumeRenderFwdCallback now accumulates `rgb`, `depth`, and `opacity`
into per-thread registers (a stack-allocated
`accRgb[MAX_VOLUME_RENDER_CHANNELS]` with `#pragma unroll` plus
scalars for depth/opacity) and writes each per-ray output exactly
once, eliminating the per-sample global-memory read-modify-write on
`outRGB` / `outDepth` / `outOpacity`. Output tensors are now
allocated with `torch::empty` (plus `torch::zeros` for `outWs` on
the training path, since backward's `thrust::inclusive_scan` reads
into the early-termination tail).
- Add a compile-time `NeedsBackward` template parameter on
`volumeRenderFwdCallback`, `volumeRenderCPU`, and the
`volumeRender` CUDA kernel. When false the kernel skips the
per-sample `outWs` store and the per-ray `outDepth` /
`outTotalSamples` stores, and the host returns size-0 placeholder
tensors with matching dtype/device. The dispatch layer picks the
specialization via a tiny `std::true_type` / `std::false_type`
tag-dispatch lambda inside `AT_DISPATCH_V2`.
- Replace the public `fvdb::volumeRender` with
`fvdb::volumeRenderForward` / `fvdb::volumeRenderBackward` in
`FVDB.h` / `FVDB.cpp`, and bind them to Python as
`_fvdb_cpp.volume_render_fwd` / `volume_render_bwd` in
`src/python/Bindings.cpp` (with updated `_fvdb_cpp.pyi` stubs).
- Delete `src/fvdb/detail/autograd/VolumeRender.{cpp,h}` (and its
`CMakeLists.txt` entry) and implement the autograd layer in Python
(`fvdb/_volume_render.py`), following the same
`torch.autograd.Function` pattern as `_gaussian_autograd.py`. The
new `volume_render` wrapper routes through `_VolumeRenderFn.apply`
only when grad is enabled and at least one differentiable input
has `requires_grad=True`; otherwise it calls
`_fvdb_cpp.volume_render_fwd(..., needs_backward=False)` directly
and returns size-0 placeholders for `depth` / `ws` /
`total_samples`. `fvdb/__init__.py` / `__init__.pyi` now re-export
`volume_render` from this new module instead of `_fvdb_cpp`.
- Add full Doxygen docstrings for both `volumeRenderForward` /
`volumeRenderBackward` in the public `FVDB.h` and their
`detail::ops` counterparts in `detail/ops/VolumeRender.h`,
documenting all parameters, shapes, the early-termination
condition (`T <= transmittanceThresh`), the channel-count limit
(`1 <= C <= MAX_VOLUME_RENDER_CHANNELS`), and the inference fast
path's size-0 output behavior.
- Update the one `ws` / `total_samples`-asserting test
(`test_volume_render_total_samples_counts_terminating_sample`) to
call `sigmas.requires_grad_(True)` so it continues to exercise
the backward-aware path now that the no-grad path returns size-0
placeholders for those tensors.
Performance (fire renderer, 256 spp):
| stage | baseline | + register accum | + inference fast path |
|------------------------|-------------:|-----------------:|----------------------:|
| volume_render, total | 8772.5 ms | 6171.4 ms | 2127.0 ms |
| volume_render, avg/spp | 34.27 ms | 24.11 ms | 8.31 ms |
| share of total runtime | 60.7% | 51.9% | 27.1% |
| total render walltime | 14461.7 ms | 11887.0 ms | 7862.8 ms |
Stage-on-stage:
- Register accumulation (Stage 1): -29.7% on volume_render and
-17.8% on end-to-end walltime (2.57 s saved).
- Inference fast path (Stage 2): a further ~2.9x on volume_render
(6171 ms -> 2127 ms) -- the per-sample outWs store and the
accompanying host-side zero-init of the N-sized buffer were the
single largest remaining source of global-memory traffic in the
stage.
- Combined: ~4.1x on the volume_render stage and ~1.86x end-to-end
(14.5 s -> 7.86 s) vs. the pre-optimization baseline.
volume_render's share of total runtime dropped from 60.7% to
27.1%, handing the "top bottleneck" title to sample_trilinear
(which was then addressed in the preceding commit on this branch).
Correctness is unchanged on the backward-aware path: the Python
wrapper routes through the CUDA forward with needsBackward=true
whenever any input requires grad, so saved tensors and the backward
kernel are identical to before. The inference fast path is
observationally equivalent for callers that consume only rgb /
opacity.
Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
There was a problem hiding this comment.
Pull request overview
This PR optimizes volume_render by reducing per-sample global-memory traffic in the forward pass and moves the autograd layer from C++ into a Python torch.autograd.Function, adding an inference fast path that skips backward-only outputs.
Changes:
- Reworked CUDA/CPU forward to accumulate
rgb/depth/opacityper-ray in registers and write outputs once per ray, plus aneedsBackwardspecialization to skipws/depth/total_sampleson inference. - Replaced the public C++ API/bindings from a single
volume_renderentrypoint to explicit forward/backward bindings (volume_render_fwd/volume_render_bwd) and implemented autograd infvdb/_volume_render.py. - Updated Python exports/stubs and adjusted the unit test to force the backward-aware path where
ws/total_samplesare asserted.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/test_ray_marching.py | Forces backward-aware execution so ws/total_samples remain materialized for the assertion. |
| src/python/Bindings.cpp | Replaces single binding with explicit volume_render_fwd / volume_render_bwd. |
| src/fvdb/detail/ops/VolumeRender.h | Documents and extends forward API with needsBackward; adds backward docs. |
| src/fvdb/detail/ops/VolumeRender.cu | Implements per-ray register accumulation and NeedsBackward specialization with size-0 placeholders on inference. |
| src/fvdb/detail/autograd/VolumeRender.h | Deleted C++ autograd layer header. |
| src/fvdb/detail/autograd/VolumeRender.cpp | Deleted C++ autograd layer implementation. |
| src/fvdb/FVDB.h | Replaces public API with volumeRenderForward / volumeRenderBackward and adds Doxygen docs. |
| src/fvdb/FVDB.cpp | Routes forward/backward to raw ops; removes C++ autograd plumbing. |
| src/CMakeLists.txt | Removes the deleted C++ autograd source from the build. |
| fvdb/_volume_render.py | Adds Python autograd wrapper and inference fast path routing logic. |
| fvdb/_fvdb_cpp.pyi | Updates stubs for the new forward/backward bindings. |
| fvdb/init.pyi | Re-exports volume_render from the new Python wrapper module. |
| fvdb/init.py | Re-exports volume_render from the new Python wrapper module. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Updated the backward method to accept None for any of the gradient outputs, ensuring compatibility with autograd's behavior when gradients are not required for certain outputs. - Introduced a helper function to coerce None gradients into zero tensors with matching shape, dtype, and device. - Added a regression test to verify the correct handling of None gradients, ensuring that the output gradients match expected values when some inputs are not used in the loss. Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
harrism
left a comment
There was a problem hiding this comment.
A few minor comments. Also Claude noticed:
Tutorial doc is now silently inference-incompatible
docs/tutorials/volume_rendering.mdinvokesvolume_renderand usesdepthdownstream. Worth either:
- A one-line comment near line 403 noting that depth is only populated when at least one of ray_density/ray_color has requires_grad=True, or
- Updating the tutorial's volume_render_func to assert/raise if it's invoked without grad and depth is needed.
Co-authored-by: Mark Harris <mharris@nvidia.com> Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
- Updated the documentation in `_volume_render.py` to clarify the conditions under which `sigmas` and `rgbs` require gradients, and specified that `delta_ts` and `ts` are non-differentiable inputs. - Adjusted the logic to ensure that the backward path is selected correctly based on the differentiability of `sigmas` and `rgbs`, while explicitly noting that setting `requires_grad` on `delta_ts` or `ts` does not influence this selection. - Modified the `VolumeRender.h` header to remove the default value for `needsBackward`, aligning with the updated logic. Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
- Introduced a new test method `test_volume_render_inference_fast_path` in `TestVolumeRender` to validate the behavior of the volume_render function when bypassing the autograd graph. - The test checks output shapes and values for different scenarios where gradients are not required, ensuring that the fast path returns size-0 placeholders for certain outputs while maintaining correct rgb and opacity values. - This addition enhances coverage for the volume_render functionality, particularly focusing on performance optimizations and correctness in inference scenarios. Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 13 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Rewrite the volume-render forward pass around per-ray register accumulation plus a dedicated inference fast path, and move the backward plumbing out of C++ autograd into a Python torch.autograd.Function wrapper to match our other functional ops.
volumeRenderFwdCallback now accumulates
rgb,depth, andopacityinto per-thread registers (a stack-allocatedaccRgb[MAX_VOLUME_RENDER_CHANNELS]with#pragma unrollplus scalars for depth/opacity) and writes each per-ray output exactly once, eliminating the per-sample global-memory read-modify-write onoutRGB/outDepth/outOpacity. Output tensors are now allocated withtorch::empty(plustorch::zerosforoutWson the training path, since backward'sthrust::inclusive_scanreads into the early-termination tail).Add a compile-time
NeedsBackwardtemplate parameter onvolumeRenderFwdCallback,volumeRenderCPU, and thevolumeRenderCUDA kernel. When false the kernel skips the per-sampleoutWsstore and the per-rayoutDepth/outTotalSamplesstores, and the host returns size-0 placeholder tensors with matching dtype/device. The dispatch layer picks the specialization via a tinystd::true_type/std::false_typetag-dispatch lambda insideAT_DISPATCH_V2.Replace the public
fvdb::volumeRenderwithfvdb::volumeRenderForward/fvdb::volumeRenderBackwardinFVDB.h/FVDB.cpp, and bind them to Python as_fvdb_cpp.volume_render_fwd/volume_render_bwdinsrc/python/Bindings.cpp(with updated_fvdb_cpp.pyistubs).Delete
src/fvdb/detail/autograd/VolumeRender.{cpp,h}(and itsCMakeLists.txtentry) and implement the autograd layer in Python (fvdb/_volume_render.py), following the sametorch.autograd.Functionpattern as_gaussian_autograd.py. The newvolume_renderwrapper routes through_VolumeRenderFn.applyonly when grad is enabled and at least one differentiable input hasrequires_grad=True; otherwise it calls_fvdb_cpp.volume_render_fwd(..., needs_backward=False)directly and returns size-0 placeholders fordepth/ws/total_samples.fvdb/__init__.py/__init__.pyinow re-exportvolume_renderfrom this new module instead of_fvdb_cpp.Add full Doxygen docstrings for both
volumeRenderForward/volumeRenderBackwardin the publicFVDB.hand theirdetail::opscounterparts indetail/ops/VolumeRender.h, documenting all parameters, shapes, the early-termination condition (T <= transmittanceThresh), the channel-count limit (1 <= C <= MAX_VOLUME_RENDER_CHANNELS), and the inference fast path's size-0 output behavior.Update the one
ws/total_samples-asserting test (test_volume_render_total_samples_counts_terminating_sample) to callsigmas.requires_grad_(True)so it continues to exercise the backward-aware path now that the no-grad path returns size-0 placeholders for those tensors.Performance (fire renderer, 256 spp):
Stage-on-stage:
SampleGridTrilinear#639).Correctness is unchanged on the backward-aware path: the Python wrapper routes through the CUDA forward with needsBackward=true whenever any input requires grad, so saved tensors and the backward kernel are identical to before. The inference fast path is observationally equivalent for callers that consume only rgb / opacity.