Optimizations for `volume_render` and move its autograd layer to Python by swahtz · Pull Request #640 · openvdb/fvdb-core

swahtz · 2026-04-24T08:04:49Z

Rewrite the volume-render forward pass around per-ray register accumulation plus a dedicated inference fast path, and move the backward plumbing out of C++ autograd into a Python torch.autograd.Function wrapper to match our other functional ops.

volumeRenderFwdCallback now accumulates rgb, depth, and opacity into per-thread registers (a stack-allocated accRgb[MAX_VOLUME_RENDER_CHANNELS] with #pragma unroll plus scalars for depth/opacity) and writes each per-ray output exactly once, eliminating the per-sample global-memory read-modify-write on outRGB / outDepth / outOpacity. Output tensors are now allocated with torch::empty (plus torch::zeros for outWs on the training path, since backward's thrust::inclusive_scan reads into the early-termination tail).
Add a compile-time NeedsBackward template parameter on volumeRenderFwdCallback, volumeRenderCPU, and the volumeRender CUDA kernel. When false the kernel skips the per-sample outWs store and the per-ray outDepth / outTotalSamples stores, and the host returns size-0 placeholder tensors with matching dtype/device. The dispatch layer picks the specialization via a tiny std::true_type / std::false_type tag-dispatch lambda inside AT_DISPATCH_V2.
Replace the public fvdb::volumeRender with fvdb::volumeRenderForward / fvdb::volumeRenderBackward in FVDB.h / FVDB.cpp, and bind them to Python as _fvdb_cpp.volume_render_fwd / volume_render_bwd in src/python/Bindings.cpp (with updated _fvdb_cpp.pyi stubs).
Delete src/fvdb/detail/autograd/VolumeRender.{cpp,h} (and its CMakeLists.txt entry) and implement the autograd layer in Python (fvdb/_volume_render.py), following the same torch.autograd.Function pattern as _gaussian_autograd.py. The new volume_render wrapper routes through _VolumeRenderFn.apply only when grad is enabled and at least one differentiable input has requires_grad=True; otherwise it calls _fvdb_cpp.volume_render_fwd(..., needs_backward=False) directly and returns size-0 placeholders for depth / ws / total_samples. fvdb/__init__.py / __init__.pyi now re-export volume_render from this new module instead of _fvdb_cpp.
Add full Doxygen docstrings for both volumeRenderForward / volumeRenderBackward in the public FVDB.h and their detail::ops counterparts in detail/ops/VolumeRender.h, documenting all parameters, shapes, the early-termination condition (T <= transmittanceThresh), the channel-count limit (1 <= C <= MAX_VOLUME_RENDER_CHANNELS), and the inference fast path's size-0 output behavior.
Update the one ws / total_samples-asserting test (test_volume_render_total_samples_counts_terminating_sample) to call sigmas.requires_grad_(True) so it continues to exercise the backward-aware path now that the no-grad path returns size-0 placeholders for those tensors.

Performance (fire renderer, 256 spp):

stage	baseline	+ register accum	+ inference fast path
volume_render, total	8772.5 ms	6171.4 ms	2127.0 ms
volume_render, avg/spp	34.27 ms	24.11 ms	8.31 ms
share of total runtime	60.7%	51.9%	27.1%
total render walltime	14461.7 ms	11887.0 ms	7862.8 ms

Stage-on-stage:

Register accumulation (Stage 1): -29.7% on volume_render and -17.8% on end-to-end walltime (2.57 s saved).
Inference fast path (Stage 2): a further ~2.9x on volume_render (6171 ms -> 2127 ms) -- the per-sample outWs store and the accompanying host-side zero-init of the N-sized buffer were the single largest remaining source of global-memory traffic in the stage.
Combined: ~4.1x on the volume_render stage and ~1.86x end-to-end (14.5 s -> 7.86 s) vs. the pre-optimization baseline. volume_render's share of total runtime dropped from 60.7% to 27.1%, handing the "top bottleneck" title to sample_trilinear (which is then addressed in Add Vec2 and double fast paths to SampleGridTrilinear #639).

Correctness is unchanged on the backward-aware path: the Python wrapper routes through the CUDA forward with needsBackward=true whenever any input requires grad, so saved tensors and the backward kernel are identical to before. The inference fast path is observationally equivalent for callers that consume only rgb / opacity.

Rewrite the volume-render forward pass around per-ray register accumulation plus a dedicated inference fast path, and move the backward plumbing out of C++ autograd into a Python torch.autograd.Function wrapper to match our other functional ops. - volumeRenderFwdCallback now accumulates `rgb`, `depth`, and `opacity` into per-thread registers (a stack-allocated `accRgb[MAX_VOLUME_RENDER_CHANNELS]` with `#pragma unroll` plus scalars for depth/opacity) and writes each per-ray output exactly once, eliminating the per-sample global-memory read-modify-write on `outRGB` / `outDepth` / `outOpacity`. Output tensors are now allocated with `torch::empty` (plus `torch::zeros` for `outWs` on the training path, since backward's `thrust::inclusive_scan` reads into the early-termination tail). - Add a compile-time `NeedsBackward` template parameter on `volumeRenderFwdCallback`, `volumeRenderCPU`, and the `volumeRender` CUDA kernel. When false the kernel skips the per-sample `outWs` store and the per-ray `outDepth` / `outTotalSamples` stores, and the host returns size-0 placeholder tensors with matching dtype/device. The dispatch layer picks the specialization via a tiny `std::true_type` / `std::false_type` tag-dispatch lambda inside `AT_DISPATCH_V2`. - Replace the public `fvdb::volumeRender` with `fvdb::volumeRenderForward` / `fvdb::volumeRenderBackward` in `FVDB.h` / `FVDB.cpp`, and bind them to Python as `_fvdb_cpp.volume_render_fwd` / `volume_render_bwd` in `src/python/Bindings.cpp` (with updated `_fvdb_cpp.pyi` stubs). - Delete `src/fvdb/detail/autograd/VolumeRender.{cpp,h}` (and its `CMakeLists.txt` entry) and implement the autograd layer in Python (`fvdb/_volume_render.py`), following the same `torch.autograd.Function` pattern as `_gaussian_autograd.py`. The new `volume_render` wrapper routes through `_VolumeRenderFn.apply` only when grad is enabled and at least one differentiable input has `requires_grad=True`; otherwise it calls `_fvdb_cpp.volume_render_fwd(..., needs_backward=False)` directly and returns size-0 placeholders for `depth` / `ws` / `total_samples`. `fvdb/__init__.py` / `__init__.pyi` now re-export `volume_render` from this new module instead of `_fvdb_cpp`. - Add full Doxygen docstrings for both `volumeRenderForward` / `volumeRenderBackward` in the public `FVDB.h` and their `detail::ops` counterparts in `detail/ops/VolumeRender.h`, documenting all parameters, shapes, the early-termination condition (`T <= transmittanceThresh`), the channel-count limit (`1 <= C <= MAX_VOLUME_RENDER_CHANNELS`), and the inference fast path's size-0 output behavior. - Update the one `ws` / `total_samples`-asserting test (`test_volume_render_total_samples_counts_terminating_sample`) to call `sigmas.requires_grad_(True)` so it continues to exercise the backward-aware path now that the no-grad path returns size-0 placeholders for those tensors. Performance (fire renderer, 256 spp): | stage | baseline | + register accum | + inference fast path | |------------------------|-------------:|-----------------:|----------------------:| | volume_render, total | 8772.5 ms | 6171.4 ms | 2127.0 ms | | volume_render, avg/spp | 34.27 ms | 24.11 ms | 8.31 ms | | share of total runtime | 60.7% | 51.9% | 27.1% | | total render walltime | 14461.7 ms | 11887.0 ms | 7862.8 ms | Stage-on-stage: - Register accumulation (Stage 1): -29.7% on volume_render and -17.8% on end-to-end walltime (2.57 s saved). - Inference fast path (Stage 2): a further ~2.9x on volume_render (6171 ms -> 2127 ms) -- the per-sample outWs store and the accompanying host-side zero-init of the N-sized buffer were the single largest remaining source of global-memory traffic in the stage. - Combined: ~4.1x on the volume_render stage and ~1.86x end-to-end (14.5 s -> 7.86 s) vs. the pre-optimization baseline. volume_render's share of total runtime dropped from 60.7% to 27.1%, handing the "top bottleneck" title to sample_trilinear (which was then addressed in the preceding commit on this branch). Correctness is unchanged on the backward-aware path: the Python wrapper routes through the CUDA forward with needsBackward=true whenever any input requires grad, so saved tensors and the backward kernel are identical to before. The inference fast path is observationally equivalent for callers that consume only rgb / opacity. Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

Copilot

Pull request overview

This PR optimizes volume_render by reducing per-sample global-memory traffic in the forward pass and moves the autograd layer from C++ into a Python torch.autograd.Function, adding an inference fast path that skips backward-only outputs.

Changes:

Reworked CUDA/CPU forward to accumulate rgb/depth/opacity per-ray in registers and write outputs once per ray, plus a needsBackward specialization to skip ws/depth/total_samples on inference.
Replaced the public C++ API/bindings from a single volume_render entrypoint to explicit forward/backward bindings (volume_render_fwd / volume_render_bwd) and implemented autograd in fvdb/_volume_render.py.
Updated Python exports/stubs and adjusted the unit test to force the backward-aware path where ws/total_samples are asserted.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
tests/unit/test_ray_marching.py	Forces backward-aware execution so `ws`/`total_samples` remain materialized for the assertion.
src/python/Bindings.cpp	Replaces single binding with explicit `volume_render_fwd` / `volume_render_bwd`.
src/fvdb/detail/ops/VolumeRender.h	Documents and extends forward API with `needsBackward`; adds backward docs.
src/fvdb/detail/ops/VolumeRender.cu	Implements per-ray register accumulation and `NeedsBackward` specialization with size-0 placeholders on inference.
src/fvdb/detail/autograd/VolumeRender.h	Deleted C++ autograd layer header.
src/fvdb/detail/autograd/VolumeRender.cpp	Deleted C++ autograd layer implementation.
src/fvdb/FVDB.h	Replaces public API with `volumeRenderForward` / `volumeRenderBackward` and adds Doxygen docs.
src/fvdb/FVDB.cpp	Routes forward/backward to raw ops; removes C++ autograd plumbing.
src/CMakeLists.txt	Removes the deleted C++ autograd source from the build.
fvdb/_volume_render.py	Adds Python autograd wrapper and inference fast path routing logic.
fvdb/_fvdb_cpp.pyi	Updates stubs for the new forward/backward bindings.
fvdb/init.pyi	Re-exports `volume_render` from the new Python wrapper module.
fvdb/init.py	Re-exports `volume_render` from the new Python wrapper module.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Updated the backward method to accept None for any of the gradient outputs, ensuring compatibility with autograd's behavior when gradients are not required for certain outputs. - Introduced a helper function to coerce None gradients into zero tensors with matching shape, dtype, and device. - Added a regression test to verify the correct handling of None gradients, ensuring that the output gradients match expected values when some inputs are not used in the loss. Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

harrism

A few minor comments. Also Claude noticed:

Tutorial doc is now silently inference-incompatible docs/tutorials/volume_rendering.md invokes volume_render and uses depth downstream. Worth either:

A one-line comment near line 403 noting that depth is only populated when at least one of ray_density/ray_color has requires_grad=True, or
Updating the tutorial's volume_render_func to assert/raise if it's invoked without grad and depth is needed.

Co-authored-by: Mark Harris <mharris@nvidia.com> Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

- Updated the documentation in `_volume_render.py` to clarify the conditions under which `sigmas` and `rgbs` require gradients, and specified that `delta_ts` and `ts` are non-differentiable inputs. - Adjusted the logic to ensure that the backward path is selected correctly based on the differentiability of `sigmas` and `rgbs`, while explicitly noting that setting `requires_grad` on `delta_ts` or `ts` does not influence this selection. - Modified the `VolumeRender.h` header to remove the default value for `needsBackward`, aligning with the updated logic. Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

- Introduced a new test method `test_volume_render_inference_fast_path` in `TestVolumeRender` to validate the behavior of the volume_render function when bypassing the autograd graph. - The test checks output shapes and values for different scenarios where gradients are not required, ensuring that the fast path returns size-0 placeholders for certain outputs while maintaining correct rgb and opacity values. - This addition enhances coverage for the volume_render functionality, particularly focusing on performance optimizations and correctness in inference scenarios. Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

Copilot

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

Copilot

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

swahtz added this to the v0.5 milestone Apr 24, 2026

swahtz requested a review from a team as a code owner April 24, 2026 08:04

swahtz requested review from blackencino and sifakis April 24, 2026 08:04

swahtz added optimization Performance or memory optimization core library Core fVDB library. i.e. anything in the _Cpp module (C++) or fvdb python module labels Apr 24, 2026

swahtz requested a review from Copilot April 24, 2026 08:06

Copilot started reviewing on behalf of swahtz April 24, 2026 08:07 View session

format fix

9e32320

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

Copilot AI reviewed Apr 24, 2026

View reviewed changes

Comment thread fvdb/_volume_render.py

Comment thread src/fvdb/detail/ops/VolumeRender.cu Outdated

harrism reviewed Apr 26, 2026

View reviewed changes

Comment thread tests/unit/test_ray_marching.py

Comment thread fvdb/_volume_render.py Outdated

Comment thread src/fvdb/detail/ops/VolumeRender.cu Outdated

Comment thread src/fvdb/FVDB.h

swahtz and others added 3 commits April 27, 2026 00:55

Update src/fvdb/detail/ops/VolumeRender.cu

e2716f1

Co-authored-by: Mark Harris <mharris@nvidia.com> Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

swahtz requested a review from Copilot April 26, 2026 13:12

Copilot started reviewing on behalf of swahtz April 26, 2026 13:12 View session

Copilot AI reviewed Apr 26, 2026

View reviewed changes

Comment thread fvdb/_volume_render.py

Comment thread tests/unit/test_ray_marching.py Outdated

Update tests/unit/test_ray_marching.py

43c06d8

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

swahtz requested a review from Copilot April 27, 2026 01:07

Copilot started reviewing on behalf of swahtz April 27, 2026 01:07 View session

Copilot AI reviewed Apr 27, 2026

View reviewed changes

harrism approved these changes Apr 27, 2026

View reviewed changes

swahtz merged commit a044411 into openvdb:main Apr 27, 2026
43 checks passed

swahtz deleted the js/volume_render_py_autograd_and_optimize branch April 27, 2026 23:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimizations for `volume_render` and move its autograd layer to Python#640

Optimizations for `volume_render` and move its autograd layer to Python#640
swahtz merged 7 commits into
openvdb:mainfrom
swahtz:js/volume_render_py_autograd_and_optimize

swahtz commented Apr 24, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

harrism left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

swahtz commented Apr 24, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

harrism left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants