Skip to content

Optimizations for volume_render and move its autograd layer to Python#640

Merged
swahtz merged 7 commits into
openvdb:mainfrom
swahtz:js/volume_render_py_autograd_and_optimize
Apr 27, 2026
Merged

Optimizations for volume_render and move its autograd layer to Python#640
swahtz merged 7 commits into
openvdb:mainfrom
swahtz:js/volume_render_py_autograd_and_optimize

Conversation

@swahtz

@swahtz swahtz commented Apr 24, 2026

Copy link
Copy Markdown
Contributor

Rewrite the volume-render forward pass around per-ray register accumulation plus a dedicated inference fast path, and move the backward plumbing out of C++ autograd into a Python torch.autograd.Function wrapper to match our other functional ops.

  • volumeRenderFwdCallback now accumulates rgb, depth, and opacity into per-thread registers (a stack-allocated accRgb[MAX_VOLUME_RENDER_CHANNELS] with #pragma unroll plus scalars for depth/opacity) and writes each per-ray output exactly once, eliminating the per-sample global-memory read-modify-write on outRGB / outDepth / outOpacity. Output tensors are now allocated with torch::empty (plus torch::zeros for outWs on the training path, since backward's thrust::inclusive_scan reads into the early-termination tail).

  • Add a compile-time NeedsBackward template parameter on volumeRenderFwdCallback, volumeRenderCPU, and the volumeRender CUDA kernel. When false the kernel skips the per-sample outWs store and the per-ray outDepth / outTotalSamples stores, and the host returns size-0 placeholder tensors with matching dtype/device. The dispatch layer picks the specialization via a tiny std::true_type / std::false_type tag-dispatch lambda inside AT_DISPATCH_V2.

  • Replace the public fvdb::volumeRender with fvdb::volumeRenderForward / fvdb::volumeRenderBackward in FVDB.h / FVDB.cpp, and bind them to Python as _fvdb_cpp.volume_render_fwd / volume_render_bwd in src/python/Bindings.cpp (with updated _fvdb_cpp.pyi stubs).

  • Delete src/fvdb/detail/autograd/VolumeRender.{cpp,h} (and its CMakeLists.txt entry) and implement the autograd layer in Python (fvdb/_volume_render.py), following the same torch.autograd.Function pattern as _gaussian_autograd.py. The new volume_render wrapper routes through _VolumeRenderFn.apply only when grad is enabled and at least one differentiable input has requires_grad=True; otherwise it calls _fvdb_cpp.volume_render_fwd(..., needs_backward=False) directly and returns size-0 placeholders for depth / ws / total_samples. fvdb/__init__.py / __init__.pyi now re-export volume_render from this new module instead of _fvdb_cpp.

  • Add full Doxygen docstrings for both volumeRenderForward / volumeRenderBackward in the public FVDB.h and their detail::ops counterparts in detail/ops/VolumeRender.h, documenting all parameters, shapes, the early-termination condition (T <= transmittanceThresh), the channel-count limit (1 <= C <= MAX_VOLUME_RENDER_CHANNELS), and the inference fast path's size-0 output behavior.

  • Update the one ws / total_samples-asserting test (test_volume_render_total_samples_counts_terminating_sample) to call sigmas.requires_grad_(True) so it continues to exercise the backward-aware path now that the no-grad path returns size-0 placeholders for those tensors.

Performance (fire renderer, 256 spp):

stage baseline + register accum + inference fast path
volume_render, total 8772.5 ms 6171.4 ms 2127.0 ms
volume_render, avg/spp 34.27 ms 24.11 ms 8.31 ms
share of total runtime 60.7% 51.9% 27.1%
total render walltime 14461.7 ms 11887.0 ms 7862.8 ms

Stage-on-stage:

  • Register accumulation (Stage 1): -29.7% on volume_render and -17.8% on end-to-end walltime (2.57 s saved).
  • Inference fast path (Stage 2): a further ~2.9x on volume_render (6171 ms -> 2127 ms) -- the per-sample outWs store and the accompanying host-side zero-init of the N-sized buffer were the single largest remaining source of global-memory traffic in the stage.
  • Combined: ~4.1x on the volume_render stage and ~1.86x end-to-end (14.5 s -> 7.86 s) vs. the pre-optimization baseline. volume_render's share of total runtime dropped from 60.7% to 27.1%, handing the "top bottleneck" title to sample_trilinear (which is then addressed in Add Vec2 and double fast paths to SampleGridTrilinear #639).

Correctness is unchanged on the backward-aware path: the Python wrapper routes through the CUDA forward with needsBackward=true whenever any input requires grad, so saved tensors and the backward kernel are identical to before. The inference fast path is observationally equivalent for callers that consume only rgb / opacity.

Rewrite the volume-render forward pass around per-ray register
accumulation plus a dedicated inference fast path, and move the
backward plumbing out of C++ autograd into a Python
torch.autograd.Function wrapper to match our other functional ops.

- volumeRenderFwdCallback now accumulates `rgb`, `depth`, and `opacity`
  into per-thread registers (a stack-allocated
  `accRgb[MAX_VOLUME_RENDER_CHANNELS]` with `#pragma unroll` plus
  scalars for depth/opacity) and writes each per-ray output exactly
  once, eliminating the per-sample global-memory read-modify-write on
  `outRGB` / `outDepth` / `outOpacity`. Output tensors are now
  allocated with `torch::empty` (plus `torch::zeros` for `outWs` on
  the training path, since backward's `thrust::inclusive_scan` reads
  into the early-termination tail).

- Add a compile-time `NeedsBackward` template parameter on
  `volumeRenderFwdCallback`, `volumeRenderCPU`, and the
  `volumeRender` CUDA kernel. When false the kernel skips the
  per-sample `outWs` store and the per-ray `outDepth` /
  `outTotalSamples` stores, and the host returns size-0 placeholder
  tensors with matching dtype/device. The dispatch layer picks the
  specialization via a tiny `std::true_type` / `std::false_type`
  tag-dispatch lambda inside `AT_DISPATCH_V2`.

- Replace the public `fvdb::volumeRender` with
  `fvdb::volumeRenderForward` / `fvdb::volumeRenderBackward` in
  `FVDB.h` / `FVDB.cpp`, and bind them to Python as
  `_fvdb_cpp.volume_render_fwd` / `volume_render_bwd` in
  `src/python/Bindings.cpp` (with updated `_fvdb_cpp.pyi` stubs).

- Delete `src/fvdb/detail/autograd/VolumeRender.{cpp,h}` (and its
  `CMakeLists.txt` entry) and implement the autograd layer in Python
  (`fvdb/_volume_render.py`), following the same
  `torch.autograd.Function` pattern as `_gaussian_autograd.py`. The
  new `volume_render` wrapper routes through `_VolumeRenderFn.apply`
  only when grad is enabled and at least one differentiable input
  has `requires_grad=True`; otherwise it calls
  `_fvdb_cpp.volume_render_fwd(..., needs_backward=False)` directly
  and returns size-0 placeholders for `depth` / `ws` /
  `total_samples`. `fvdb/__init__.py` / `__init__.pyi` now re-export
  `volume_render` from this new module instead of `_fvdb_cpp`.

- Add full Doxygen docstrings for both `volumeRenderForward` /
  `volumeRenderBackward` in the public `FVDB.h` and their
  `detail::ops` counterparts in `detail/ops/VolumeRender.h`,
  documenting all parameters, shapes, the early-termination
  condition (`T <= transmittanceThresh`), the channel-count limit
  (`1 <= C <= MAX_VOLUME_RENDER_CHANNELS`), and the inference fast
  path's size-0 output behavior.

- Update the one `ws` / `total_samples`-asserting test
  (`test_volume_render_total_samples_counts_terminating_sample`) to
  call `sigmas.requires_grad_(True)` so it continues to exercise
  the backward-aware path now that the no-grad path returns size-0
  placeholders for those tensors.

Performance (fire renderer, 256 spp):

| stage                  |     baseline | + register accum | + inference fast path |
|------------------------|-------------:|-----------------:|----------------------:|
| volume_render, total   |    8772.5 ms |        6171.4 ms |             2127.0 ms |
| volume_render, avg/spp |     34.27 ms |         24.11 ms |               8.31 ms |
| share of total runtime |        60.7% |            51.9% |                 27.1% |
| total render walltime  |   14461.7 ms |       11887.0 ms |             7862.8 ms |

Stage-on-stage:

- Register accumulation (Stage 1): -29.7% on volume_render and
  -17.8% on end-to-end walltime (2.57 s saved).
- Inference fast path (Stage 2): a further ~2.9x on volume_render
  (6171 ms -> 2127 ms) -- the per-sample outWs store and the
  accompanying host-side zero-init of the N-sized buffer were the
  single largest remaining source of global-memory traffic in the
  stage.
- Combined: ~4.1x on the volume_render stage and ~1.86x end-to-end
  (14.5 s -> 7.86 s) vs. the pre-optimization baseline.
  volume_render's share of total runtime dropped from 60.7% to
  27.1%, handing the "top bottleneck" title to sample_trilinear
  (which was then addressed in the preceding commit on this branch).

Correctness is unchanged on the backward-aware path: the Python
wrapper routes through the CUDA forward with needsBackward=true
whenever any input requires grad, so saved tensors and the backward
kernel are identical to before. The inference fast path is
observationally equivalent for callers that consume only rgb /
opacity.

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
@swahtz swahtz added this to the v0.5 milestone Apr 24, 2026
@swahtz swahtz requested a review from a team as a code owner April 24, 2026 08:04
@swahtz swahtz requested review from blackencino and sifakis April 24, 2026 08:04
@swahtz swahtz added optimization Performance or memory optimization core library Core fVDB library. i.e. anything in the _Cpp module (C++) or fvdb python module labels Apr 24, 2026
@swahtz swahtz requested a review from Copilot April 24, 2026 08:06
Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes volume_render by reducing per-sample global-memory traffic in the forward pass and moves the autograd layer from C++ into a Python torch.autograd.Function, adding an inference fast path that skips backward-only outputs.

Changes:

  • Reworked CUDA/CPU forward to accumulate rgb/depth/opacity per-ray in registers and write outputs once per ray, plus a needsBackward specialization to skip ws/depth/total_samples on inference.
  • Replaced the public C++ API/bindings from a single volume_render entrypoint to explicit forward/backward bindings (volume_render_fwd / volume_render_bwd) and implemented autograd in fvdb/_volume_render.py.
  • Updated Python exports/stubs and adjusted the unit test to force the backward-aware path where ws/total_samples are asserted.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/unit/test_ray_marching.py Forces backward-aware execution so ws/total_samples remain materialized for the assertion.
src/python/Bindings.cpp Replaces single binding with explicit volume_render_fwd / volume_render_bwd.
src/fvdb/detail/ops/VolumeRender.h Documents and extends forward API with needsBackward; adds backward docs.
src/fvdb/detail/ops/VolumeRender.cu Implements per-ray register accumulation and NeedsBackward specialization with size-0 placeholders on inference.
src/fvdb/detail/autograd/VolumeRender.h Deleted C++ autograd layer header.
src/fvdb/detail/autograd/VolumeRender.cpp Deleted C++ autograd layer implementation.
src/fvdb/FVDB.h Replaces public API with volumeRenderForward / volumeRenderBackward and adds Doxygen docs.
src/fvdb/FVDB.cpp Routes forward/backward to raw ops; removes C++ autograd plumbing.
src/CMakeLists.txt Removes the deleted C++ autograd source from the build.
fvdb/_volume_render.py Adds Python autograd wrapper and inference fast path routing logic.
fvdb/_fvdb_cpp.pyi Updates stubs for the new forward/backward bindings.
fvdb/init.pyi Re-exports volume_render from the new Python wrapper module.
fvdb/init.py Re-exports volume_render from the new Python wrapper module.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread fvdb/_volume_render.py
Comment thread src/fvdb/detail/ops/VolumeRender.cu Outdated
- Updated the backward method to accept None for any of the gradient outputs, ensuring compatibility with autograd's behavior when gradients are not required for certain outputs.
- Introduced a helper function to coerce None gradients into zero tensors with matching shape, dtype, and device.
- Added a regression test to verify the correct handling of None gradients, ensuring that the output gradients match expected values when some inputs are not used in the loss.

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

@harrism harrism left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor comments. Also Claude noticed:

Tutorial doc is now silently inference-incompatible docs/tutorials/volume_rendering.md invokes volume_render and uses depth downstream. Worth either:

  • A one-line comment near line 403 noting that depth is only populated when at least one of ray_density/ray_color has requires_grad=True, or
  • Updating the tutorial's volume_render_func to assert/raise if it's invoked without grad and depth is needed.

Comment thread tests/unit/test_ray_marching.py
Comment thread fvdb/_volume_render.py Outdated
Comment thread src/fvdb/detail/ops/VolumeRender.cu Outdated
Comment thread src/fvdb/FVDB.h
swahtz and others added 3 commits April 27, 2026 00:55
Co-authored-by: Mark Harris <mharris@nvidia.com>
Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
- Updated the documentation in `_volume_render.py` to clarify the conditions under which `sigmas` and `rgbs` require gradients, and specified that `delta_ts` and `ts` are non-differentiable inputs.
- Adjusted the logic to ensure that the backward path is selected correctly based on the differentiability of `sigmas` and `rgbs`, while explicitly noting that setting `requires_grad` on `delta_ts` or `ts` does not influence this selection.
- Modified the `VolumeRender.h` header to remove the default value for `needsBackward`, aligning with the updated logic.

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
- Introduced a new test method `test_volume_render_inference_fast_path` in `TestVolumeRender` to validate the behavior of the volume_render function when bypassing the autograd graph.
- The test checks output shapes and values for different scenarios where gradients are not required, ensuring that the fast path returns size-0 placeholders for certain outputs while maintaining correct rgb and opacity values.
- This addition enhances coverage for the volume_render functionality, particularly focusing on performance optimizations and correctness in inference scenarios.

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread fvdb/_volume_render.py
Comment thread tests/unit/test_ray_marching.py Outdated
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@swahtz swahtz merged commit a044411 into openvdb:main Apr 27, 2026
43 checks passed
@swahtz swahtz deleted the js/volume_render_py_autograd_and_optimize branch April 27, 2026 23:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core library Core fVDB library. i.e. anything in the _Cpp module (C++) or fvdb python module optimization Performance or memory optimization

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants