feat: Full Fabric acceleration stack — local poses, stage cache, fused compose#5728
feat: Full Fabric acceleration stack — local poses, stage cache, fused compose#5728pv-nvidia wants to merge 23 commits into
Conversation
…pagation Add Warp kernels that operate on wp.indexedfabricarray for direct local↔world matrix propagation without round-tripping through USD: - decompose_indexed_fabric_transforms: extract pos/quat/scale from ifa - compose_indexed_fabric_transforms: write pos/quat/scale into ifa - update_indexed_local_matrix_from_world: local = inv(parent) * world - update_indexed_world_matrix_from_local: world = parent * local Also refactor existing kernels to use wp.where (branchless) instead of if/else for broadcast index selection. These kernels are the foundation for Fabric-accelerated get/set_local_poses in FabricFrameView.
…factory
FabricFrameView had an internal _use_fabric flag that fell back to
UsdFrameView when Fabric was disabled or the device was unsupported.
This violated single-responsibility: FabricFrameView pretended to be
one class but sometimes behaved as another.
Now the FrameView factory handles all dispatch:
- PhysX + Fabric enabled + supported device → FabricFrameView
- PhysX without Fabric (or unsupported device) → UsdFrameView
- Newton → NewtonSiteFrameView
FabricFrameView no longer checks _use_fabric or _fabric_supported_devices.
It assumes Fabric is available (the factory guarantees this).
UsdFrameView is eagerly registered on the factory since it lives in
isaaclab (not a backend package), so FactoryBase's dynamic import
(isaaclab_{backend}.sim.views) can't discover it.
Extract _compose_fabric_transform() to deduplicate the kernel-launch logic shared by set_world_poses and set_scales. The initial USD->Fabric sync now composes position, orientation, and scale in one call, so PrepareForReuse is invoked exactly once per logical update. Also replace assert with RuntimeError in _rebuild_fabric_arrays so the topology-change guard survives python -O.
Introduces FabricStageCache — a lightweight cache for the usdrt stage attachment and IFabricHierarchy handles, registered as a service on SimulationContext via the service locator (set_service/get_service). Multiple FabricFrameView instances now share a single hierarchy handle instead of each creating its own. The cache is automatically closed on SimulationContext.clear_instance(). Also replaces the assert device check with a proper RuntimeError and removes the now-unused isaaclab.sim import from fabric_frame_view.py.
# Conflicts: # source/isaaclab_physx/isaaclab_physx/sim/views/fabric_frame_view.py
- get_local_poses / set_local_poses via indexed fabric kernels - get_scales / set_scales with full Fabric acceleration - Per-view dirty tracking with automatic re-propagation - Topology-change recovery via _rebuild_fabric_arrays - 13 new integration tests (49 total pass)
- Allow FabricFrameView to run on cuda:N for any N; USDRT SelectPrims no longer needs cuda:0. - Refactor the Fabric write path into a single _compose_fabric_transform helper shared by set_world_poses, set_scales, and the initial USD->Fabric sync, collapsing the sync to one kernel launch with one PrepareForReuse. - Replace the topology-invariant assert with RuntimeError so it survives python -O. - Add multi_gpu pytest marker plus cuda:1 unit-test coverage for both Fabric write paths, and run them in the existing test-multi-gpu CI job (one extra step, no new job).
The standard pytest invocation in CI runs the fabric test file without filtering on the ``multi_gpu`` marker, so the ``cuda:1`` tests get scheduled on every runner including the single-GPU ones. Previously ``_skip_if_unavailable`` hard-failed via ``pytest.fail`` whenever ``GITHUB_ACTIONS=true`` and the requested device was missing, on the theory that this would catch a misconfigured multi-GPU runner. In practice it just broke the standard CI: the dedicated ``test-fabric-multi-gpu`` workflow already pre-flights ``torch.cuda.device_count() >= 2`` before invoking pytest, so a genuinely misconfigured multi-GPU runner is already caught there. Always skip rather than fail when the requested ``cuda:N`` index isn't available. Drop the now-unused ``import os``.
Kit's CLI parser reads sys.argv directly at startup and segfaults on
pytest flags that collide with its own short options. Running
pytest -m multi_gpu source/isaaclab_physx/test/sim/test_views_xform_prim_fabric.py
crashes during collection because Kit sees ``-m multi_gpu`` and exits
with ``Ill formed parameter: -m`` followed by SIGSEGV (exit code 245)
inside ``simulation_app._start_app``.
Strip sys.argv to argv[0] before instantiating AppLauncher. The test
file takes no CLI arguments of its own, mirroring the broader pattern
used by ``test_tiled_camera_env.py`` which assigns
``sys.argv[1:] = args_cli.unittest_args`` after argparse.
wp.to_torch on a ProxyArray is deprecated in favor of the .torch accessor. Switch the three call sites that consume the ProxyArray returned by get_world_poses; leave get_scales call sites alone since that method still returns a raw wp.array (no .torch accessor).
- Add a GPU-count pre-flight step to the test-fabric-multi-gpu CI job so a runner regression to a single GPU fails the workflow instead of silently skipping every cuda:1 test. This is what the comment in _skip_if_unavailable already promised existed. - Note that the sys.argv strip in test_views_xform_prim_fabric.py must stay between the AppLauncher import and its instantiation; any CLI parser or reordering re-exposes Kit to pytest argv and segfaults at startup. - Document the _fabric_usd_sync_done side effect on _compose_fabric_transform so callers can see why subsequent getters stop pulling from USD.
The class docstring and __init__ device-param doc still claimed ``cuda:0`` only. Refresh both to note that Fabric acceleration runs on any CUDA index, so the autodoc API page reflects the actual contract.
Move the test-fabric-multi-gpu job out of test-multi-gpu.yaml and into a dedicated test-fabric-multi-gpu.yaml. The two workflows share the same runner label, install step, and GPU pre-flight, but trigger on disjoint path sets so changes to FabricFrameView no longer gate the distributed-training validation and vice versa. test-multi-gpu.yaml is now byte-identical to upstream/develop.
No self-hosted runner with the 'multi-gpu' label is registered. All runs queue indefinitely. Kept as workflow_dispatch only so it can be manually triggered once a runner is provisioned. See also .github/workflows/test-multi-gpu.yaml (same issue).
New scripts/benchmarks/benchmark_dexsuite_wrist_camera.py runs the dexsuite-reorient task twice with identical settings except the camera mount (static base vs wrist-mounted under Robot/ee_link/palm_link), records per-env observation-camera mp4s, and reports per-mode FPS for objective comparison of moving-camera quality and performance. Supports single-GPU (--mode both spawns matched base/wrist subprocesses) and multi-GPU (--distributed under torchrun writes per-rank fps CSVs; --mode summarize aggregates totals as a final rank-0-only step). Uses imageio_ffmpeg for video encoding because cv2.VideoWriter mp4v silently drops frames inside Kit's loaded libavcodec context.
The previous _resolve_out_dir reused base_out_dir only when an args.json already lived inside it; otherwise it created a fresh timestamped subdir. That worked for the --mode both orchestrator (which writes args.json up front) but broke multi-GPU local runs: each torchrun --mode base call made its own timestamped subdir, --mode wrist made another, --mode summarize made a third — so summarize never found the per-rank fps CSVs. Now --mode base | wrist | summarize use --out_dir as-is. The orchestrator still wraps a timestamped subdir around the user's path before spawning subprocesses, so single-process --mode both behaves identically. Callers running torchrun directly own picking a unique path, which matches what the experiment shell script already passes.
…flicts) Combined local-poses stack with multi-GPU support. Uses torch.as_tensor() as canonical conversion. Includes multi-GPU tests gated by ISAACLAB_TEST_MULTI_GPU.
The factory gate hardcoded cuda:0 as the only supported CUDA device. Multi-GPU setups need cuda:1, cuda:2, etc. Replace the static tuple with a function that accepts any valid cuda:<N> index.
There was a problem hiding this comment.
Code Review Summary
This is an impressive, well-architected PR that delivers significant performance improvements through Fabric GPU acceleration. The ~8× scene creation speedup and 150-260× faster pose operations (at 1024 prims) are substantial gains for large-scale simulations.
✅ Strengths
-
Clean architecture: The
FabricStageCacheservice pattern elegantly shares hierarchy handles across views without coupling. -
Correctness under non-trivial transforms: The transpose-convention handling in
update_indexed_local_matrix_from_worldandupdate_indexed_world_matrix_from_localcorrectly leverages(A·B)ᵀ = Bᵀ·Aᵀto avoid explicit transposes on Fabric's column-transposed storage. -
Robust dirty-flag isolation: Per-view
_world_dirtytracking prevents multi-view race conditions that would occur with a per-stage flag. -
Comprehensive test coverage: 13+ new integration tests cover rotated parents, scaled parents, multi-view isolation, and topology recovery scenarios.
-
Factory dispatch refinement: Moving Fabric/USD dispatch from
FabricFrameViewinternals toFrameView._get_backend()eliminates the misleading abstraction.
🔍 Findings & Suggestions
1. Potential race condition in concurrent PrepareForReuse calls
📍 fabric_frame_view.py — _get_world_rw_array() / _get_local_rw_array() / _get_world_ro_array()
Each accessor independently calls PrepareForReuse() on its selection. In a scenario where set_world_poses() and set_local_poses() are called in rapid succession (e.g., from different threads or coroutines), there's a potential for:
_world_sel_rw.PrepareForReuse()returnsTrue(topology change)- Rebuild starts but before completing,
_local_sel_rw.PrepareForReuse()is called - Both rebuilds race on shared state
Suggestion: Consider adding a simple lock or consolidating rebuild logic to ensure atomicity during topology changes, especially if this class may be used concurrently.
2. Missing validation for _fabric_empty_2d_array_sentinel shape
📍 fabric_frame_view.py:126 & fabric.py kernels
The sentinel is wp.zeros((0, 0), ...) with shape (0, 0). The kernels check shape[0] > 0 which works, but the docstring states the inner dim "is never indexed" while the sentinel has inner dim 0. This works but is semantically confusing.
Suggestion: Consider wp.zeros((0, 3), ...) or wp.zeros((0, 4), ...) to match the expected shapes for positions/quaternions respectively, making intent clearer.
3. Device validation could be more robust
📍 frame_view.py:25-36 — _is_fabric_supported_device()
if device.startswith("cuda:"):
try:
int(device.split(":", 1)[1])
return True
except (ValueError, IndexError):
passThis accepts "cuda:999" even if only 2 GPUs exist. Consider validating against torch.cuda.device_count() or wp.cuda_device_count().
Severity: Low — USDRT/Warp will fail downstream with a clearer error, but early validation would improve UX.
4. Silent skip for invalid parent prims in _initialize_fabric()
📍 fabric_frame_view.py:521-528
rt_prim = self._stage.GetPrimAtPath(path)
if not rt_prim.IsValid():
continue # Silent skipIf a parent prim is invalid, initialization silently skips it, but later _compute_parent_fabric_indices() will raise a RuntimeError. The skip-then-fail pattern could be confusing during debugging.
Suggestion: Either raise early with context, or log a warning when skipping invalid prims.
5. Unused _FABRIC_SUPPORTED_DEVICES module variable
📍 frame_view.py:19
_FABRIC_SUPPORTED_DEVICES: tuple[str, ...] | None = None # None = compute dynamicallyThis variable is declared but never used — _is_fabric_supported_device() hardcodes the logic instead.
Suggestion: Remove the unused variable or implement the promised "compute dynamically" behavior.
6. Test test_clear_instance_closes_service may have timing assumptions
📍 test_fabric_stage_cache.py:77-83
The test assumes SimulationContext.clear_instance() synchronously calls close() on services. If this happens asynchronously or is order-dependent, the assertion could flake.
Severity: Low — likely fine given IsaacLab's synchronous teardown pattern.
📊 Summary
| Category | Assessment |
|---|---|
| Correctness | ✅ Excellent — transform math is correct, edge cases well-handled |
| Performance | ✅ Excellent — 8× scene creation, 150-260× pose ops |
| Test Coverage | ✅ Comprehensive — 13+ new tests including edge cases |
| Architecture | ✅ Clean — service pattern, factory dispatch, per-view state |
| Documentation | ✅ Good — changelog entries, docstrings, inline comments |
| Maintainability |
Verdict
Approve — This is high-quality work that delivers substantial performance improvements with careful attention to correctness. The findings above are minor suggestions; none are blocking.
Update (336386b): Reviewed incremental changes since 2b033461. This commit adds significant new functionality:
New additions:
- Multi-GPU CI workflow (
.github/workflows/test-fabric-multi-gpu.yaml) — correctly disabled pending runner provisioning - Wrist-camera benchmark script with clean multi-rank aggregation
- Comprehensive unit tests for Warp fabric kernels (
test_fabric_kernels.py) FabricStageCacheservice for shared hierarchy handles- Extensive test coverage for interleaved world/local pose operations
Previous inline comment status:
- ✅ SettingsManager dead import — Fixed (import removed)
⚠️ _get_parent_world_ro_arrayunused — Still present but never called⚠️ get_scalesinconsistency — Partially fixed (sync added for cached path, but non-cached still returns without sync; still returnswp.arraynotProxyArray)⚠️ _FABRIC_SUPPORTED_DEVICESunused — Still declared but unused at module level
New observations:
- The
app_launcher.pychanges are a solid improvement — rebuildingsys.argvfrom known-good Kit args is more robust than the previous mark-and-sweep approach. - The interleaved setter tests (
test_interleaved_set_*) provide excellent coverage for the dirty-flag logic. - The indexed fabric kernel additions (
compose_indexed_fabric_transforms, etc.) are well-documented and match the claimed storage convention.
Overall: The PR continues to be high-quality work. The remaining inline issues are minor (P2) and don't block merge.
Greptile SummaryThis PR consolidates five individual PRs into a full Fabric acceleration stack for
Confidence Score: 4/5The core Fabric kernel math and the world ↔ local consistency protocol are both correct. The main risks are cosmetic: a dead import, a dead helper method, and an API inconsistency in get_scales. None of these affect correctness at runtime. The transpose-convention proof in the new kernels is correct, the per-view dirty-flag isolation is properly tested with a multi-view scenario, and the FabricStageCache lifecycle is wired correctly through ServiceLocator. The three style-level findings (unused import, dead method, get_scales return type) are real cleanup work but do not change observable behavior. The unused _FABRIC_SUPPORTED_DEVICES variable in frame_view.py is misleading but inert. fabric_frame_view.py warrants a second look to remove the unused SettingsManager import and the dead _get_parent_world_ro_array method before merging. Important Files Changed
Sequence DiagramsequenceDiagram
participant Caller
participant FrameView as FrameView (factory)
participant FFV as FabricFrameView
participant Cache as FabricStageCache
participant Fabric as Fabric (GPU)
participant USD as UsdFrameView
Caller->>FrameView: FrameView(path, device)
FrameView->>FrameView: _get_backend() checks fabric_enabled + device
FrameView-->>FFV: instantiate FabricFrameView
Caller->>FFV: get_world_poses()
FFV->>FFV: _initialize_fabric() [lazy, first call only]
FFV->>Cache: services[FabricStageCache] (get or create)
Cache-->>FFV: stage + hierarchy handle
FFV->>USD: get_world_poses() / get_local_poses() / get_scales()
USD-->>FFV: USD-seeded pos/ori/scale
FFV->>Fabric: compose_indexed_fabric_transforms (seed worldMatrix + localMatrix)
FFV->>FFV: "_world_dirty = True"
FFV->>FFV: _sync_world_from_local_if_dirty()
FFV->>Fabric: update_indexed_world_matrix_from_local
Fabric-->>FFV: child world matrices updated
FFV->>Fabric: decompose_indexed_fabric_transforms
FFV-->>Caller: ProxyArray(positions), ProxyArray(orientations)
Caller->>FFV: set_local_poses(translations, orientations)
FFV->>Fabric: compose_indexed_fabric_transforms to localMatrix
FFV->>FFV: "_world_dirty = True"
Caller->>FFV: get_world_poses()
FFV->>FFV: _sync_world_from_local_if_dirty()
FFV->>Fabric: update_indexed_world_matrix_from_local
FFV->>Fabric: decompose_indexed_fabric_transforms
FFV-->>Caller: consistent world poses
Reviews (1): Last reviewed commit: "fix: allow FabricFrameView on any cuda:<..." | Re-trigger Greptile |
| from pxr import Gf, Usd, UsdGeom | ||
|
|
||
| import isaaclab.sim as sim_utils | ||
| from isaaclab.app.settings_manager import SettingsManager |
There was a problem hiding this comment.
SettingsManager is imported at the module level but is no longer referenced anywhere in this file after the refactor moved the fabricEnabled check to FrameView._get_backend. The dead import adds a spurious load-time dependency.
| from isaaclab.app.settings_manager import SettingsManager | |
| # SettingsManager check moved to FrameView._get_backend; not needed here. |
| def _get_parent_world_ro_array(self): | ||
| # Built and refreshed alongside the trans_ro selection (parents share that selection). | ||
| if self._parent_world_ifa_ro is None or self._trans_sel_ro.PrepareForReuse(): | ||
| self._rebuild_trans_ro_arrays() | ||
| return self._parent_world_ifa_ro |
There was a problem hiding this comment.
_get_parent_world_ro_array is defined but never called anywhere in the class. Every internal consumer of _parent_world_ifa_ro accesses it directly (after calling _rebuild_trans_ro_arrays as needed). Because the method calls _trans_sel_ro.PrepareForReuse(), leaving it around risks a stale caller inadvertently triggering a topology rebuild out-of-sequence. If it is intended for future use, a # noqa comment or a TODO note would clarify intent; otherwise it should be removed.
| @@ -290,168 +365,421 @@ def get_scales(self, indices=None): | |||
| scales_wp = wp.zeros((count, 3), dtype=wp.float32, device=self._device) | |||
|
|
|||
| wp.launch( | |||
| kernel=fabric_utils.decompose_fabric_transformation_matrix_to_warp_arrays, | |||
| kernel=fabric_utils.decompose_indexed_fabric_transforms, | |||
| dim=count, | |||
| inputs=[ | |||
| self._fabric_world_matrices, | |||
| self._fabric_dummy_buffer, | |||
| self._fabric_dummy_buffer, | |||
| self._get_world_ro_array(), | |||
| self._fabric_empty_2d_array_sentinel, | |||
| self._fabric_empty_2d_array_sentinel, | |||
| scales_wp, | |||
| indices_wp, | |||
| self._view_to_fabric, | |||
| ], | |||
| device=self._fabric_device, | |||
| device=self._device, | |||
| ) | |||
|
|
|||
| if use_cached: | |||
| wp.synchronize() | |||
| return scales_wp | |||
There was a problem hiding this comment.
get_scales is inconsistent with get_world_poses and get_local_poses in two ways. First, when indices is supplied (non-cached path), the method returns without calling wp.synchronize(), while still returning a raw wp.array — callers that immediately call .numpy() on it race against the running kernel. Second, the class docstring states "Pose getters return ProxyArray", but get_scales returns a bare wp.array. Wrapping the return in ProxyArray and adding a wp.synchronize() for the non-cached branch would align the API with the other getters.
| # Devices supported by Fabric GPU acceleration (USDRT SelectPrims + Warp). | ||
| # Any CUDA device is supported — multi-GPU setups use cuda:1, cuda:2, etc. | ||
| _FABRIC_SUPPORTED_DEVICES: tuple[str, ...] | None = None # None = compute dynamically |
There was a problem hiding this comment.
The module-level variable
_FABRIC_SUPPORTED_DEVICES is declared with a comment that implies it will be used as a lazy cache, but _is_fabric_supported_device never reads or writes it. The dead variable is misleading — it suggests a caching strategy that is not implemented.
| # Devices supported by Fabric GPU acceleration (USDRT SelectPrims + Warp). | |
| # Any CUDA device is supported — multi-GPU setups use cuda:1, cuda:2, etc. | |
| _FABRIC_SUPPORTED_DEVICES: tuple[str, ...] | None = None # None = compute dynamically | |
| # Devices supported by Fabric GPU acceleration (USDRT SelectPrims + Warp). | |
| # Any CUDA device is supported — multi-GPU setups use cuda:1, cuda:2, etc. |
Previously, set_world_poses eagerly called _sync_local_from_world after every write, launching a warp kernel + wp.synchronize() to recompute local matrices even when nobody reads them. This added ~23% overhead. Now we set a _local_dirty flag and only recompute in get_local_poses (via _sync_local_from_world_if_dirty). Same treatment for set_scales. Benchmark (1024 prims, Fabric, L40): set_world_poses: 0.1258 ms -> 0.0689 ms (-45%, now 425x vs USD) interleaved set+get: 0.1673 ms -> 0.1114 ms (-33%) All 49 fabric xform tests pass.
Ensure at most one of _world_dirty / _local_dirty is True at any time. set_local_poses now flushes stale locals (from a prior set_world_poses) before overwriting, preventing stale local data on other indices. Symmetrically, set_world_poses already flushed stale worlds. A one-time warning is logged when interleaved setters are detected, guiding users toward the fast path (use one setter exclusively per frame). Also documents the leaf-prim assumption and dirty-flag invariant in the class docstring. All 49 fabric xform tests pass.
2b03346 to
7f18b1c
Compare
…ng tests Four new tests: - test_interleaved_set_world_then_set_local_partial_indices: set_world on idx 0, then set_local on idx 1 → both produce correct poses - test_interleaved_set_local_then_set_world_partial_indices: set_local on idx 0, then set_world on idx 1 → both produce correct poses - test_interleaved_set_emits_warning: confirms the one-time warning fires on first interleave, does not repeat - test_no_warning_when_using_single_setter: confirms no warning when only one setter is used repeatedly These tests would have caught the bug fixed in bad524c / 7f18b1c.
Summary
Consolidated PR bringing the full Fabric acceleration stack to
FabricFrameView. This builds on the merged mGPU support (#5514) and adds:IndexedFabricArrayFabricFrameViewfallback to theFrameViewfactory (eliminates misleading abstraction)set_world_poses+set_scalesinstead of separate callsget_local_poses()/set_local_poses()operate directly on FabriclocalMatrix, with automatic world→local recompositionFabricFrameViewon anycuda:<N>device (was incorrectly restricted tocuda:0only)Motivation
Piotr's wrist camera shimmering on multi-GPU (#5717 context): parent prim local transforms change on
cuda:1but camera world pose doesn't get recomposed → stale poses. The local→world composition pipeline in this PR fixes camera pose propagation on non-primary GPUs.Performance
On 2x L40 (4 envs, wrist camera benchmark):
cuda:1now gets full Fabric acceleration (was silently falling back to USD)Testing
test_views_xform_prim_fabric.pyextended with local pose tests, topology rebuild tests, and multi-GPU roundtrip teststest_fabric_stage_cache.pyfor the stage cache servicetest_fabric_kernels.pyfor indexed transform kernelsIndividual PRs (now superseded by this consolidated PR)
pv/indexed-fabric-kernelspv/fabric-view-no-fallbackpv/fabric-fused-composepv/fabric-stage-cachepv/fabric-local-poses