feat: Full Fabric acceleration stack — local poses, stage cache, fused compose by pv-nvidia · Pull Request #5728 · isaac-sim/IsaacLab

pv-nvidia · 2026-05-21T14:18:36Z

Summary

Consolidated PR bringing the full Fabric acceleration stack to FabricFrameView. This builds on the merged mGPU support (#5514) and adds:

Indexed Fabric transform kernels — Warp kernels for local↔world matrix propagation using IndexedFabricArray
Factory-based dispatch — Move Fabric/USD selection from internal FabricFrameView fallback to the FrameView factory (eliminates misleading abstraction)
Fused compose kernel — Single kernel launch for set_world_poses + set_scales instead of separate calls
FabricStageCache — Shared hierarchy handles across views for faster scene creation (~8× improvement)
Fabric-accelerated local poses — get_local_poses() / set_local_poses() operate directly on Fabric localMatrix, with automatic world→local recomposition
Device gate fix — Allow FabricFrameView on any cuda:<N> device (was incorrectly restricted to cuda:0 only)

Motivation

Piotr's wrist camera shimmering on multi-GPU (#5717 context): parent prim local transforms change on cuda:1 but camera world pose doesn't get recomposed → stale poses. The local→world composition pipeline in this PR fixes camera pose propagation on non-primary GPUs.

Performance

On 2x L40 (4 envs, wrist camera benchmark):

Scene creation: 12.2s → 1.5s (8× faster from FabricStageCache)
FPS parity at low env counts (camera-render-dominated); gains at 64+ envs
cuda:1 now gets full Fabric acceleration (was silently falling back to USD)

Testing

Existing test_views_xform_prim_fabric.py extended with local pose tests, topology rebuild tests, and multi-GPU roundtrip tests
New test_fabric_stage_cache.py for the stage cache service
New test_fabric_kernels.py for indexed transform kernels

Individual PRs (now superseded by this consolidated PR)

feat: add indexed fabric transform kernels for local/world matrix propagation #5675 pv/indexed-fabric-kernels
refactor: move Fabric/USD dispatch from FabricFrameView to FrameView factory #5673 pv/fabric-view-no-fallback
refactor: fuse set_world_poses/set_scales into single _compose_fabric_transform #5674 pv/fabric-fused-compose
feat: add FabricStageCache service for shared hierarchy handles #5676 pv/fabric-stage-cache
feat: Fabric-accelerated get/set_local_poses via indexedfabricarray #5677 pv/fabric-local-poses

…pagation Add Warp kernels that operate on wp.indexedfabricarray for direct local↔world matrix propagation without round-tripping through USD: - decompose_indexed_fabric_transforms: extract pos/quat/scale from ifa - compose_indexed_fabric_transforms: write pos/quat/scale into ifa - update_indexed_local_matrix_from_world: local = inv(parent) * world - update_indexed_world_matrix_from_local: world = parent * local Also refactor existing kernels to use wp.where (branchless) instead of if/else for broadcast index selection. These kernels are the foundation for Fabric-accelerated get/set_local_poses in FabricFrameView.

…factory FabricFrameView had an internal _use_fabric flag that fell back to UsdFrameView when Fabric was disabled or the device was unsupported. This violated single-responsibility: FabricFrameView pretended to be one class but sometimes behaved as another. Now the FrameView factory handles all dispatch: - PhysX + Fabric enabled + supported device → FabricFrameView - PhysX without Fabric (or unsupported device) → UsdFrameView - Newton → NewtonSiteFrameView FabricFrameView no longer checks _use_fabric or _fabric_supported_devices. It assumes Fabric is available (the factory guarantees this). UsdFrameView is eagerly registered on the factory since it lives in isaaclab (not a backend package), so FactoryBase's dynamic import (isaaclab_{backend}.sim.views) can't discover it.

Extract _compose_fabric_transform() to deduplicate the kernel-launch logic shared by set_world_poses and set_scales. The initial USD->Fabric sync now composes position, orientation, and scale in one call, so PrepareForReuse is invoked exactly once per logical update. Also replace assert with RuntimeError in _rebuild_fabric_arrays so the topology-change guard survives python -O.

Introduces FabricStageCache — a lightweight cache for the usdrt stage attachment and IFabricHierarchy handles, registered as a service on SimulationContext via the service locator (set_service/get_service). Multiple FabricFrameView instances now share a single hierarchy handle instead of each creating its own. The cache is automatically closed on SimulationContext.clear_instance(). Also replaces the assert device check with a proper RuntimeError and removes the now-unused isaaclab.sim import from fabric_frame_view.py.

# Conflicts: # source/isaaclab_physx/isaaclab_physx/sim/views/fabric_frame_view.py

- get_local_poses / set_local_poses via indexed fabric kernels - get_scales / set_scales with full Fabric acceleration - Per-view dirty tracking with automatic re-propagation - Topology-change recovery via _rebuild_fabric_arrays - 13 new integration tests (49 total pass)

- Allow FabricFrameView to run on cuda:N for any N; USDRT SelectPrims no longer needs cuda:0. - Refactor the Fabric write path into a single _compose_fabric_transform helper shared by set_world_poses, set_scales, and the initial USD->Fabric sync, collapsing the sync to one kernel launch with one PrepareForReuse. - Replace the topology-invariant assert with RuntimeError so it survives python -O. - Add multi_gpu pytest marker plus cuda:1 unit-test coverage for both Fabric write paths, and run them in the existing test-multi-gpu CI job (one extra step, no new job).

The standard pytest invocation in CI runs the fabric test file without filtering on the ``multi_gpu`` marker, so the ``cuda:1`` tests get scheduled on every runner including the single-GPU ones. Previously ``_skip_if_unavailable`` hard-failed via ``pytest.fail`` whenever ``GITHUB_ACTIONS=true`` and the requested device was missing, on the theory that this would catch a misconfigured multi-GPU runner. In practice it just broke the standard CI: the dedicated ``test-fabric-multi-gpu`` workflow already pre-flights ``torch.cuda.device_count() >= 2`` before invoking pytest, so a genuinely misconfigured multi-GPU runner is already caught there. Always skip rather than fail when the requested ``cuda:N`` index isn't available. Drop the now-unused ``import os``.

Kit's CLI parser reads sys.argv directly at startup and segfaults on pytest flags that collide with its own short options. Running pytest -m multi_gpu source/isaaclab_physx/test/sim/test_views_xform_prim_fabric.py crashes during collection because Kit sees ``-m multi_gpu`` and exits with ``Ill formed parameter: -m`` followed by SIGSEGV (exit code 245) inside ``simulation_app._start_app``. Strip sys.argv to argv[0] before instantiating AppLauncher. The test file takes no CLI arguments of its own, mirroring the broader pattern used by ``test_tiled_camera_env.py`` which assigns ``sys.argv[1:] = args_cli.unittest_args`` after argparse.

wp.to_torch on a ProxyArray is deprecated in favor of the .torch accessor. Switch the three call sites that consume the ProxyArray returned by get_world_poses; leave get_scales call sites alone since that method still returns a raw wp.array (no .torch accessor).

- Add a GPU-count pre-flight step to the test-fabric-multi-gpu CI job so a runner regression to a single GPU fails the workflow instead of silently skipping every cuda:1 test. This is what the comment in _skip_if_unavailable already promised existed. - Note that the sys.argv strip in test_views_xform_prim_fabric.py must stay between the AppLauncher import and its instantiation; any CLI parser or reordering re-exposes Kit to pytest argv and segfaults at startup. - Document the _fabric_usd_sync_done side effect on _compose_fabric_transform so callers can see why subsequent getters stop pulling from USD.

The class docstring and __init__ device-param doc still claimed ``cuda:0`` only. Refresh both to note that Fabric acceleration runs on any CUDA index, so the autodoc API page reflects the actual contract.

Move the test-fabric-multi-gpu job out of test-multi-gpu.yaml and into a dedicated test-fabric-multi-gpu.yaml. The two workflows share the same runner label, install step, and GPU pre-flight, but trigger on disjoint path sets so changes to FabricFrameView no longer gate the distributed-training validation and vice versa. test-multi-gpu.yaml is now byte-identical to upstream/develop.

No self-hosted runner with the 'multi-gpu' label is registered. All runs queue indefinitely. Kept as workflow_dispatch only so it can be manually triggered once a runner is provisioned. See also .github/workflows/test-multi-gpu.yaml (same issue).

New scripts/benchmarks/benchmark_dexsuite_wrist_camera.py runs the dexsuite-reorient task twice with identical settings except the camera mount (static base vs wrist-mounted under Robot/ee_link/palm_link), records per-env observation-camera mp4s, and reports per-mode FPS for objective comparison of moving-camera quality and performance. Supports single-GPU (--mode both spawns matched base/wrist subprocesses) and multi-GPU (--distributed under torchrun writes per-rank fps CSVs; --mode summarize aggregates totals as a final rank-0-only step). Uses imageio_ffmpeg for video encoding because cv2.VideoWriter mp4v silently drops frames inside Kit's loaded libavcodec context.

The previous _resolve_out_dir reused base_out_dir only when an args.json already lived inside it; otherwise it created a fresh timestamped subdir. That worked for the --mode both orchestrator (which writes args.json up front) but broke multi-GPU local runs: each torchrun --mode base call made its own timestamped subdir, --mode wrist made another, --mode summarize made a third — so summarize never found the per-rank fps CSVs. Now --mode base | wrist | summarize use --out_dir as-is. The orchestrator still wraps a timestamped subdir around the user's path before spawning subprocesses, so single-process --mode both behaves identically. Callers running torchrun directly own picking a unique path, which matches what the experiment shell script already passes.

…flicts) Combined local-poses stack with multi-GPU support. Uses torch.as_tensor() as canonical conversion. Includes multi-GPU tests gated by ISAACLAB_TEST_MULTI_GPU.

…c-stack

The factory gate hardcoded cuda:0 as the only supported CUDA device. Multi-GPU setups need cuda:1, cuda:2, etc. Replace the static tuple with a function that accepts any valid cuda:<N> index.

isaaclab-review-bot

Code Review Summary

This is an impressive, well-architected PR that delivers significant performance improvements through Fabric GPU acceleration. The ~8× scene creation speedup and 150-260× faster pose operations (at 1024 prims) are substantial gains for large-scale simulations.

✅ Strengths

Clean architecture: The FabricStageCache service pattern elegantly shares hierarchy handles across views without coupling.
Correctness under non-trivial transforms: The transpose-convention handling in update_indexed_local_matrix_from_world and update_indexed_world_matrix_from_local correctly leverages (A·B)ᵀ = Bᵀ·Aᵀ to avoid explicit transposes on Fabric's column-transposed storage.
Robust dirty-flag isolation: Per-view _world_dirty tracking prevents multi-view race conditions that would occur with a per-stage flag.
Comprehensive test coverage: 13+ new integration tests cover rotated parents, scaled parents, multi-view isolation, and topology recovery scenarios.
Factory dispatch refinement: Moving Fabric/USD dispatch from FabricFrameView internals to FrameView._get_backend() eliminates the misleading abstraction.

🔍 Findings & Suggestions

1. Potential race condition in concurrent PrepareForReuse calls
📍 fabric_frame_view.py — _get_world_rw_array() / _get_local_rw_array() / _get_world_ro_array()

Each accessor independently calls PrepareForReuse() on its selection. In a scenario where set_world_poses() and set_local_poses() are called in rapid succession (e.g., from different threads or coroutines), there's a potential for:

_world_sel_rw.PrepareForReuse() returns True (topology change)
Rebuild starts but before completing, _local_sel_rw.PrepareForReuse() is called
Both rebuilds race on shared state

Suggestion: Consider adding a simple lock or consolidating rebuild logic to ensure atomicity during topology changes, especially if this class may be used concurrently.

2. Missing validation for _fabric_empty_2d_array_sentinel shape
📍 fabric_frame_view.py:126 & fabric.py kernels

The sentinel is wp.zeros((0, 0), ...) with shape (0, 0). The kernels check shape[0] > 0 which works, but the docstring states the inner dim "is never indexed" while the sentinel has inner dim 0. This works but is semantically confusing.

Suggestion: Consider wp.zeros((0, 3), ...) or wp.zeros((0, 4), ...) to match the expected shapes for positions/quaternions respectively, making intent clearer.

3. Device validation could be more robust
📍 frame_view.py:25-36 — _is_fabric_supported_device()

if device.startswith("cuda:"):
    try:
        int(device.split(":", 1)[1])
        return True
    except (ValueError, IndexError):
        pass

This accepts "cuda:999" even if only 2 GPUs exist. Consider validating against torch.cuda.device_count() or wp.cuda_device_count().

Severity: Low — USDRT/Warp will fail downstream with a clearer error, but early validation would improve UX.

4. Silent skip for invalid parent prims in _initialize_fabric()
📍 fabric_frame_view.py:521-528

rt_prim = self._stage.GetPrimAtPath(path)
if not rt_prim.IsValid():
    continue  # Silent skip

If a parent prim is invalid, initialization silently skips it, but later _compute_parent_fabric_indices() will raise a RuntimeError. The skip-then-fail pattern could be confusing during debugging.

Suggestion: Either raise early with context, or log a warning when skipping invalid prims.

5. Unused _FABRIC_SUPPORTED_DEVICES module variable
📍 frame_view.py:19

_FABRIC_SUPPORTED_DEVICES: tuple[str, ...] | None = None  # None = compute dynamically

This variable is declared but never used — _is_fabric_supported_device() hardcodes the logic instead.

Suggestion: Remove the unused variable or implement the promised "compute dynamically" behavior.

6. Test test_clear_instance_closes_service may have timing assumptions
📍 test_fabric_stage_cache.py:77-83

The test assumes SimulationContext.clear_instance() synchronously calls close() on services. If this happens asynchronously or is order-dependent, the assertion could flake.

Severity: Low — likely fine given IsaacLab's synchronous teardown pattern.

📊 Summary

Category	Assessment
Correctness	✅ Excellent — transform math is correct, edge cases well-handled
Performance	✅ Excellent — 8× scene creation, 150-260× pose ops
Test Coverage	✅ Comprehensive — 13+ new tests including edge cases
Architecture	✅ Clean — service pattern, factory dispatch, per-view state
Documentation	✅ Good — changelog entries, docstrings, inline comments
Maintainability	⚠️ Good — some complexity in multi-selection management

Verdict

Approve — This is high-quality work that delivers substantial performance improvements with careful attention to correctness. The findings above are minor suggestions; none are blocking.

Update (336386b): Reviewed incremental changes since 2b033461. This commit adds significant new functionality:

New additions:

Multi-GPU CI workflow (.github/workflows/test-fabric-multi-gpu.yaml) — correctly disabled pending runner provisioning
Wrist-camera benchmark script with clean multi-rank aggregation
Comprehensive unit tests for Warp fabric kernels (test_fabric_kernels.py)
FabricStageCache service for shared hierarchy handles
Extensive test coverage for interleaved world/local pose operations

Previous inline comment status:

✅ SettingsManager dead import — Fixed (import removed)
⚠️ _get_parent_world_ro_array unused — Still present but never called
⚠️ get_scales inconsistency — Partially fixed (sync added for cached path, but non-cached still returns without sync; still returns wp.array not ProxyArray)
⚠️ _FABRIC_SUPPORTED_DEVICES unused — Still declared but unused at module level

New observations:

The app_launcher.py changes are a solid improvement — rebuilding sys.argv from known-good Kit args is more robust than the previous mark-and-sweep approach.
The interleaved setter tests (test_interleaved_set_*) provide excellent coverage for the dirty-flag logic.
The indexed fabric kernel additions (compose_indexed_fabric_transforms, etc.) are well-documented and match the claimed storage convention.

Overall: The PR continues to be high-quality work. The remaining inline issues are minor (P2) and don't block merge.

greptile-apps · 2026-05-21T14:33:45Z

Greptile Summary

This PR consolidates five individual PRs into a full Fabric acceleration stack for FabricFrameView, adding indexed transform kernels, a shared FabricStageCache, and Fabric-native get_local_poses/set_local_poses. The factory-level dispatch in FrameView is refactored so that Fabric vs. USD selection happens at creation time instead of inside the view.

Fabric-native local poses are implemented by writing/reading omni:fabric:localMatrix directly via IndexedFabricArray, with a per-view _world_dirty flag to trigger lazy world-from-local recomposition on the next world read.
FabricStageCache is added as a ServiceLocator-managed singleton that shares one IFabricHierarchy handle across all views on the same stage, reducing scene-creation time ~8×.
Fused compose kernel (compose_indexed_fabric_transforms) replaces the separate set_world_poses + set_scales kernel launches, and the assert prim-count guard is promoted to RuntimeError so it fires at every optimization level.

Confidence Score: 4/5

The core Fabric kernel math and the world ↔ local consistency protocol are both correct. The main risks are cosmetic: a dead import, a dead helper method, and an API inconsistency in get_scales. None of these affect correctness at runtime.

The transpose-convention proof in the new kernels is correct, the per-view dirty-flag isolation is properly tested with a multi-view scenario, and the FabricStageCache lifecycle is wired correctly through ServiceLocator. The three style-level findings (unused import, dead method, get_scales return type) are real cleanup work but do not change observable behavior. The unused _FABRIC_SUPPORTED_DEVICES variable in frame_view.py is misleading but inert.

fabric_frame_view.py warrants a second look to remove the unused SettingsManager import and the dead _get_parent_world_ro_array method before merging.

Important Files Changed

Filename	Overview
source/isaaclab_physx/isaaclab_physx/sim/views/fabric_frame_view.py	Major refactor: removes internal USD fallback, adds Fabric-native local poses, three persistent selections, indexed fabric arrays, and per-view dirty tracking. Contains an unused import (SettingsManager) and a dead helper method (_get_parent_world_ro_array) that are leftovers from the refactor.
source/isaaclab/isaaclab/utils/warp/fabric.py	Adds four new Warp kernels (decompose/compose indexed, update_indexed_local/world_from_world/local) and replaces if/else branching with wp.where. Transpose convention math is correct; kernel signatures are consistent.
source/isaaclab/isaaclab/sim/views/frame_view.py	Factory dispatch moved from FabricFrameView internal fallback to this factory. Adds _is_fabric_supported_device, but declares _FABRIC_SUPPORTED_DEVICES module-level variable that is never read. No functional bugs found.
source/isaaclab_physx/isaaclab_physx/sim/fabric_stage_cache.py	New lightweight singleton cache for usdrt Stage attachment and IFabricHierarchy handle. Lifecycle is correctly managed by ServiceLocator (close() called on SimulationContext teardown). No issues found.
source/isaaclab_physx/test/sim/test_views_xform_prim_fabric.py	Removes xfail override for set_world_updates_local (now fixed), adds tests for Fabric local poses, rotated-parent transpose convention, scaled-parent seed, and per-view dirty-flag isolation. Topology-change test no longer exercises real PrepareForReuse=True path.
source/isaaclab_physx/test/sim/test_fabric_stage_cache.py	New test suite for FabricStageCache; covers register/retrieve, hierarchy caching, close, and clear_instance lifecycle. Complete and correct.
source/isaaclab/test/utils/warp/test_fabric_kernels.py	New unit tests for decompose/compose math using plain wp.array (no Fabric runtime required). Covers identity, translation-only, scale, and round-trip cases. Solid coverage.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant FrameView as FrameView (factory)
    participant FFV as FabricFrameView
    participant Cache as FabricStageCache
    participant Fabric as Fabric (GPU)
    participant USD as UsdFrameView

    Caller->>FrameView: FrameView(path, device)
    FrameView->>FrameView: _get_backend() checks fabric_enabled + device
    FrameView-->>FFV: instantiate FabricFrameView

    Caller->>FFV: get_world_poses()
    FFV->>FFV: _initialize_fabric() [lazy, first call only]
    FFV->>Cache: services[FabricStageCache] (get or create)
    Cache-->>FFV: stage + hierarchy handle
    FFV->>USD: get_world_poses() / get_local_poses() / get_scales()
    USD-->>FFV: USD-seeded pos/ori/scale
    FFV->>Fabric: compose_indexed_fabric_transforms (seed worldMatrix + localMatrix)
    FFV->>FFV: "_world_dirty = True"
    FFV->>FFV: _sync_world_from_local_if_dirty()
    FFV->>Fabric: update_indexed_world_matrix_from_local
    Fabric-->>FFV: child world matrices updated
    FFV->>Fabric: decompose_indexed_fabric_transforms
    FFV-->>Caller: ProxyArray(positions), ProxyArray(orientations)

    Caller->>FFV: set_local_poses(translations, orientations)
    FFV->>Fabric: compose_indexed_fabric_transforms to localMatrix
    FFV->>FFV: "_world_dirty = True"

    Caller->>FFV: get_world_poses()
    FFV->>FFV: _sync_world_from_local_if_dirty()
    FFV->>Fabric: update_indexed_world_matrix_from_local
    FFV->>Fabric: decompose_indexed_fabric_transforms
    FFV-->>Caller: consistent world poses

_{Reviews (1): Last reviewed commit: "fix: allow FabricFrameView on any cuda:<..." | Re-trigger Greptile}

greptile-apps · 2026-05-21T14:33:49Z

+from pxr import Gf, Usd, UsdGeom

 import isaaclab.sim as sim_utils
 from isaaclab.app.settings_manager import SettingsManager


SettingsManager is imported at the module level but is no longer referenced anywhere in this file after the refactor moved the fabricEnabled check to FrameView._get_backend. The dead import adds a spurious load-time dependency.

Suggested change

from isaaclab.app.settings_manager import SettingsManager

# SettingsManager check moved to FrameView._get_backend; not needed here.

greptile-apps · 2026-05-21T14:33:50Z

+    def _get_parent_world_ro_array(self):
+        # Built and refreshed alongside the trans_ro selection (parents share that selection).
+        if self._parent_world_ifa_ro is None or self._trans_sel_ro.PrepareForReuse():
+            self._rebuild_trans_ro_arrays()
+        return self._parent_world_ifa_ro


_get_parent_world_ro_array is defined but never called anywhere in the class. Every internal consumer of _parent_world_ifa_ro accesses it directly (after calling _rebuild_trans_ro_arrays as needed). Because the method calls _trans_sel_ro.PrepareForReuse(), leaving it around risks a stale caller inadvertently triggering a topology rebuild out-of-sequence. If it is intended for future use, a # noqa comment or a TODO note would clarify intent; otherwise it should be removed.

greptile-apps · 2026-05-21T14:33:52Z

@@ -290,168 +365,421 @@ def get_scales(self, indices=None):
            scales_wp = wp.zeros((count, 3), dtype=wp.float32, device=self._device)

        wp.launch(
-            kernel=fabric_utils.decompose_fabric_transformation_matrix_to_warp_arrays,
+            kernel=fabric_utils.decompose_indexed_fabric_transforms,
            dim=count,
            inputs=[
-                self._fabric_world_matrices,
-                self._fabric_dummy_buffer,
-                self._fabric_dummy_buffer,
+                self._get_world_ro_array(),
+                self._fabric_empty_2d_array_sentinel,
+                self._fabric_empty_2d_array_sentinel,
                scales_wp,
                indices_wp,
-                self._view_to_fabric,
            ],
-            device=self._fabric_device,
+            device=self._device,
        )

        if use_cached:
            wp.synchronize()
        return scales_wp


get_scales is inconsistent with get_world_poses and get_local_poses in two ways. First, when indices is supplied (non-cached path), the method returns without calling wp.synchronize(), while still returning a raw wp.array — callers that immediately call .numpy() on it race against the running kernel. Second, the class docstring states "Pose getters return ProxyArray", but get_scales returns a bare wp.array. Wrapping the return in ProxyArray and adding a wp.synchronize() for the non-cached branch would align the API with the other getters.

greptile-apps · 2026-05-21T14:33:53Z

+# Devices supported by Fabric GPU acceleration (USDRT SelectPrims + Warp).
+# Any CUDA device is supported — multi-GPU setups use cuda:1, cuda:2, etc.
+_FABRIC_SUPPORTED_DEVICES: tuple[str, ...] | None = None  # None = compute dynamically


The module-level variable _FABRIC_SUPPORTED_DEVICES is declared with a comment that implies it will be used as a lazy cache, but _is_fabric_supported_device never reads or writes it. The dead variable is misleading — it suggests a caching strategy that is not implemented.

Suggested change

# Devices supported by Fabric GPU acceleration (USDRT SelectPrims + Warp).

# Any CUDA device is supported — multi-GPU setups use cuda:1, cuda:2, etc.

_FABRIC_SUPPORTED_DEVICES: tuple[str, ...] | None = None # None = compute dynamically

# Devices supported by Fabric GPU acceleration (USDRT SelectPrims + Warp).

# Any CUDA device is supported — multi-GPU setups use cuda:1, cuda:2, etc.

Previously, set_world_poses eagerly called _sync_local_from_world after every write, launching a warp kernel + wp.synchronize() to recompute local matrices even when nobody reads them. This added ~23% overhead. Now we set a _local_dirty flag and only recompute in get_local_poses (via _sync_local_from_world_if_dirty). Same treatment for set_scales. Benchmark (1024 prims, Fabric, L40): set_world_poses: 0.1258 ms -> 0.0689 ms (-45%, now 425x vs USD) interleaved set+get: 0.1673 ms -> 0.1114 ms (-33%) All 49 fabric xform tests pass.

Ensure at most one of _world_dirty / _local_dirty is True at any time. set_local_poses now flushes stale locals (from a prior set_world_poses) before overwriting, preventing stale local data on other indices. Symmetrically, set_world_poses already flushed stale worlds. A one-time warning is logged when interleaved setters are detected, guiding users toward the fast path (use one setter exclusively per frame). Also documents the leaf-prim assumption and dirty-flag invariant in the class docstring. All 49 fabric xform tests pass.

…ng tests Four new tests: - test_interleaved_set_world_then_set_local_partial_indices: set_world on idx 0, then set_local on idx 1 → both produce correct poses - test_interleaved_set_local_then_set_world_partial_indices: set_local on idx 0, then set_world on idx 1 → both produce correct poses - test_interleaved_set_emits_warning: confirms the one-time warning fires on first interleave, does not repeat - test_no_warning_when_using_single_setter: confirms no warning when only one setter is used repeatedly These tests would have caught the bug fixed in bad524c / 7f18b1c.

pv-nvidia and others added 20 commits May 20, 2026 13:52

Merge branch 'pv/indexed-fabric-kernels' into pv/fabric-local-poses-base

7fb8e5a

Merge branch 'pv/fabric-stage-cache' into pv/fabric-local-poses-base

13a5a18

# Conflicts: # source/isaaclab_physx/isaaclab_physx/sim/views/fabric_frame_view.py

Update FabricFrameView docstrings for multi-GPU support

711264b

The class docstring and __init__ device-param doc still claimed ``cuda:0`` only. Refresh both to note that Fabric acceleration runs on any CUDA index, so the autodoc API page reflects the actual contract.

Merge feat/frame-view-enable-mgpu into full-fabric-stack (resolve con…

1bad91d

…flicts) Combined local-poses stack with multi-GPU support. Uses torch.as_tensor() as canonical conversion. Includes multi-GPU tests gated by ISAACLAB_TEST_MULTI_GPU.

Merge branch 'nblauch/dexsuite-wrist-camera-bench' into pv/full-fabri…

37f8e39

…c-stack

fix: allow FabricFrameView on any cuda:<N> device

3b3d418

The factory gate hardcoded cuda:0 as the only supported CUDA device. Multi-GPU setups need cuda:1, cuda:2, etc. Replace the static tuple with a function that accepts any valid cuda:<N> index.

pv-nvidia requested review from Mayankm96, jtigue-bdai, kellyguo11 and pascal-roth as code owners May 21, 2026 14:18

github-actions Bot added documentation Improvements or additions to documentation isaac-lab Related to Isaac Lab team labels May 21, 2026

pv-nvidia marked this pull request as draft May 21, 2026 14:19

isaaclab-review-bot Bot reviewed May 21, 2026

View reviewed changes

greptile-apps Bot reviewed May 21, 2026

View reviewed changes

pv-nvidia added 2 commits May 22, 2026 11:23

pv-nvidia force-pushed the pv/fabric-full-stack branch from 2b03346 to 7f18b1c Compare May 22, 2026 13:11

pv-nvidia mentioned this pull request May 22, 2026

feat: add fabric_read/fabric_write context managers to FabricFrameView #5382

Closed

github-actions Bot added the infrastructure label May 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Full Fabric acceleration stack — local poses, stage cache, fused compose#5728

feat: Full Fabric acceleration stack — local poses, stage cache, fused compose#5728
pv-nvidia wants to merge 23 commits into
isaac-sim:developfrom
pv-nvidia:pv/fabric-full-stack

pv-nvidia commented May 21, 2026

Uh oh!

isaaclab-review-bot Bot left a comment •

edited

Loading

Uh oh!

greptile-apps Bot commented May 21, 2026

Uh oh!

greptile-apps Bot May 21, 2026

Uh oh!

greptile-apps Bot May 21, 2026

Uh oh!

greptile-apps Bot May 21, 2026

Uh oh!

greptile-apps Bot May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	from isaaclab.app.settings_manager import SettingsManager
	# SettingsManager check moved to FrameView._get_backend; not needed here.

Conversation

pv-nvidia commented May 21, 2026

Summary

Motivation

Performance

Testing

Individual PRs (now superseded by this consolidated PR)

Uh oh!

isaaclab-review-bot Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Code Review Summary

✅ Strengths

🔍 Findings & Suggestions

📊 Summary

Verdict

Uh oh!

greptile-apps Bot commented May 21, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

isaaclab-review-bot Bot left a comment •

edited

Loading