Skip to content

Turns off fabric for non-cuda0 multi-GPU runs to avoid mGPU errors in USDRT#4959

Merged
kellyguo11 merged 3 commits intoisaac-sim:developfrom
kellyguo11:fix-fabric-mgpu
Mar 12, 2026
Merged

Turns off fabric for non-cuda0 multi-GPU runs to avoid mGPU errors in USDRT#4959
kellyguo11 merged 3 commits intoisaac-sim:developfrom
kellyguo11:fix-fabric-mgpu

Conversation

@kellyguo11
Copy link
Copy Markdown
Contributor

@kellyguo11 kellyguo11 commented Mar 12, 2026

Description

USDRT select prim currently requires cuda:0. the fix for this will be available in the next Kit version.

For now, we will turn off fabric for non-cuda:0 devices to avoid the error in USDRT, which in turn will cause a hang in multi-GPU runs.

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Checklist

  • I have read and understood the contribution guidelines
  • I have run the pre-commit checks with ./isaaclab.sh --format
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • I have updated the changelog and the corresponding version in the extension's config/extension.toml file
  • I have added my name to the CONTRIBUTORS.md or my name already exists there

@github-actions github-actions Bot added bug Something isn't working isaac-lab Related to Isaac Lab team infrastructure labels Mar 12, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 12, 2026

Greptile Summary

This PR fixes multi-GPU hangs in Isaac Lab by disabling Fabric mode for any device that is not cuda/cuda:0, working around a known USDRT limitation where SelectPrims only supports the primary CUDA context. The core logic changes in xform_prim_view.py are correct and well-documented. However, the docker/.env.base changes accidentally include what appear to be temporary development settings pointing to NVIDIA's internal container registry and a floating image tag, which would break Docker builds for all external users and make them non-reproducible.

Key changes:

  • Added an early-exit guard in XFormPrimView.__init__ that turns off _use_fabric when self._device is cuda:1 or higher, preventing the downstream USDRT C++ crash.
  • Fixed four wp.launch calls that incorrectly passed device=self._device when launching kernels that operate on fabricarrays pinned to cuda:0; now correctly use device=self._fabric_device.
  • Corrected _view_to_fabric allocation to target fabric_device instead of self._device, eliminating device-mismatch issues in the bidirectional prim-index mapping.
  • docker/.env.base: Switches to nvcr.io/nvidian (internal registry) with a floating latest-release-6-0 tag — these should be reverted before merging.

Confidence Score: 2/5

  • The core Python fix is safe but the Docker env change must be reverted before this PR can be merged.
  • The xform_prim_view.py changes are logically sound and correctly address the multi-GPU crash. However, docker/.env.base has been changed to use NVIDIA's internal container registry (nvcr.io/nvidian) with a non-pinned floating tag (latest-release-6-0). These changes would silently break Docker-based workflows for all external contributors and CI systems that don't have internal registry access, and would make image provenance non-deterministic. This needs to be reverted before the PR is safe to merge.
  • docker/.env.base — must be reverted to the public registry (nvcr.io/nvidia/isaac-sim) and pinned version (6.0.0) before merge.

Important Files Changed

Filename Overview
docker/.env.base Switches the Isaac Sim base image from the public NGC registry (nvcr.io/nvidia) to NVIDIA's internal registry (nvcr.io/nvidian) and replaces the pinned version 6.0.0 with a floating tag latest-release-6-0. This will break Docker builds for all external users and makes builds non-reproducible.
source/isaaclab/isaaclab/sim/views/xform_prim_view.py Adds a guard in __init__ that disables Fabric mode for any device that is not cuda or cuda:0, preventing the USDRT SelectPrims C++ crash on multi-GPU runs. Also fixes kernel launch device from self._device to self._fabric_device in all four Fabric kernel invocations, and ensures _view_to_fabric is allocated on fabric_device rather than self._device.
source/isaaclab/config/extension.toml Version bump from 4.5.18 to 4.5.19 to accompany the bug fix.
source/isaaclab/docs/CHANGELOG.rst Adds a 4.5.19 changelog entry accurately describing the multi-GPU Fabric disable and the kernel device-mismatch fixes.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[XFormPrimView.__init__] --> B{fabricEnabled in settings?}
    B -- No --> C[_use_fabric = False\nUSD path]
    B -- Yes --> D{device == 'cpu'?}
    D -- Yes --> E[_use_fabric = False\nWarning logged]
    D -- No --> F{device NOT in\n'cuda' or 'cuda:0'?}
    F -- Yes\ncuda:1, cuda:2, etc. --> G[_use_fabric = False\nWarning logged\nFalls back to USD]
    F -- No\ncuda or cuda:0 --> H[_use_fabric = True\nFabric path enabled]

    H --> I[_initialize_fabric called lazily]
    I --> J[Normalize fabric_device\nto 'cuda:0']
    J --> K[SelectPrims on cuda:0]
    K --> L[_view_to_fabric on fabric_device\n_fabric_to_view fabricarray]
    L --> M[wp.launch kernels\ndevice=fabric_device]
Loading

Comments Outside Diff (1)

  1. docker/.env.base, line 8-10 (link)

    Internal registry and non-pinned floating tag

    nvcr.io/nvidian is NVIDIA's internal container registry, not the public NGC registry (nvcr.io/nvidia). External users and CI systems that don't have credentials for the internal registry will be unable to pull the image and build the Docker container.

    Additionally, changing the version from the pinned 6.0.0 to the floating tag latest-release-6-0 makes Docker builds non-reproducible — future builds may silently pull a different image.

    These look like temporary development/testing changes (pointing to an internal pre-release that contains the upstream USDRT fix) that should not be merged to develop. They should be reverted to the public registry and a pinned version:

Last reviewed commit: f122f3e

@kellyguo11 kellyguo11 merged commit 8977324 into isaac-sim:develop Mar 12, 2026
12 of 14 checks passed
daniela-hase pushed a commit to daniela-hase/IsaacLab that referenced this pull request Mar 30, 2026
… USDRT (isaac-sim#4959)

## Description

USDRT select prim currently requires cuda:0. the fix for this will be
available in the next Kit version.

For now, we will turn off fabric for non-cuda:0 devices to avoid the
error in USDRT, which in turn will cause a hang in multi-GPU runs.

## Type of change

<!-- As you go through the list, delete the ones that are not
applicable. -->

- Bug fix (non-breaking change which fixes an issue)


## Checklist

- [x] I have read and understood the [contribution
guidelines](https://isaac-sim.github.io/IsaacLab/main/source/refs/contributing.html)
- [x] I have run the [`pre-commit` checks](https://pre-commit.com/) with
`./isaaclab.sh --format`
- [x] I have made corresponding changes to the documentation
- [x] My changes generate no new warnings
- [ ] I have added tests that prove my fix is effective or that my
feature works
- [ ] I have updated the changelog and the corresponding version in the
extension's `config/extension.toml` file
- [ ] I have added my name to the `CONTRIBUTORS.md` or my name already
exists there

<!--
As you go through the checklist above, you can mark something as done by
putting an x character in it

For example,
- [x] I have done this task
- [ ] I have not done this task
-->
david-tingdahl-nvidia pushed a commit to david-tingdahl-nvidia/IsaacLab that referenced this pull request Mar 31, 2026
… USDRT (isaac-sim#4959)

USDRT select prim currently requires cuda:0. the fix for this will be
available in the next Kit version.

For now, we will turn off fabric for non-cuda:0 devices to avoid the
error in USDRT, which in turn will cause a hang in multi-GPU runs.

<!-- As you go through the list, delete the ones that are not
applicable. -->

- Bug fix (non-breaking change which fixes an issue)

- [x] I have read and understood the [contribution
guidelines](https://isaac-sim.github.io/IsaacLab/main/source/refs/contributing.html)
- [x] I have run the [`pre-commit` checks](https://pre-commit.com/) with
`./isaaclab.sh --format`
- [x] I have made corresponding changes to the documentation
- [x] My changes generate no new warnings
- [ ] I have added tests that prove my fix is effective or that my
feature works
- [ ] I have updated the changelog and the corresponding version in the
extension's `config/extension.toml` file
- [ ] I have added my name to the `CONTRIBUTORS.md` or my name already
exists there

<!--
As you go through the checklist above, you can mark something as done by
putting an x character in it

For example,
- [x] I have done this task
- [ ] I have not done this task
-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working infrastructure isaac-lab Related to Isaac Lab team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants