[CI] Cross-platform — Part 4: Multi-GPU pytest workflow with cuda:1 coverage#5823
[CI] Cross-platform — Part 4: Multi-GPU pytest workflow with cuda:1 coverage#5823hujc7 wants to merge 41 commits into
Conversation
Re-enables the pull_request trigger on test-fabric-multi-gpu.yaml and wires it to run the FabricFrameView contract tests with ISAACLAB_TEST_MULTI_GPU=1, which activates the three cuda:1 -parameterised tests added in isaac-sim#5514. The cuda:1 tests target FabricFrameView's SelectPrims path on non-zero CUDA device indices. They currently hang indefinitely on real multi-GPU hardware (reproduced locally on 3x RTX 6000 Pro Blackwell and on the multi-GPU runner pool); the 60-min workflow timeout will cancel the job and surface the regression in CI for the FabricFrameView maintainers. Install pipeline matches isaac-sim#5738's proven-working layout: - Pin Python 3.12 via SHA-pinned actions/setup-python. - Pre-install cmake via pip to skip install.py's sudo apt-get branch. - ./isaaclab.sh --install none (core only, avoids egl_probe libEGL). - pip install isaacsim[all,extscache]==${vars.ISAACSIM_BASE_VERSION || '6.0.0'} --extra-index-url https://pypi.nvidia.com. - Bypass Kit's interactive EULA via OMNI_KIT_ACCEPT_EULA / ACCEPT_EULA / ISAAC_SIM_HEADLESS. Status: this PR is expected to fail with the 60-min workflow timeout. Land once the underlying hang in fabric_frame_view.py is fixed.
There was a problem hiding this comment.
Overall
This is a well-designed PR that introduces a clean, env-driven device parametrization helper for multi-GPU testing. The mask grammar is elegant and the documentation in the module docstring is excellent. The workflow setup and test migrations look correct.
However, there's a critical issue that will prevent the tests from running.
Critical: Missing __init__.py
The new isaaclab.testing package is missing its __init__.py file. Without it, Python won't recognize source/isaaclab/isaaclab/testing/ as a package, and all the imports like:
from isaaclab.testing import cuda_test_deviceswill fail with ModuleNotFoundError.
Fix: Add source/isaaclab/isaaclab/testing/__init__.py with at minimum:
# Copyright (c) 2022-2026, The Isaac Lab Project Developers (https://github.com/isaac-sim/IsaacLab/blob/main/CONTRIBUTORS.md).
# All rights reserved.
#
# SPDX-License-Identifier: BSD-3-Clause
"""Testing utilities for Isaac Lab."""
from .devices import cuda_test_devices
__all__ = ["cuda_test_devices"]Minor Suggestions
1. Workflow: Consider using matrix strategy for future extensibility
The pytest target list is explicit (good for control), but as you add more files in P1/P2, consider whether a matrix or glob pattern might be cleaner. Current approach is fine for now.
2. devices.py line 114: Consider caching available devices
Calling torch.cuda.device_count() on every test parametrization is cheap but redundant. A module-level cache (computed once at import) would be cleaner:
@functools.lru_cache(maxsize=1)
def _list_available_devices() -> tuple[str, ...]:
...(Return tuple for hashability, or just compute at module level.)
3. Workflow line 57: Pin Isaac Sim version handling
The fallback '6.0.0' is hardcoded. If vars.ISAACSIM_BASE_VERSION is the source of truth, consider failing explicitly when unset rather than silently falling back, to avoid version drift.
4. Consider adding a simple unit test for cuda_test_devices itself
A few tests for the helper (mask parsing, strict vs non-strict, edge cases like empty mask, invalid chars) would catch regressions in the helper logic itself.
Summary
- Blocking: Add the missing
__init__.pyforisaaclab.testing - Non-blocking: Minor polish suggestions above
Once the __init__.py is added, this looks good to merge. Nice work on the mask grammar design! 🎯
Adds a single helper, cuda_test_devices(), that converts a 3-position
device mask (env-var ISAACLAB_TEST_DEVICES, default '110') into the
list of device strings tests parametrize over. Single-GPU CI sees no
change (default mask '110' resolves to [cpu, cuda:0], identical to the
hardcoded lists tests carry today). The new multi-GPU-pytest workflow
sets ISAACLAB_TEST_DEVICES=001 so migrated tests run on cuda:1 only.
Mask grammar: each position is 0 or 1, optional trailing X expands to
all remaining positions. Position 0 -> cpu; position k>=1 -> cuda:{k-1}.
Strict mode raises on missing devices; non-strict returns empty for
opt-in tests that should skip on hosts that can't satisfy them.
P0 migration (pure-Python utility tests, no Kit):
* source/isaaclab/test/utils/test_math.py: 45 parametrize sites +
2 inline for-loops migrated.
* source/isaaclab/test/utils/test_wrench_composer.py: 37 sites.
* source/isaaclab/test/utils/test_episode_data.py: 5 sites.
Each migrated site replaces a hardcoded [cpu, cuda:0] (or the reversed
or tuple form) with cuda_test_devices(). Migration is additive - one
import line per file plus the inline edits. No test logic changes.
Workflow: .github/workflows/test-multi-gpu-pytest.yaml runs on the
[self-hosted, ..., multi-gpu] pool with ISAACLAB_TEST_DEVICES=001.
Triggered on changes to the helper, the P0 test files, or the
workflow itself.
Excluded scope (to follow up after CI validates this MVP):
* P1 light-Kit tests (test_simulation_context, test_views_xform_prim,
test_newton_model_utils, test_views_xform_prim_newton).
* P2 asset tests (test_articulation / test_rigid_object on physx and
newton backends).
* FabricFrameView cuda:1 tests (PR isaac-sim#5514) - separate path, the
SelectPrims deadlock there is tracked independently.
Reverts the fabric-specific .github/workflows/test-fabric-multi-gpu.yaml
edits that were carried on this branch from the earlier PR scope; that
demo is independent of this framework work.
c0b18ca to
30667b8
Compare
Migrate 16 additional test files (P0 extras + P1 + P2 + P3) to call cuda_test_devices() in their device parametrize, covering ~280 sites across articulation/rigid-object/rigid-object-collection/sim/sensors suites for physx, newton, and ovphysx backends. Rewrite the workflow's run step to auto-discover any test file calling cuda_test_devices() via grep, so new opt-ins land without workflow edits. Files are split into a pure-Python pytest session and per-file Kit-bound invocations (Kit is a process-wide singleton). A hardcoded SKIP list parks the known-broken FabricFrameView cuda:1 path. Per-Kit-file timeout 600 bounds any single hang at 10 minutes so the job surfaces all failing files rather than blocking on the first.
There was a problem hiding this comment.
Overall Assessment
This is a well-designed PR that introduces an elegant, env-driven device parametrization system for multi-GPU testing. The mask grammar is intuitive (e.g., 110 → [cpu, cuda:0], 001 → [cuda:1]), the documentation is excellent, and the backward-compatible default ensures zero impact on existing single-GPU CI runs.
The PR successfully addresses:
- ✅ New
isaaclab.testingmodule withcuda_test_devices()helper - ✅ Auto-discovery workflow that picks up opt-in test files automatically
- ✅ Migration of 19 test files (~280 parametrize sites)
- ✅ Clear P1/P2 scope documentation for follow-ups
Findings
1. Missing Unit Tests for the Helper Function (Medium)
Location: source/isaaclab/isaaclab/testing/devices.py
The cuda_test_devices() function has non-trivial mask parsing logic with multiple edge cases (trailing X, strict vs non-strict, empty masks, invalid characters), but there are no dedicated unit tests for the helper itself.
Recommendation: Add a test file like source/isaaclab/test/testing/test_cuda_test_devices.py with cases covering:
- Default mask resolution
- Explicit mask parsing (
"110","001","00X","X") - Strict mode raising on unavailable devices
- Non-strict mode returning empty list gracefully
- Invalid mask characters raising
ValueError
This protects against regressions in the helper logic itself.
2. Workflow SKIP List Pattern Could Be More Robust (Low) ✅ RESOLVED
Resolved in f66b101 — The new include-files approach is cleaner and avoids the heredoc pattern entirely.
3. Consider Caching Device List (Nit)
Location: source/isaaclab/isaaclab/testing/devices.py line 113
_list_available_devices() is called on every cuda_test_devices() invocation. While torch.cuda.device_count() is cheap, caching the result with @functools.lru_cache would be cleaner and more efficient for test collection.
Summary
This is a solid foundation for multi-GPU testing infrastructure. The design choices are well-reasoned and the migration is clean.
Actionable items:
- (Recommended) Add unit tests for
cuda_test_devices()to catch edge-case regressions (Optional) Minor robustness improvements noted above✅ Resolved
Nice work on the mask grammar design! 🎯
Update (d6b2934): New commit adds explicit pip install pytest step to the workflow — a good defensive fix ensuring runner independence. No new issues. Previous suggestions remain optional improvements for future iterations.
Update (f1fa016): Good improvements in this commit:
- Extended pip install to include
pytest-mockandflakywith clear explanation - Added
is_ok()helper treating pytest exit code 5 (no tests collected) as success — correct for device-gated tests - Extended SKIP list with 4 newton tests and improved documentation
No new issues. Previous optional suggestions (unit tests for helper, robustness improvements) remain relevant but non-blocking.
Update (042c409): This commit expands the skip list with well-documented categories explaining why each test is skipped (FabricFrameView deadlocks, SIGSEGV failures, cuda:1 hangs, etc.). Good practice to document the reasoning. Also adds test_simulation_context.py and physx rigid-object tests to the skip list. No new issues — this is maintenance housekeeping for multi-GPU stability.
Update (1546520): 🎉 Excellent fix! This commit addresses the root cause of the cuda:1 test failures:
app_launcher.py: AddedISAACLAB_SIM_DEVICEenv var support to override the default device when not explicitly passed. Clean implementation with good documentation.- Workflow: Sets
ISAACLAB_SIM_DEVICE=cuda:1to boot Kit withactive_gpu=1from process start. This correctly targets PhysX/Warp to cuda:1 before SimulationApp locks it. - Result: 7 tests removed from the skip list and now run successfully on cuda:1!
This is a proper fix rather than just skipping problematic tests. No new issues introduced.
Update (feb8e21): Skip list maintenance — expanded documentation with categorized skip reasons:
- Upstream Kit limits (FabricFrameView)
- Pre-existing test/API breakage (newton contact_sensor)
- Newton cuda:1 — Warp allocator failure in mujoco_warp collision driver
- PhysX cuda:1 — hangs specific to AWS L40 runner (passes on other hardware)
Added 7 files back to skip list (4 Newton, 3 PhysX asset tests) with clear root-cause explanations. Good documentation practice for future debugging. No new issues.
Update (6eeda64): Added changelog entry (jichuanh-multi-gpu-ci.minor.rst) documenting the new cuda_test_devices() helper and ISAACLAB_SIM_DEVICE env var. Good housekeeping. No new issues.
Update (e3d328b): Major infrastructure improvement — switched from bare-runner installation to Docker-based execution:
- Added
configjob that loads isaacsim image config fromconfig.yamlvia sparse checkout - Added
ecr-build-push-pullaction to pull the same per-commit CI image used by regular CI (dep matrices now match exactly) - Tests run inside container with workspace volume-mounted, removing need for
pip installsteps on runner - Cleaned up SKIP list comments (removed obsolete contact_sensor reference)
- Proper symlink setup (
rm -f _isaac_sim && ln -s /isaac-sim _isaac_sim) inside container
This is a solid approach — reusing the regular CI's Docker image ensures environment consistency and removes runner-specific pip quirks. Good fallback documentation if ECR cache misses. No new issues.
Update (3c5248d): Good documentation and consistency improvements:
- Added clear comments explaining
--entrypoint bash(overrides isaac-sim base image's default ENTRYPOINT that would swallow the script) - Added explanation for
--ignore=tools/conftest.py(matches regular CI's behavior) - Consolidated ignore paths into
ignore_argsvariable and applied to both pure-python and per-file pytest runs - Cleaner shell structure:
--entrypoint bash+-cinstead of inlinebash -c
No new issues. Ready for merge. 🚀
Update (9ff909d): Fixes local image availability issue — when ecr-build-push-pull takes the deps-cache-hit path, it creates an ECR registry-side alias but leaves no image in local Docker. This commit adds an "Ensure image is local" step that:
- Creates isolated DOCKER_CONFIG with
credsStoredisabled (workaround for broken runner credential helper) - Re-authenticates to ECR via EC2 IAM role
- Explicitly pulls and tags the image for subsequent container steps
Good defensive fix with clear comments explaining the edge case. No new issues.
Update (6706690): Good fixes for container permission issues:
- Added
--user "$(id -u):$(id -g)"to run container as host uid:gid — fixes write permission failures on volume-mounted workspace - Created writable $HOME (
/tmp/isaaclab-ci-home) withXDG_CACHE_HOMEfor Warp/numpy/pip caches - Well-commented: explains the issue (image's default
isaaclabuid 1000 ≠ runner'sgithub-runneruser)
Proper solution for the permission mismatch. No new issues.
Update (328878a): Skip list expanded with ovphysx tests. Good documentation explaining the issue: module-level pytest.mark.skipif collects 0 items + 1 skip, which isaaclab.sh CLI wrapper interprets as exit 1 (doesn't distinguish from real failures). Added 3 ovphysx asset tests to skip list. Clean maintenance commit. No new issues.
Update (665f0c3): Added temporary run_docker_tests: 'false' override in build.yaml to free GPU runners during PR iteration. The TEMP comment is well-documented ("revert before final review").
Update (f66b101): 🎉 Excellent refactor! This commit makes major architectural improvements:
Changes:
- New
extra-env-varsinput added to bothrun-testsandrun-package-testsactions — clean mechanism to forward env vars into the container with proper parsing (handles comments, whitespace,KEY=valueformat) - Workflow refactored to use
run-package-testsaction instead of ~150 lines of inline bash. Much cleaner and more maintainable. - Separate
buildjob added to pre-populate the ECR image before test job runs (mirrors the build→test split inbuild.yaml) - Include-list approach replaces the SKIP list pattern — explicitly listing 9 known-working tests is safer than auto-discovery with exclusions
- Path triggers expanded to include action files, so workflow changes properly trigger CI
My previous Finding #2 (SKIP list robustness) is now obsolete — the new include-files approach is cleaner and avoids the heredoc pattern entirely. ✅
No new issues. This is a significant quality improvement that makes the multi-GPU workflow consistent with the rest of the CI infrastructure.
run_docker_tests: 'false' in build.yaml (from 665f0c3).
Update (3c985eb): Minor update for FabricFrameView retest iteration:
Changes in build.yaml:
- Updated TEMP comment: now focused on "FabricFrameView cuda:1 retest" (previously was "Multi-GPU pytest docker conversion")
- Streamlined comment (removed redundant "freeing the single-GPU runner pool" explanation)
run_docker_tests: 'false'remains set — still iterating
Changes in test-multi-gpu-pytest.yaml:
- Added
test_views_xform_prim_fabric.pyto the include-files list — attempting to enable the FabricFrameView test on cuda:1
This is a WIP commit testing whether the FabricFrameView deadlock issue (documented in earlier skip lists) has been resolved. run_docker_tests: 'false' still needs to be reverted before merge.
No new architectural issues. Just experimental test expansion.
Update (bdcb527): ✅ Final Cleanup Complete
The latest commit removes the temporary run_docker_tests: 'false' override that was used during iteration, restoring the proper change-detection logic:
- run_docker_tests: 'false'
+ run_docker_tests: ${{ steps.detect.outputs.run_docker_tests }}This was the expected final step — the temporary bypass was necessary to test the FabricFrameView cuda:1 retest path, and it's now correctly reverted.
Status: All outstanding items addressed. PR is ready for merge. 🚀
pytest is not pulled in by --install none or by isaacsim[all,extscache]. Runner state was masking this; pin it explicitly.
flaky and pytest-mock are declared in source/isaaclab/setup.py install_requires but pip's resolver was silently skipping them when combined with the pytorch/nvidia extra-index urls in the install step. Pin them explicitly so the multi-GPU runner is runner-state independent. SKIP four newton test files that the cuda:1 cold-runner surfaces as broken (test_contact_sensor hits a pre-existing measure_total kwarg bug; test_articulation segfaults; test_rigid_object_collection and test_views_xform_prim_newton have cuda:1 specific failures). They're still parametrized via cuda_test_devices() so single-GPU CI continues to cover cpu+cuda:0. Accept pytest exit code 5 (no tests collected) so module-level pytestmark skips (e.g. backend-availability gates in ovphysx) and device-only parametrize that resolves to [] on incompatible hosts both count as success.
Four additional test files surfaced cuda:1-specific failures or hangs on the multi-GPU runner: * test_simulation_context: passes test_init[cuda:1] then hangs the next parametrize variant, 10-min per-file timeout fires. * newton/test_rigid_object: 41 cuda:1 failures (out of 54). * physx/test_rigid_object: passes test_initialization[cuda:1-1] then hangs at [cuda:1-2] (env-count 2 on cuda:1). * physx/test_rigid_object_collection: same hang signature. They keep their cuda_test_devices() parametrize so single-GPU CI continues to exercise cpu+cuda:0; only multi-GPU CI skips them pending separate investigation.
When the caller doesn't pass device= explicitly, AppLauncher now falls back to the ISAACLAB_SIM_DEVICE env var (if set) instead of the hardcoded cuda:0 default. Kit's active_gpu / physics_gpu are process-global and locked after SimulationApp init, so per-test parametrize alone cannot retarget GPU selection once the app is up. Boot-time alignment is the only path that works. The multi-GPU pytest workflow now sets ISAACLAB_SIM_DEVICE=cuda:1 alongside ISAACLAB_TEST_DEVICES=001, so PhysX and Warp pin to cuda:1 from process start. Drops 7 entries from the SKIP list (5 cuda:1 hangs around active_gpu/cuda mismatch + 2 newton suites likely sharing the same root cause). Remaining SKIPs: * FabricFrameView (usdrt SelectPrims cuda:0-only, upstream Kit) * newton/contact_sensor (Newton PR isaac-sim#2135 measure_total rename, needs caller update in newton_manager.py — tracked separately).
Round-5 CI confirmed the AppLauncher fix unblocks test_simulation_context (was hanging at second parametrize, now 42 passed in 62 s). Other files surfaced separate, non-AppLauncher root causes that need independent fixes: * Newton suites (4 files): Warp allocator failure inside mujoco_warp.collision_driver on cuda:1. Reproduces locally on a 3-device MIG host; root cause is the Warp/mujoco_warp interaction, not AppLauncher routing. * PhysX suites (3 files): hang at test_initialization[cuda:1-2] only on the AWS multi-GPU runner. Passes in 11 s locally with the same code, so the hang is runner-specific (L40 driver / peer access / PCIe topology), not an IsaacLab bug. test_simulation_context stays in scope (the AppLauncher fix made it pass deterministically). FabricFrameView usdrt and contact_sensor Newton API rename remain in SKIP for their pre-existing root causes.
Replaces the pip-install path with ECR pull of the same isaac-lab CI image used by build.yaml. ECR auths via the runner EC2 instance's IAM role (no nvcr.io credentials required at PR-time), so fork PRs work without exposing NGC_API_KEY. Benefits: * Newton 1.2+ pre-installed in the image, fixing the contact_sensor measure_total kwarg mismatch without a manual pip pin. * Eliminates the 9 min cold pip-install step (image pull from ECR is tens of seconds when cached). * Dep matrix matches single-GPU CI exactly, so both gates surface the same kind of dep skew. If ECR cache misses (e.g. build.yaml hasn't completed first), the action falls back to local build; that path is slow and requires NGC_API_KEY. Validating ECR auth on the multi-gpu runner pool is the primary goal of this commit — drops contact_sensor from SKIP to test that the version skew is resolved.
The previous attempt missed: 1. The isaac-sim base image has an ENTRYPOINT that launches Kit, so bash -c '...' was passed as Kit's argv (Kit booted, my script never ran). Mirror run-tests action: --entrypoint bash + -c '...'. 2. tools/conftest.py's pytest_ignore_collect returns True for every file (subprocess-per-test runner), so pytest collects 0 items and exits. Pass --ignore=tools/conftest.py --ignore=source/isaaclab/test/install_ci, same as run-tests does.
ecr-build-push-pull only pulls locally on the exact-cache-hit path. On deps-cache-hit (registry-side alias) the image isn't local, so docker run fails with 'Unable to find image isaac-lab-ci:... locally' followed by an unauthed Docker Hub pull attempt. Explicit pull via the ECR URL covers all paths uniformly.
The ecr-build-push-pull action cleans up its temp DOCKER_CONFIG after running, so the docker login from inside the action does not persist. Re-authenticate via aws ecr get-login-password (works via the runner EC2's IAM role, no AWS creds in the workflow).
Runner's default ~/.docker/config.json declares a credential helper that fails with 'not implemented'. Mirror the same workaround the ecr-build-push-pull action uses: drop a fresh config with credsStore set to empty string, then docker login + pull work.
Image's default USER is isaaclab (uid 1000), which doesn't own the volume-mounted host workspace, so it can't ln -s _isaac_sim (perm denied) — falling back to PATH python3 which doesn't exist in the image, hence pytest exit 127.
Running container as host uid:gid means the image's default /root home is not writable, so Warp/numpy/pip cache writes hit PermissionError [Errno 13] '/root/.cache'. Mount a fresh tmp dir and point HOME + XDG_CACHE_HOME at it.
The docker image properly installs ovphysx, so module-level pytestmark.skipif now triggers (no backend init at multi-gpu) collecting 0 items / 1 skipped. isaaclab.sh's CLI wrapper translates that exit-5 to exit-1, breaking the workflow's is_ok() check. Skip the 3 ovphysx files here.
Per ~/.claude/skills/pr/ci-iteration-shortcut.md. All gated Docker + Tests jobs (single-GPU build/test matrix) skip via their existing if-gate. Revert before final review. PR 5823 iterates the multi-GPU pytest docker conversion; the heavy single-GPU matrix adds no signal to that work and costs 30+ runner minutes per push.
Replaces the 237-line custom workflow with the ~100-line shape used by single-GPU test jobs: pull image via ecr-build-push-pull, run pytest in container via run-package-tests + run-tests, let tools/conftest.py handle the Kit-singleton subprocess-per-test pattern. The 8 docker-runtime quirks I worked around in the previous attempt (ENTRYPOINT, conftest ignore, deps-cache pull, DOCKER_CONFIG cleanup, credsStore, uid mismatch, HOME, exit-5 propagation) are all already handled inside the run-tests action. No reinvention. Adds one input to run-tests + run-package-tests: extra-env-vars (multiline KEY=value), used here to inject ISAACLAB_TEST_DEVICES and ISAACLAB_SIM_DEVICE so the container's pytest parametrize and Kit boot align on cuda:1. Test scope: 9 opt-in basenames covering ~512 cuda:1 tests, same discovery scope as the previous attempt minus the 11 SKIPped files.
ecr-build-push-pull's deps-cache-hit path only creates a registry-side alias (no local pull). Without a prior build job that establishes the exact-commit tag in ECR, the test job's internal ecr-build-push-pull hits exact-cache-miss + deps-cache-hit and leaves no local image, so docker run fails with 'pull access denied'. Mirrors the build → test split that build.yaml already uses for the single-GPU matrix. Build job pre-populates the exact tag (via buildx imagetools create on deps-cache-hit, or full build on miss); test job's inner ecr-build-push-pull then hits exact-cache-hit and pulls locally via the action's existing 'Pull exact image' step.
This reverts commit 665f0c3.
There was a problem hiding this comment.
Update (5d29bb0): ✅ Ready for Merge
The final commit reverts the temporary `run_docker_tests: 'false' override — the exact action my previous review flagged as blocking. All outstanding items are now addressed:
✅ Missing __init__.py — Added in earlier commit
✅ **run_docker_tests: 'false' reverted** — This commit ✅ **Workflow refactored** to use run-package-tests` action (f66b101)
✅ Include-list approach replaces SKIP heredoc pattern
✅ Changelog fragment added
Summary
This PR is now complete and ready for merge. The multi-GPU testing infrastructure is solid:
- Clean
cuda_test_devices()helper with elegant mask grammar ISAACLAB_SIM_DEVICEenv var for AppLauncher device override- Proper integration with existing CI via
run-package-testsaction - 9 test files migrated with known-working cuda:1 paths
No new issues introduced. Nice work! 🚀
Update (d6d69c4): ⚠️ Iteration Checkpoint — Not Merge-Ready
This commit re-introduces the temporary run_docker_tests: 'false' override in .github/workflows/build.yaml`. The PR author is iterating and has added a note:
"TEMP (revert before final review / before landing): force run_docker_tests=false while iterating PR #5823"
Status: The PR was previously ready for merge, but is now in an iteration state. This temporary flag must be reverted before merging — the author has clearly marked it as such.
📋 Blocking for merge:
- Revert `run_docker_tests: 'false' override (same fix as commit 5d29bb0)
No functional regressions otherwise — this is expected iteration behavior.
Update (ea919b7): Module Relocation
This commit relocates the testing utilities from isaaclab.testing to isaaclab.test.utils:
Changes:
- Renamed
source/isaaclab/isaaclab/testing/→source/isaaclab/isaaclab/test/utils/ - Updated imports across 16 test files in
isaaclab,isaaclab_newton,isaaclab_ovphysx, andisaaclab_physxpackages - Workflow path trigger updated from
source/isaaclab/isaaclab/testing/**→source/isaaclab/isaaclab/test/utils/** - Changelog fragment updated to reflect new import path
✅ This is a clean refactor that consolidates test utilities under the isaaclab.test namespace. All import statements are consistently updated.
📋 Blocking for merge:
- Revert `run_docker_tests: 'false' override (status unchanged from previous update)
Otherwise, no new issues introduced.
Update (0118ea7): 🎯 Nice Refinements
This commit adds several quality-of-life improvements:
Changes:
1. run-tests/action.yml — Smarter env var parsing
- Now preserves mid-line
#characters (e.g.,IMAGE_TAG=v1.0#nightlysurvives) - Only treats full-line comments (where first non-whitespace is
#) as comments - Strips YAML block indent whitespace properly
2. test-multi-gpu-pytest.yaml — Auto-discovery of opt-in tests
- Replaces hardcoded include-list with
grep -rl 'cuda_test_devices'discovery - Adding a new test to multi-GPU CI now requires no workflow edit — just use the helper
- Clean design: opt-in via code, not via workflow maintenance
3. cuda_test_devices() helper — New skip parameter
- Added
skip: dict[str, str]for per-device skips with reasons - Wraps skipped devices in
pytest.param(..., marks=pytest.mark.skip(reason=...)) - CI shows
SKIPPEDwith reason instead of silent omission - Changed
strictdefault fromTruetoFalse(friendlier for CPU-only dev machines)
4. test_views_xform_prim_fabric.py — Migrated to helper
- Replaced
@pytest.mark.parametrize("device", ["cpu", "cuda:0"])withcuda_test_devices() - Removed manual
ISAACLAB_TEST_MULTI_GPUskipif decorators - Uses
mask="001", strict=Falsefor cuda:1-only tests
✅ All changes are clean refinements that improve maintainability and developer experience.
📋 Blocking for merge:
- Revert
run_docker_tests: 'false'override (status unchanged)
Nice incremental polish! 🔧
Update (101a47a): 🔧 Newton Device-Pinning & Test Tolerance Fixes
This commit addresses cuda:1 context issues in Newton manager and fixes device-specific test tolerances:
Changes:
1. newton_manager.py — Device context pinning (issue #5132)
Added explicit torch.cuda.set_device() + wp.set_device() calls in two critical locations:
start_simulation()— Before Warp/Newton allocations to ensure mujoco_warp collision pipeline allocates on the correct CUDA deviceinitialize_solver()— Mirrors the guard fromstart_simulation()for solver build and collision-pipeline init
This fixes null pointer returns when mujoco_warp allocates against a primary CUDA context that was never made current. The guard is idempotent if already pinned.
2. wp.ScopedCapture(device=device) — Explicit device for CUDA graph capture
Previously, CUDA graph capture didn't specify a device; now explicitly passes device to avoid capturing on wrong GPU.
3. Test tolerance fixes (test_rigid_object.py in both newton & physx packages)
Changed if device == "cuda:0" to if device.startswith("cuda") for tolerance selection. This ensures cuda:1 (and any other CUDA device) uses the relaxed GPU tolerance (1e-2) instead of the strict CPU tolerance (1e-5).
✅ Solid fixes:
- Device-pinning guards reference the related issue (#5132)
- Comments explain why the guards are needed (mujoco_warp context behavior)
- Test tolerance fix is the correct pattern for any CUDA device
- CUDA graph capture now device-aware
📋 Blocking for merge:
- Revert
run_docker_tests: 'false'override (status unchanged)
Otherwise, these are important correctness fixes for multi-GPU setups. 👍
Greptile SummaryThis PR adds
Confidence Score: 4/5Safe to merge; the multi-GPU workflow and mask-resolution logic are well-designed and single-GPU CI behaviour is provably unchanged. The core feature is correct and thoroughly tested. The strict=True default raises at collection time on hosts without cuda:0, changing the failure mode for migrated tests and potentially tripping up contributors on CPU-only machines. The env-var forwarding loop silently strips # from values, which is benign today but could cause data loss for future users of extra-env-vars. Neither issue affects the CI green path described in the PR. source/isaaclab/isaaclab/testing/devices.py (strict-mode collection failure on CPU-only hosts) and .github/actions/run-tests/action.yml (env-var value stripping). Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["pytest collection\n(parametrize decorator)"] --> B["cuda_test_devices()"]
B --> C{"mask= arg\nprovided?"}
C -- "No" --> D["read ISAACLAB_TEST_DEVICES\nenv var"]
D --> E{"env var set?"}
E -- "No" --> F["default mask = '110'"]
E -- "Yes" --> G["use env var mask"]
C -- "Yes" --> G
F & G --> H["_list_available_devices()\n[cpu, cuda:0, cuda:1, ...]"]
H --> I["_expand_mask(mask, len)"]
I --> J{"strict=True and\nsurplus '1' bits?"}
J -- "Yes" --> K["raise ValueError\nat collection time"]
J -- "No" --> L["return filtered device list"]
L --> M["pytest parametrizes\nover device list"]
subgraph AppLauncher["AppLauncher._resolve_device_settings"]
N{"device_explicit\npassed?"}
N -- "No" --> O{"ISAACLAB_SIM_DEVICE\nset?"}
O -- "Yes" --> P["device = env var value"]
O -- "No" --> Q["keep default device"]
N -- "Yes" --> Q
P & Q --> R{"XR mode\nenabled?"}
R -- "Yes" --> S["override to cpu"]
R -- "No" --> T["proceed with device"]
end
Reviews (1): Last reviewed commit: "Revert "TEMP: force run_docker_tests=fal..." | Re-trigger Greptile |
|
|
||
| Args: | ||
| mask: Optional explicit mask string following the grammar in the | ||
| module docstring. When ``None``, the helper reads the mask from | ||
| the ``ISAACLAB_TEST_DEVICES`` environment variable, defaulting | ||
| to ``"110"`` if the variable is unset. | ||
| strict: When ``True`` (the default), raise ``ValueError`` if the | ||
| mask requests devices the host does not have or if the resolved | ||
| list would be empty. When ``False``, silently truncate to what | ||
| the host can satisfy - callers using this in | ||
| ``pytest.mark.parametrize`` get a clean "no tests collected for |
There was a problem hiding this comment.
strict=True default raises at collection time on hosts without cuda:0
With the default mask "110" and strict=True, calling cuda_test_devices() on a host that reports only ["cpu"] (no CUDA) immediately raises ValueError during pytest collection — before any test runs. Before this change, the same tests (parametrized with ["cuda:0", "cpu"]) would collect normally; only the cuda:0 variant would fail at runtime. Tests like test_episode_data.py and test_math.py have no AppLauncher dependency, so they are plausible targets for collection on CPU-only dev machines. Passing strict=False to the default call-sites, or documenting the constraint, would preserve the old graceful-degrade behaviour for contributors without full GPU setups.
| # Caller-supplied extra env vars (one KEY=value per line). | ||
| if [ -n "$extra_env_vars" ]; then | ||
| while IFS= read -r line; do | ||
| line="${line%%#*}" | ||
| line="$(echo "$line" | xargs)" | ||
| [ -z "$line" ] && continue | ||
| key="${line%%=*}" |
There was a problem hiding this comment.
Comment-stripping and
xargs trimming silently corrupt values containing # or single quotes
line="${line%%#*}" strips everything from the first # in a line, so a future caller passing a value like IMAGE_TAG=v1.0#nightly would silently become IMAGE_TAG=v1.0. Similarly, $(echo "$line" | xargs) collapses internal whitespace and fails on values containing unbalanced single quotes. For the two values used today (001 and cuda:1) this is harmless, but it is a latent trap for any future extra-env-vars user. Documenting the # restriction in the input description, or switching to a safer stripping approach, would prevent silent data loss.
* Re-applies the run_docker_tests='false' guard in build.yaml's changes job (per pr/ci-iteration-shortcut) so the single-GPU Docker + Tests matrix skips during this iteration. * Adds test_views_xform_prim_fabric.py to the multi-GPU include-files list. Previously SKIPped because pip-install rounds hung on a usdrt SelectPrims cuda:1 deadlock; the docker image carries a newer Kit, so the cuda:1 path is worth re-validating here. Must be reverted before final review.
The docker image's newer Kit resolves the usdrt SelectPrims cuda:1 deadlock that previously kept this file in the SKIP list (pip-install rounds hit it). Run 26587461494 passed: 36 passed, 3 skipped, 2 xfailed for test_views_xform_prim_fabric.py. This also restores build.yaml's changes detection (drops the temp TEMP-iteration skip).
Per ~/.claude/skills/pr/ci-iteration-shortcut.md. Keep the single-GPU Docker + Tests matrix disabled until iteration is over and the PR is ready to land. Revert as the last commit before merge.
Folds the helper into the existing isaaclab.test subpackage shape (sibling of isaaclab.test.benchmark, isaaclab.test.mock_interfaces) under a new isaaclab.test.utils subpackage. Drops the standalone isaaclab.testing folder, which was a new top-level namespace with no precedent. Import path: from isaaclab.test.utils import cuda_test_devices.
Implements Greptile P2.1 + P2.2 and consolidates the device-skip
mechanism inside test files so the workflow needs no opt-in or
opt-out edits.
API:
* cuda_test_devices() default strict=False — CPU-only dev hosts now
collect the cpu variant cleanly instead of failing at pytest
collection (Greptile P2.1).
* cuda_test_devices(skip={device: reason}) — wraps unsupported
variants in pytest.param(..., marks=pytest.mark.skip(reason=...))
so pytest still collects them and shows SKIPPED with the reason in
CI output. Per-call granularity; reason co-located with the test.
Workflow:
* Auto-discovery via grep for cuda_test_devices callers; no SKIP list
in the workflow. Adding/removing a test from multi-GPU scope is a
test-file-only edit.
run-tests action:
* extra-env-vars parser now skips only full-line comments (no mid-line
# stripping) and doesn't xargs-collapse whitespace (Greptile P2.2).
Test file migrations:
* 7 previously-SKIPped files (4 newton + 3 physx) now declare a
module-level _CUDA_1_BROKEN dict with a tracking-issue URL and apply
cuda_test_devices(skip=_CUDA_1_BROKEN) per parametrize site.
* test_views_xform_prim_fabric.py migrated to the helper too (was
using the legacy ISAACLAB_TEST_MULTI_GPU env var pattern), so
auto-discovery picks it up.
Earlier rounds SKIPped 7 newton+physx files based on failures observed on the pip-install path or pre-docker rounds. The docker image carries newer Kit + Newton + Warp that already resolved 2 other categories (measure_total kwarg, FabricFrameView usdrt deadlock). Re-running the cuda:1 variants of these 7 files to see which actually still fail on the docker path.
Enable-all run (0118ea7) confirmed the docker image resolves the PhysX hangs and FabricFrameView/contact_sensor failures that earlier rounds suspected. Two narrow categories remain: * 4 newton files — Warp/mujoco_warp init-order on cuda:1 (issue isaac-sim#5132). Same root cause across all four; gated via module-level _NEWTON_5132 dict. * 1 PhysX test — test_rigid_body_no_friction[cuda:1-*] precision drift (1.8e-3 vs 1e-5 tolerance); gated via per-test _PHYSX_NO_FRICTION_CUDA1 dict. Everything else previously SKIPped now runs and passes on cuda:1 (test_articulation, test_rigid_object except no_friction, test_rigid_object_collection, test_views_xform_prim_fabric).
Add torch.cuda.set_device(device) + wp.set_device(device) at the
top of NewtonManager.start_simulation and initialize_solver
so mujoco_warp's collision pipeline allocates against an initialized
cuda:N primary context. Also pass device=device to the standard-
path wp.ScopedCapture (the relaxed-graph sibling already did this).
Local repro confirms: isaaclab_newton/test/assets/test_rigid_object.py
on cuda:1 was 41 failed; now 45 passed / 9 skipped / 0 failed.
Also fixes the test_rigid_body_no_friction tolerance branch in
both isaaclab_physx and isaaclab_newton test_rigid_object.py
files. The author already documented GPU non-determinism and set
tolerance = 1e-2 for cuda:0; the else branch fell through to
the CPU-tight 1e-5 on cuda:1, where PhysX's GPU integrator drift is
the same 1.8e-3 envelope. Gate on device.startswith('cuda') so
all cuda devices share the same loose tolerance.
Drops the temporary _NEWTON_5132 and _PHYSX_NO_FRICTION_CUDA1
skip dicts from the 5 test files now that the underlying bugs are
fixed.
Tracks: isaac-sim#5132.
There was a problem hiding this comment.
Update (52da4a7): 🔧 Container Isolation & Race Condition Fixes
This commit significantly hardens the parallel container execution by addressing race conditions and improving filesystem isolation.
Key Changes:
1. Pre-created _isaac_sim symlink on host
rm -f _isaac_sim && ln -s /isaac-sim _isaac_sim- Critical fix: Prevents parallel containers from racing on
rm/lnoperations on the shared workspace mount - The symlink is dangling on the host (no
/isaac-simpath) but resolves correctly inside each container
2. Comprehensive per-shard runtime directories
New volume mounts for complete isolation:
$shard_runtime/isaac-sim/kit/cache → /isaac-sim/kit/cache
$shard_runtime/isaac-sim/kit/data → /isaac-sim/kit/data
$shard_runtime/isaac-sim/kit/logs → /isaac-sim/kit/logs
$shard_runtime/isaac-sim/cache → /isaac-sim/.cache
$shard_runtime/isaac-sim/computecache → /isaac-sim/.nv/ComputeCache
$shard_runtime/isaac-sim/config → /isaac-sim/.nvidia-omniverse/config
$shard_runtime/isaac-sim/data → /isaac-sim/.local/share/ov/data
$shard_runtime/isaac-sim/logs → /isaac-sim/.nvidia-omniverse/logs
$shard_runtime/isaac-sim/pkg → /isaac-sim/.local/share/ov/pkg
- Each container now has fully isolated writable areas for Kit and Isaac-Sim state
- Prevents cross-contamination of caches, logs, and configs between parallel shards
3. Additional environment variables
XDG_DATA_HOME=/tmp/isaaclab-ci-home/.local/share
USER="${host_user}"
LOGNAME="${host_user}"
OMNI_KIT_DISABLE_CUP=1
ISAAC_SIM_LOW_MEMORY=1
PYTHONUNBUFFERED=1
PYTHONIOENCODING=utf-8OMNI_KIT_DISABLE_CUP=1: Disables Omniverse Connector Upload (reduces network I/O)ISAAC_SIM_LOW_MEMORY=1: Memory-conscious mode for parallel execution- Python buffering settings ensure clean log capture
4. Minor cleanup
- Removed hardcoded "4 GPUs" from comment (now generic)
unset HUB__ARGS__DETECT_ONLYadded inside container- Explicit
:rwsuffixes on volume mounts for clarity
✅ Excellent improvements:
- Race condition fixed: The symlink pre-creation is a crucial fix for parallel stability
- Full isolation: Each shard is now a proper sandbox with no shared writable state
- Memory-conscious:
ISAAC_SIM_LOW_MEMORY=1is smart for 3+ containers on one runner - Clean logging: Python unbuffered mode ensures real-time log capture
📋 Blocking for merge:
- Revert
run_docker_tests: 'false'override inbuild.yaml(still pending)
This commit addresses the practical realities of running parallel GPU containers. The filesystem isolation is now robust enough for production CI. 👍
📜 Previous review updates
Update (9dc722d): 🚀 Single-Runner Parallel Shard Architecture
This commit refactored the multi-GPU workflow from a matrix-based strategy (N runners) to parallel Docker containers on a single runner.
Key improvements:
- Resource efficiency: 1 runner instance vs 3 — doesn't burn the GPU pool
- Shared build cache: All shards share the same pulled image
- Clean log grouping:
::group::shard cuda:Nper container - Wait-all-then-aggregate pattern for clean failure handling
Update (9cb9b32): 🎯 Dynamic GPU Detection & Matrix Sharding
Transformed workflow to dynamic, matrix-based strategy that auto-adapts to runner's GPU count. (Now superseded by single-runner parallel approach.)
Update (101a47a): 🔧 Newton Device-Pinning & Test Tolerance Fixes
Device-pinning guards for Newton manager and test tolerance fixes for cuda:N devices.
Update (5d29bb0): 🎉 Ready for Merge
Reverted temporary run_docker_tests: 'false' override. (Note: has since been re-added for iteration.)
Update (3941b42): ✅ Test Dependency Install Fix
This commit addresses a runtime failure where pytest dependencies were missing inside the parallel shard containers.
Change:
./isaaclab.sh -p -m pip install pytest pytest-mock junitparser flatdict flaky "coverage>=7.6.1"Why needed:
- The inline
docker runfor parallel shards bypassed therun-testsaction - That action normally installs these test deps as a hardcoded step
- Without them,
tools/conftest.pyfails at collection time (importsjunitparser) - Now each shard installs deps in parallel (~2s overhead, acceptable)
✅ Good fix: Mirrors the run-tests action's install line. Comment documents the rationale.
📋 Blocking for merge:
- Revert
run_docker_tests: 'false'override inbuild.yaml(still pending)
Previous inline comments status:
- 🔄 P2:
strict=Truedefault indevices.py— not addressed in this commit (documentation/behavior concern) - 🔄 P2: Comment-stripping
#inextra-env-vars— not addressed in this commit (latent bug concern)
These are P2 (non-blocking) suggestions that can be addressed separately.
Update (612a019): 📺 Live Streaming Output & Timeout Adjustment
Changes:
1. Timeout raised 30 → 45 minutes (diagnostic)
timeout-minutes: 45 # was 30- First buffered parallel run hit the 30-min wall without printing anything useful
- Raised temporarily to capture real per-shard durations
- Comment documents intent to tighten back to 30 once split is balanced
2. Live streaming per-shard output with tagging
(
docker run ... 2>&1 | tee "${logfiles[$cuda]}" | stdbuf -oL sed "s/^/[cuda:$cuda] /"
exit "${PIPESTATUS[0]}"
) &Why this matters:
- Previously output was fully buffered to logfiles — silent until job completion
- Now: live
[cuda:0],[cuda:1],[cuda:2]prefixed lines interleave in the workflow log teekeeps clean per-shard logfiles for the grouped re-print at job endstdbuf -oLensures line-buffered output (no delayed batching)exit "${PIPESTATUS[0]}"propagates docker's exit code, not sed's — a failing shard isn't masked
✅ Good improvements:
- Debugging win: Interleaved tagged output makes parallel runs followable in real-time
- Correct exit handling:
PIPESTATUS[0]is the right pattern for preserving the producer's exit code through a pipe - Temporary nature documented: The 45-min timeout is explicitly flagged as diagnostic, not permanent
📋 Blocking for merge:
- Revert
run_docker_tests: 'false'override inbuild.yaml(still pending)
P2 inline comments (non-blocking, from earlier review):
strict=Truedefault indevices.py— still open- Comment-stripping
#inextra-env-vars— still open
Update (ae7ddad): ✅ Round-Robin File Partitioning
Changes:
1. Deterministic file distribution across shards
IFS=',' read -ra ALL_FILES <<< "$INCLUDE_FILES"
for i in "${!ALL_FILES[@]}"; do
if [ $((i % SHARD_COUNT)) -eq "$idx" ]; then
shard_files+=("${ALL_FILES[$i]}")
fi
done
shard_include=$(IFS=,; echo "${shard_files[*]}")Why this matters:
- Before: Every shard received the full
TEST_INCLUDE_FILESlist and relied onTEST_SHARD_INDEX/TEST_SHARD_COUNTto filter at pytest-collection time - After: Each shard receives only its pre-partitioned slice via
TEST_INCLUDE_FILES - Round-robin by position: shard 0 gets files 0,3,6,...; shard 1 gets 1,4,7,...; etc.
- Eliminates duplicate test runs across shards
2. Simplified container env vars
- Removed
TEST_SHARD_INDEXandTEST_SHARD_COUNT— no longer needed - Container only sees
TEST_INCLUDE_FILESwith its pre-assigned subset
✅ Good fix:
- Fixes redundant test execution (each file now runs on exactly one GPU)
- Comment documents the intentional trade-off: correct but not duration-balanced
- Duration-aware balancing acknowledged as future work
📋 Blocking for merge:
- Revert
run_docker_tests: 'false'override inbuild.yaml(still pending)
P2 inline comments (non-blocking):
strict=Truedefault indevices.py— still open- Comment-stripping
#inextra-env-vars— still open
Update (1547a9c): 🎯 test_devices() API Redesign — Scope ∩ Runtime Intersection
This commit introduces a significant API redesign for device parametrization, replacing cuda_test_devices() with a cleaner test_devices() function based on scope ∩ runtime_devices intersection.
Core Concept: Two Masks, Two Roles
# New API:
@pytest.mark.parametrize("device", test_devices("11X"))
def test_foo(device): ...Scope (call-site, author-owned): The devices the test is valid on
Runtime Devices (env var, operator-owned): The devices the run may use
The intersection determines what actually runs:
- Single-GPU CI:
scope="11X"∩runtime="110"→[cpu, cuda:0] - Multi-GPU shard:
scope="11X"∩runtime="00X"→[cuda:N](one non-default)
Mask Grammar Changes
| Mask | Meaning |
|---|---|
"110" |
cpu + cuda:0 (default scope & runtime) |
"11X" |
cpu + cuda:0 + any one non-default GPU |
"00X" |
A non-default GPU only |
"100" |
cpu only |
"001" |
cuda:1 specifically (rare) |
Key innovation: Trailing X means any one non-default GPU (not all). It resolves to ISAACLAB_SIM_DEVICE when set, or the lowest-index available non-default GPU.
Workflow Integration
# Each shard now uses uniform runtime mask
runtime_devices="00X" # Not per-shard "001", "002", etc.
-e ISAACLAB_TEST_DEVICES="$runtime_devices" \
-e ISAACLAB_SIM_DEVICE="cuda:$cuda" \The X resolves to whichever GPU the shard was assigned via ISAACLAB_SIM_DEVICE.
API Simplification
Before (removed):
cuda_test_devices(mask=None, strict=False, skip=None)After:
test_devices(scope="110", require_available=False)- Removed
skip=parameter (use pytest.mark.skip instead) - Renamed
strict→require_available(clearer intent) scopereplacesmaskwith clearer semantics
Mass Migration
All test files updated from cuda_test_devices() to test_devices("11X"):
test_newton_model_utils.pytest_simulation_context.pytest_views_xform_prim.pytest_math.py,test_noise.py,test_modifiers.pytest_wrench_composer.pytest_articulation.py(newton, physx, ovphysx)test_rigid_object.py,test_rigid_object_collection.pytest_contact_sensor.pytest_views_xform_prim_fabric.py- And more...
Test Infrastructure Improvements
tools/conftest.py now includes device information in the per-test timing table:
per_test_time_table = PrettyTable(field_names=["Test", "Device", "Time (s)"])Extracts device from test params (e.g., [size0-cuda:1]) for visibility.
Changelog
* Added :func:`isaaclab.test.utils.test_devices` to parametrize unit tests over
a device set resolved as ``scope ∩ runtime_devices``✅ Excellent design:
- Clean separation: Test authors specify validity, operators specify availability
- No per-shard customization: Uniform
"00X"mask works across any GPU count - Backwards compatible for single-GPU: Default
"110"scope matches old behavior - Self-documenting:
"11X"clearly reads as "cpu, cuda:0, and one non-default GPU"
📋 Blocking for merge:
- Revert
run_docker_tests: 'false'override inbuild.yaml(still pending)
P2 inline comments (non-blocking, from earlier review):
- Comment-stripping
#inextra-env-vars— still open
Update (a1fa279): 🔄 test_devices() API v2 — All Non-Default GPUs & Argless Default
This commit evolves the test_devices() helper with cleaner semantics and per-shard device pinning.
Key Semantic Change: Trailing X Now Means "All Remaining Devices"
Before (v1): X = "any one non-default GPU" (resolved via ISAACLAB_SIM_DEVICE)
After (v2): X = "include all remaining devices"
# v1: scope="11X" → cpu + cuda:0 + any-one-of(cuda:1, cuda:2, ...)
# v2: scope="11X" → cpu + cuda:0 + cuda:1 + cuda:2 + ... (all of them)The "run-once" property is now achieved through per-shard runtime masks in CI, not via the scope itself.
New Default: Argless test_devices()
# New idiomatic usage (default scope="11X"):
@pytest.mark.parametrize("device", test_devices()) # cpu + cuda:0 + all non-default GPUs
def test_foo(device): ...
# Pure math tests that don't need multi-GPU coverage:
@pytest.mark.parametrize("device", test_devices("110")) # cpu + cuda:0 only
def test_math_helper(device): ...Workflow: Per-Shard Precise Masks
# Old v1 approach:
runtime_devices="00X" # uniform across shards, X resolved via ISAACLAB_SIM_DEVICE
# New v2 approach:
runtime_devices=$(python3 -c "print('0' * ($cuda + 1) + '1')")
# cuda:1 → "001", cuda:2 → "0001", etc.Each shard now receives a position-specific mask that names exactly one device. The scope ∩ runtime intersection then yields that single GPU.
Implementation Simplification
Removed:
_target_nondefault()— no longer resolving "any one"_select()— complex mask-to-set conversion
Added:
_expand(mask, count)— simple mask expansion to boolean listskip={device: reason}parameter — gate known-broken device variants withpytest.mark.skip
Error Handling: Skip vs Raise
# Scope that doesn't intersect runtime → silent skip (green)
# e.g., "110" test on a non-default-GPU shard
# Runtime explicitly names devices host doesn't have → raise (red)
# e.g., ISAACLAB_TEST_DEVICES="0001" on a 2-GPU hostThis prevents vacuous greens from misconfigured runs while allowing legitimate skips.
Test Migrations
Changed to argless test_devices():
test_newton_model_utils.pytest_simulation_context.pytest_views_xform_prim*.pytest_wrench_composer.pytest_articulation.py,test_rigid_object*.pytest_contact_sensor.py
Changed to explicit test_devices("110"):
test_math.py,test_noise.py,test_modifiers.py(pure-math tests)test_episode_data.py
New: Comprehensive Unit Tests
source/isaaclab/test/utils/test_device_selection.py — 148 lines of tests covering:
- Scope ∩ runtime resolution
- Argless equals default scope
- Skip vs raise semantics
skip=parameter wrapping_expand()mask grammar
✅ Excellent evolution:
- Cleaner mental model:
X= "the rest" is more intuitive than "pick one" - Simpler implementation: No more
ISAACLAB_SIM_DEVICEresolution logic in the helper - Per-shard precision: Workflow owns the "which GPU" decision, not the test helper
- Comprehensive tests: The new test file is well-structured and covers edge cases
📋 Blocking for merge:
- Revert
run_docker_tests: 'false'override inbuild.yaml(still pending)
P2 inline comments (non-blocking):\n- Comment-stripping # in extra-env-vars — still open
EXPERIMENTAL — may be reverted if observed runtime or balance is worse than the serial baseline. Replaces the single-job 39-min serial test with a dynamic matrix of parallel jobs, one per non-default cuda:N device on the runner. Mechanism: * build job runs nvidia-smi on a multi-gpu runner, emits cuda_shards (JSON array, e.g. [1, 2, 3]) and shard_count (int) as outputs. * test job matrix consumes cuda_shards via fromJSON, so the partition follows whatever the pool exposes (4 GPUs today; auto-adapts to 2 or 8 later without code edits). * Each shard pins ISAACLAB_SIM_DEVICE=cuda:N + ISAACLAB_TEST_DEVICES to the matching mask, and requests shard-index=N-1 / shard-count from tools/conftest.py's file-shard logic. Expected wall on 4-GPU pool: ~13 min (max single-shard duration with naive paths[idx::3] split). Was ~39 min serial. Risks for monitoring: * Naive shard split is by sorted file name, not duration. Worst-case shard could pack the two longest files (530 s + 360 s) together. * Per-test parametrize is now ONE cuda:N per file rather than all of cuda:1/2/3 — distributes coverage instead of duplicating. Catches the same non-default-cuda defect class but loses 3x coverage on individual files. Revert path: drop this commit (drops the build outputs, the matrix strategy, and the per-shard env computation). Workflow reverts to a single-job test on cuda:1.
Matrix-of-shards allocated N separate runners from the multi-gpu pool even though each shard only uses 1 GPU and the runner exposes 4. Replaces the matrix with a single job that backgrounds one `docker run` per non-default cuda:N and waits for all. Same shard-index/shard-count semantics into tools/conftest.py, same ISAACLAB_TEST_DEVICES + ISAACLAB_SIM_DEVICE env per shard, just no extra runner. Per-shard logs grouped at the end; exit code aggregates across all shards. Trade: more bash in the workflow (the run-tests action can't be backgrounded as a composite step), but uses 1/N runners and pulls the image once instead of N times.
Prev attempt hit two issues that the run-tests action handles: 1. Symlink race — all 3 shards did rm/ln on the shared workspace's _isaac_sim symlink simultaneously, leaving cuda:1 without the symlink and forcing isaaclab.sh to fall back to python3 (not in path, exit 127). Now pre-create the symlink on the host once, before launching containers. 2. ModuleNotFoundError: junitparser — missing USER, LOGNAME, XDG_DATA_HOME env vars and the per-shard isaac-sim writable overlays that run-tests action sets. Without them the bundled python's import resolution went sideways. Mirror the full run-tests volume + env setup per shard.
The custom inline docker run bypassed run-tests action's hardcoded pip install of pytest pytest-mock junitparser flatdict flaky coverage, so the bundled python in the image (which does not include these) failed to import junitparser at tools/conftest.py load time. Mirror the run-tests action's pip install line inside the shard's in-container script. Each shard pays the install cost once (~2 s) in parallel.
The prior buffered design wrote each shard to a logfile and only cat'd them after wait, so when the job hit the 30-min timeout it had printed nothing but the launch notices — zero diagnostic signal. Each shard now pipes through 'tee clean.log | stdbuf -oL sed "s/^/[cuda:N] /"': the three concurrent streams interleave but every line is tagged with its GPU, so progress is followable live in the workflow log. tee keeps the unprefixed per-shard file for the end-of-job grouped re-print. The subshell exits with PIPESTATUS[0] (docker's code), so a passing sed tail can't mask a failing shard. timeout 30 -> 45 (DIAGNOSTIC): the first parallel run ran ~24.5 min of shard execution without finishing. Raised so a slow-but-completing run yields real per-shard durations to judge contention vs imbalance; tighten back to 30 once the picture is clear.
The shards passed the full include-files list plus TEST_SHARD_INDEX/ COUNT, but tools/conftest.py disables its shard split when include-files is set — so every shard ran the entire migrated suite (3x total work, ~40 min) instead of partitioning. Partition the discovered file list in the workflow instead: shard idx takes basenames at positions idx, idx+N, ... and passes only that slice as TEST_INCLUDE_FILES (dropping the now-ignored shard-index/ count env). Each file runs on exactly one GPU. Verified the split is deterministic with no duplicate/missing files. The round-robin split is correct but not load-balanced; duration-aware / work-stealing balancing is tracked as a follow-up.
| Ordered list of device strings as torch would address them. | ||
| """ | ||
| devices = ["cpu"] | ||
| if torch.cuda.is_available(): |
There was a problem hiding this comment.
Why are we using torch api here? This smells like tech debt. Shouldn't we switch to use warp? i.e.:
>>> import warp
>>> warp.init()
Warp 1.12.0 initialized:
CUDA Toolkit 12.9, Driver 13.0
Devices:
"cpu" : "x86_64"
"cuda:0" : "NVIDIA RTX PRO 6000 Blackwell Workstation Edition" (95 GiB, sm_120, mempool enabled)
Kernel cache:
/home/pbarejko/.cache/warp/1.12.0
>>> warp.get_cuda_devices()
['cuda:0']There was a problem hiding this comment.
Checked. using warp requires initializing runtime which might have side-effect to kit. torch does not have that.
| ], | ||
| ) | ||
| @pytest.mark.parametrize("device", ["cuda:0", "cpu"]) | ||
| @pytest.mark.parametrize("device", cuda_test_devices()) |
There was a problem hiding this comment.
Based on function name cuda_test_devices have we started to exclude cpu device tests. If so - why?
There was a problem hiding this comment.
renamed to test_devices
| so adopting this helper has zero impact on single-GPU runs. | ||
| """ | ||
|
|
||
| _ENV_VAR = "ISAACLAB_TEST_DEVICES" |
There was a problem hiding this comment.
Name of this global variable makes no sense, besides it's used in one place only. This unnecessary indirection makes code harder to read.
| reason="Multi-GPU tests disabled (set ISAACLAB_TEST_MULTI_GPU=1 to enable)", | ||
| ) | ||
| @pytest.mark.parametrize("device", ["cuda:1"]) | ||
| @pytest.mark.parametrize("device", cuda_test_devices(mask="001", strict=False)) |
There was a problem hiding this comment.
I don't understand the premise behind strictness in this function. I read it as maybe run on GPU 1, if unavailable then run on what's available. This introduces testing ambiguity, Just by looking at this line I don't know what you are trying to test.
In my opinion we would be better without cuda_test_devices function and keep list explicit, instead of resolved at runtime. In testing you want to keep constraints explicit.
The isaaclab_newton non-default-device fix and the isaaclab_physx / isaaclab_ovphysx test migrations had no changelog fragments, failing the changelog-fragments check. Add a Fixed entry for newton and .skip markers for the test-only packages.
test_devices() (renamed from cuda_test_devices) resolves the parametrized device list as scope intersect budget: the call-site mask declares the devices a test is valid on; the ISAACLAB_TEST_DEVICES env var declares the devices a run may use (default 110 = cpu + cuda:0). A trailing X means 'any one non-default GPU', resolved to ISAACLAB_SIM_DEVICE, so every shard shares one uniform budget instead of a per-shard positional mask. Replaces the strict flag with require_available. Migrate the call sites to scope masks (11X / 00X) and simplify the multi-GPU workflow auto-discovery and per-shard env accordingly.
Address review: the single-use _ENV_VAR constant named the ISAACLAB_TEST_DEVICES env var but its name conveyed nothing. Rename it _RUNTIME_DEVICES_ENV_VAR (and _DEFAULT_BUDGET -> _DEFAULT_RUNTIME_DEVICES), and use 'runtime devices' for the env-controlled device set throughout the docstrings, changelog, and workflow, so the scope (call-site) vs runtime-devices (env) split reads consistently.
The end-of-run summary showed only per-file timing. Add a GPU column to the per-file table (the run's boot device) and a new per-test run-time table (slowest first, with each test's device parsed from its id), and rename the mislabeled 'Per Test Result Summary' to 'Per File Result Summary'.
Rework the device-selection helper from scope/budget masks with an "any one non-default GPU" X wildcard (resolved via ISAACLAB_SIM_DEVICE) to a plain scope ∩ runtime intersection: - scope (call-site mask, default "11X") and runtime (ISAACLAB_TEST_DEVICES, default "110") are each expanded to per-device include flags and ANDed. - A trailing X includes all remaining devices; the helper no longer reads ISAACLAB_SIM_DEVICE. Running once on a non-default GPU is a property of the one-device-per-shard runtime the workflow sets, not of the helper. - Raise only when the runtime names a device the host lacks (a misconfigured run); a scope that simply does not intersect the run's devices skips. Migrate call sites to the convention: argless test_devices() for the device-dependent common case, "110" for torch-only math tests (which then skip the non-default shards), "00X" for non-default-only regressions. The multi-GPU workflow now sets a per-shard one-device runtime mask and its file discovery matches argless calls. Add unit tests covering the intersection, the skip-versus-raise boundary, the skip= gating, and the mask grammar.
1. Summary
isaaclab.test.utils.test_devices()— resolves a test's parametrized device list asscope ∩ runtime: the call-site mask declares the devices a test is valid on (default"11X"); theISAACLAB_TEST_DEVICESenv var declares the devices a run may use (default"110"⇒ cpu + cuda:0).ISAACLAB_SIM_DEVICE, honored byisaaclab.app.AppLauncheras the implicit-default boot device, so the workflow can boot Kit on a non-default GPU without editing everyAppLauncher()call site.mujoco_warp's collision pipeline no longer null-pointers oncuda:N.test-multi-gpu-pytest.yamlruns the non-default-GPU subset across all non-default GPUs in parallel on the[self-hosted, …, multi-gpu]pool, one container per GPU, each a disjoint round-robin file slice. Same ECR-pulledisaac-labimage andrun-testsactions as the single-GPU matrix.test_devices()(device-dependent), 57"110"(torch-only math, which then skip the non-default shards), 3"00X"(non-default-only regressions) — and adds 23 unit tests for the helper. Single-GPU CI behavior is unchanged.2. Device selection:
scope ∩ runtimeA test declares scope; the operator/CI sets the runtime; the executed set is the intersection. The author never names the shard's GPU; the operator never inspects a test's device support.
2.1 Mask grammar
Positions left to right:
0= cpu,1= cuda:0 (default GPU),2= cuda:1,3= cuda:2, … Each is0/1. A trailingXmeans "include all the remaining devices". cpu, cuda:0, and a non-default GPU stay distinct by position — passing on cuda:0 says nothing about cuda:1+, which is the point.11011X00X100test_devices()defaults to"11X", so the device-agnostic common case needs no argument. The helper never readsISAACLAB_SIM_DEVICE(that isAppLauncher's boot device); it assumes the run'sISAACLAB_TEST_DEVICESalready matches it, which the workflow guarantees.2.2 Runtime and the two CI lanes
ISAACLAB_TEST_DEVICEScarries the runtime. Ascope="11X"(or argless) test resolves to:"110") ⇒[cpu, cuda:0]."0001", one device) ⇒[cuda:2].An empty result means the test is cleanly skipped for that run — e.g. a
"00X"test on a single-GPU host, or a"110"math test on a non-default-GPU shard. The helper raises only when the run explicitly setISAACLAB_TEST_DEVICESto a device the host does not have (a misconfigured shard), so a runner with fewer GPUs than expected fails loudly instead of reporting a vacuous green.Running each test once on a single non-default GPU is a property of the workflow (a one-device-per-shard runtime mask plus round-robin file assignment), not of the helper — there is no device-picking logic inside
test_devices.3. Workflow shape
config→build(ECR cache + GPU detect, emits the non-default shard list) →test(onedocker runpercuda:Nin parallel on the multi-GPU runner). Each shard sets a per-shard runtime mask naming exactly its GPU (cuda:2⇒"0001") andISAACLAB_SIM_DEVICE=cuda:N, soscope ∩ runtimeresolves to that one GPU and Kit boots on the same one. Auto-discovery includes anytest_*.pywith an arglesstest_devices()or atest_devices("..X")scope. Inside each container,run-testsruns./isaaclab.sh -p -m pytestviatools/conftest.py's subprocess-per-file pattern.4. Single-GPU CI invariant
Migrated tests behave identically on the single-GPU pool: with
ISAACLAB_TEST_DEVICESunset the runtime is"110", so an argless /"11X"scope resolves to[cpu, cuda:0]— the same set tests carried before.ISAACLAB_SIM_DEVICEunset ⇒AppLauncherdefaults tocuda:0. Both env vars only take effect on the multi-GPU workflow.5. Known non-default-device gaps
test_get_jacobians_link_origin_contract,test_body_q_consistent_after_root_write) expose a genuine non-default-device defect beyond [Bug Report] Multi-GPU distributed training fails with Newton physics backend #5132 (stalebody_q/ link poses oncuda:N). They stay pinned tocuda:0for now; broadening is gated on the upstream fix.devicefixture and the remaining literalparametrize("device", [...])sites (~150) are not on the helper yet, so those tests still run cpu/cuda:0 on shards. Migrating them is a separate pass.