Skip to content

[CI] Cross-platform — Part 5: Dynamic work-stealing across multi-GPU shards#5875

Draft
hujc7 wants to merge 68 commits into
isaac-sim:developfrom
hujc7:jichuanh/multi-gpu-dynamic-sharding
Draft

[CI] Cross-platform — Part 5: Dynamic work-stealing across multi-GPU shards#5875
hujc7 wants to merge 68 commits into
isaac-sim:developfrom
hujc7:jichuanh/multi-gpu-dynamic-sharding

Conversation

@hujc7
Copy link
Copy Markdown
Collaborator

@hujc7 hujc7 commented May 30, 2026

1. Summary

Stacked on #5823 — the multi-GPU workflow file and the conftest summary code both originate there. Review only the two files above; the rest of the diff is #5823 and disappears once it merges.

2. Why work-stealing

The static round-robin split (#5823) assigns file i to shard i mod N. File durations vary widely (Kit-boot tests ~60–90 s, pure-math tests ~ms), so one shard routinely becomes the long pole while others idle. Work-stealing removes the imbalance with no duration model: whichever shard is free claims the next file.

3. Mechanism

3.1 Shared queue (conftest)

tools/conftest.py gains _claim_queued_file(queue_path) — an flock-guarded pop of the first line of a shared queue file — and _queued_files(), a generator that drains it. When ISAACLAB_TEST_QUEUE is set, run_individual_tests iterates the queue instead of the static include list; each file still runs once, on this container's pinned GPU. The end-of-run summary reports the files this container actually claimed.

3.2 Workflow

The discover step also emits the full relative paths; the run step seeds them into queue.txt in a dir mounted into every shard container at /mgpu, sets ISAACLAB_TEST_QUEUE=/mgpu/queue.txt, and drops the round-robin partition. flock on the shared mount serializes claims across containers (same host inode via the bind mount).

4. Design note — why not one container with N GPU worker threads

An earlier prototype ran a single container with one thread per GPU. The blocker: Kit's shader/compute cache lives at the absolute path /isaac-sim/.nv/ComputeCache, which is shared across concurrent Kit boots in one container — and per-worker HOME cannot relocate an absolute path. Concurrent shader compiles would race there. The existing N-container design exists precisely to isolate that cache, so this PR keeps N containers and only changes how files are distributed to them.

5. Validation

Queue logic validated locally: interleaved claimers each get every file exactly once, no double-claim, a missing queue file returns cleanly. conftest compiles, workflow YAML parses, pre-commit clean. The single-GPU CI path is untouched (ISAACLAB_TEST_QUEUE unset ⇒ the static iteration as before).


Note (2026-05-30): Cherry-picked #5881 for hypothesis test

This branch currently contains a cherry-pick of #5881 (Honor device kwarg over sim_cfg.device in build_simulation_context) to validate whether the Kit lifecycle hang in dynamic 3-shard mode is exacerbated by the device-drift bug. CI on this push exercises both fixes together. When #5881 lands on develop, this branch will rebase and the duplicate commit will drop automatically.

hujc7 added 30 commits May 26, 2026 23:10
Re-enables the pull_request trigger on test-fabric-multi-gpu.yaml and
wires it to run the FabricFrameView contract tests with
ISAACLAB_TEST_MULTI_GPU=1, which activates the three cuda:1
-parameterised tests added in isaac-sim#5514.

The cuda:1 tests target FabricFrameView's SelectPrims path on non-zero
CUDA device indices.  They currently hang indefinitely on real
multi-GPU hardware (reproduced locally on 3x RTX 6000 Pro Blackwell
and on the multi-GPU runner pool); the 60-min workflow timeout will
cancel the job and surface the regression in CI for the
FabricFrameView maintainers.

Install pipeline matches isaac-sim#5738's proven-working layout:
- Pin Python 3.12 via SHA-pinned actions/setup-python.
- Pre-install cmake via pip to skip install.py's sudo apt-get branch.
- ./isaaclab.sh --install none (core only, avoids egl_probe libEGL).
- pip install isaacsim[all,extscache]==${vars.ISAACSIM_BASE_VERSION
  || '6.0.0'} --extra-index-url https://pypi.nvidia.com.
- Bypass Kit's interactive EULA via OMNI_KIT_ACCEPT_EULA / ACCEPT_EULA
  / ISAAC_SIM_HEADLESS.

Status: this PR is expected to fail with the 60-min workflow timeout.
Land once the underlying hang in fabric_frame_view.py is fixed.
Adds a single helper, cuda_test_devices(), that converts a 3-position
device mask (env-var ISAACLAB_TEST_DEVICES, default '110') into the
list of device strings tests parametrize over.  Single-GPU CI sees no
change (default mask '110' resolves to [cpu, cuda:0], identical to the
hardcoded lists tests carry today).  The new multi-GPU-pytest workflow
sets ISAACLAB_TEST_DEVICES=001 so migrated tests run on cuda:1 only.

Mask grammar: each position is 0 or 1, optional trailing X expands to
all remaining positions. Position 0 -> cpu; position k>=1 -> cuda:{k-1}.
Strict mode raises on missing devices; non-strict returns empty for
opt-in tests that should skip on hosts that can't satisfy them.

P0 migration (pure-Python utility tests, no Kit):

* source/isaaclab/test/utils/test_math.py: 45 parametrize sites +
  2 inline for-loops migrated.
* source/isaaclab/test/utils/test_wrench_composer.py: 37 sites.
* source/isaaclab/test/utils/test_episode_data.py: 5 sites.

Each migrated site replaces a hardcoded [cpu, cuda:0] (or the reversed
or tuple form) with cuda_test_devices().  Migration is additive - one
import line per file plus the inline edits.  No test logic changes.

Workflow: .github/workflows/test-multi-gpu-pytest.yaml runs on the
[self-hosted, ..., multi-gpu] pool with ISAACLAB_TEST_DEVICES=001.
Triggered on changes to the helper, the P0 test files, or the
workflow itself.

Excluded scope (to follow up after CI validates this MVP):

* P1 light-Kit tests (test_simulation_context, test_views_xform_prim,
  test_newton_model_utils, test_views_xform_prim_newton).
* P2 asset tests (test_articulation / test_rigid_object on physx and
  newton backends).
* FabricFrameView cuda:1 tests (PR isaac-sim#5514) - separate path, the
  SelectPrims deadlock there is tracked independently.

Reverts the fabric-specific .github/workflows/test-fabric-multi-gpu.yaml
edits that were carried on this branch from the earlier PR scope; that
demo is independent of this framework work.
Migrate 16 additional test files (P0 extras + P1 + P2 + P3) to call
cuda_test_devices() in their device parametrize, covering ~280 sites
across articulation/rigid-object/rigid-object-collection/sim/sensors
suites for physx, newton, and ovphysx backends.

Rewrite the workflow's run step to auto-discover any test file calling
cuda_test_devices() via grep, so new opt-ins land without workflow
edits. Files are split into a pure-Python pytest session and per-file
Kit-bound invocations (Kit is a process-wide singleton). A hardcoded
SKIP list parks the known-broken FabricFrameView cuda:1 path.

Per-Kit-file timeout 600 bounds any single hang at 10 minutes so the
job surfaces all failing files rather than blocking on the first.
pytest is not pulled in by --install none or by isaacsim[all,extscache].
Runner state was masking this; pin it explicitly.
flaky and pytest-mock are declared in source/isaaclab/setup.py
install_requires but pip's resolver was silently skipping them when
combined with the pytorch/nvidia extra-index urls in the install step.
Pin them explicitly so the multi-GPU runner is runner-state independent.

SKIP four newton test files that the cuda:1 cold-runner surfaces as
broken (test_contact_sensor hits a pre-existing measure_total kwarg
bug; test_articulation segfaults; test_rigid_object_collection and
test_views_xform_prim_newton have cuda:1 specific failures).  They're
still parametrized via cuda_test_devices() so single-GPU CI continues
to cover cpu+cuda:0.

Accept pytest exit code 5 (no tests collected) so module-level
pytestmark skips (e.g. backend-availability gates in ovphysx) and
device-only parametrize that resolves to [] on incompatible hosts both
count as success.
Four additional test files surfaced cuda:1-specific failures or hangs
on the multi-GPU runner:

* test_simulation_context: passes test_init[cuda:1] then hangs the
  next parametrize variant, 10-min per-file timeout fires.
* newton/test_rigid_object: 41 cuda:1 failures (out of 54).
* physx/test_rigid_object: passes test_initialization[cuda:1-1] then
  hangs at [cuda:1-2] (env-count 2 on cuda:1).
* physx/test_rigid_object_collection: same hang signature.

They keep their cuda_test_devices() parametrize so single-GPU CI
continues to exercise cpu+cuda:0; only multi-GPU CI skips them
pending separate investigation.
When the caller doesn't pass device= explicitly, AppLauncher now
falls back to the ISAACLAB_SIM_DEVICE env var (if set) instead of the
hardcoded cuda:0 default. Kit's active_gpu / physics_gpu are
process-global and locked after SimulationApp init, so per-test
parametrize alone cannot retarget GPU selection once the app is up.
Boot-time alignment is the only path that works.

The multi-GPU pytest workflow now sets ISAACLAB_SIM_DEVICE=cuda:1
alongside ISAACLAB_TEST_DEVICES=001, so PhysX and Warp pin to cuda:1
from process start.

Drops 7 entries from the SKIP list (5 cuda:1 hangs around
active_gpu/cuda mismatch + 2 newton suites likely sharing the same
root cause). Remaining SKIPs:
* FabricFrameView (usdrt SelectPrims cuda:0-only, upstream Kit)
* newton/contact_sensor (Newton PR isaac-sim#2135 measure_total rename, needs
  caller update in newton_manager.py — tracked separately).
Round-5 CI confirmed the AppLauncher fix unblocks test_simulation_context
(was hanging at second parametrize, now 42 passed in 62 s). Other
files surfaced separate, non-AppLauncher root causes that need
independent fixes:

* Newton suites (4 files): Warp allocator failure inside
  mujoco_warp.collision_driver on cuda:1. Reproduces locally on a
  3-device MIG host; root cause is the Warp/mujoco_warp interaction,
  not AppLauncher routing.
* PhysX suites (3 files): hang at test_initialization[cuda:1-2] only
  on the AWS multi-GPU runner. Passes in 11 s locally with the same
  code, so the hang is runner-specific (L40 driver / peer access /
  PCIe topology), not an IsaacLab bug.

test_simulation_context stays in scope (the AppLauncher fix made it
pass deterministically). FabricFrameView usdrt and contact_sensor
Newton API rename remain in SKIP for their pre-existing root causes.
Replaces the pip-install path with ECR pull of the same isaac-lab CI
image used by build.yaml.  ECR auths via the runner EC2 instance's
IAM role (no nvcr.io credentials required at PR-time), so fork PRs
work without exposing NGC_API_KEY.

Benefits:
* Newton 1.2+ pre-installed in the image, fixing the contact_sensor
  measure_total kwarg mismatch without a manual pip pin.
* Eliminates the 9 min cold pip-install step (image pull from ECR
  is tens of seconds when cached).
* Dep matrix matches single-GPU CI exactly, so both gates surface
  the same kind of dep skew.

If ECR cache misses (e.g. build.yaml hasn't completed first), the
action falls back to local build; that path is slow and requires
NGC_API_KEY.  Validating ECR auth on the multi-gpu runner pool is
the primary goal of this commit — drops contact_sensor from SKIP to
test that the version skew is resolved.
The previous attempt missed:
1. The isaac-sim base image has an ENTRYPOINT that launches Kit, so
   bash -c '...' was passed as Kit's argv (Kit booted, my script
   never ran). Mirror run-tests action: --entrypoint bash +
   -c '...'.
2. tools/conftest.py's pytest_ignore_collect returns True for
   every file (subprocess-per-test runner), so pytest collects 0
   items and exits. Pass --ignore=tools/conftest.py
   --ignore=source/isaaclab/test/install_ci, same as run-tests does.
ecr-build-push-pull only pulls locally on the exact-cache-hit path.
On deps-cache-hit (registry-side alias) the image isn't local, so
docker run fails with 'Unable to find image isaac-lab-ci:... locally'
followed by an unauthed Docker Hub pull attempt.  Explicit pull via
the ECR URL covers all paths uniformly.
The ecr-build-push-pull action cleans up its temp DOCKER_CONFIG after
running, so the docker login from inside the action does not persist.
Re-authenticate via aws ecr get-login-password (works via the runner
EC2's IAM role, no AWS creds in the workflow).
Runner's default ~/.docker/config.json declares a credential helper
that fails with 'not implemented'.  Mirror the same workaround the
ecr-build-push-pull action uses: drop a fresh config with credsStore
set to empty string, then docker login + pull work.
Image's default USER is isaaclab (uid 1000), which doesn't own the
volume-mounted host workspace, so it can't ln -s _isaac_sim (perm
denied) — falling back to PATH python3 which doesn't exist in the
image, hence pytest exit 127.
Running container as host uid:gid means the image's default /root
home is not writable, so Warp/numpy/pip cache writes hit
PermissionError [Errno 13] '/root/.cache'. Mount a fresh tmp dir
and point HOME + XDG_CACHE_HOME at it.
The docker image properly installs ovphysx, so module-level
pytestmark.skipif now triggers (no backend init at multi-gpu)
collecting 0 items / 1 skipped. isaaclab.sh's CLI wrapper translates
that exit-5 to exit-1, breaking the workflow's is_ok() check. Skip
the 3 ovphysx files here.
Per ~/.claude/skills/pr/ci-iteration-shortcut.md.  All gated Docker +
Tests jobs (single-GPU build/test matrix) skip via their existing
if-gate. Revert before final review.

PR 5823 iterates the multi-GPU pytest docker conversion; the heavy
single-GPU matrix adds no signal to that work and costs 30+ runner
minutes per push.
Replaces the 237-line custom workflow with the ~100-line shape used by
single-GPU test jobs: pull image via ecr-build-push-pull, run pytest
in container via run-package-tests + run-tests, let tools/conftest.py
handle the Kit-singleton subprocess-per-test pattern.

The 8 docker-runtime quirks I worked around in the previous attempt
(ENTRYPOINT, conftest ignore, deps-cache pull, DOCKER_CONFIG cleanup,
credsStore, uid mismatch, HOME, exit-5 propagation) are all already
handled inside the run-tests action. No reinvention.

Adds one input to run-tests + run-package-tests: extra-env-vars
(multiline KEY=value), used here to inject ISAACLAB_TEST_DEVICES and
ISAACLAB_SIM_DEVICE so the container's pytest parametrize and Kit
boot align on cuda:1.

Test scope: 9 opt-in basenames covering ~512 cuda:1 tests, same
discovery scope as the previous attempt minus the 11 SKIPped files.
ecr-build-push-pull's deps-cache-hit path only creates a registry-side
alias (no local pull). Without a prior build job that establishes the
exact-commit tag in ECR, the test job's internal ecr-build-push-pull
hits exact-cache-miss + deps-cache-hit and leaves no local image, so
docker run fails with 'pull access denied'.

Mirrors the build → test split that build.yaml already uses for the
single-GPU matrix. Build job pre-populates the exact tag (via
buildx imagetools create on deps-cache-hit, or full build on miss);
test job's inner ecr-build-push-pull then hits exact-cache-hit and
pulls locally via the action's existing 'Pull exact image' step.
* Re-applies the run_docker_tests='false' guard in build.yaml's
  changes job (per pr/ci-iteration-shortcut) so the single-GPU
  Docker + Tests matrix skips during this iteration.
* Adds test_views_xform_prim_fabric.py to the multi-GPU include-files
  list. Previously SKIPped because pip-install rounds hung on a
  usdrt SelectPrims cuda:1 deadlock; the docker image carries a
  newer Kit, so the cuda:1 path is worth re-validating here.

Must be reverted before final review.
The docker image's newer Kit resolves the usdrt SelectPrims cuda:1
deadlock that previously kept this file in the SKIP list (pip-install
rounds hit it). Run 26587461494 passed: 36 passed, 3 skipped, 2
xfailed for test_views_xform_prim_fabric.py.

This also restores build.yaml's changes detection (drops the temp
TEMP-iteration skip).
Per ~/.claude/skills/pr/ci-iteration-shortcut.md. Keep the single-GPU
Docker + Tests matrix disabled until iteration is over and the PR is
ready to land. Revert as the last commit before merge.
Folds the helper into the existing isaaclab.test subpackage shape
(sibling of isaaclab.test.benchmark, isaaclab.test.mock_interfaces)
under a new isaaclab.test.utils subpackage. Drops the standalone
isaaclab.testing folder, which was a new top-level namespace with no
precedent.

Import path: from isaaclab.test.utils import cuda_test_devices.
Implements Greptile P2.1 + P2.2 and consolidates the device-skip
mechanism inside test files so the workflow needs no opt-in or
opt-out edits.

API:
* cuda_test_devices() default strict=False — CPU-only dev hosts now
  collect the cpu variant cleanly instead of failing at pytest
  collection (Greptile P2.1).
* cuda_test_devices(skip={device: reason}) — wraps unsupported
  variants in pytest.param(..., marks=pytest.mark.skip(reason=...))
  so pytest still collects them and shows SKIPPED with the reason in
  CI output. Per-call granularity; reason co-located with the test.

Workflow:
* Auto-discovery via grep for cuda_test_devices callers; no SKIP list
  in the workflow. Adding/removing a test from multi-GPU scope is a
  test-file-only edit.

run-tests action:
* extra-env-vars parser now skips only full-line comments (no mid-line
  # stripping) and doesn't xargs-collapse whitespace (Greptile P2.2).

Test file migrations:
* 7 previously-SKIPped files (4 newton + 3 physx) now declare a
  module-level _CUDA_1_BROKEN dict with a tracking-issue URL and apply
  cuda_test_devices(skip=_CUDA_1_BROKEN) per parametrize site.
* test_views_xform_prim_fabric.py migrated to the helper too (was
  using the legacy ISAACLAB_TEST_MULTI_GPU env var pattern), so
  auto-discovery picks it up.
Earlier rounds SKIPped 7 newton+physx files based on failures observed
on the pip-install path or pre-docker rounds. The docker image carries
newer Kit + Newton + Warp that already resolved 2 other categories
(measure_total kwarg, FabricFrameView usdrt deadlock). Re-running the
cuda:1 variants of these 7 files to see which actually still fail on
the docker path.
Enable-all run (0118ea7) confirmed the docker image resolves the
PhysX hangs and FabricFrameView/contact_sensor failures that earlier
rounds suspected. Two narrow categories remain:

* 4 newton files — Warp/mujoco_warp init-order on cuda:1 (issue isaac-sim#5132).
  Same root cause across all four; gated via module-level
  _NEWTON_5132 dict.
* 1 PhysX test — test_rigid_body_no_friction[cuda:1-*] precision
  drift (1.8e-3 vs 1e-5 tolerance); gated via per-test
  _PHYSX_NO_FRICTION_CUDA1 dict.

Everything else previously SKIPped now runs and passes on cuda:1
(test_articulation, test_rigid_object except no_friction,
test_rigid_object_collection, test_views_xform_prim_fabric).
Add torch.cuda.set_device(device) + wp.set_device(device) at the
top of NewtonManager.start_simulation and initialize_solver
so mujoco_warp's collision pipeline allocates against an initialized
cuda:N primary context. Also pass device=device to the standard-
path wp.ScopedCapture (the relaxed-graph sibling already did this).

Local repro confirms: isaaclab_newton/test/assets/test_rigid_object.py
on cuda:1 was 41 failed; now 45 passed / 9 skipped / 0 failed.

Also fixes the test_rigid_body_no_friction tolerance branch in
both isaaclab_physx and isaaclab_newton test_rigid_object.py
files. The author already documented GPU non-determinism and set
tolerance = 1e-2 for cuda:0; the else branch fell through to
the CPU-tight 1e-5 on cuda:1, where PhysX's GPU integrator drift is
the same 1.8e-3 envelope. Gate on device.startswith('cuda') so
all cuda devices share the same loose tolerance.

Drops the temporary _NEWTON_5132 and _PHYSX_NO_FRICTION_CUDA1
skip dicts from the 5 test files now that the underlying bugs are
fixed.

Tracks: isaac-sim#5132.
Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


Update (commit 74d1ad1): New commit replaces the workflow-level grep -vE exclusion list with a file-level marker convention.

Changes in This Update

1. .github/workflows/test-multi-gpu-pytest.yaml

The discovery step now checks each candidate file for a MULTI_GPU_SKIP_REASON module-level constant instead of maintaining a hardcoded exclusion list:

for f in "${candidates[@]}"; do
  if grep -q '^MULTI_GPU_SKIP_REASON' "$f"; then
    skipped+=("$f")
  else
    discovered+=("$f")
  fi
done

Skipped files emit a ::notice:: annotation with the reason for CI visibility.

2. Test File Annotations

source/isaaclab_newton/test/assets/test_articulation.py and source/isaaclab_physx/test/assets/test_articulation.py both add:

MULTI_GPU_SKIP_REASON = "Kit lifecycle bug: SIGHUP ~80% into the file under concurrent multi-GPU"

Assessment

This is a clean improvement over the previous grep -vE approach:

Decoupled — Test owners add/remove skip annotations without touching workflow files

Self-documenting — The reason is co-located with the test file

Visible — CI ::notice:: annotations surface skipped files and reasons

Maintainable — Adding a new skip is one line in the test file, not editing YAML

Minor Observations

  1. Regex anchoring: grep -q '^MULTI_GPU_SKIP_REASON' requires the constant at line start. If someone indents it accidentally (e.g., inside a conditional), it won't be detected. This is actually a feature — it forces module-level declaration.

  2. Reason extraction: The sed pipeline handles both "quoted" and unquoted values, but might truncate if the reason contains embedded quotes. Low risk since reasons are simple strings.

Outstanding Items (Unchanged)

The previous review's open items still apply:

  • run_docker_tests: 'false' hardcoding in build.yaml (marked as "TEMP")
  • 45-minute timeout (documented as diagnostic, target is 30)

Verdict

LGTM ✅ — Nice incremental improvement. The file-level marker convention is cleaner and more maintainable than workflow-level exclusion lists.


Update (commit e2bfa1b): New commit adds --ulimit nofile=65535:65535 to the workflow's docker run and comments out the MULTI_GPU_SKIP_REASON markers in both test files as a CI experiment to validate the fix.

Assessment: Clean diagnostic change. The ulimit bump targets the root cause (Kit/Vulkan needing >1024 fds for multi-GPU shared resources), and the commented-out markers preserve easy rollback. No new issues introduced. Previous outstanding items (build.yaml run_docker_tests: 'false' hardcoding, 45-min timeout) still apply but are pre-existing.


Update (commit 5464c1e): Experiment rolled back — --ulimit nofile=65535:65535 removed from docker run and MULTI_GPU_SKIP_REASON markers restored in both test files. The ulimit bump did not resolve the Kit lifecycle failures. Comments upgraded from "TEMP CI EXPERIMENT" to proper documentation explaining the skip convention. No new issues introduced.

hujc7 added 13 commits May 30, 2026 05:06
The articulation jacobian / body_q tests are unreliable under concurrent
multi-GPU execution: a 3-way concurrent repro fails
test_get_jacobians_link_origin_contract on cuda:0/1/2 alike (J*q_dot wrong,
~110/120 elements), and multi-GPU CI hangs the whole file. They are not
device-specific (they fail on the default GPU too), so the prior ["cuda:0"]
pin did not avoid the problem on the shards.

Gate them with test_devices("01X", skip_non_default=...): they run on cuda:0
in single-GPU CI as before and collect as SKIPPED (with a reason) on the
non-default shards. Adds skip_non_default to the test_devices helper, a unit
test for it, and -v to the per-file pytest command so a future hang names its
test.
Under concurrent multi-GPU execution a long-running but otherwise-passing
test file (newton test_articulation) was killed mid-run by SIGHUP
(signal 1) at ~481s, far under its timeout, from orphaned-process cleanup
during shutdown of a sibling test's Kit children. A CI test subprocess
runs with no controlling terminal, so SIGHUP is always spurious; ignore it
in the child. The conftest uses SIGKILL for its own timeout and teardown,
so legitimate cleanup is unaffected.
SIG_IGN set before exec did not survive Kit/carbonite startup, which
resets the SIGHUP disposition, so a spurious orphan-cleanup SIGHUP still
killed a long-running, otherwise-passing test file mid-run under
concurrent multi-GPU load. Block SIGHUP via the signal mask instead: a
disposition reset does not clear the mask, so the signal stays pending
and is never delivered. SIGKILL remains for the conftest's own timeout
and teardown.
Hardcoded ["cuda:0"] device parametrizations always ran on cuda:0 even
on a non-default-GPU shard, because the simulation context retargets
physics to the parametrized device. They never exercised a non-default
GPU. Route them through test_devices("01X") so they follow the run's
device (cuda:0 in single-GPU, the shard GPU in multi-GPU), giving real
non-default coverage and removing per-shard physics device churn.
The spurious SIGHUP that kills newton test_articulation on the non-default
shards is external (the conftest only sends SIGKILL) and does not reproduce
on a single host or under local concurrency, so its sender is unknown.

Wrap the per-file test command (gated by ISAACLAB_SIGHUP_PROBE, set in the
multi-GPU workflow) with a probe that blocks SIGHUP and uses sigwaitinfo to
log the sending PID / comm / cmdline. This only records the sender; it does
not change whether the file fails. Reverted once the source is identified.
newton and physx test_articulation trigger a Kit/Isaac-Sim process-lifecycle
failure under concurrent multi-GPU execution on non-default GPUs (newton:
SIGHUP ~80% into the file on cuda:2/3; physx: shutdown hang). It is not
reproducible single-host -- single, 3-way concurrent, and multi-file local
repros all pass -- and is not caused by the sharding (workflow, conftest, and
work-queue verified clean). It is the same heavy file the Kit-integration
pipeline disables.

Exclude it from the multi-GPU work queue in the discover step (it still runs
in single-GPU CI), and revert the in-process SIGHUP-suppression attempts
(SIG_IGN + signal-mask block) and the sender probe: the signal lands on the
Kit process deep in the tree, which a launch-point wrapper cannot reach.
Tracked upstream as a Kit issue; re-include once fixed.
The skip_non_default gate and the workflow grep -v exclusion of
test_articulation were defending against a Newton COM-velocity-contract
bug that the pinned Newton v1.2.0 already fixes (upstream PR 2206).
Local 3-way concurrent repro on the pinned env confirms: J*qdot drops
from ~21 to ~4e-6 on panda under the same contention. The SIGHUP/hang
in test_articulation is hypothesized to be downstream of the same bug
(garbage joint velocities cascading into solver runaway); removing the
exclusion exposes both for CI validation. Drops the now-unused
skip_non_default parameter and its unit tests.
Multi-GPU CI run 26691294444 on 162ac6a (un-gate experiment) failed with
the same newton SIGHUP + physx shutdown-hang in test_articulation that
the prior f1762a5 gate addressed. The newton COM-velocity contract bug
that 162ac6a's jacobian-gate-removal targeted IS fixed by the pinned
Newton v1.2.0 (locally verified, A_newton_jac drops ~7 orders of
magnitude), but the SIGHUP/shutdown-hang is a separate Kit/Isaac-Sim
process-lifecycle bug, also reproduced locally with 4-MIG concurrency on
pinned newton v1.2.0 + isaac-sim:6.0.0-dev2.

Restore the workflow discover-step ``grep -v`` filter only; keep the
skip_non_default infrastructure removal from 162ac6a since the jacobian
fix made that helper dead code.
Move the multi-GPU CI per-file exclusion from a hardcoded grep -v in the
workflow to a generic module-level ``MULTI_GPU_SKIP_REASON`` constant in
the test file itself. The workflow discover step now filters out any
file declaring this marker and emits a ::notice with the reason. Adding
or removing a fragile file no longer requires a workflow edit.

Apply the marker to newton and physx ``test_articulation.py`` (Kit
lifecycle bug: SIGHUP / shutdown-hang under concurrency, separate from
the now-fixed newton COM-velocity contract bug).
Experiment to test whether the Kit Linux documented fix
(--ulimit nofile=65535:65535 for shared-Vulkan-resource creation when
>2 GPUs are active) resolves the multi-GPU Kit lifecycle SIGHUP /
shutdown-hang in test_articulation. Disables the MULTI_GPU_SKIP_REASON
markers temporarily so the file runs on shards; if CI still fails, the
markers (commented in place) get restored.
Local 2/3/4-MIG cross-GPU threshold sweep confirms the Kit SIGHUP /
shutdown-hang bug in test_articulation fires at >=3 concurrent Kit
instances (v8 3-MIG repros it identically to CI's 3-shard config;
v7 2-MIG passes cleanly). Cap multi-GPU CI's shard count at 2 to
stay below the threshold; un-gate test_articulation so it runs on
the multi-GPU lane again. Restores ~5min wall-time cost (2 shards
instead of 3) in exchange for re-enabling the file.
Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


Update (commit e7d1af7): New commit implements a 2-shard cap as a concurrency-threshold workaround for the Kit lifecycle bug.

Changes in This Commit

1. .github/workflows/test-multi-gpu-pytest.yaml

Shard count is now capped at 2, regardless of available GPUs:

# TEMP EXPERIMENT: cap shards at 2 to test concurrency-threshold workaround for
# the Kit lifecycle bug (SIGHUP/shutdown-hang fires at >=3 concurrent Kit
# instances; 2 is below the threshold per local 2-MIG cross-GPU repro).
shards=$(python3 -c "import json; cap=2; print(json.dumps(list(range(1, min($N, cap+1)))))")
count=$(python3 -c "cap=2; print(min($N-1, cap))")

2. Test Files Re-enabled

Both test_articulation.py files (isaaclab_newton and isaaclab_physx) now have their MULTI_GPU_SKIP_REASON markers commented out to test whether the 2-shard cap avoids the Kit lifecycle bug.

Assessment

Hypothesis-driven debugging — The approach systematically tests the theory that ≥3 concurrent Kit instances trigger the lifecycle bug, while ≤2 are stable.

Conservative cap — 2 shards still provides multi-GPU coverage (cuda:1, cuda:2) while staying below the suspected threshold.

⚠️ GPU Utilization — On 4+ GPU systems, only 2 of the non-default GPUs will be utilized. This is acceptable as a diagnostic step but should be reverted once the root cause is understood.

Comments on Implementation

  1. Inline Python for cap logic: The cap calculation uses inline Python:

    count=$(python3 -c "cap=2; print(min($N-1, cap))")

    This works but is slightly fragile. Consider extracting cap=2 to a workflow-level variable for easier adjustment:

    env:
      SHARD_CAP: 2  # TEMP: workaround for Kit lifecycle bug
  2. Experiment lifecycle: This is now the third commit in the diagnostic cycle (enable → test → observe). The clear comments document the experiment purpose, which is good practice.

Outstanding Items (Cumulative)

Item Status Location
run_docker_tests: 'false' hardcoding Pre-existing build.yaml
45-minute timeout (target 30) Pre-existing test-multi-gpu-pytest.yaml
2-shard cap New (diagnostic) test-multi-gpu-pytest.yaml
Commented MULTI_GPU_SKIP_REASON markers New (experiment) Both test_articulation.py

Verdict

LGTM for experiment ✅ — Clean diagnostic change. The 2-shard cap is a reasonable approach to isolate the concurrency threshold. Results from this CI run will inform whether the cap resolves the issue or if the problem lies elsewhere.

Pre-merge checklist (once experiments conclude):

  • Revert or formalize the shard cap based on findings
  • Restore or remove MULTI_GPU_SKIP_REASON markers accordingly
  • Revert run_docker_tests: 'false' in build.yaml
  • Adjust 45-minute timeout if validated

Update (commit 4bc8b33): The 2-shard cap experiment has concluded. This commit:

  1. Removes the shard cap — all available GPUs are now used (shards=$(python3 -c "...range(1, $N)"), count=$((N - 1))). The inline Python is simplified and the bash arithmetic is cleaner.
  2. Re-enables MULTI_GPU_SKIP_REASON in both test_articulation.py files — with improved documentation explaining the opt-out mechanism. These files are excluded from concurrent multi-GPU CI while the upstream Kit lifecycle bug persists.

Previous Concerns Status

  • ⚠️ Inline Python for cap logic → ✅ Moot — the cap is removed entirely. The remaining inline Python is minimal and the count uses native bash arithmetic.
  • ⚠️ GPU Utilization → ✅ Fixed — all non-default GPUs are now utilized.
  • 📋 2-shard cap → ✅ Reverted (experiment concluded).
  • 📋 Commented MULTI_GPU_SKIP_REASON markers → ✅ Markers restored and properly documented.

Assessment

Clean resolution of the diagnostic cycle. The approach is sound: use all GPUs but exclude known-bad test files via a well-documented constant. The improved comments on MULTI_GPU_SKIP_REASON clearly explain the mechanism and removal criteria. No new issues introduced.


Update (commit 3d3f136): New commit adds a fix for build_simulation_context silently ignoring the device kwarg when sim_cfg is also provided.

Changes

  1. simulation_context.pydevice parameter default changed from "cuda:0" to None. When device is explicitly passed alongside sim_cfg, it now overrides sim_cfg.device. This fixes warp kernel device-mismatch errors on non-default GPUs.
  2. Changelog entry added documenting the fix.
  3. Test updates — both headless and non-headless test files split the test case to separately verify: (a) sim_cfg values are preserved when no device kwarg is given, and (b) explicit device kwarg wins when both are passed.

Assessment

Correct fix — The None default + explicit override pattern is clean and backward-compatible. Callers not passing device get SimulationCfg's default; callers passing it explicitly get what they asked for.

Well-tested — Both branches (sim_cfg-only and device-override) are covered.

Good documentation — The inline comment explains the rationale and the failure mode it prevents.

⚠️ Scope note — This commit fixes a real bug but is unrelated to the CI multi-GPU sharding work in the PR title. Consider whether this should be a separate PR for cleaner git history, though it is understandable if the bug was discovered during multi-GPU testing.

No blocking issues. LGTM.


Update (commit 3bd01c4): Removes MULTI_GPU_SKIP_REASON from both test_articulation.py files (isaaclab_newton and isaaclab_physx), fully re-enabling them in concurrent multi-GPU CI.

This is the logical conclusion of the diagnostic cycle: the build_simulation_context device-override fix from the previous commit likely resolved the warp kernel failures that were triggering Kit lifecycle issues under concurrent execution. With the root cause addressed, the opt-out markers are no longer needed.

Changes

Both files simply remove the 6-line MULTI_GPU_SKIP_REASON constant block:

# File-level opt-out from concurrent multi-GPU CI. The multi-GPU workflow's
# discover step skips any test file declaring this module-level constant.
# ...
MULTI_GPU_SKIP_REASON = "Kit lifecycle bug: ..."

Assessment

Clean removal — No code changes beyond deleting the opt-out marker.

Logical progression — The diagnostic cycle is complete: 2-shard cap (experiment) → device fix (root cause) → re-enable tests (validation).

⚠️ Risk note — If the Kit lifecycle issues resurface in CI, these markers can be quickly re-added. The workflow's MULTI_GPU_SKIP_REASON discovery mechanism remains in place.

Outstanding Pre-merge Items

Item Status Notes
run_docker_tests: 'false' ⚠️ Still hardcoded build.yaml — revert before final merge
45-minute timeout ℹ️ Diagnostic Can tighten to 30 once validated
Cherry-picked #5881 ℹ️ Noted in PR description Will drop on rebase after #5881 merges

No new issues introduced. LGTM. ✅

Update (commit 7d04e41): Adds SIGHUP signal handling and fixes _abort_signal_handle_callback to properly exit after cleanup.

Changes

  1. app_launcher.py — Registers SIGHUP handler so child processes shut down cleanly when the parent shell (supervising sibling shards) exits. Rewrites the callback to call sys.exit(128 + signum) after app.close(), preventing Python from resuming with half-torn-down Kit state. Uses contextlib.suppress(Exception) for robustness in signal context.
  2. Changelog entry — Documents both fixes clearly.

Assessment

Correct fix — SIGHUP cascading to child shards was the missing piece causing "Stage X already attached" failures. The explicit sys.exit() after cleanup is the right pattern for replaced signal dispositions.

Clean implementation — Both contextlib and sys were already imported. Parameter rename from signal to signum avoids shadowing the module. Comments are thorough.

No new issues introduced.

Outstanding Pre-merge Items

Item Status Notes
run_docker_tests: 'false' ⚠️ Still hardcoded build.yaml — revert before final merge
45-minute timeout ℹ️ Diagnostic Can tighten to 30 once validated

Update (commit 2b4530d): Two new diagnostic commits (7251b74 + 2b4530d). The latest restores 3 shards and replaces --gpus all with --gpus "device=$cuda" to give each container exclusive access to a single physical GPU. This eliminates the cross-GPU visibility that was triggering the SIGHUP cascade / "Stage already attached" pattern.

Assessment: Clean diagnostic change. The per-container GPU isolation via --gpus device=N is the correct approach — it mirrors MIG-level hardware isolation and prevents Kit processes from interfering with each other's GPU contexts. The explanatory comment is clear and well-reasoned.

No new issues introduced. Previous outstanding items unchanged:

Item Status Notes
run_docker_tests: 'false' ⚠️ Still hardcoded build.yaml — revert before final merge
45-minute timeout ℹ️ Diagnostic Can tighten to 30 once validated

Update (commit 6137b1b): Significant pivot — instead of per-container GPU isolation (--gpus device=N), the new approach eliminates Kit entirely from newton tests (kitless mode) and re-adds MULTI_GPU_SKIP_REASON for the physx test that still requires Kit.

Changes

  1. Workflow — Reverts --gpus "device=$cuda"--gpus all. The per-container isolation is no longer needed since newton tests no longer boot Kit.

  2. schemas.py — New _create_fixed_joint_to_world() helper replaces the from omni.physx.scripts import utils import with pure pxr.UsdPhysics calls. Well-implemented: handles instance proxy / prototype prim climbing, unique joint naming, and correctly sets body0 (world) / body1 (articulation) with proper transforms via UsdGeom.XformCache.

  3. Newton test filestest_articulation.py and test_rigid_object_collection.py remove AppLauncher boot and test_devices() import, replacing with hardcoded ["cuda:0", "cpu"] or ["cuda:0"]. This makes them fully kitless, avoiding the Kit lifecycle bug entirely.

  4. PhysX test_articulation.py — Re-adds MULTI_GPU_SKIP_REASON to opt out of concurrent multi-GPU CI (this file still requires Kit).

  5. Changelog entries — Properly added for the kitless fix.

Assessment

Better root-cause solution — Rather than working around Kit concurrency bugs with container isolation, this removes the Kit dependency where it isn't needed. Kitless newton tests run faster (~30s saved per file) and are immune to Kit lifecycle issues.

_create_fixed_joint_to_world() is well-written — Correctly mirrors the omni.physx.scripts.utils.createJoint single-selection Fixed branch. The prototype/instance climb, unique naming, and transform extraction are all correct.

Clean test migration — Removing AppLauncher is safe for newton-only tests since SimulationContext gates Kit-specific paths on has_kit().

Outstanding Pre-merge Items

Item Status Notes
run_docker_tests: 'false' ⚠️ Still hardcoded build.yaml — revert before final merge
45-minute timeout ℹ️ Diagnostic Can tighten to 30 once validated
MULTI_GPU_SKIP_REASON in physx test ℹ️ Intentional Stays until upstream Kit fix (IsaacLab #3475)

No new issues introduced. LGTM. ✅

Most test callers pass both ``sim_cfg=`` and ``device=`` to
:func:`isaaclab.sim.build_simulation_context`, implicitly expecting the
``device`` kwarg to win. The helper previously dropped the kwarg silently
when ``sim_cfg`` was provided, causing warp kernel-launch device
mismatches on non-default GPUs: the test fixture allocated ``env_ids``
on the requested device while the articulation's ``self.device``
resolved from the untouched ``sim_cfg`` default (``cuda:0``), and
``wp.launch(..., device=self.device)`` failed with::

    RuntimeError: Error launching kernel 'set_root_link_pose_to_sim_index',
    trying to launch on device='cuda:0',
    but input array for argument 'env_ids' is on device=cuda:2.

Change ``device``'s default to ``None`` (sentinel) and apply it as an
override after sim_cfg construction in both branches. The one test that
asserted the old "sim_cfg overrides everything" contract is updated to
cover the new override semantics.
Drop the MULTI_GPU_SKIP_REASON marker from both the newton and physx
test_articulation variants so they participate in dynamic 3-shard
multi-GPU pytest again. Pairs with the cherry-picked device-kwarg fix
to validate whether the Kit lifecycle hang is exacerbated by the
device-drift bug. If the multi-GPU pytest workflow now holds up
consistently across re-runs, the upstream Kit issue may not require
the file-level skip.
hujc7 added 9 commits May 31, 2026 00:10
Two coupled bugs in :class:`isaaclab.app.AppLauncher`:

1. SIGHUP was unhandled. Kit launches with
   ``--/app/installSignalHandlers=0``, so when a controlling session
   leader exits (e.g. the parent shell that supervises sibling shards in
   multi-GPU CI), child Kit processes receive SIGHUP with default
   disposition: terminate. ``_atexit_close`` does not run, so
   ``SimulationApp.close`` is skipped and USD/PhysX state is left
   attached. The next sibling shard then trips
   ``[Error] [omni.physx.plugin] Stage X already attached`` and Kit
   shutdown subsequently hangs on the orphan's state.

   Register the same handler used for SIGTERM/SIGABRT/SIGSEGV.

2. ``_abort_signal_handle_callback`` swallowed the signal's terminate
   semantics. After calling ``self._app.close()`` it returned, so
   Python resumed execution past the signal as if nothing happened.
   The replaced OS-default disposition would have killed the process;
   the Python handler did not.

   Wrap ``_app.close()`` in ``contextlib.suppress(Exception)`` and call
   ``sys.exit(128 + signum)`` to preserve the conventional signal-exit
   encoding and actually terminate.
Pin shard_count to min(available, 2) to test whether the Kit lifecycle
hang (SIGHUP cascade + "Stage already attached" + 52s shutdown hang on
test_articulation) only manifests at 3+ concurrent Kit processes.

Local 3-MIG repro on Horde passes cleanly (hardware-isolated MIG slices);
CI 3-shard on shared-GPU runners fails consistently. This commit narrows
the failure window so the data tells us:

* 2-shard CI green and consistent -> 3+ is the concurrency threshold;
  isolation layer or CUDA_VISIBLE_DEVICES per-shard is the fix.
* 2-shard CI still flaky -> something other than process count is the
  trigger; deeper investigation needed.

Revert after the data is collected.
Two changes in one commit (paired diagnostic):

1. Restore the dynamic shard_count = N-1 computation; the 2-shard cap
   diagnostic is being superseded by this run.

2. Replace ``--gpus all`` with ``--gpus device=$cuda`` so each shard
   container sees only one physical GPU. Mirrors the hardware-level
   isolation that MIG provides on the Horde 3-shard local repro
   (which passes cleanly), and removes the cross-process GPU visibility
   that the multi-GPU CI runner currently allows.

Hypothesis: the SIGHUP cascade + "Stage already attached" pattern only
fires when sibling Kit processes can see each other's GPUs and share
host driver state. If this commit's CI is green, isolation is the
fix and we make this permanent.

Revert after the data is collected.
Paired with the previous ``--gpus device=$cuda`` isolation diagnostic.
With per-shard GPU isolation, each container sees exactly one physical
GPU and it appears as ``cuda:0`` inside the container. The previous
``ISAACLAB_TEST_DEVICES=$runtime_devices`` (e.g. ``"0001"`` for cuda:2)
and ``ISAACLAB_SIM_DEVICE=cuda:$cuda`` (e.g. ``cuda:2``) tried to use
indices the container can no longer see, so collection failed:

  ValueError: ISAACLAB_TEST_DEVICES='0001' names no device available
  on this host (available: ['cpu', 'cuda:0'])

Set both to ``cuda:0``/``01`` unconditionally. The work queue still
distributes files across the 3 shards so each physical GPU exercises
a different slice.
Restore the MULTI_GPU_SKIP_REASON marker on the physx variant only.
Newton test_articulation drops AppLauncher entirely via PR isaac-sim#5883, so it
runs cleanly under concurrent multi-GPU. The physx variant must still
boot Kit for omni.physics; under 3-shard concurrent CI runners (shared
GPU visibility) Kit's shutdown hangs >52s, causing SIGHUP cascade across
sibling shards and "Stage already attached" errors.

Cross-linked upstream at IsaacLab isaac-sim#3475 / OMPE-43816 (deferred past
Isaac Sim 5.0 per the engineering thread).
Bundles the kitless conversion of newton test_articulation +
test_rigid_object_collection into the dynamic-sharding branch so the
multi-GPU CI workflow actually exercises a non-Kit-booted newton
test_articulation alongside the physx skip. Will rebase away when isaac-sim#5883
lands.

Includes the universal schemas.py fix
(``_create_fixed_joint_to_world`` replaces unguarded
``omni.physx.scripts.utils.createJoint``) and the .skip changelog
fragments for the test-only packages.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

infrastructure isaac-lab Related to Isaac Lab team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant