[CI] Cross-platform — Part 3: Windows workflow#5700
Conversation
There was a problem hiding this comment.
Review Summary
This PR adds experimental Windows CI jobs for perception smoke testing, building on the uv-based installation path. The approach is well-structured, with incremental jobs that progressively test more functionality.
✅ Strengths
- Incremental approach: The workflow builds from simple torch/scipy tests → IsaacLab core install → full perception smoke
- Cross-platform fix in
AssetConverterBase: Moving from hardcoded/tmp/IsaacLabtotempfile.gettempdir()is the correct POSIX/Windows-agnostic solution - Proper marker recognition in AppLauncher: The
windows_ciandarm_cimarkers are now correctly filtered fromsys.argv - Good use of
continue-on-error: true: Appropriate for experimental jobs that should not gate merges
🔍 Findings
1. Path handling in GITHUB_PATH append (Low Priority)
echo "$HOME\\.local\\bin" | Out-File -FilePath $env:GITHUB_PATH -Append -Encoding utf8The $HOME variable in PowerShell uses forward slashes internally, but .local\\bin uses backslashes. While PowerShell is generally tolerant, consider using Join-Path for consistency:
Join-Path $HOME ".local\\bin" | Out-File -FilePath $env:GITHUB_PATH -Append -Encoding utf82. Potential silent failure in perception smoke (Medium Priority)
The inline Python smoke script does not capture or report specific failure modes:
python perception_smoke.pyIf the script fails during Kit initialization (e.g., GPU detection issues), the error may be buried in the output. Consider adding explicit error handling:
import sys
try:
# ... existing code ...
print("perception smoke: PASS")
except Exception as e:
print(f"perception smoke: FAIL - {type(e).__name__}: {e}", file=sys.stderr)
sys.exit(1)This would make CI failure reasons more immediately visible.
3. Artifact retention note
7-day retention for artifacts is appropriate for experimental CI. Once this stabilizes, consider if longer retention is needed for debugging flaky Windows-specific failures.
📝 Documentation
The PR description and commit messages are excellent - each iteration is well-documented with clear rationale for changes. The .skip changelog file appropriately notes this is CI-only.
✅ Verdict
LGTM - The changes are well-structured for an experimental CI addition. The continue-on-error: true at both job and step level appropriately prevents this from blocking the PR workflow while still providing visibility into Windows perception test stability.
The AssetConverterBase fix for cross-platform temp directories is a clean, correct change that should be merged regardless of the CI experiment outcome.
Update (6811028): PR rebased and squashed to two clean commits:
9242498— Adds Windows + ARM pytest markers,AppLauncherargv filtering, andAssetConverterBasecross-platform temp dir fix6811028— Full Windows CI workflow with the 5-job pipeline
Content is unchanged from prior review. All previous findings remain valid optional recommendations. Approval stands.
Update (6fdab92): New push adds the full Windows CI workflow (was previously staged). Reviewed incremental diff from f2b2445...6fdab92:
- ✅ Finding #1 addressed: Now using
Add-Content -Path $env:GITHUB_PATHwhich is cleaner - ✅ Finding #2 addressed: Perception smoke script now has 3-minute watchdog thread + explicit asserts throughout
⚠️ Note:build.yamlhas temporaryrun_docker_tests: 'false'hardcoding — marked for revert before merge
No new issues in the incremental changes. Approval stands.
Update (3c3ba63): Reviewed incremental diff from 6fdab92...3c3ba63:
- Changed
isaacsim[all]>=6.0.0→isaacsim[all,extscache]==6.0.0.* - ✅ Good refinement: adds
extscacheextra for pre-cached extensions and pins to 6.0.0.x patch versions for more deterministic CI
No new issues. Approval stands.
Update (7345095): Reviewed incremental diff from 3c3ba63...7345095:
- ✅ Added
uv pip install -e source/isaaclab_physxbeforeisaaclab_tasksin both Windows CI jobs - This fixes a runtime dependency:
isaaclab_taskseagerly imports fromisaaclab_physxduring package walk (GroundPlaneCfg physics_material forwarding shim) - Without this, import dies with "RigidBodyMaterialCfg has moved to isaaclab_physx.sim.spawners.materials"
Good fix — correctly identifies and addresses the import-time dependency issue. Approval stands.
Update (683c110): Reviewed incremental diff from 7345095...683c110:
- Changed
uv pip install -e source/isaaclab_physx→uv pip install --no-deps -e source/isaaclab_physx - Added inline comment explaining:
isaaclab-physx==1.1.0has a hard dep onisaaclab-ppispwhich isn't on any index nor insource/. The dep is referenced lazily at runtime, so--no-depsunblocks editable install.
✅ Reasonable workaround for a missing transitive dependency. Since isaaclab-ppisp is lazily loaded and not exercised in these CI tests, skipping dep resolution is pragmatic. Approval stands.
Update (e9d4152): Reviewed incremental diff from 683c110...e9d4152:
- Added
h5pyto test deps fortest_hdf5_dataset_file_handler.py - Added CUDA torch wheel install (
--index-url https://download.pytorch.org/whl/cu128 torch torchvision) to fixTorch not compiled with CUDA enablederror in test_episode_data's[cuda:0]parametrize cases - Switched from marker-based discovery (
-m windows_ciontest/utils) to explicit file list (4 specific test files) - Good rationale in comments: avoids importing neighbors with module-level
AppLauncher/parser.parse_args()that fail without Sim installed
✅ Pragmatic fix. Explicit file listing is more robust than marker filtering when neighbor imports cause collection errors. Approval stands.
Update (ce5d56c): Reviewed incremental diff from e9d4152...ce5d56c:
perception-windowsjob condition changed fromrun_windows_ci == 'true'→if: false(gated off)- Timeout reduced from 30 → 10 minutes
- ✅ Excellent inline comment explaining rationale: Windows runner lacks functional Vulkan/GPU stack (
vkEnumeratePhysicalDevices failed), Kit hangs and starves workflow
Sensible decision — disabling the GPU-dependent perception job until the runner is confirmed GPU-capable prevents resource waste. Clear documentation makes it easy to re-enable later. Approval stands.
Update (319eb8f): Reviewed incremental diff from ce5d56c...319eb8f:
- ✅ Tighter job timeouts: Reduced timeouts with clear inline comments explaining rationale (venv-sanity 30→15, editable-install 45→30, kit-launch 30→15, path-io 30→15 min)
- ✅
build.shcross-platform fix: AddedPYTHON="${PYTHON:-python3}"override so Windows git-bash (which only haspython) can call the script - ✅ Explicit Git Bash invocation: Windows CI now invokes
C:\Program Files\Git\bin\bash.exedirectly with$env:PYTHON = "python"to run wheel build - ✅ GIL-immune watchdogs: Replaced Python thread-based watchdogs with PowerShell
Start-Process+WaitForExit()— correctly addresses the failure mode where Kit init hangs in C code holding the GIL, making Python daemon threads useless
Excellent hardening. The GIL insight is particularly good — OS-level process watchdogs are the correct solution for Kit/Vulkan hangs. Approval stands.
Update (bf4c614): Reviewed incremental diff from 319eb8f...bf4c614:
- Added
--seedflag touv venv --python 3.12 --seed env_isaaclab_uv - ✅ Correct fix:
uv venvby default creates a minimal venv without pip/setuptools/wheel. The--seedflag installs these seed packages so the downstream wheel-builder step (python -m pip install build wheel) works - Good inline comment documenting the rationale
Minor but necessary fix for the wheel build step. Approval stands.
Update (79c6d2e): Reviewed incremental diff from bf4c614...79c6d2e:
- ✅ Re-enabled
perception-windowsjob: Changedif: false→if: needs.changes.outputs.run_windows_ci == 'true' - Updated inline comment: clarifies PowerShell watchdog (3-min cap) and 10-min job timeout as defense layers against hung Kit/Vulkan init
Runner GPU/Vulkan stack appears operational now. The existing defenses (OS-level watchdog + job timeout) remain in place for safety. Approval stands.
Update (81f5b9f): Reviewed incremental diff from 79c6d2e...81f5b9f:
- ✅ PowerShell 5.1 compatibility fix: Replaced
$proc.Kill($true)withStop-Process -Id $proc.Id -Force -ErrorAction SilentlyContinuein both kit-launch and perception watchdogs - The
.Kill($true)overload (kill entire process tree) only exists in .NET 5+ — GitHub Windows runners use PowerShell 5.1 on .NET Framework where this overload does not exist Stop-Process -Forceis the standard PowerShell-native approach for forceful termination
Good catch on the .NET version incompatibility. Approval stands.
Update (7efc11a): Reviewed incremental diff from 81f5b9f...7efc11a:
- ✅ New reusable action:
.github/actions/windows-instance-state/action.ymlreports disk usage and key directory sizes (uv cache, pip cache, Kit state, etc.) before/after each job - ✅ Targeted cleanup in post phase: Removes Kit user state (
%APPDATA%\NVIDIA Corporation\Omniverse Kit), Kit docs (%USERPROFILE%\Documents\Kit), temp scratch dirs, and escaped user site-packages — leaves content-addressed caches (uv, pip) intact for speedup - ✅ All 5 Windows jobs instrumented: venv-sanity, install-windows, kit-launch, path-io, and perception-windows now have pre/post state hooks
- ✅ Good separation: caches that are safe across runs (content-addressed) are preserved; user state that can cause cross-run issues is cleaned
Excellent operational hygiene. Observability into runner state will help debug flaky failures, and the targeted cleanup prevents state accumulation without sacrificing cache benefits. Approval stands.
Update (94f37c6): Reviewed incremental diff from 7efc11a...94f37c6:
- ✅ GPU diagnostics:
windows-instance-statenow runsnvidia-smiearly to surface "no GPU on runner" vs "GPU present but Vulkan/driver issue" — good for triaging Kit init failures - ✅ PowerShell robustness: Fixed
Get-ChildItemto use-Fileflag and handle null sum (previously could fail on empty/missing dirs) - ✅ New reusable action:
.github/actions/windows-sim-paths/action.ymldiscovers Isaac Sim install root viapip show isaacsim-kernel, exportsISAAC_PATH/CARB_APP_PATH/EXP_PATH, and prepends Kit DLL dirs toPATHfor Vulkan ICD resolution. Mirrors PR #4018's known-working pattern. - ✅ Comprehensive headless env vars: Added
HEADLESS,ISAAC_SIM_HEADLESS,OMNI_KIT_NO_WINDOW,OMNI_KIT_DISABLE_WATCHDOG,OMNI_KIT_TELEMETRY,CARB_LOGGING_SEVERITY,PYTHONUNBUFFERED,PYTHONIOENCODING— properly gates Kit to headless mode when no display server is available - ✅ Cleaner step separation: Split combined "Install + Boot Kit" steps into separate install, path-resolution, and boot steps — improves failure isolation and step timing visibility
Excellent structural improvements. The DLL search path fix is likely to resolve Vulkan ICD discovery issues that previously caused Kit hangs. Approval stands.
Update (560a176): Reviewed incremental diff from 94f37c6...560a176:
- ✅ Switched from
python -m pip showtouv pip show: Aligns with the rest of the CI using uv for package management - ✅ Fixed stderr handling: Removed
2>&1merge; uv writes its banner ("Using Python ...") to stderr, which trips PowerShell's$ErrorActionPreference=StopviaNativeCommandErrorhandling on stderr lines - ✅ Added
$LASTEXITCODEcheck: Gracefully handles failures fromuv pip showby setting$pipShow = ""
Good refinement — consistent use of uv and proper handling of PowerShell's strict stderr behavior. Approval stands.
Update (fe4d04a): Reviewed incremental diff from 560a176...fe4d04a:
- ✅ New Tier 3 training smoke job:
training-smoke-windowsruns cartpole training tests via pytest - ✅ New cross-platform test file:
test_cartpole_training_smoke.pywith dual markers (arm_ci,windows_ci) and OS-aware launcher selection (isaaclab.batvsisaaclab.sh) - ✅ Two training cases: State-only (rsl_rl, no Vulkan required) and perception (rl_games, needs Vulkan) — clear inline comments explain expected pass/fail behavior per GPU mode
- ✅ Consistent job structure: Follows established patterns (instance state hooks, sim path resolution, artifact upload,
continue-on-error: true) - ✅ Good error reporting: Test helper captures stdout/stderr tail on failure for debugging
- ✅ Dual-PR coordination: Commit message clearly explains the dedup strategy with PR #5698
No new issues. Clean addition following established CI patterns. Approval stands.
Update (bd5fe5a): Reviewed incremental diff from fe4d04a...bd5fe5a:
- ✅ Added
uv pip install -e 'source/isaaclab_rl[rsl_rl,rl_games]'with inline comment - This installs the
rsl-rl-libandrl-gamespackages (declared as extras inisaaclab_rl/setup.py) needed by the training smoke tests - Without these,
rsl_rl/train.pyandrl_games/train.pyfail withModuleNotFoundErrorbefore Kit boots
Clean dependency fix. Approval stands.
Update (b72ddb8): Reviewed incremental diff from bd5fe5a...b72ddb8:
Major architectural change: Consolidated all separate Windows CI jobs (general, install, kit-launch, path-io, perception, training-smoke) into a single windows-ci job.
✅ Good rationale: PR comment explains the runner-pool autoscaler idle-down issue — multiple smaller jobs caused the runner to tear down between them, starving queued siblings. Single job keeps runner allocation continuous.
Key structural changes:
- ✅ Single shared setup step (no
continue-on-error) that must succeed for any tests to run - ✅ Individual test steps use
continue-on-error: trueso failures don't skip subsequent tests - ✅ New Aggregate step: Parses per-step outcomes, fails job only if blocking tests fail
- ✅ Perception explicitly non-blocking: Marked as expected failure under TCC mode, logs warning instead of failing job
- ✅ Step dependencies: Tests gate on
steps.setup.outcome == 'success'orsteps.sim-paths.outcome == 'success' - ✅ 90-minute outer timeout: Accommodates full sequential pipeline (~35-45 min typical)
No new issues introduced. This is a well-reasoned operational fix that addresses a real CI infrastructure problem while maintaining the same test coverage. Approval stands.
Update (b72ddb8): Reviewed incremental diff from bd5fe5a...b72ddb8:
Major architectural change: Consolidated all separate Windows CI jobs (general, install, kit-launch, path-io, perception, training-smoke) into a single windows-ci job.
✅ Good rationale: PR comment explains the runner-pool autoscaler idle-down issue — multiple smaller jobs caused the runner to tear down between them, starving queued siblings. Single job keeps runner allocation continuous.
Key structural changes:
- ✅ Single shared setup step (no
continue-on-error) that must succeed for any tests to run - ✅ Individual test steps use
continue-on-error: trueso failures don't skip subsequent tests - ✅ New Aggregate step: Parses per-step outcomes, fails job only if blocking tests fail
- ✅ Perception explicitly non-blocking: Marked as expected failure under TCC mode, logs warning instead of failing job
- ✅ Step dependencies: Tests gate on
steps.setup.outcome == 'success'orsteps.sim-paths.outcome == 'success' - ✅ 90-minute outer timeout: Accommodates full sequential pipeline (~35-45 min typical)
No new issues introduced. This is a well-reasoned operational fix that addresses a real CI infrastructure problem while maintaining the same test coverage. Approval stands.
Update (cd1e739): Reviewed incremental diff from b72ddb8...cd1e739 (rebase + new commits):
Structural simplifications:
- ✅ Trimmed verbose comments: Header block and inline comments cleaned up — maintains clarity without redundancy
- ✅ Simplified setup step: Replaced manual
uv pip install -echain with./isaaclab.bat -i 'isaacsim,rl[rsl_rl,rl_games]'(the canonical install path) - ✅ Added
isaaclab_newton+isaaclab_ppisp: Import smoke now verifies these packages (required by rebase-mergeddevelopchanges)
New diagnostic probe:
- ✅ Vulkan probe step: Runs
nvidia-smi, searches forvulkan-1.dll, checks ICD registry entries, and attempts a minimalvkCreateInstance/vkEnumeratePhysicalDevicesvia ctypes - Good for isolating "no GPU on runner" vs "GPU present but TCC mode" vs "Vulkan loader broken"
continue-on-error: true— never gates the job, purely diagnostic- Output captured to
reports/vulkan-probe.txtfor artifact upload
Perception test disabled (expected):
- ✅ Perception smoke step is now fully commented out with a detailed rationale block explaining the L40S TCC mode limitation
- Training smoke now uses
-k 'not perception'filter to skip the perception subcase - Clear unblock criteria documented: vGPU config, bare-metal Vulkan SKU, or Linux runner
- Aggregate step no longer references
perception— no false failure noise
Minor cleanups:
- Removed
perception_smoke.pyfrom artifact upload paths - Simplified em-dash usage in error strings (
-instead of—) for ASCII safety in PowerShell
No new issues. The changes are well-reasoned operational fixes that improve diagnostics while acknowledging known hardware limitations. Approval stands.
Update (cd1e739): Reviewed incremental diff from b72ddb8...cd1e739 (48 new commits including develop merge):
Windows CI refinements (jichuanh/windows-spark-ci-perception branch):
- ✅
8538627b: Disabled perception smoke job until vGPU is configured — sensible gating - ✅
6a40fe32: Fixed install order, added isaaclab_newton, improved Aggregate step with ASCII-safe output - ✅
9711ab80: Fixed Windows file-lock race in HDF5 test fixture cleanup with explicittearDownclose - ✅
7fc6750d: Delegated install to./isaaclab.bat -i— cleaner than inline PowerShell - ✅
ee5dde22: Cleaned up inline comments in PowerShell blocks - ✅
e0df27b3: Skipped cartpole perception subcase in training smoke (no Vulkan/vGPU) - ✅
cd1e7398: Added Vulkan diagnostic probe (nvidia-smi + vulkaninfo) for triage
Develop merge brings unrelated changes (deformables, PPISP, OvPhysX FrameView, etc.) — no conflicts with Windows CI.
No new issues in the incremental Windows CI changes. Approval stands.
Update (966e6d3): Reviewed incremental diff from cd1e739...966e6d3:
Major refactoring: Probe-only iteration mode
This commit restructures the Windows CI workflow into a minimal "probe-only" mode while Vulkan diagnosis is ongoing:
✅ New standalone tools/vulkan_probe.py:
- Extracted from inline PowerShell heredoc into a proper Python module
- Cross-platform support: handles
vulkan-1.dll(Windows),libvulkan.so.1(Linux), andlibvulkan.1.dylib(macOS) - Clean ctypes implementation of
vkCreateInstance+vkEnumeratePhysicalDevices - Proper SPDX license header and docstring
- Reusable outside CI (direct invocation:
python tools/vulkan_probe.py)
✅ Vulkan probe step moved earlier:
- Now runs as the first substantive step after instance state check
- No longer depends on
steps.setup.outcome— can run without Isaac Sim installed - Enhanced diagnostics: host info (date, machine name), Vulkan ICD registry for both 64-bit and WOW6432Node paths
- Falls back to
tools/vulkan_probe.pyifvulkaninfois not on PATH
✅ All test steps disabled (if: false):
- Install uv, Setup venv, Resolve Isaac Sim paths, all test branches (deps, path-io, kit-launch, training-smoke, wheel-build) are now gated off
- This is intentional for fast iteration on the Vulkan probe without ~25+ minutes of setup time
- Comments clearly mark this as temporary: "disabled: probe-only iteration"
✅ Simplified aggregate gate:
- Only checks
vulkan-probeoutcome - Clear comment explains this is probe-only mode, restore full gating when test steps are re-enabled
Assessment: This is a reasonable debugging iteration. The PR author is methodically diagnosing the L40S Vulkan/TCC issue by isolating the probe step. The refactoring of the probe code into a standalone module is a quality improvement that will benefit future maintenance.
No new issues. Approval stands.
Update (bae6977): Reviewed incremental diff from 966e6d3...bae6977:
Major change: Exiting probe-only mode, re-enabling all test steps
This commit removes the temporary probe-only iteration mode and re-enables the full Windows CI test pipeline:
✅ Removed Vulkan probe step and tools/vulkan_probe.py:
- The standalone Vulkan diagnostic script has been deleted
- The inline Vulkan probe workflow step has been removed entirely
- This signals the Vulkan/L40S diagnosis phase is complete or being handled differently
✅ Re-enabled all test steps:
Install uv: Now runs (removedif: false)Setup venv + install isaaclab + isaacsim + test deps: Re-enabled with--seedflag comment retainedResolve Isaac Sim paths: Re-enabledDeps smoke (torch + scipy): Now conditional onalways() && steps.setup.outcome == 'success'Path-IO tests (utils): Same conditionalKit headless boot smoke: Conditional onsteps.sim-paths.outcome == 'success'Cartpole training smoke (state rsl_rl): Same conditionalWheel build + reinstall + smoke import: Conditional onsteps.setup.outcome == 'success'
✅ Full aggregate gating restored:
- Aggregate step now checks all 7 gating steps: setup, sim-paths, deps, path-io, kit-launch, training-smoke, wheel-build
- Perception is explicitly excluded from blocking (commented rationale retained)
- Job fails only if any of the blocking steps fail
✅ Clean workflow structure:
- Setup step has no
continue-on-error— hard fails abort job early (correct behavior) - Test steps retain
continue-on-error: true— allows collecting results from all tests even if some fail - Conditions use
always() &&pattern to ensure steps run even after priorcontinue-on-errorfailures
Assessment: This is the expected follow-up to the probe-only iteration. The CI is now ready for production use on the L40S Windows runner. All original test coverage is restored with proper conditional execution and failure aggregation.
No new issues. Approval stands.
Update (bae6977): Reviewed incremental diff from 966e6d3...bae6977:
Major change: Exiting probe-only mode, re-enabling all test steps
This commit removes the temporary probe-only iteration mode and re-enables the full Windows CI test pipeline:
✅ Removed Vulkan probe step and tools/vulkan_probe.py:
- The standalone Vulkan diagnostic script has been deleted
- The inline Vulkan probe workflow step has been removed entirely
- This signals the Vulkan/L40S diagnosis phase is complete or being handled differently
✅ Re-enabled all test steps:
Install uv: Now runs (removedif: false)Setup venv + install isaaclab + isaacsim + test deps: Re-enabled with--seedflag comment retainedResolve Isaac Sim paths: Re-enabledDeps smoke (torch + scipy): Now conditional onalways() && steps.setup.outcome == 'success'Path-IO tests (utils): Same conditionalKit headless boot smoke: Conditional onsteps.sim-paths.outcome == 'success'Cartpole training smoke (state rsl_rl): Same conditionalWheel build + reinstall + smoke import: Conditional onsteps.setup.outcome == 'success'
✅ Full aggregate gating restored:
- Aggregate step now checks all 7 gating steps: setup, sim-paths, deps, path-io, kit-launch, training-smoke, wheel-build
- Perception is explicitly excluded from blocking (commented rationale retained)
- Job fails only if any of the blocking steps fail
✅ Clean workflow structure:
- Setup step has no
continue-on-error— hard fails abort job early (correct behavior) - Test steps retain
continue-on-error: true— allows collecting results from all tests even if some fail - Conditions use
always() &&pattern to ensure steps run even after priorcontinue-on-errorfailures
Assessment: This is the expected follow-up to the probe-only iteration. The CI is now ready for production use on the L40S Windows runner. All original test coverage is restored with proper conditional execution and failure aggregation.
No new issues. Approval stands.
Update (27def39): Reviewed incremental diff from bae6977...27def39:
Major change: Perception tests fully re-enabled (WDDM mode confirmed)
The runner GPU is now in WDDM mode, which means Vulkan is available. This commit re-enables all perception coverage:
✅ Training smoke now runs both subcases:
- Step renamed: "Cartpole training smoke (state rsl_rl)" → "Cartpole training smoke (state + perception)"
- Removed
-k "not perception"pytest filter — both state (rsl_rl) and perception (rl_games,--enable_cameras) subcases now execute - Timeout increased from 15 → 25 minutes to accommodate the additional perception training workload
✅ Perception smoke step re-enabled:
- The previously commented-out
test-perceptionstep is now live code - Exercises Kit's RTX/Vulkan path directly with
Isaac-Cartpole-RGB-Camera-Direct-v0 - 3-minute PowerShell
WaitForExitwatchdog remains in place for safety - Step-level
continue-on-error: true+ 8-minute job timeout unchanged
✅ Aggregate gating updated:
perceptionadded to$resultsmap and$blockingarray- Perception failures now block the job (appropriate since WDDM confirms Vulkan availability)
- Previous comment about perception being excluded from blocking removed
✅ Artifact upload updated:
perception_smoke.pyadded back to uploaded artifact paths
✅ Removed stale documentation:
- Deleted the large commented-out perception step block and its associated rationale comment about L40S TCC mode limitations
- Cleaned up aggregate step comment (was "Perception is disabled"; now just "Every active test step gates the job")
Assessment: This is the natural conclusion of the Vulkan/WDDM investigation. With the runner confirmed in WDDM mode, all perception coverage is correctly restored. The decision to make perception a blocking step is appropriate — if Vulkan is known-working, failures should be surfaced.
No new issues. Approval stands.
Foundation for cross-platform CI. Registers four pytest markers (windows, windows_ci, arm, arm_ci), teaches AppLauncher to recognize them in argv so they do not leak into Isaac Sim's argparse, and moves the AssetConverterBase USD scratch directory from a hardcoded /tmp/IsaacLab to tempfile.gettempdir() for cross-platform compatibility. Tags source/isaaclab/test/deps/test_torch.py and test_scipy.py with the new markers so they are selectable by future cross-platform jobs. Workflow files (arm-ci.yaml, windows-ci.yaml) ship in follow-up PRs.
d9d7dac to
d25f9bb
Compare
Same shape as arm-ci.yaml but the install path is native pip + uv on the Windows host (no Docker for Linux-based Isaac Sim wheels). Jobs (all continue-on-error: true): Tier 1 — general-windows, install-windows, kit-launch-windows Tier 2 — path-io-windows, perception-windows Every pytest invocation passes --timeout=N + --timeout-method=thread (signal is unavailable on Windows) plus --continue-on-collection-errors so a hung test cannot consume the full job slot and a broken neighbor file does not poison the marker-driven discovery. perception-windows wraps the cartpole-camera smoke in an inline Python script with explicit assertions and an inner watchdog thread that aborts the process after 180s. This replaces the previous pattern where Vulkan init failures hung the job instead of erroring. Tags four path-IO test files (test_configclass, test_dict, test_episode_data, test_hdf5_dataset_file_handler) with the windows_ci marker so path-io-windows picks them up via marker-driven discovery.
d25f9bb to
6811028
Compare
Forces run_docker_tests=false in build.yaml's changes job so all gated test jobs skip via their existing if-gate. Must be reverted before final review.
Kit bootstrap aborts on the Windows runner with 'Unable to bootstrap inner kit kernel: EOF when reading a line' when stdin is not a tty and no EULA env vars are set. Set OMNI_KIT_ACCEPT_EULA / ACCEPT_EULA / PRIVACY_CONSENT at the workflow level so every job inherits them.
Bare 'isaacsim' on Windows pulls only isaacsim + isaacsim-kernel; Kit bootstrap then warns 'PYTHONPATH path doesn't exist (...site-packages/isaacsim/exts/isaacsim.simulation_app)' / 'Unable to expose isaacsim.simulation_app API: Extension not found', and 'from isaacsim import SimulationApp' resolves to None, so AppLauncher dies with 'TypeError: NoneType object is not callable'. Match install.py / wheel_builder canonical spec: isaacsim[all]>=6.0.0.
Pytest collection over source/isaaclab/test imports sensors/test_tiled_camera_env.py whose module-level argparse.parse_args consumes pytest's --ignore=... / -m windows_ci flags and INTERNALERRORs collection (collected 595 items / 48 errors). The windows_ci-tagged path-IO tests on this branch all live in test/utils, so narrow the pytest scope to that subdir — keeps the marker filter intact without forcing every test file in the tree to be importable bare.
…lder) Bare 'isaacsim[all]' on Windows fails Kit startup with 'ImportError: cannot import name get_metrics_assembler_interface from omni.metrics.assembler.core (unknown location)' — the extension is registered but its implementation isn't on disk because the extscache extra wasn't requested. wheel_builder/res/python_packages.toml pins 'isaacsim[all,extscache]==6.0.0.*' for exactly this reason; mirror it.
import isaaclab_tasks walks all task packages, which transitively touches GroundPlaneCfg.physics_material -> isaaclab.sim.spawners.materials forwarding shim, which raises 'RigidBodyMaterialCfg has moved to isaaclab_physx.sim.spawners.materials. Install the isaaclab_physx extension or update your import.' Install it editable before isaaclab_assets / isaaclab_tasks so the shim resolves.
isaaclab-physx==1.1.0 declares a hard dep on isaaclab-ppisp which is not in source/ and not on any package index, so uv refuses the install with 'isaaclab-ppisp was not found in the package registry'. The ppisp import in isaaclab_physx is lazy (runtime, not at import), so --no-deps gets us a working editable install. Mirrors the same workaround used by the ARM-side install path (see install.py).
Three distinct gaps surfaced in path-io-windows on commit 683c110: 1. test_episode_data[cuda:0] parametrize: 'Torch not compiled with CUDA enabled' — default torch wheel on Windows pypi is CPU-only. Install torch + torchvision from download.pytorch.org/whl/cu128. 2. test_hdf5_dataset_file_handler: 'No module named h5py' — h5py was never declared by the isaaclab core dep set on Windows. Install it. 3. test_version.py / test_wrench_composer_*.py: KeyError 'EXP_PATH' at collection. Those files instantiate AppLauncher at module load and need an Isaac Sim install path-IO does not provide. Replace the '-m windows_ci' marker filter (which still imports every file in test/utils for collection) with explicit windows_ci-tagged file paths. Also drop --ignore=tools/conftest.py since no conftest sits under utils/.
The Windows runner reports 'vkEnumeratePhysicalDevices failed. No physical device is found.' / 'Failed to create any GPU devices' when Kit boots with --enable_cameras=True. Kit then hangs (the in-script 3-min watchdog can't reliably preempt a C-level GIL-held call), the job consumes its full timeout-minutes, and every other queued job on the same runner gets cancelled. Set the perception job's 'if' to false so it never claims the runner. Also tighten timeout-minutes from 30 to 10 so even when re-enabled it fails fast rather than starving siblings. Flip 'if' back to needs.changes.outputs.run_windows_ci == 'true' once the runner is confirmed GPU-capable.
Python thread watchdogs cannot preempt a Kit/Vulkan init that hangs in a C call holding the GIL — observed on this runner where the 3-min in-script time.sleep + os._exit never fired and perception_smoke held the Windows runner for the full 40-min job timeout, starving every other job. Replace the thread watchdog inside perception_smoke.py with a PowerShell Start-Process + WaitForExit at the shell layer (OS-level process kill, immune to GIL). Apply the same pattern to kit-launch-windows's inline python invocation. Tighten per-job timeout-minutes: general-windows 30 -> 15 install-windows 45 -> 30 kit-launch 30 -> 15 path-io 30 -> 15 The hard upper bound is now the second line of defence; the PowerShell watchdog catches runaway python first.
PowerShell on the Windows runner doesn't have bash on PATH: bash : The term 'bash' is not recognized as the name of a cmdlet ... Git for Windows installs bash.exe at C:\Program Files\Git\bin\bash.exe; invoke it directly with a Test-Path guard and exit-code check so failures fast-fail.
build.sh hardcoded python3. Linux installs expose python3 (and that
remains the default), but Windows git-bash only has python (no python3
symlink), so the build was dying with 'python3: command not found'
the moment install-windows tried to run the canonical wheel build.
Make build.sh use ${PYTHON:-python3} for every interpreter call and
pass PYTHON=python from the Windows workflow before invoking it. Linux
behavior unchanged; one variable lets Windows reuse the same script.
PowerShell on the Windows runner reads the yaml as a non-UTF-8 code page; em-dashes (U+2014) inside the Write-Host string literals got mojibake'd to 'â€"' and tripped the parser: ParserError: TerminatorExpectedAtEndOfString Replace the two affected em-dashes with ASCII '-'. Comment-line em-dashes elsewhere in the file are harmless (tokenizer skips them) and stay as-is to avoid touching unrelated lines.
build.sh runs 'python -m pip install build wheel' inside the venv. uv venv ships without pip by default, so this failed with C:\...\env_isaaclab_uv\Scripts\python.exe: No module named pip right after gen_pyproject.py emitted the generated pyproject.toml. Add --seed to the install-windows venv create so pip / setuptools / wheel land inside the venv; the other 3 jobs don't call build.sh and keep the lighter seedless venvs.
Flips perception-windows from 'if: false' back to the standard needs.changes.outputs.run_windows_ci gate. The PowerShell process-level watchdog around the inline Kit boot stays as the inner guard; the tightened 10-min job timeout-minutes is the outer guard so a Vulkan init regression cannot starve other queued jobs again.
The watchdog used $proc.Kill($true), which compiles on .NET 5+ but not on PowerShell 5.1's .NET Framework (Process.Kill has no (bool) overload there). It still surfaced 'MethodCountCouldNotFindBest' on the runner after the kill ::error was emitted. Switch to Stop-Process -Id $proc.Id -Force -ErrorAction SilentlyContinue which is PS5-native and idempotent.
Adds .github/actions/windows-instance-state composite action with a
single 'phase' input:
pre : print disk free + sizes of cache and user-state dirs
post : print state, wipe non-cache user state, print state again
Each of the 5 Windows-runner jobs now reports state right after
checkout (BEFORE) and at the end with if: always() (AFTER), so any
poisoned state shows up immediately and the runner is left net-zero
outside intentional content caches.
Cleaned in 'post' (state, chain-risk):
%APPDATA%\NVIDIA Corporation\Omniverse Kit
%USERPROFILE%\Documents\Kit
%TEMP%\Kit* / hub-* / omniverse-* crash scratch dirs
%APPDATA%\Python\Python312\site-packages\{build,wheel} (escaped
from build.sh's pip install --user fallback)
Kept across runs (content-addressed, no chain):
%LOCALAPPDATA%\uv\cache
%LOCALAPPDATA%\pip\Cache
%LOCALAPPDATA%\NVIDIA\Omniverse (Kit shader cache; invalidated
by Kit itself on version mismatch)
Extend the workflow-level env block with the headless/no-window/EULA flags that PR isaac-sim#4018's known-working build.yml proved out: ISAACSIM_ACCEPT_EULA=YES # different layer from ACCEPT_EULA HEADLESS=1, ISAAC_SIM_HEADLESS=1, ISAAC_SIM_LOW_MEMORY=1 WINDOWS_PLATFORM=true OMNI_KIT_NO_WINDOW=1 # critical: blocks Kit from trying to # open a display when no desktop session OMNI_KIT_DISABLE_WATCHDOG=1, OMNI_KIT_TELEMETRY=0 CARB_LOGGING_SEVERITY=error PYTHONUNBUFFERED=1, PYTHONIOENCODING=utf-8 Add .github/actions/windows-sim-paths/ composite action that re-activates the caller's venv, resolves the Isaac Sim install root via pip show isaacsim-kernel, and exports: ISAAC_PATH, CARB_APP_PATH (sim/kit), EXP_PATH (workspace/apps), RESOURCE_NAME It also prepends <sim>/kit/plugins and <sim>/bin to PATH so the Vulkan loader can find NVIDIA's ICD DLLs (likely root cause of 'vkEnumeratePhysicalDevices failed. No physical device is found.' on this runner — DLL search defaults do not include the Sim install). Wire into kit-launch-windows and perception-windows by splitting their 'install + launch' steps into three: install isaacsim, resolve Sim paths (this action), boot Kit. Install-windows and path-io-windows don't boot Kit so don't need this. Extend the windows-instance-state action's report with nvidia-smi output so 'no GPU' vs 'GPU present, Vulkan can't load' is visible in every job's pre-state dump. Also harden the size measurement against junctions/reparse points that have no Length property (suppresses the GenericMeasurePropertyNotFound noise observed in the previous run).
'python -m pip show isaacsim-kernel' inside the uv venv failed with 'No module named pip' because uv venvs are created without seeding pip / setuptools / wheel by default. uv itself can introspect the venv (it tracks its own install metadata) so 'uv pip show' is the correct lookup here.
PowerShell treats 'Using Python 3.12.13 environment at: env_isaaclab_uv' (uv banner on stderr) as a NativeCommandError record when captured via '2>&1' under $ErrorActionPreference='Stop', failing the step before parsing the Location: line. Drop the 2>&1 so stderr just streams to the host log; rely on $LASTEXITCODE for failure detection. Also surfaces an important data point this run captured for free: nvidia-smi: NVIDIA L40S, 582.53, 46068 MiB The runner DOES have a real GPU. The earlier 'vkEnumeratePhysicalDevices failed' was DLL-discovery, not GPU absence — which is exactly what this PATH prepend (Sim bin + kit/plugins) is supposed to fix once the path resolution runs cleanly.
Duplicate test_cartpole_training_smoke.py from PR isaac-sim#5698's branch so PR isaac-sim#5700 doesn't chain on it. Cross-platform tweaks vs ARM's copy: - pytestmark = [arm_ci, windows_ci] # dual marker - _LAUNCHER picks isaaclab.bat on Windows, isaaclab.sh elsewhere Add training-smoke-windows job that pytests this file in the same install + Sim-paths context as perception-windows. continue-on-error true and timeout-minutes 30 mirror the other Windows jobs. State case (Isaac-Cartpole-Direct-v0 / rsl_rl) should pass on TCC — no RTX, no Vulkan touch. Perception case (Isaac-Cartpole-RGB-Camera-Direct-v0 / rl_games) needs Vulkan and will fail on this runner until WDDM is enabled. Whichever of isaac-sim#5698 / isaac-sim#5700 merges first wins the test file; the other PR will drop the duplicate on rebase.
test_cartpole_training_smoke.py invokes scripts/reinforcement_learning/rsl_rl/train.py (state case) scripts/reinforcement_learning/rl_games/train.py (perception case) Both train scripts import rsl_rl / rl_games as their first non-stdlib imports — and the previous Windows training-smoke install didn't pull either, so both cases hit: ModuleNotFoundError: No module named 'rsl_rl' ModuleNotFoundError: No module named 'rl_games' isaaclab_rl/setup.py declares these as extras [rsl_rl] / [rl_games]; install the editable package with both extras so the framework packages (rsl-rl-lib + rl-games) end up in the venv.
Same coverage as before — deps smoke + path-IO + kit-launch + cartpole training smoke + perception + wheel build — but as sequential steps inside a single runs-on: [self-hosted, gpu-windows] job. Why: 1. Single venv create + single isaacsim install shared across all test steps. Saves ~5 venv setups (~3 min each = ~15 min wall). 2. The runner gets ONE allocation, stays continuously busy, never sees an inter-job idle gap. Autoscaler can't tear it down and strand queued siblings (the cancellation cascade we kept hitting). 3. Same affinity guarantee as Linux/ARM single-job model — every test step touches the same runner's filesystem and Sim install. Each test step has continue-on-error: true and writes its own JUnit XML. A final aggregate step parses outcomes and fails the job iff any non-perception step failed. perception is gated as 'warning, not failure' until the runner pool fixes TCC->WDDM, so the workflow doesn't lie about overall status while still surfacing the failure clearly.
The self-hosted Windows runner uses an NVIDIA L40S, a Data Center GPU. On bare-metal Windows, NVIDIA's data-center driver does not expose graphics APIs (OpenGL/Vulkan/DirectX) for these SKUs regardless of TCC vs WDDM driver mode; per the Data Center GPU driver release notes, vGPU is required to expose them. Kit's boot path reflects this exactly: vkEnumeratePhysicalDevices returns no devices, gpu.foundation logs "TCC is not supported. GPU(s) should be in WDDM mode.", and Kit then hangs in omni.gpu_foundation_factory until the OS-level watchdog fires. Comment out the perception step (preserve verbatim for restoration), drop the now-dangling perception_smoke.py artifact path and the steps.test-perception.outcome reference in the Aggregate step, and note in the file header that perception is disabled. The disabled-step context block lists the three independent unblock criteria (vGPU on L40S, swap runner SKU, or move perception coverage to Linux) so the next maintainer can pick whichever lands first.
The cross-platform CI series adds source/isaaclab_tasks/test/ test_cartpole_training_smoke.py without a paired fragment, so the nightly Check changelog fragments gate currently rejects the PR. Add a .skip entry under source/isaaclab_tasks/changelog.d/ matching the existing source/isaaclab/changelog.d/jichuanh-windows-ci.skip convention (CI/test-only, no user-facing API change).
Three related changes that together unblock the consolidated windows-ci job from the latent failures uncovered once the perception step stopped masking everything else: * Install `isaacsim[all,extscache]==6.0.0.*` BEFORE the cu128 torch upgrade. `isaacsim` pulls CPU torch transitively and was silently overwriting the cu128 wheel installed earlier; `[cuda:0]`-parametrized cases in Deps smoke and Path-IO then fail with "Torch not compiled with CUDA enabled". The new order mirrors install.py (_install_isaacsim() then _ensure_cuda_torch()). * Install `source/isaaclab_newton` with `--no-deps`. cartpole_env_cfg.py imports `isaaclab_newton.physics` at module load, so every cartpole task fails with `ModuleNotFoundError: No module named 'isaaclab_newton'` without it. Same `--no-deps` reason as isaaclab_physx (both declare a bare-name dep on isaaclab_ppisp that's not yet on this branch nor on any index; the ppisp import is lazy at runtime). The smoke-import line is extended so this regression fails fast in setup, not in a later test step. * Replace the em-dash in the Aggregate step's `Write-Host "::error::"` with an ASCII hyphen. PowerShell 5.1 reads the temp .ps1 as cp1252, so the 3-byte UTF-8 em-dash mis-decodes inside the string and the closing quote is mis-detected, raising "The string is missing the terminator". The path was never executed before because `$failed` was always empty (only perception had failed, and it was excluded from the gating set).
The temp_dir fixture used `tempfile.mkdtemp()` + `shutil.rmtree()` for cleanup. On Windows, h5py's libhdf5 keeps an internal handle to the file briefly after `.close()`, so `rmtree` races with the handle release and raises `PermissionError [WinError 32]` on teardown of `test_write_and_load_episode[cuda:0]`. The assertions had already passed; only the cleanup was failing. Switch to `tempfile.TemporaryDirectory(ignore_cleanup_errors=True)` (Python 3.10+). On Linux/macOS this flag is a no-op since no cleanup error is raised; on Windows it absorbs the libhdf5 handle-release race without masking real failures (the test body still asserts via the explicit `dataset_file_handler.close()` calls). Drop the now-unused `shutil` import.
Pull in source/isaaclab_ppisp (the ppisp package missing from this branch) and the updated install.py that includes isaaclab_ppisp and isaaclab_newton in CORE_ISAACLAB_SUBMODULES. With ppisp present, the workflow no longer needs --no-deps workarounds for isaaclab_physx / isaaclab_newton; the subsequent commit collapses the hand-rolled pip sequence into a single ./isaaclab.bat -i call.
Replace the hand-rolled `uv pip install ...` sequence in the setup step with a single `.\isaaclab.bat -i 'isaacsim,rl[rsl_rl,rl_games]'` call, now that the develop merge brings in `source/isaaclab_ppisp/` and the updated install.py that includes `isaaclab_ppisp` and `isaaclab_newton` in CORE_ISAACLAB_SUBMODULES. The hand-rolled sequence had grown three latent issues, all of which the canonical install.py path avoids: * Install order — `_install_isaacsim()` runs before `_ensure_cuda_torch()` inside install.py, so isaacsim's transitive CPU torch can't shadow the cu128 wheel. The previous hand-rolled order had the cu128 upgrade first and broke `[cuda:0]`-parametrized tests. * Missing isaaclab_newton — install.py walks CORE_ISAACLAB_SUBMODULES, so isaaclab_newton is installed automatically. cartpole_env_cfg.py's import of `isaaclab_newton.physics` no longer fails. * No more --no-deps workarounds — with `source/isaaclab_ppisp/` present the renderer-backend bare-name dep resolves through the local editable install. The workflow keeps the test-only `pytest pytest-timeout h5py` install (install.py doesn't carry pytest plumbing) and the post-install smoke import. Setup-step body shrinks from ~25 lines to ~3 substantive lines. Matches the "Mirror Linux CI setup for new platforms" rule: same entry point as Linux CI (`./isaaclab.sh -i`), so install-order bugs and new core submodules are picked up automatically when install.py changes.
PowerShell / pytest commands inside YAML run: blocks render as plain text in editors without an embedded-language highlighter, so heavy inline commentary inside those blocks becomes visual noise rather than documentation. Strip it. Inter-step comments (section headers, pre-step rationale, the disabled-perception context block) are kept — those sit at the YAML level and read fine without syntax-highlighting help. Net: -80 lines, mostly redundant restatement of what surrounding identifiers and commit history already make clear.
`test_train_cartpole_perception` builds Isaac-Cartpole-RGB-Camera-Direct-v0 which boots Kit with `enable_cameras=True`, hits the L40S TCC / no-vGPU Vulkan path, and hangs until the pytest 600s timeout fires (logs show `Stack of MainThread` thread dumps). Same blocker as the disabled standalone perception smoke. Filter the training-smoke pytest invocation with `-k 'not perception'` so the state subcase (Isaac-Cartpole-Direct-v0 + rsl_rl) is the only case exercised on the current Windows runner pool. Latest CI run shows the state subcase passes in ~30s. Drop the filter when the L40S vGPU unblock criterion lands (same condition tracked in the disabled perception step's context block).
Independent probe of the Vulkan loader on the runner, separate from Kit. Captures nvidia-smi driver+display info, lists vulkan-1.dll and ICD registry entries, and runs vulkaninfo --summary if available (falls back to a ctypes-based vkCreateInstance + vkEnumeratePhysicalDevices probe via the existing uv venv when the SDK isn't installed). Output goes to reports/vulkan-probe.txt and is included in the windows-ci-reports artifact. continue-on-error: true so the probe is informational only and does not gate the job. Added to the Aggregate $results listing for visibility. Background: PR 5700 perception step fails on the runner with "vkEnumeratePhysicalDevices failed. No physical device is found." + "TCC is not supported. GPU(s) should be in WDDM mode." Adding the direct vulkaninfo / loader probe answers the question of what the Vulkan ICD stack itself sees, independent of Kit's bootstrap path.
Last CI run's probe step parse-failed because PowerShell doesn't support bash heredoc (<<'PYEOF') and the YAML block scalar couldn't host an unindented PowerShell here-string for the embedded Python. Move the ctypes Vulkan loader probe out of the workflow into a standalone tools/vulkan_probe.py: * Loads vulkan-1.dll / libvulkan.so.1 via ctypes. * Calls vkCreateInstance + vkEnumeratePhysicalDevices. * Reports loader-load, instance-create, and physical-device count. * No dependencies beyond the OS Vulkan loader; cross-platform. The workflow now invokes it with system Python on the runner. Probe moves to the first runnable step (right after instance-state report) so diagnostic data appears in ~30 seconds instead of after the 15-min isaaclab.bat -i install. All other test steps gated off (`if: false`) for now while we iterate; aggregate gates the job purely on the probe's outcome. Disabled-perception context block left intact for the next maintainer.
…ools/" This reverts commit 966e6d3.
…nfo)" This reverts commit cd1e739.
PR 5695 lays foundation only. The two tagged tests (test_scipy.py, test_torch.py) had pytestmark applied here but no in-PR consumer — the arm-ci and windows-ci workflows that select on these markers ship in isaac-sim#5698 and isaac-sim#5700 respectively. Each downstream PR should land the marker annotations on the tests it actually wants in its canary, alongside the workflow that runs them. This keeps 5695 self-contained as scaffolding and prevents unused annotations from accumulating here.
The Windows runner GPUs are now in WDDM mode, so Kit's RTX/Vulkan path can enumerate a device. Re-enable the camera perception smoke and the perception subcase of the cartpole training smoke that were gated off under the data-center (TCC) driver, and add perception to the aggregate gating and report artifacts.
There was a problem hiding this comment.
Updated Review: Windows CI Pipeline (Part 3)
This is a comprehensive and well-structured PR that adds Windows CI support to IsaacLab. The implementation shows strong attention to platform-specific challenges and includes robust error handling. Below is my detailed analysis:
✅ Strengths
1. Solid Architecture
- Clean separation of reusable GitHub Actions (
windows-instance-state,windows-sim-paths) from the main workflow - Marker-driven test discovery (
windows_ci) allows adding tests without YAML edits - Consolidated single-job design prevents runner teardown between steps — smart approach for autoscaling runners
2. Robust Process Management
- OS-level watchdogs via
Start-Process + WaitForExitinstead of relying on GIL-vulnerable Python thread watchdogs - Per-job timeouts with
--timeout-method=thread(correctly avoiding Unix-only SIGALRM) - Hard timeouts with explicit kill on hung Kit/Vulkan processes
3. Cross-Platform Improvements
AssetConverterBasenow usestempfile.gettempdir()instead of hardcoded/tmp— correct fixwheel_builder/build.shPYTHONoverride handles Windows git-bash (which only haspython, notpython3)- HDF5 fixture uses
ignore_cleanup_errors=Trueto handle Windows file-lock races — good catch
4. Comprehensive Environment Setup
- EULA acceptance variables properly configured
- DLL search paths correctly prepended for Vulkan ICD discovery
- Instance state reporting helps diagnose runner issues (GPU presence, disk space, cache sizes)
⚠️ Suggestions & Minor Issues
1. Temporary Docker/Tests Disable in build.yaml
run_docker_tests: 'false' # TEMP: disable heavy Linux Docker + TestsThe comment says "revert before final review" — this should be reverted before merge. Consider adding a # TODO: tag that linters/CI can catch.
2. Hardcoded Git Bash Path
$gitBash = "C:\Program Files\Git\bin\bash.exe"
if (-not (Test-Path $gitBash)) { throw "Git Bash not found at $gitBash" }This assumes a standard Git for Windows installation. Consider:
- Falling back to
where.exe bashor checking common alternate paths (C:\Program Files (x86)\Git\bin\bash.exe) - Or documenting this as a runner requirement
3. Perception Test Gating
The perception test depends on WDDM mode for Vulkan. If runners could be in TCC mode, consider:
- Auto-detecting driver mode via
nvidia-smioutput - Adding a conditional skip rather than a CI failure
4. pytestmark Placement in test_configclass.py
from isaaclab.utils.configclass import _field_module_dir, configclass
pytestmark = pytest.mark.windows_ci
from isaaclab.utils.dict import class_to_dict, ...The pytestmark is sandwiched between imports — while functional, it's unconventional. Consider grouping it with other module-level declarations after all imports.
5. Magic Numbers
Several timeout values are hardcoded (300000ms, 180000ms, 600s, etc.). Consider:
- Documenting why these specific values were chosen
- Using named constants or environment variables for runner-specific tuning
📝 Questions
-
Checkpoint Cleanup: The cartpole training smoke tests run two iterations and create checkpoints. Are these cleaned up, or will they accumulate on runners?
-
Change Detection Logic: The
changesjob patterns don't include.github/actions/windows-*— is this intentional? Changes to these actions wouldn't trigger the workflow for validation. -
Python 3.12 Assumption: The
user site-packagespath assumes Python 3.12:"user site-pkgs" = "$env:APPDATA\Python\Python312\site-packages"
Should this be dynamic based on the installed Python version?
🔧 Final Notes
Overall, this is high-quality CI infrastructure work. The attention to Windows-specific quirks (DLL search order, file locking, PowerShell encoding, process watchdogs) demonstrates deep platform expertise.
Before merge:
- Revert the
run_docker_tests: 'false'workaround inbuild.yaml - Consider addressing the Git Bash path hardcoding
Other items are suggestions for future improvement rather than blockers.
Summary
Part 3 of the cross-platform CI series. Adds
.github/workflows/windows-ci.yaml— CI pipeline for Windows GPU self-hosted runners. Same shape asarm-ci.yaml(Part 2, #5698) but native install path instead of Docker.general-windows,install-windows(uv install + wheel build + reinstall),kit-launch-windows.path-io-windows,perception-windows, ...continue-on-error: truewhile runners stabilize.--timeout=N+--timeout-method=signalso hung tests fail fast (fixes the previous perception-windows pattern where Vulkan failures hung the job instead of erroring).pytest <root> -m windows_ci --continue-on-collection-errors. Adding a new Windows-safe test requires only tagging itwindows_ci, no yaml edit.Series
PRs prefixed with
[CI] Cross-platform —. Current siblings:Depends on Part 1 (#5695). Independent of Part 2 (#5698).
Test plan
./isaaclab.sh -f(pre-commit) passes.