[Fix] Pin numpy!=2.3.5 to dodge OpenBLAS atfork SIGSEGV at Kit fork()#5642
[Fix] Pin numpy!=2.3.5 to dodge OpenBLAS atfork SIGSEGV at Kit fork()#5642hujc7 wants to merge 1 commit into
Conversation
NumPy 2.3.5 ships a vendored OpenBLAS (libscipy_openblas64_-fdde5778.so) whose pthread_atfork handler calls blas_thread_shutdown_ -> pthread_join on workers that don't exist in the child of fork(). Kit's libomni.platforminfo calls fork() during SimulationApp startup, which triggers the handler and SIGSEGVs the test process. IsaacLab's CI Docker layer (docker/Dockerfile.base:117) runs 'isaaclab.sh --install' which pip-installs the source packages and resolves numpy>=2 against the IsaacLab dep tree. pin-pink -> pinocchio (pin) -> cmeel-boost transitively caps numpy <2.4, so pip picks the highest 2.3.x: 2.3.5. That landed the broken OpenBLAS into site-packages, shadowing Isaac Sim's prebundled numpy 2.3.1. Tightening to 'numpy>=2,!=2.3.5' across the four packages that declare a numpy dep (isaaclab, isaaclab_tasks, isaaclab_rl, isaaclab_visualizers) keeps the loose lower bound but excludes the single known-broken release. Pip resolves to 2.3.4, which ships a different bundled OpenBLAS hash (libscipy_openblas64_-8fb3d286.so) that was the resolved version for IsaacLab CI prior to numpy 2.3.5 without these crashes. Refs: - numpy/numpy#30092 - scipy/scipy#23686 - OpenMathLib/OpenBLAS#5520 - JIRA OMPE-92261 Verified locally: - numpy 2.3.0 / 2.3.1 ship libscipy_openblas64_-56d6093b.so (safe) - numpy 2.3.2 / 2.3.3 / 2.3.4 ship libscipy_openblas64_-8fb3d286.so - numpy 2.3.5 ships libscipy_openblas64_-fdde5778.so (broken) - Isaac Sim base image (both 5/11 and 5/15 candidates) prebundles numpy 2.3.1 with -56d6093b at omni.kit.pip_archive/pip_prebundle A longer-term fix is bumping cmeel-boost upstream so numpy 2.4.1+ becomes resolvable; coordination with Isaac Sim base image is also in flight separately.
There was a problem hiding this comment.
Code Review Summary
This PR correctly addresses a critical CI stability issue by excluding NumPy 2.3.5, which ships a broken OpenBLAS bundle that causes SIGSEGV during SimulationApp startup.
✅ Strengths
-
Well-researched root cause analysis — The PR description thoroughly documents the OpenBLAS bisection across numpy 2.3.x versions and correctly identifies the problematic
-fdde5778bundle in 2.3.5. -
Consistent application — All 4 packages declaring a numpy dependency are updated:
isaaclab/setup.py:"numpy>=2"→"numpy>=2,!=2.3.5"isaaclab_tasks/setup.py:"numpy>=2"→"numpy>=2,!=2.3.5"isaaclab_rl/setup.py:"numpy"→"numpy>=2,!=2.3.5"isaaclab_visualizers/setup.py:"numpy"→"numpy>=2,!=2.3.5"
-
Proper documentation — Changelog fragments are added for all 4 packages with clear explanations and issue references.
-
Inline comments — The comments in
setup.pyfiles reference the upstream issue (numpy/numpy#30092) and JIRA ticket, which helps future maintainers understand why the exclusion exists. -
Minimal, targeted fix — Rather than over-constraining (e.g.,
numpy<2.3.5), the PR surgically excludes only the known-broken version.
📝 Minor Observations
-
Changelog fragment duplication —
isaaclab_rlandisaaclab_visualizersshare identical changelog text that says "(was unconstrained)", which is accurate but the text could be slightly differentiated. This is a very minor nit and not blocking. -
Future cleanup — Once
cmeel-boostlifts its<2.4cap and numpy 2.4.1+ becomes resolvable, the!=2.3.5exclusion will become redundant (since pip would skip 2.3.5 anyway). Consider adding a tracking comment or TODO. However, this is also very minor since the exclusion causes no harm if left in place.
⚙️ Technical Validation
- The numpy upstream issue numpy#30092 is closed/fixed, confirming this is a known upstream regression
- The PEP 440 version specifier
>=2,!=2.3.5is syntactically correct - The cmeel-boost transitive cap
<2.4means pip will resolve to 2.3.4 (highest non-excluded 2.3.x), which ships the-8fb3d286OpenBLAS bundle that was working in CI prior to 2.3.5
Verdict
This is a clean, well-documented fix for a CI-breaking regression. The approach is sound — excluding a single broken release is preferable to pinning to a specific "known good" version, as it allows for maximum flexibility while avoiding the crash.
LGTM once CI passes. 👍
|
Can we push this fix upstream? |
My understanding is evolving going in this direction. Originally thought it was the docker image, but the pin/pin-pink dependency is declared by lab, and the cmeel-boost required by pin-pink/pin has numpy < 2.4 which always got resolved to this bad version. |
Importing numpy before pytest registers the broken OpenBLAS atfork handler in this shell, then Kit's libomni.platforminfo fork() trips it and SIGSEGVs - exactly the bug isaac-sim#5642 targets. Surface the diagnostic AFTER pytest instead, with pytest's exit code preserved so the job still passes/fails based on real test outcomes.
The setup.py constraint "numpy>=2,!=2.3.5" landed in isaac-sim#5642 is silently overridden during isaaclab.sh --install because pip resolves each submodule install independently: - isaaclab -> numpy stays at 2.3.1 (already satisfied) - isaaclab_mimic[h5py]-> numpy 1.26.4 (h5py wheel ABI) - isaaclab_rl -> numpy 2.4.5 - isaaclab_teleop[dex-retargeting] -> numpy 2.3.5 (cmeel-boost <2.4 cap) - isaaclab_visualizers-> numpy 2.3.4 - isaaclab_mimic[robomimic] -> numpy 1.26.4 - _ensure_pink_ik_dependencies_installed force-reinstall -> numpy 2.3.5 The final pin-pink force-reinstall sees only pin-pink's numpy>=1.19 plus cmeel-boost's numpy<2.4 cap and lands on numpy 2.3.5 - the exact release whose vendored OpenBLAS (libscipy_openblas64_-fdde5778.so) registers a buggy pthread_atfork handler that SIGSEGVs Kit's libomni.platforminfo fork() during SimulationApp startup. After the pin-pink force-reinstall, append one more pip invocation that explicitly upgrades numpy to >= 2.4.1. pip prints a resolver warning about cmeel-boost's cap but installs numpy 2.4.5 anyway; numpy's stable C ABI (numpy >= 2.0) keeps cmeel's compiled extensions (libpinocchio, libcoal, ...) working at runtime. The atfork fix landed upstream in numpy 2.4.1, so the entire 2.3.x risk class is bypassed. Validated locally on env_isaaclab_test (numpy 2.4.5 + pinocchio 3.9.0 + pin 3.9.0 + daqp + qpsolvers): - import numpy, pinocchio, pink, daqp: OK - Bundled OpenBLAS hash: -32a4b2a6 (not the broken -fdde5778) - IsaacLab Pink IK unit tests: 54/54 pass (test_pink_ik_components.py 21/21, test_local_frame_task.py 24/24, test_null_space_posture_task.py 9/9) Related: numpy/numpy#30092, OpenMathLib/OpenBLAS#5520
The setup.py constraint "numpy>=2,!=2.3.5" landed in isaac-sim#5642 is silently overridden during isaaclab.sh --install: each pip install -e <submodule> runs an independent resolve, and the final pin-pink force-reinstall in _ensure_pink_ik_dependencies_installed lands on numpy 2.3.5 because pip sees only pin-pink's own deps (numpy>=1.19) plus cmeel-boost's numpy<2.4 cap. numpy 2.3.5 ships a vendored OpenBLAS (libscipy_openblas64_-fdde5778.so) whose pthread_atfork handler crashes Kit's libomni.platforminfo fork() during SimulationApp startup. Two changes, both restating an explicit "pip install --upgrade numpy>=2.4.1" as the *last* pip invocation in each install path: 1. _ensure_numpy_above_openblas_atfork_bug() in install.py — runs unconditionally at the end of --install (not gated by the pink-ik probe outcome), so upgrades on an already-functioning env also pull numpy forward. 2. Dockerfile.curobo — apply the same upgrade after its post-install steps (nvidia-curobo + isaaclab_teleop editable install), which otherwise drag numpy back to 2.3.5 via dex-retargeting -> pin -> cmeel-boost. pip prints a resolver warning about cmeel-boost's cap then installs numpy 2.4.5 anyway. numpy 2.4.1+ ships the upstream OpenBLAS atfork fix, so the entire 2.3.x risk class is bypassed. numpy's stable C ABI keeps cmeel's compiled extensions (libpinocchio, libcoal, ...) working at runtime. Validated: - env_isaaclab_test smoke test (numpy 2.4.5 + cmeel pinocchio + pink + daqp + qpsolvers all import; toy IK solve OK). - IsaacLab Pink IK unit tests: 54/54 pass against numpy 2.4.5 (test_pink_ik_components 21/21, test_local_frame_task 24/24, test_null_space_posture_task 9/9). - PR isaac-sim#5655 (validation): every base-image test job reports numpy 2.4.5 + openblas -32a4b2a6 (clean, not the broken -fdde5778). Worst-case import order (numpy imported before pytest spawns Kit) also passes — confirming the upstream atfork fix is real, not just dodge-by-order. Related: numpy/numpy#30092, OpenMathLib/OpenBLAS#5520
The setup.py constraint "numpy>=2,!=2.3.5" landed in isaac-sim#5642 is silently overridden during isaaclab.sh --install: each pip install -e <submodule> runs an independent resolve, and the final pin-pink force-reinstall in _ensure_pink_ik_dependencies_installed lands on numpy 2.3.5 because pip sees only pin-pink's own deps (numpy>=1.19) plus cmeel-boost's numpy<2.4 cap. numpy 2.3.5 ships a vendored OpenBLAS (libscipy_openblas64_-fdde5778.so) whose pthread_atfork handler crashes Kit's libomni.platforminfo fork() during SimulationApp startup. Two changes, both restating an explicit "pip install --upgrade numpy>=2.4.1" as the *last* pip invocation in each install path: 1. _ensure_numpy_above_openblas_atfork_bug() in install.py — runs unconditionally at the end of --install (not gated by the pink-ik probe outcome), so upgrades on an already-functioning env also pull numpy forward. 2. Dockerfile.curobo — apply the same upgrade after its post-install steps (nvidia-curobo + isaaclab_teleop editable install), which otherwise drag numpy back to 2.3.5 via dex-retargeting -> pin -> cmeel-boost. pip prints a resolver warning about cmeel-boost's cap then installs numpy 2.4.5 anyway. numpy 2.4.1+ ships the upstream OpenBLAS atfork fix, so the entire 2.3.x risk class is bypassed. numpy's stable C ABI keeps cmeel's compiled extensions (libpinocchio, libcoal, ...) working at runtime. Validated: - env_isaaclab_test smoke test (numpy 2.4.5 + cmeel pinocchio + pink + daqp + qpsolvers all import; toy IK solve OK). - IsaacLab Pink IK unit tests: 54/54 pass against numpy 2.4.5 (test_pink_ik_components 21/21, test_local_frame_task 24/24, test_null_space_posture_task 9/9). - PR isaac-sim#5655 (validation): every base-image test job reports numpy 2.4.5 + openblas -32a4b2a6 (clean, not the broken -fdde5778). Worst-case import order (numpy imported before pytest spawns Kit) also passes — confirming the upstream atfork fix is real, not just dodge-by-order. Related: numpy/numpy#30092, OpenMathLib/OpenBLAS#5520
TL;DR
NumPy 2.3.5 ships a vendored OpenBLAS (
libscipy_openblas64_-fdde5778.so) whosepthread_atforkhandler crashes inside Kit'slibomni.platforminfofork() duringSimulationAppstartup. IsaacLab'ssetup.pydeclaresnumpy>=2, and with thepin-pink → pinocchio → cmeel-boosttransitive cap of<2.4, pip resolves to 2.3.5 — exactly the broken release. This PR adds!=2.3.5to that constraint across the four packages that declare a numpy dep. Pip then resolves to 2.3.4, which ships a different OpenBLAS bundle. Targets the dependency layer of the OpenBLAS-class CI SIGSEGV in IsaacLab's own setup.py.Why this fix lives here, not in Isaac Sim's base image
Verified by
docker runon both candidate Isaac Sim images (the current pinsha256:0dd49a11…and the rollingsha256:06197a67…) — both prebundle numpy 2.3.1 with the safe OpenBLAS hash-56d6093b. The broken numpy 2.3.5 enters the running container at IsaacLab'sdocker/Dockerfile.base:117-118, whenisaaclab.sh --installruns and pip resolves IsaacLab'snumpy>=2constraint to 2.3.5 — installing into_isaac_sim/kit/python/lib/python3.12/site-packages/and shadowing the base image's prebundle.So the dependency-resolution layer is in IsaacLab's
setup.py.NumPy 2.3.x → bundled OpenBLAS hash bisection (verified)
libscipy_openblas64_-56d6093b.solibscipy_openblas64_-56d6093b.solibscipy_openblas64_-8fb3d286.solibscipy_openblas64_-8fb3d286.solibscipy_openblas64_-8fb3d286.solibscipy_openblas64_-fdde5778.soWith this PR's
numpy>=2,!=2.3.5, pip resolves to 2.3.4 (highest non-broken). The-8fb3d286bundle was IsaacLab CI's resolved version for ~4 months before 2.3.5 was released, but it wasn't tested against the new CUDA 13.2 driver / runner environment that started showing the SIGSEGV pattern on 2026-05-12. If a reviewer prefers maximally-conservative, the constraint can be tightened tonumpy>=2,<2.3.2which forces the bit-identical-to-base-image 2.3.1.Why not bump to numpy ≥ 2.4.1 (which has the upstream OpenBLAS fix)?
pin-pink(Pink IK library) depends onpin(Pinocchio) →libpinocchio 3.9.0→cmeel-boost ~=1.89.0. The latestcmeel-boost 1.89.0declares:Forcing
numpy>=2.4 + pin>=2.6.3producesResolutionImpossible. Untilcmeel-boostupstream lifts its cap (or IsaacLab moves Pinocchio to a non-PyPI install path), the highest numpy IsaacLab can resolve to is 2.3.x. Excluding 2.3.5 is therefore the short-term fix at the dependency-resolution layer.Related in-flight work
AppLauncherruns Kit, and adjusts a couple of tests that don't needAppLauncher. Operates at the import-order layer.PYTHONFAULTHANDLER=1+-vvto surface crash backtraces in CI. Diagnostic instrumentation.OMP_/OPENBLAS_/MKL_NUM_THREADS=1). Operates at the OpenBLAS runtime layer.This PR operates at the dependency-resolution layer. The four approaches address different layers of the same problem; reviewers can choose which set to land.
Files touched
All four sites must agree per IsaacLab convention (pip first-declaration-wins on transitive resolution).
References
Type of change
Checklist
pre-commitchecks with./isaaclab.sh --format