fix: AlphaGenome JAX Metal guard + macOS fresh-install audit v4#11
Closed
lucapinello wants to merge 1 commit intochorus-applicationsfrom
Closed
fix: AlphaGenome JAX Metal guard + macOS fresh-install audit v4#11lucapinello wants to merge 1 commit intochorus-applicationsfrom
lucapinello wants to merge 1 commit intochorus-applicationsfrom
Conversation
The macOS CPU-forcing guard in AlphaGenome's load and predict templates only fired when device was None or started with "cpu". Callers passing device='cuda:0' (common in scripts) bypassed it, letting jax-metal initialize and crash with "UNIMPLEMENTED: default_memory_space". Fix: on Darwin, always force JAX_PLATFORMS=cpu unless the caller explicitly requests Metal. Applied to all three code paths: - alphagenome.py:_load_direct (also moved env var before import jax) - load_template.py - predict_template.py Audit report at audits/2026-04-16_macos_fresh_install_audit_v4.md covers: clean-slate install of all 7 envs, GPU verification (6 oracles), 12 example regenerations with diffs, 3 notebook executions (0 errors), 13 HTML Selenium checks, 7-check normalization CDF audit (all pass). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Contributor
Author
|
Superseded — branch pre-dates current main-line evolution; its diff would destroy 43,380 lines of work already on chorus-applications. Closing without merge. |
This was referenced Apr 24, 2026
lucapinello
added a commit
that referenced
this pull request
Apr 24, 2026
Closes the last three v26 follow-ups before calling v26 done: P1 #10 (runner.py): set TF_CPP_MIN_LOG_LEVEL=3 via _prepare_env() for enformer + chrombpnet subprocesses, silencing the 3-5 lines of TF/absl boot spam per prediction. P1 #11 (exceptions.py, base.py): _setup_environment() now records an actionable message on every failure path (missing env / validation failure / import error / unexpected) into self._env_setup_error and still flips use_environment=False. A new _check_env_ready() fires at the top of predict() / predict_region_replacement() / predict_region_insertion_at() / predict_variant_effect(): if the user originally asked for use_environment=True and setup failed, it raises the new EnvironmentNotReadyError with the recorded hint (`chorus setup` / `chorus health`) instead of letting the caller hit a confusing ModuleNotFoundError inside the base env. The legitimate use_environment=False test/library path still passes through. P2 #14 (result.py): OraclePredictionTrack.interpolate() and .aggregate() now raise ValueError with the sibling-method pointer on a bad target resolution, not bare assert (which gets stripped by -O). P2 #21 (cli/main.py): --verbose on health / list / genome now sets the root logger to DEBUG so EnvironmentManager / GenomeManager / oracle debug output actually surfaces. Previously the flag only gated a few extra print lines in the command handler. P2 #22 (runner.py): subprocess.TimeoutExpired in run_code_in_environment() and run_in_environment() is caught and re-raised as RuntimeError with a pointer to CHORUS_NO_TIMEOUT=1, replacing the bare TimeoutExpired with truncated stderr. Tests: 340 passed, 1 skipped on fast suite (no smoke). Co-authored-by: lp698 <lp698@dimm2fv07n65x.partners.org> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
lucapinello
added a commit
that referenced
this pull request
Apr 29, 2026
…loses #63, #64, #65) (#66) * fix(env-setup): timeout-soft policy + plug load_pretrained_model gate (closes #64) Two related fixes for issue #64: 1. chorus/core/base.py:_setup_environment now distinguishes transient validation timeouts (slow NFS / cold-cache `import jax` probes) from genuine missing-dep failures. validate_environment returns issues prefixed with "Timeout while checking dependency" for the timeout case and "Missing dependency" / "Error checking" for real failures. Previously, BOTH paths set use_environment=False and recorded an _env_setup_error, leading to oracle.load_pretrained_model() falling through to _load_direct in the wrong env on cold-NFS lab boxes. New policy: - Timeout-only failures: log a warning, KEEP use_environment=True (don't set _env_setup_error). The actual subprocess invocation has its own per-call timeout and will surface a real error if the env is genuinely broken. - Genuine missing-dep failures: keep the v26 P1 #11 invariant — downgrade + raise EnvironmentNotReadyError on next user call. The conjunctive 'all timeouts' check means a mixed timeout+missing-dep issue list is treated as a genuine failure (the missing-dep is the real signal). 2. Each oracle's load_pretrained_model now calls self._check_env_ready() as its first action. predict() already does this (base.py:215); load_pretrained_model didn't, so a silent downgrade still surfaced as ModuleNotFoundError from _load_direct instead of the intended EnvironmentNotReadyError. 7 oracles, 1-line edit each: - alphagenome.py - alphagenome_pt.py - borzoi.py - chrombpnet.py - sei.py - legnet.py - enformer.py _check_env_ready is a no-op when _user_asked_for_env=False (base.py:180), so this is safe for tests that pass use_environment=False. 3. tests/test_environment_setup_gating.py — 4 fast-suite tests pinning: - Timeout-only: use_environment stays True, _env_setup_error stays None - Missing-dep: downgrade + EnvironmentNotReadyError on load_pretrained_model - Mixed timeouts + missing-dep: treated as genuine failure - use_environment=False: validation path never runs, _check_env_ready never raises Tests monkeypatch EnvironmentManager.validate_environment so they don't need a real conda env or GPU. Run in <2 s. Audit-script workaround at audits/2026-04-29_alphagenome_pt_stress_test/ (o.use_environment = True; o._env_setup_error = None) is no longer needed after this fix. Linux/CUDA spot-check should confirm. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(alphagenome): rewrite equivalence test via chorus API (closes #63, #65) The pre-rewrite test compared JAX and PyTorch outputs at the *raw model head* and asserted shape parity. JAX's DNase head exposes 305 tracks and the PyTorch port's exposes 384 — different upstream filtering choices, not different weights. The shape assertion was bound to fail on any platform; it just didn't surface until the Linux/CUDA spot-check on `audit/linux-cuda-pr62`. Rewrite uses oracle.predict() which goes through chorus's local_index slicing in chorus/oracles/alphagenome.py:_predict — that selects the user-requested tracks by identifier from the shared 5,731-track metadata cache (alphagenome_tracks.json). Post-slicing arrays are shape-compatible across backends regardless of raw head shape. Concrete changes: - Drop subprocess-driven JAX_RUNNER + PT_RUNNER heredocs (~70 lines) and the bare subprocess.run(["mamba", ...]) calls in _env_exists() and _run() (closes #65). The test now uses chorus's own oracle API which spawns into per-oracle envs via EnvironmentRunner.run_code_in_environment — the canonical path that resolves mamba/conda binaries via EnvironmentManager's MAMBA_EXE / shutil.which / fallback chain. - Use EnvironmentManager.environment_exists(oracle) for skip detection instead of rolling our own `mamba env list --json` shellout. - Hardcode three canonical DNase identifiers (K562 EFO:0002067, HepG2 EFO:0001187, hepatocyte CL:0000182), verified against alphagenome_metadata.get_metadata().search_tracks() at write time. Same set as v30 macOS Tier 1 audit so equivalence numbers stay comparable across audits. - Tolerances unchanged: max abs diff < 0.1, mean rel diff < 5%. Both M3 Ultra and A100 audits reported 0.74–1.85% per-track rel error, well inside this bound. Test stays @pytest.mark.integration; skips cleanly when either env or hg38.fa is missing. Run on Linux/CUDA via: mamba run -n chorus pytest tests/test_alphagenome_backends_equivalence.py -m integration -v — no longer fails with FileNotFoundError on bare `mamba` (the inner subprocess goes through chorus's resolver, not the test's own subprocess.run). #64 fix in the previous commit removes the need for the audit-script workaround (o.use_environment = True; o._env_setup_error = None). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: lp698 <lp698@dimm2fv07n65x.partners.org> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
device='cuda:0'bypassed it, causingUNIMPLEMENTED: default_memory_spacecrashes with jax-metal. Fixed in all 3 code paths (alphagenome.py, load_template.py, predict_template.py).Test plan
🤖 Generated with Claude Code