fix: AlphaGenome JAX Metal guard + macOS fresh-install audit v4 by lucapinello · Pull Request #11 · pinellolab/chorus

lucapinello · 2026-04-16T16:45:24Z

Summary

Bug fix: AlphaGenome's macOS CPU-forcing guard was incomplete — device='cuda:0' bypassed it, causing UNIMPLEMENTED: default_memory_space crashes with jax-metal. Fixed in all 3 code paths (alphagenome.py, load_template.py, predict_template.py).
Audit report: Clean-slate macOS fresh-install audit covering all 7 envs, 6 oracle GPU checks, 12 example regenerations, 3 notebook executions, 13 HTML Selenium checks, and 7-check normalization CDF audit.

Test plan

Delete all chorus conda envs + caches, reinstall from zero
Verify GPU: Borzoi/SEI/LegNet on MPS, Enformer/ChromBPNet on Metal, AlphaGenome on CPU
Regenerate all 12 application examples, diff against committed (all within tolerance)
Execute all 3 notebooks (0 errors, 0 stale messages)
Selenium check 13 HTML reports (all structural checks pass)
Normalization CDF audit: 7 checks × 6 oracles = all pass

🤖 Generated with Claude Code

The macOS CPU-forcing guard in AlphaGenome's load and predict templates only fired when device was None or started with "cpu". Callers passing device='cuda:0' (common in scripts) bypassed it, letting jax-metal initialize and crash with "UNIMPLEMENTED: default_memory_space". Fix: on Darwin, always force JAX_PLATFORMS=cpu unless the caller explicitly requests Metal. Applied to all three code paths: - alphagenome.py:_load_direct (also moved env var before import jax) - load_template.py - predict_template.py Audit report at audits/2026-04-16_macos_fresh_install_audit_v4.md covers: clean-slate install of all 7 envs, GPU verification (6 oracles), 12 example regenerations with diffs, 3 notebook executions (0 errors), 13 HTML Selenium checks, 7-check normalization CDF audit (all pass). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

lucapinello · 2026-04-22T18:10:13Z

Superseded — branch pre-dates current main-line evolution; its diff would destroy 43,380 lines of work already on chorus-applications. Closing without merge.

Closes the last three v26 follow-ups before calling v26 done: P1 #10 (runner.py): set TF_CPP_MIN_LOG_LEVEL=3 via _prepare_env() for enformer + chrombpnet subprocesses, silencing the 3-5 lines of TF/absl boot spam per prediction. P1 #11 (exceptions.py, base.py): _setup_environment() now records an actionable message on every failure path (missing env / validation failure / import error / unexpected) into self._env_setup_error and still flips use_environment=False. A new _check_env_ready() fires at the top of predict() / predict_region_replacement() / predict_region_insertion_at() / predict_variant_effect(): if the user originally asked for use_environment=True and setup failed, it raises the new EnvironmentNotReadyError with the recorded hint (`chorus setup` / `chorus health`) instead of letting the caller hit a confusing ModuleNotFoundError inside the base env. The legitimate use_environment=False test/library path still passes through. P2 #14 (result.py): OraclePredictionTrack.interpolate() and .aggregate() now raise ValueError with the sibling-method pointer on a bad target resolution, not bare assert (which gets stripped by -O). P2 #21 (cli/main.py): --verbose on health / list / genome now sets the root logger to DEBUG so EnvironmentManager / GenomeManager / oracle debug output actually surfaces. Previously the flag only gated a few extra print lines in the command handler. P2 #22 (runner.py): subprocess.TimeoutExpired in run_code_in_environment() and run_in_environment() is caught and re-raised as RuntimeError with a pointer to CHORUS_NO_TIMEOUT=1, replacing the bare TimeoutExpired with truncated stderr. Tests: 340 passed, 1 skipped on fast suite (no smoke). Co-authored-by: lp698 <lp698@dimm2fv07n65x.partners.org> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…loses #63, #64, #65) (#66) * fix(env-setup): timeout-soft policy + plug load_pretrained_model gate (closes #64) Two related fixes for issue #64: 1. chorus/core/base.py:_setup_environment now distinguishes transient validation timeouts (slow NFS / cold-cache `import jax` probes) from genuine missing-dep failures. validate_environment returns issues prefixed with "Timeout while checking dependency" for the timeout case and "Missing dependency" / "Error checking" for real failures. Previously, BOTH paths set use_environment=False and recorded an _env_setup_error, leading to oracle.load_pretrained_model() falling through to _load_direct in the wrong env on cold-NFS lab boxes. New policy: - Timeout-only failures: log a warning, KEEP use_environment=True (don't set _env_setup_error). The actual subprocess invocation has its own per-call timeout and will surface a real error if the env is genuinely broken. - Genuine missing-dep failures: keep the v26 P1 #11 invariant — downgrade + raise EnvironmentNotReadyError on next user call. The conjunctive 'all timeouts' check means a mixed timeout+missing-dep issue list is treated as a genuine failure (the missing-dep is the real signal). 2. Each oracle's load_pretrained_model now calls self._check_env_ready() as its first action. predict() already does this (base.py:215); load_pretrained_model didn't, so a silent downgrade still surfaced as ModuleNotFoundError from _load_direct instead of the intended EnvironmentNotReadyError. 7 oracles, 1-line edit each: - alphagenome.py - alphagenome_pt.py - borzoi.py - chrombpnet.py - sei.py - legnet.py - enformer.py _check_env_ready is a no-op when _user_asked_for_env=False (base.py:180), so this is safe for tests that pass use_environment=False. 3. tests/test_environment_setup_gating.py — 4 fast-suite tests pinning: - Timeout-only: use_environment stays True, _env_setup_error stays None - Missing-dep: downgrade + EnvironmentNotReadyError on load_pretrained_model - Mixed timeouts + missing-dep: treated as genuine failure - use_environment=False: validation path never runs, _check_env_ready never raises Tests monkeypatch EnvironmentManager.validate_environment so they don't need a real conda env or GPU. Run in <2 s. Audit-script workaround at audits/2026-04-29_alphagenome_pt_stress_test/ (o.use_environment = True; o._env_setup_error = None) is no longer needed after this fix. Linux/CUDA spot-check should confirm. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(alphagenome): rewrite equivalence test via chorus API (closes #63, #65) The pre-rewrite test compared JAX and PyTorch outputs at the *raw model head* and asserted shape parity. JAX's DNase head exposes 305 tracks and the PyTorch port's exposes 384 — different upstream filtering choices, not different weights. The shape assertion was bound to fail on any platform; it just didn't surface until the Linux/CUDA spot-check on `audit/linux-cuda-pr62`. Rewrite uses oracle.predict() which goes through chorus's local_index slicing in chorus/oracles/alphagenome.py:_predict — that selects the user-requested tracks by identifier from the shared 5,731-track metadata cache (alphagenome_tracks.json). Post-slicing arrays are shape-compatible across backends regardless of raw head shape. Concrete changes: - Drop subprocess-driven JAX_RUNNER + PT_RUNNER heredocs (~70 lines) and the bare subprocess.run(["mamba", ...]) calls in _env_exists() and _run() (closes #65). The test now uses chorus's own oracle API which spawns into per-oracle envs via EnvironmentRunner.run_code_in_environment — the canonical path that resolves mamba/conda binaries via EnvironmentManager's MAMBA_EXE / shutil.which / fallback chain. - Use EnvironmentManager.environment_exists(oracle) for skip detection instead of rolling our own `mamba env list --json` shellout. - Hardcode three canonical DNase identifiers (K562 EFO:0002067, HepG2 EFO:0001187, hepatocyte CL:0000182), verified against alphagenome_metadata.get_metadata().search_tracks() at write time. Same set as v30 macOS Tier 1 audit so equivalence numbers stay comparable across audits. - Tolerances unchanged: max abs diff < 0.1, mean rel diff < 5%. Both M3 Ultra and A100 audits reported 0.74–1.85% per-track rel error, well inside this bound. Test stays @pytest.mark.integration; skips cleanly when either env or hg38.fa is missing. Run on Linux/CUDA via: mamba run -n chorus pytest tests/test_alphagenome_backends_equivalence.py -m integration -v — no longer fails with FileNotFoundError on bare `mamba` (the inner subprocess goes through chorus's resolver, not the test's own subprocess.run). #64 fix in the previous commit removes the need for the audit-script workaround (o.use_environment = True; o._env_setup_error = None). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: lp698 <lp698@dimm2fv07n65x.partners.org> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

lucapinello closed this Apr 22, 2026

lucapinello deleted the audit/2026-04-16-fresh-install-v4 branch April 22, 2026 18:10

This was referenced Apr 24, 2026

v26 usability: fix P0 (invalid track) + 4 P1s (CLI noise, bare exceptions, MCP debug) #43

Closed

v26 final cleanup: TF noise, env-setup swallow, P2 polish #45

Merged

This was referenced Apr 29, 2026

chorus.core.base: env-validation timeout silently flips use_environment=True → False, then crashes on import in wrong env #64

Closed

Post-PR-#62 follow-ups: env-setup gating + equivalence-test rewrite (closes #63, #64, #65) #66

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: AlphaGenome JAX Metal guard + macOS fresh-install audit v4#11

fix: AlphaGenome JAX Metal guard + macOS fresh-install audit v4#11
lucapinello wants to merge 1 commit intochorus-applicationsfrom
audit/2026-04-16-fresh-install-v4

lucapinello commented Apr 16, 2026

Uh oh!

lucapinello commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lucapinello commented Apr 16, 2026

Summary

Test plan

Uh oh!

lucapinello commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant