Skip to content

fix: AlphaGenome JAX Metal guard + macOS fresh-install audit v4#11

Closed
lucapinello wants to merge 1 commit intochorus-applicationsfrom
audit/2026-04-16-fresh-install-v4
Closed

fix: AlphaGenome JAX Metal guard + macOS fresh-install audit v4#11
lucapinello wants to merge 1 commit intochorus-applicationsfrom
audit/2026-04-16-fresh-install-v4

Conversation

@lucapinello
Copy link
Copy Markdown
Contributor

Summary

  • Bug fix: AlphaGenome's macOS CPU-forcing guard was incomplete — device='cuda:0' bypassed it, causing UNIMPLEMENTED: default_memory_space crashes with jax-metal. Fixed in all 3 code paths (alphagenome.py, load_template.py, predict_template.py).
  • Audit report: Clean-slate macOS fresh-install audit covering all 7 envs, 6 oracle GPU checks, 12 example regenerations, 3 notebook executions, 13 HTML Selenium checks, and 7-check normalization CDF audit.

Test plan

  • Delete all chorus conda envs + caches, reinstall from zero
  • Verify GPU: Borzoi/SEI/LegNet on MPS, Enformer/ChromBPNet on Metal, AlphaGenome on CPU
  • Regenerate all 12 application examples, diff against committed (all within tolerance)
  • Execute all 3 notebooks (0 errors, 0 stale messages)
  • Selenium check 13 HTML reports (all structural checks pass)
  • Normalization CDF audit: 7 checks × 6 oracles = all pass

🤖 Generated with Claude Code

The macOS CPU-forcing guard in AlphaGenome's load and predict templates
only fired when device was None or started with "cpu". Callers passing
device='cuda:0' (common in scripts) bypassed it, letting jax-metal
initialize and crash with "UNIMPLEMENTED: default_memory_space".

Fix: on Darwin, always force JAX_PLATFORMS=cpu unless the caller
explicitly requests Metal. Applied to all three code paths:
- alphagenome.py:_load_direct (also moved env var before import jax)
- load_template.py
- predict_template.py

Audit report at audits/2026-04-16_macos_fresh_install_audit_v4.md
covers: clean-slate install of all 7 envs, GPU verification (6 oracles),
12 example regenerations with diffs, 3 notebook executions (0 errors),
13 HTML Selenium checks, 7-check normalization CDF audit (all pass).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@lucapinello
Copy link
Copy Markdown
Contributor Author

Superseded — branch pre-dates current main-line evolution; its diff would destroy 43,380 lines of work already on chorus-applications. Closing without merge.

@lucapinello lucapinello deleted the audit/2026-04-16-fresh-install-v4 branch April 22, 2026 18:10
lucapinello added a commit that referenced this pull request Apr 24, 2026
Closes the last three v26 follow-ups before calling v26 done:

P1 #10 (runner.py): set TF_CPP_MIN_LOG_LEVEL=3 via _prepare_env() for
enformer + chrombpnet subprocesses, silencing the 3-5 lines of TF/absl
boot spam per prediction.

P1 #11 (exceptions.py, base.py): _setup_environment() now records an
actionable message on every failure path (missing env / validation
failure / import error / unexpected) into self._env_setup_error and
still flips use_environment=False. A new _check_env_ready() fires at
the top of predict() / predict_region_replacement() /
predict_region_insertion_at() / predict_variant_effect(): if the user
originally asked for use_environment=True and setup failed, it raises
the new EnvironmentNotReadyError with the recorded hint
(`chorus setup` / `chorus health`) instead of letting the caller hit
a confusing ModuleNotFoundError inside the base env. The legitimate
use_environment=False test/library path still passes through.

P2 #14 (result.py): OraclePredictionTrack.interpolate() and
.aggregate() now raise ValueError with the sibling-method pointer on
a bad target resolution, not bare assert (which gets stripped by -O).

P2 #21 (cli/main.py): --verbose on health / list / genome now sets
the root logger to DEBUG so EnvironmentManager / GenomeManager /
oracle debug output actually surfaces. Previously the flag only gated
a few extra print lines in the command handler.

P2 #22 (runner.py): subprocess.TimeoutExpired in
run_code_in_environment() and run_in_environment() is caught and
re-raised as RuntimeError with a pointer to CHORUS_NO_TIMEOUT=1,
replacing the bare TimeoutExpired with truncated stderr.

Tests: 340 passed, 1 skipped on fast suite (no smoke).

Co-authored-by: lp698 <lp698@dimm2fv07n65x.partners.org>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
lucapinello added a commit that referenced this pull request Apr 29, 2026
…loses #63, #64, #65) (#66)

* fix(env-setup): timeout-soft policy + plug load_pretrained_model gate (closes #64)

Two related fixes for issue #64:

1. chorus/core/base.py:_setup_environment now distinguishes transient
   validation timeouts (slow NFS / cold-cache `import jax` probes) from
   genuine missing-dep failures. validate_environment returns issues
   prefixed with "Timeout while checking dependency" for the timeout
   case and "Missing dependency" / "Error checking" for real failures.

   Previously, BOTH paths set use_environment=False and recorded an
   _env_setup_error, leading to oracle.load_pretrained_model() falling
   through to _load_direct in the wrong env on cold-NFS lab boxes.

   New policy:
   - Timeout-only failures: log a warning, KEEP use_environment=True
     (don't set _env_setup_error). The actual subprocess invocation has
     its own per-call timeout and will surface a real error if the env
     is genuinely broken.
   - Genuine missing-dep failures: keep the v26 P1 #11 invariant —
     downgrade + raise EnvironmentNotReadyError on next user call.

   The conjunctive 'all timeouts' check means a mixed timeout+missing-dep
   issue list is treated as a genuine failure (the missing-dep is the
   real signal).

2. Each oracle's load_pretrained_model now calls self._check_env_ready()
   as its first action. predict() already does this (base.py:215);
   load_pretrained_model didn't, so a silent downgrade still surfaced
   as ModuleNotFoundError from _load_direct instead of the intended
   EnvironmentNotReadyError. 7 oracles, 1-line edit each:

   - alphagenome.py
   - alphagenome_pt.py
   - borzoi.py
   - chrombpnet.py
   - sei.py
   - legnet.py
   - enformer.py

   _check_env_ready is a no-op when _user_asked_for_env=False
   (base.py:180), so this is safe for tests that pass
   use_environment=False.

3. tests/test_environment_setup_gating.py — 4 fast-suite tests pinning:
   - Timeout-only: use_environment stays True, _env_setup_error stays None
   - Missing-dep: downgrade + EnvironmentNotReadyError on load_pretrained_model
   - Mixed timeouts + missing-dep: treated as genuine failure
   - use_environment=False: validation path never runs, _check_env_ready
     never raises

   Tests monkeypatch EnvironmentManager.validate_environment so they
   don't need a real conda env or GPU. Run in <2 s.

Audit-script workaround at audits/2026-04-29_alphagenome_pt_stress_test/
(o.use_environment = True; o._env_setup_error = None) is no longer
needed after this fix. Linux/CUDA spot-check should confirm.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(alphagenome): rewrite equivalence test via chorus API (closes #63, #65)

The pre-rewrite test compared JAX and PyTorch outputs at the *raw model
head* and asserted shape parity. JAX's DNase head exposes 305 tracks
and the PyTorch port's exposes 384 — different upstream filtering
choices, not different weights. The shape assertion was bound to fail
on any platform; it just didn't surface until the Linux/CUDA spot-check
on `audit/linux-cuda-pr62`.

Rewrite uses oracle.predict() which goes through chorus's local_index
slicing in chorus/oracles/alphagenome.py:_predict — that selects the
user-requested tracks by identifier from the shared 5,731-track
metadata cache (alphagenome_tracks.json). Post-slicing arrays are
shape-compatible across backends regardless of raw head shape.

Concrete changes:

- Drop subprocess-driven JAX_RUNNER + PT_RUNNER heredocs (~70 lines)
  and the bare subprocess.run(["mamba", ...]) calls in _env_exists()
  and _run() (closes #65). The test now uses chorus's own oracle API
  which spawns into per-oracle envs via EnvironmentRunner.run_code_in_environment
  — the canonical path that resolves mamba/conda binaries via
  EnvironmentManager's MAMBA_EXE / shutil.which / fallback chain.
- Use EnvironmentManager.environment_exists(oracle) for skip detection
  instead of rolling our own `mamba env list --json` shellout.
- Hardcode three canonical DNase identifiers (K562 EFO:0002067,
  HepG2 EFO:0001187, hepatocyte CL:0000182), verified against
  alphagenome_metadata.get_metadata().search_tracks() at write time.
  Same set as v30 macOS Tier 1 audit so equivalence numbers stay
  comparable across audits.
- Tolerances unchanged: max abs diff < 0.1, mean rel diff < 5%.
  Both M3 Ultra and A100 audits reported 0.74–1.85% per-track rel
  error, well inside this bound.

Test stays @pytest.mark.integration; skips cleanly when either env
or hg38.fa is missing. Run on Linux/CUDA via:

    mamba run -n chorus pytest tests/test_alphagenome_backends_equivalence.py -m integration -v

— no longer fails with FileNotFoundError on bare `mamba` (the
inner subprocess goes through chorus's resolver, not the test's
own subprocess.run). #64 fix in the previous commit removes the
need for the audit-script workaround
(o.use_environment = True; o._env_setup_error = None).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: lp698 <lp698@dimm2fv07n65x.partners.org>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant