borzoi_v1 with basic prediction and examples#1
Closed
cosmoss8274 wants to merge 1 commit intopinellolab:mainfrom
Closed
borzoi_v1 with basic prediction and examples#1cosmoss8274 wants to merge 1 commit intopinellolab:mainfrom
cosmoss8274 wants to merge 1 commit intopinellolab:mainfrom
Conversation
lucapinello
pushed a commit
that referenced
this pull request
Apr 15, 2026
Addresses every actionable item in audits/2026-04-14_macos_arm64.md.
All changes are platform-conditional — Linux CUDA paths are unchanged.
PyTorch oracles (borzoi, sei, legnet) — auto-detect MPS on Apple Silicon
- Both the in-process loader (chorus/oracles/{borzoi,sei,legnet}.py) and
the subprocess templates ({borzoi,sei,legnet}_source/templates/{load,
predict}_template.py) now resolve `device is None` (or the new 'auto'
sentinel) as: cuda > mps > cpu. Linux + CUDA box hits the cuda branch
first, no behavior change there.
- SEI: replaced the hard `map_location='cpu'` device pin (the value is
still used to load weights to host memory before .to(device), which is
the standard pattern across torch versions and works for mps too).
- Sei BSplineTransformation lazily moved its spline matrix only when
`input.is_cuda`. Generalized to any non-CPU device so the matmul works
on MPS as well. Verified: 286/286 pytest still pass.
TensorFlow oracles (chrombpnet, enformer) — Metal backend on Apple Silicon
- chorus/core/platform.py macos_arm64 adapter now adds
`tensorflow-metal>=1.1.0` to pip_add. Once installed, Apple's plugin
registers a 'GPU' physical device, so the oracles' existing
tf.config.list_physical_devices('GPU') auto-detect picks it up with no
code change. Linux paths don't see the macos_arm64 adapter so CUDA stays
intact.
JAX oracle (alphagenome) — unchanged
- Already explicitly skips Metal in auto-detect (jax-metal still missing
`default_memory_space` for AlphaGenome). README updated to document
this trade-off.
MCP fix — fine_map_causal_variant rsID-only crash
- Calling `fine_map_causal_variant(lead_variant="rs12740374")` previously
raised KeyError: 'chrom' at chorus/analysis/causal.py:355 because
`_parse_lead_variant("rs12740374")` returns {"id": ...} only.
- Backfill chrom/pos/ref/alt onto the sentinel from the LDlink response
(which always carries them) before invoking prioritize_causal_variants.
- Verified end-to-end: rs12740374 ranked #1 with composite=1.000 of 12 LD
variants on AlphaGenome (matches the published Musunuru-2010 finding).
SEI Zenodo download — chunked + resume + single-flight lock
- Replaced urllib.request.urlretrieve with a stdlib chunked urlopen loop
that supports HTTP Range resume and an fcntl exclusive lock so two
concurrent SeiOracle inits don't race the same partial file. Original
observed throughput on macOS was ~80 KB/s (would take ~11 hours for the
3.2 GB tar); the new path resumes interrupted downloads and progress-
logs every 100 MB.
README — macOS troubleshooting + Apple GPU policy table + kernel install
- Documented the two-mamba-installs MAMBA_ROOT_PREFIX gotcha that breaks
`chorus health` when the new chorus env lands in a different mamba root
than the per-oracle envs.
- Added the per-oracle macOS GPU support matrix (MPS / Metal / CPU) with
explicit `device=` examples.
- Added the missing `python -m ipykernel install --user --name chorus`
step to Fresh Install so examples/*.ipynb find the chorus kernel.
Validation on macOS 15.7.4 / Apple Silicon (CPU + MPS + Metal):
- 286/286 pytest pass (incl. all 6 oracle smoke-predict tests)
- chorus.create_oracle('borzoi') auto-detects mps:0
- chorus.create_oracle('sei') auto-detects mps:0 + smoke-predict ok
- chrombpnet env now reports tf.config.list_physical_devices('GPU') = [GPU:0]
- fine_map_causal_variant(lead_variant='rs12740374') ranks rs12740374
composite=1.000 of 12 LD variants
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This was referenced Apr 15, 2026
lucapinello
pushed a commit
that referenced
this pull request
Apr 15, 2026
…ixes Addresses the findings in audits/2026-04-16_application_and_normalization_audit.md (PR #9). Three categories of change: 1. Delete two example applications the audit recommends removing: - examples/applications/variant_analysis/TERT_promoter/ C228T is a melanoma-specific gain-of-function mutation; the example runs it in K562 (erythroleukemia) and shows all-negative effects. The biology is correct for the model but inverts the published direction. Rather than add a "wrong cell type" caveat, drop the example — SORT1 / FTO / BCL11A cover variant_analysis without teaching the reader a misleading result. - examples/applications/validation/HBG2_HPFH/ Already self-documented as "Not reproduced" in validation/README.md: BCL11A / ZBTB7A aren't in AlphaGenome's track catalog, so the repressor-loss mechanism isn't visible. Keeping a "validation failed" example alongside the working SORT1_rs12740374_with_CEBP confuses readers. Drop it. Also updated: root README.md (replaces HBG2_HPFH link with SORT1_rs12740374_with_CEBP), examples/applications/variant_analysis/README.md (drops TERT prompt + section), examples/applications/validation/README.md (drops HBG2 row + section + reproduce snippet), scripts/regenerate_examples.py + scripts/internal/inject_analysis_request.py (both lose their TERT_promoter/HBG2_HPFH entries). 2. Normalizer: guard against zero-count CDF rows (chorus/analysis/normalization.py). Audit finding #1 (HIGH): the committed chrombpnet_pertrack.npz has DNASE:hindbrain with effect_counts[idx] == 0 and a zero-filled CDF row. effect_percentile() / activity_percentile() silently returned 1.0 for every raw_score (including 0.0) because np.searchsorted on a zeros row returns len(row) for any non-negative probe and the denominator falls through to cdf_width. Same bug-class as the v2 chrombpnet concurrent-download race that landed in PR #8 — the hindbrain ENCODE tar must have failed to extract cleanly during the original background build. New private helper _has_samples() returns False when counts[idx] == 0, which makes _lookup / _lookup_batch return None. Callers already render None as "—" in MD/HTML tables, so users now see "no background" instead of a silent false "100th percentile". Counts-less NPZs (older format, no counts field) are treated as valid — no regression. 3. Report: suppress quantile_score when raw_score is in the noise floor (chorus/analysis/variant_report.py). Audit finding #6 (LOW): when |raw_score| < 1e-3 the effect CDF is so densely clustered around 0 that a 1-2% raw-score drift can swing the quantile by 0.5+ (observed in the Phase A rerun: committed quantile=1.0 vs rerun=0.21 for a CEBPB track with raw_score ~1e-4). Set quantile_score = None in that regime so the HTML/MD tables render "—" and readers don't misread noise as signal. Threshold chosen conservatively to cover both log2fc (pc=1.0) and logfc RNA (pc=0.001) without hiding real effects. 4. IGV.js: lazy-download the bundle into ~/.chorus/lib on first use (chorus/analysis/_igv_report.py + chorus/analysis/causal.py). Audit finding #2 (MEDIUM): reports embed a <script src="..."> to cdn.jsdelivr.net that gets evaluated every time the HTML is opened in a browser. Any viewer on an airgapped network / corporate proxy that MITMs TLS / during a jsdelivr outage sees IGV silently fail (2/19 audit reports hit ERR_CERT_AUTHORITY_INVALID). The local- cache code path already existed but was opt-in (user had to drop a file in ~/.chorus/lib/igv.min.js manually). New _ensure_igv_local() helper runs on the first report generation and populates the cache via chorus.utils.http.download_with_resume (the helper that landed in v2 PR #8). Reports written after the first successful download inline the JS directly — self-contained HTML that opens anywhere without network. Download failure is logged at WARNING and the CDN <script> tag is used as fallback, preserving the current behaviour for anyone who can't reach jsdelivr at generation time. All changes are platform-agnostic; 287/287 pytest continue to pass; fix verified behaviourally: >>> norm.effect_percentile('chrombpnet', 'DNASE:hindbrain', 0.0) None # was: 1.0 >>> norm.effect_percentile('chrombpnet', 'DNASE:HepG2', 0.0) 0.0 # unchanged >>> ts = TrackScore(raw_score=0.0005, ...); >>> _apply_normalization(ts, ...); ts.quantile_score None # noise floor See audits/2026-04-16_application_and_normalization_audit.md (PR #9) for full context, per-app screenshots, and the Phase A / B / C methodology behind each finding. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This was referenced Apr 17, 2026
lucapinello
added a commit
that referenced
this pull request
Apr 17, 2026
Two scary-looking warnings surfaced while reading notebook cell outputs in the v7 audit. Neither is a real problem but both alarm users: 1. chorus/core/base.py:323 — case-sensitive compare of reference allele vs genome. pyfaidx returns lowercase for softmasked (repetitive) regions; users always pass uppercase. The previous code fired 'Provided reference allele is not the same as the genome reference' on every variant in a softmasked locus (e.g. GATA1 TSS in quickstart notebook cell 39, comprehensive notebook cells 35 and 51). Now uses .upper() on both sides; also includes the actual allele pair in the warning message so users can confirm. 2. chorus/core/result.py:104 — 'Unknown implementation' warning fired for every Sei track (Stem cell / Multi-tissue / H3K4me3 etc.) that isn't in the hardcoded assay_type registry. The generic fallback works correctly; the warning was just noise. Downgraded to logger.debug. Scientific review of outputs: - SORT1 rs12740374: predictions match Musunuru 2010 mechanism (CEBPA/B binding gain, DNASE opening, H3K27ac gain, CAGE TSS increase) ✓ - BCL11A rs1427407: TAL1 binding loss + DNASE closing in K562 ✓ - FTO rs1421085: minimal effects in HepG2 (expected — adipose tissue) ✓ - TERT chr5:1295046 T>G: E2F1 binding gain + TERT TSS CAGE increase ✓ - SORT1 causal: rs12740374 ranks #1 composite=0.964 ✓ Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced Apr 17, 2026
lucapinello
pushed a commit
that referenced
this pull request
Apr 17, 2026
Fresh-install audit at e99fd66 verifying all 4 v10 fixes on a truly clean slate. Teardown: 14.2 GB including tfhub_modules/ this time. All 4 v10 fixes verified live: - Fix #1 (tfhub recovery): code path exists + first-install smoke passes on wiped tfhub cache. - Fix #2 (IGV HF fallback): 0/16 HTMLs fell back to CDN on the same SSL-MITM network that had 6/16 fallbacks in v10. - Fix #3 (FTO README): accurate HepG2 framing + adipose assay_ids block for the ideal run. - Fix #4 (bgzip PATH): 0 'bgzip is not installed' lines across 235 notebook cells (v10 had 20/34/60 per notebook). One minor regression exposed: Fix #4 makes tabix findable, which reveals a pre-existing bug where download_gencode leaves a stale .tbi file that coolbox's `tabix -p gff` rejects with "index file exists". Workaround = delete .tbi; NB1 retry succeeded. Proposed 3-line follow-up fix to annotations.py documented in the report. Also verified: - 308/308 pytest on fresh env (17.3 s) - 6/6 oracle smoke (7 min 2 s) — first Enformer fresh-install with wiped tfhub cache - 12/12 regen within AlphaGenome CPU non-determinism tolerance - 0 orphan HTMLs after parallel regen - 3 notebooks: 0 errors, 0 warnings, 0 bgzip spam - 16/16 HTMLs clean in Selenium - FTO README spot-check confirms Fix #3 committed correctly After 11 audit passes — the last two have surfaced no actual chorus bugs, only environmental quirks (tfhub cache, SSL MITM, PATH inheritance, stale .tbi). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced Apr 17, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Borzoi Oracle Implementation (v1)
Key Features:
-Basic borzoi model prediction functionality, exemplified in notebook
-Compatible with Enformer-style API and interface
-use_environment=True (new env) and use_environment=False (direct use existing one)
Implementation notes:
Shares core functionality with base Oracle class
Comprehensive example notebook included