borzoi_v1 with basic prediction and examples by cosmoss8274 · Pull Request #1 · pinellolab/chorus

cosmoss8274 · 2025-06-06T17:27:20Z

Borzoi Oracle Implementation (v1)
Key Features:
-Basic borzoi model prediction functionality, exemplified in notebook
-Compatible with Enformer-style API and interface
-use_environment=True (new env) and use_environment=False (direct use existing one)

Implementation notes:
Shares core functionality with base Oracle class
Comprehensive example notebook included

Addresses every actionable item in audits/2026-04-14_macos_arm64.md. All changes are platform-conditional — Linux CUDA paths are unchanged. PyTorch oracles (borzoi, sei, legnet) — auto-detect MPS on Apple Silicon - Both the in-process loader (chorus/oracles/{borzoi,sei,legnet}.py) and the subprocess templates ({borzoi,sei,legnet}_source/templates/{load, predict}_template.py) now resolve `device is None` (or the new 'auto' sentinel) as: cuda > mps > cpu. Linux + CUDA box hits the cuda branch first, no behavior change there. - SEI: replaced the hard `map_location='cpu'` device pin (the value is still used to load weights to host memory before .to(device), which is the standard pattern across torch versions and works for mps too). - Sei BSplineTransformation lazily moved its spline matrix only when `input.is_cuda`. Generalized to any non-CPU device so the matmul works on MPS as well. Verified: 286/286 pytest still pass. TensorFlow oracles (chrombpnet, enformer) — Metal backend on Apple Silicon - chorus/core/platform.py macos_arm64 adapter now adds `tensorflow-metal>=1.1.0` to pip_add. Once installed, Apple's plugin registers a 'GPU' physical device, so the oracles' existing tf.config.list_physical_devices('GPU') auto-detect picks it up with no code change. Linux paths don't see the macos_arm64 adapter so CUDA stays intact. JAX oracle (alphagenome) — unchanged - Already explicitly skips Metal in auto-detect (jax-metal still missing `default_memory_space` for AlphaGenome). README updated to document this trade-off. MCP fix — fine_map_causal_variant rsID-only crash - Calling `fine_map_causal_variant(lead_variant="rs12740374")` previously raised KeyError: 'chrom' at chorus/analysis/causal.py:355 because `_parse_lead_variant("rs12740374")` returns {"id": ...} only. - Backfill chrom/pos/ref/alt onto the sentinel from the LDlink response (which always carries them) before invoking prioritize_causal_variants. - Verified end-to-end: rs12740374 ranked #1 with composite=1.000 of 12 LD variants on AlphaGenome (matches the published Musunuru-2010 finding). SEI Zenodo download — chunked + resume + single-flight lock - Replaced urllib.request.urlretrieve with a stdlib chunked urlopen loop that supports HTTP Range resume and an fcntl exclusive lock so two concurrent SeiOracle inits don't race the same partial file. Original observed throughput on macOS was ~80 KB/s (would take ~11 hours for the 3.2 GB tar); the new path resumes interrupted downloads and progress- logs every 100 MB. README — macOS troubleshooting + Apple GPU policy table + kernel install - Documented the two-mamba-installs MAMBA_ROOT_PREFIX gotcha that breaks `chorus health` when the new chorus env lands in a different mamba root than the per-oracle envs. - Added the per-oracle macOS GPU support matrix (MPS / Metal / CPU) with explicit `device=` examples. - Added the missing `python -m ipykernel install --user --name chorus` step to Fresh Install so examples/*.ipynb find the chorus kernel. Validation on macOS 15.7.4 / Apple Silicon (CPU + MPS + Metal): - 286/286 pytest pass (incl. all 6 oracle smoke-predict tests) - chorus.create_oracle('borzoi') auto-detects mps:0 - chorus.create_oracle('sei') auto-detects mps:0 + smoke-predict ok - chrombpnet env now reports tf.config.list_physical_devices('GPU') = [GPU:0] - fine_map_causal_variant(lead_variant='rs12740374') ranks rs12740374 composite=1.000 of 12 LD variants Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ixes Addresses the findings in audits/2026-04-16_application_and_normalization_audit.md (PR #9). Three categories of change: 1. Delete two example applications the audit recommends removing: - examples/applications/variant_analysis/TERT_promoter/ C228T is a melanoma-specific gain-of-function mutation; the example runs it in K562 (erythroleukemia) and shows all-negative effects. The biology is correct for the model but inverts the published direction. Rather than add a "wrong cell type" caveat, drop the example — SORT1 / FTO / BCL11A cover variant_analysis without teaching the reader a misleading result. - examples/applications/validation/HBG2_HPFH/ Already self-documented as "Not reproduced" in validation/README.md: BCL11A / ZBTB7A aren't in AlphaGenome's track catalog, so the repressor-loss mechanism isn't visible. Keeping a "validation failed" example alongside the working SORT1_rs12740374_with_CEBP confuses readers. Drop it. Also updated: root README.md (replaces HBG2_HPFH link with SORT1_rs12740374_with_CEBP), examples/applications/variant_analysis/README.md (drops TERT prompt + section), examples/applications/validation/README.md (drops HBG2 row + section + reproduce snippet), scripts/regenerate_examples.py + scripts/internal/inject_analysis_request.py (both lose their TERT_promoter/HBG2_HPFH entries). 2. Normalizer: guard against zero-count CDF rows (chorus/analysis/normalization.py). Audit finding #1 (HIGH): the committed chrombpnet_pertrack.npz has DNASE:hindbrain with effect_counts[idx] == 0 and a zero-filled CDF row. effect_percentile() / activity_percentile() silently returned 1.0 for every raw_score (including 0.0) because np.searchsorted on a zeros row returns len(row) for any non-negative probe and the denominator falls through to cdf_width. Same bug-class as the v2 chrombpnet concurrent-download race that landed in PR #8 — the hindbrain ENCODE tar must have failed to extract cleanly during the original background build. New private helper _has_samples() returns False when counts[idx] == 0, which makes _lookup / _lookup_batch return None. Callers already render None as "—" in MD/HTML tables, so users now see "no background" instead of a silent false "100th percentile". Counts-less NPZs (older format, no counts field) are treated as valid — no regression. 3. Report: suppress quantile_score when raw_score is in the noise floor (chorus/analysis/variant_report.py). Audit finding #6 (LOW): when |raw_score| < 1e-3 the effect CDF is so densely clustered around 0 that a 1-2% raw-score drift can swing the quantile by 0.5+ (observed in the Phase A rerun: committed quantile=1.0 vs rerun=0.21 for a CEBPB track with raw_score ~1e-4). Set quantile_score = None in that regime so the HTML/MD tables render "—" and readers don't misread noise as signal. Threshold chosen conservatively to cover both log2fc (pc=1.0) and logfc RNA (pc=0.001) without hiding real effects. 4. IGV.js: lazy-download the bundle into ~/.chorus/lib on first use (chorus/analysis/_igv_report.py + chorus/analysis/causal.py). Audit finding #2 (MEDIUM): reports embed a <script src="..."> to cdn.jsdelivr.net that gets evaluated every time the HTML is opened in a browser. Any viewer on an airgapped network / corporate proxy that MITMs TLS / during a jsdelivr outage sees IGV silently fail (2/19 audit reports hit ERR_CERT_AUTHORITY_INVALID). The local- cache code path already existed but was opt-in (user had to drop a file in ~/.chorus/lib/igv.min.js manually). New _ensure_igv_local() helper runs on the first report generation and populates the cache via chorus.utils.http.download_with_resume (the helper that landed in v2 PR #8). Reports written after the first successful download inline the JS directly — self-contained HTML that opens anywhere without network. Download failure is logged at WARNING and the CDN <script> tag is used as fallback, preserving the current behaviour for anyone who can't reach jsdelivr at generation time. All changes are platform-agnostic; 287/287 pytest continue to pass; fix verified behaviourally: >>> norm.effect_percentile('chrombpnet', 'DNASE:hindbrain', 0.0) None # was: 1.0 >>> norm.effect_percentile('chrombpnet', 'DNASE:HepG2', 0.0) 0.0 # unchanged >>> ts = TrackScore(raw_score=0.0005, ...); >>> _apply_normalization(ts, ...); ts.quantile_score None # noise floor See audits/2026-04-16_application_and_normalization_audit.md (PR #9) for full context, per-app screenshots, and the Phase A / B / C methodology behind each finding. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Two scary-looking warnings surfaced while reading notebook cell outputs in the v7 audit. Neither is a real problem but both alarm users: 1. chorus/core/base.py:323 — case-sensitive compare of reference allele vs genome. pyfaidx returns lowercase for softmasked (repetitive) regions; users always pass uppercase. The previous code fired 'Provided reference allele is not the same as the genome reference' on every variant in a softmasked locus (e.g. GATA1 TSS in quickstart notebook cell 39, comprehensive notebook cells 35 and 51). Now uses .upper() on both sides; also includes the actual allele pair in the warning message so users can confirm. 2. chorus/core/result.py:104 — 'Unknown implementation' warning fired for every Sei track (Stem cell / Multi-tissue / H3K4me3 etc.) that isn't in the hardcoded assay_type registry. The generic fallback works correctly; the warning was just noise. Downgraded to logger.debug. Scientific review of outputs: - SORT1 rs12740374: predictions match Musunuru 2010 mechanism (CEBPA/B binding gain, DNASE opening, H3K27ac gain, CAGE TSS increase) ✓ - BCL11A rs1427407: TAL1 binding loss + DNASE closing in K562 ✓ - FTO rs1421085: minimal effects in HepG2 (expected — adipose tissue) ✓ - TERT chr5:1295046 T>G: E2F1 binding gain + TERT TSS CAGE increase ✓ - SORT1 causal: rs12740374 ranks #1 composite=0.964 ✓ Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Fresh-install audit at e99fd66 verifying all 4 v10 fixes on a truly clean slate. Teardown: 14.2 GB including tfhub_modules/ this time. All 4 v10 fixes verified live: - Fix #1 (tfhub recovery): code path exists + first-install smoke passes on wiped tfhub cache. - Fix #2 (IGV HF fallback): 0/16 HTMLs fell back to CDN on the same SSL-MITM network that had 6/16 fallbacks in v10. - Fix #3 (FTO README): accurate HepG2 framing + adipose assay_ids block for the ideal run. - Fix #4 (bgzip PATH): 0 'bgzip is not installed' lines across 235 notebook cells (v10 had 20/34/60 per notebook). One minor regression exposed: Fix #4 makes tabix findable, which reveals a pre-existing bug where download_gencode leaves a stale .tbi file that coolbox's `tabix -p gff` rejects with "index file exists". Workaround = delete .tbi; NB1 retry succeeded. Proposed 3-line follow-up fix to annotations.py documented in the report. Also verified: - 308/308 pytest on fresh env (17.3 s) - 6/6 oracle smoke (7 min 2 s) — first Enformer fresh-install with wiped tfhub cache - 12/12 regen within AlphaGenome CPU non-determinism tolerance - 0 orphan HTMLs after parallel regen - 3 notebooks: 0 errors, 0 warnings, 0 bgzip spam - 16/16 HTMLs clean in Selenium - FTO README spot-check confirms Fix #3 committed correctly After 11 audit passes — the last two have surfaced no actual chorus bugs, only environmental quirks (tfhub cache, SSL MITM, PATH inheritance, stale .tbi). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

borzoi_v1 with basic prediction and examples

69ab1a0

dmitrypenzar1996 requested a review from lucapinello July 16, 2025 20:30

lorenzoruggerii closed this Jan 29, 2026

This was referenced Apr 15, 2026

macOS Apple Silicon parity: MPS + Metal auto-detect, audit report, small bug fixes #7

Merged

macOS v2 audit + shared resumable+locked downloader (genome + chrombpnet) #8

Merged

audits: deep app + normalization audit (2026-04-16) #9

Merged

This was referenced Apr 17, 2026

audit: 2026-04-16 v6 full fresh-install audit #14

Open

Fix v6 audit finding: orphan HTMLs from discovery regen #15

Open

audit: 2026-04-17 v8 full fresh-install audit (zero findings) #18

Open

This was referenced Apr 17, 2026

audit: 2026-04-17 v10 fresh-install + content-review audit #20

Open

Fix v10 audit findings: tfhub cache + IGV HF fallback + FTO doc + bgzip PATH #21

Open

This was referenced Apr 17, 2026

audit: 2026-04-17 v11 post-v10 verification audit #22

Open

audit: 2026-04-20 v12 full UX consistency audit (cross-modality) #25

Open

Fix v12: bundle igv.min.js + drop regen entries for deleted files #26

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

borzoi_v1 with basic prediction and examples#1

borzoi_v1 with basic prediction and examples#1
cosmoss8274 wants to merge 1 commit intopinellolab:mainfrom
cosmoss8274:borzoi_yc

cosmoss8274 commented Jun 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cosmoss8274 commented Jun 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants