Skip to content

borzoi_v1 with basic prediction and examples#1

Closed
cosmoss8274 wants to merge 1 commit intopinellolab:mainfrom
cosmoss8274:borzoi_yc
Closed

borzoi_v1 with basic prediction and examples#1
cosmoss8274 wants to merge 1 commit intopinellolab:mainfrom
cosmoss8274:borzoi_yc

Conversation

@cosmoss8274
Copy link
Copy Markdown
Collaborator

Borzoi Oracle Implementation (v1)
Key Features:
-Basic borzoi model prediction functionality, exemplified in notebook
-Compatible with Enformer-style API and interface
-use_environment=True (new env) and use_environment=False (direct use existing one)

Implementation notes:
Shares core functionality with base Oracle class
Comprehensive example notebook included

lucapinello pushed a commit that referenced this pull request Apr 15, 2026
Addresses every actionable item in audits/2026-04-14_macos_arm64.md.
All changes are platform-conditional — Linux CUDA paths are unchanged.

PyTorch oracles (borzoi, sei, legnet) — auto-detect MPS on Apple Silicon
  - Both the in-process loader (chorus/oracles/{borzoi,sei,legnet}.py) and
    the subprocess templates ({borzoi,sei,legnet}_source/templates/{load,
    predict}_template.py) now resolve `device is None` (or the new 'auto'
    sentinel) as: cuda > mps > cpu. Linux + CUDA box hits the cuda branch
    first, no behavior change there.
  - SEI: replaced the hard `map_location='cpu'` device pin (the value is
    still used to load weights to host memory before .to(device), which is
    the standard pattern across torch versions and works for mps too).
  - Sei BSplineTransformation lazily moved its spline matrix only when
    `input.is_cuda`. Generalized to any non-CPU device so the matmul works
    on MPS as well. Verified: 286/286 pytest still pass.

TensorFlow oracles (chrombpnet, enformer) — Metal backend on Apple Silicon
  - chorus/core/platform.py macos_arm64 adapter now adds
    `tensorflow-metal>=1.1.0` to pip_add. Once installed, Apple's plugin
    registers a 'GPU' physical device, so the oracles' existing
    tf.config.list_physical_devices('GPU') auto-detect picks it up with no
    code change. Linux paths don't see the macos_arm64 adapter so CUDA stays
    intact.

JAX oracle (alphagenome) — unchanged
  - Already explicitly skips Metal in auto-detect (jax-metal still missing
    `default_memory_space` for AlphaGenome). README updated to document
    this trade-off.

MCP fix — fine_map_causal_variant rsID-only crash
  - Calling `fine_map_causal_variant(lead_variant="rs12740374")` previously
    raised KeyError: 'chrom' at chorus/analysis/causal.py:355 because
    `_parse_lead_variant("rs12740374")` returns {"id": ...} only.
  - Backfill chrom/pos/ref/alt onto the sentinel from the LDlink response
    (which always carries them) before invoking prioritize_causal_variants.
  - Verified end-to-end: rs12740374 ranked #1 with composite=1.000 of 12 LD
    variants on AlphaGenome (matches the published Musunuru-2010 finding).

SEI Zenodo download — chunked + resume + single-flight lock
  - Replaced urllib.request.urlretrieve with a stdlib chunked urlopen loop
    that supports HTTP Range resume and an fcntl exclusive lock so two
    concurrent SeiOracle inits don't race the same partial file. Original
    observed throughput on macOS was ~80 KB/s (would take ~11 hours for the
    3.2 GB tar); the new path resumes interrupted downloads and progress-
    logs every 100 MB.

README — macOS troubleshooting + Apple GPU policy table + kernel install
  - Documented the two-mamba-installs MAMBA_ROOT_PREFIX gotcha that breaks
    `chorus health` when the new chorus env lands in a different mamba root
    than the per-oracle envs.
  - Added the per-oracle macOS GPU support matrix (MPS / Metal / CPU) with
    explicit `device=` examples.
  - Added the missing `python -m ipykernel install --user --name chorus`
    step to Fresh Install so examples/*.ipynb find the chorus kernel.

Validation on macOS 15.7.4 / Apple Silicon (CPU + MPS + Metal):
  - 286/286 pytest pass (incl. all 6 oracle smoke-predict tests)
  - chorus.create_oracle('borzoi') auto-detects mps:0
  - chorus.create_oracle('sei')    auto-detects mps:0 + smoke-predict ok
  - chrombpnet env now reports tf.config.list_physical_devices('GPU') = [GPU:0]
  - fine_map_causal_variant(lead_variant='rs12740374') ranks rs12740374
    composite=1.000 of 12 LD variants

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
lucapinello pushed a commit that referenced this pull request Apr 15, 2026
…ixes

Addresses the findings in audits/2026-04-16_application_and_normalization_audit.md
(PR #9). Three categories of change:

1. Delete two example applications the audit recommends removing:

   - examples/applications/variant_analysis/TERT_promoter/
     C228T is a melanoma-specific gain-of-function mutation; the example
     runs it in K562 (erythroleukemia) and shows all-negative effects.
     The biology is correct for the model but inverts the published
     direction. Rather than add a "wrong cell type" caveat, drop the
     example — SORT1 / FTO / BCL11A cover variant_analysis without
     teaching the reader a misleading result.

   - examples/applications/validation/HBG2_HPFH/
     Already self-documented as "Not reproduced" in
     validation/README.md: BCL11A / ZBTB7A aren't in AlphaGenome's track
     catalog, so the repressor-loss mechanism isn't visible. Keeping a
     "validation failed" example alongside the working
     SORT1_rs12740374_with_CEBP confuses readers. Drop it.

   Also updated: root README.md (replaces HBG2_HPFH link with
   SORT1_rs12740374_with_CEBP), examples/applications/variant_analysis/README.md
   (drops TERT prompt + section), examples/applications/validation/README.md
   (drops HBG2 row + section + reproduce snippet),
   scripts/regenerate_examples.py + scripts/internal/inject_analysis_request.py
   (both lose their TERT_promoter/HBG2_HPFH entries).

2. Normalizer: guard against zero-count CDF rows
   (chorus/analysis/normalization.py).

   Audit finding #1 (HIGH): the committed chrombpnet_pertrack.npz has
   DNASE:hindbrain with effect_counts[idx] == 0 and a zero-filled CDF
   row. effect_percentile() / activity_percentile() silently returned
   1.0 for every raw_score (including 0.0) because np.searchsorted on
   a zeros row returns len(row) for any non-negative probe and the
   denominator falls through to cdf_width. Same bug-class as the v2
   chrombpnet concurrent-download race that landed in PR #8 — the
   hindbrain ENCODE tar must have failed to extract cleanly during the
   original background build.

   New private helper _has_samples() returns False when
   counts[idx] == 0, which makes _lookup / _lookup_batch return None.
   Callers already render None as "—" in MD/HTML tables, so users now
   see "no background" instead of a silent false "100th percentile".
   Counts-less NPZs (older format, no counts field) are treated as
   valid — no regression.

3. Report: suppress quantile_score when raw_score is in the noise floor
   (chorus/analysis/variant_report.py).

   Audit finding #6 (LOW): when |raw_score| < 1e-3 the effect CDF is
   so densely clustered around 0 that a 1-2% raw-score drift can swing
   the quantile by 0.5+ (observed in the Phase A rerun: committed
   quantile=1.0 vs rerun=0.21 for a CEBPB track with raw_score ~1e-4).
   Set quantile_score = None in that regime so the HTML/MD tables
   render "—" and readers don't misread noise as signal. Threshold
   chosen conservatively to cover both log2fc (pc=1.0) and logfc RNA
   (pc=0.001) without hiding real effects.

4. IGV.js: lazy-download the bundle into ~/.chorus/lib on first use
   (chorus/analysis/_igv_report.py + chorus/analysis/causal.py).

   Audit finding #2 (MEDIUM): reports embed a <script src="..."> to
   cdn.jsdelivr.net that gets evaluated every time the HTML is opened
   in a browser. Any viewer on an airgapped network / corporate proxy
   that MITMs TLS / during a jsdelivr outage sees IGV silently fail
   (2/19 audit reports hit ERR_CERT_AUTHORITY_INVALID). The local-
   cache code path already existed but was opt-in (user had to drop a
   file in ~/.chorus/lib/igv.min.js manually).

   New _ensure_igv_local() helper runs on the first report generation
   and populates the cache via chorus.utils.http.download_with_resume
   (the helper that landed in v2 PR #8). Reports written after the
   first successful download inline the JS directly — self-contained
   HTML that opens anywhere without network. Download failure is
   logged at WARNING and the CDN <script> tag is used as fallback,
   preserving the current behaviour for anyone who can't reach
   jsdelivr at generation time.

All changes are platform-agnostic; 287/287 pytest continue to pass;
fix verified behaviourally:

  >>> norm.effect_percentile('chrombpnet', 'DNASE:hindbrain', 0.0)
  None                      # was: 1.0
  >>> norm.effect_percentile('chrombpnet', 'DNASE:HepG2', 0.0)
  0.0                       # unchanged

  >>> ts = TrackScore(raw_score=0.0005, ...);
  >>> _apply_normalization(ts, ...); ts.quantile_score
  None                      # noise floor

See audits/2026-04-16_application_and_normalization_audit.md (PR #9)
for full context, per-app screenshots, and the Phase A / B / C
methodology behind each finding.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
lucapinello added a commit that referenced this pull request Apr 17, 2026
Two scary-looking warnings surfaced while reading notebook cell outputs
in the v7 audit. Neither is a real problem but both alarm users:

1. chorus/core/base.py:323 — case-sensitive compare of reference allele
   vs genome. pyfaidx returns lowercase for softmasked (repetitive)
   regions; users always pass uppercase. The previous code fired
   'Provided reference allele is not the same as the genome reference'
   on every variant in a softmasked locus (e.g. GATA1 TSS in quickstart
   notebook cell 39, comprehensive notebook cells 35 and 51). Now uses
   .upper() on both sides; also includes the actual allele pair in the
   warning message so users can confirm.

2. chorus/core/result.py:104 — 'Unknown implementation' warning fired
   for every Sei track (Stem cell / Multi-tissue / H3K4me3 etc.) that
   isn't in the hardcoded assay_type registry. The generic fallback
   works correctly; the warning was just noise. Downgraded to
   logger.debug.

Scientific review of outputs:
- SORT1 rs12740374: predictions match Musunuru 2010 mechanism (CEBPA/B
  binding gain, DNASE opening, H3K27ac gain, CAGE TSS increase) ✓
- BCL11A rs1427407: TAL1 binding loss + DNASE closing in K562 ✓
- FTO rs1421085: minimal effects in HepG2 (expected — adipose tissue) ✓
- TERT chr5:1295046 T>G: E2F1 binding gain + TERT TSS CAGE increase ✓
- SORT1 causal: rs12740374 ranks #1 composite=0.964 ✓

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
lucapinello pushed a commit that referenced this pull request Apr 17, 2026
Fresh-install audit at e99fd66 verifying all 4 v10 fixes on a truly
clean slate. Teardown: 14.2 GB including tfhub_modules/ this time.

All 4 v10 fixes verified live:
- Fix #1 (tfhub recovery): code path exists + first-install smoke
  passes on wiped tfhub cache.
- Fix #2 (IGV HF fallback): 0/16 HTMLs fell back to CDN on the same
  SSL-MITM network that had 6/16 fallbacks in v10.
- Fix #3 (FTO README): accurate HepG2 framing + adipose assay_ids
  block for the ideal run.
- Fix #4 (bgzip PATH): 0 'bgzip is not installed' lines across 235
  notebook cells (v10 had 20/34/60 per notebook).

One minor regression exposed: Fix #4 makes tabix findable, which
reveals a pre-existing bug where download_gencode leaves a stale
.tbi file that coolbox's `tabix -p gff` rejects with "index file
exists". Workaround = delete .tbi; NB1 retry succeeded. Proposed
3-line follow-up fix to annotations.py documented in the report.

Also verified:
- 308/308 pytest on fresh env (17.3 s)
- 6/6 oracle smoke (7 min 2 s) — first Enformer fresh-install with
  wiped tfhub cache
- 12/12 regen within AlphaGenome CPU non-determinism tolerance
- 0 orphan HTMLs after parallel regen
- 3 notebooks: 0 errors, 0 warnings, 0 bgzip spam
- 16/16 HTMLs clean in Selenium
- FTO README spot-check confirms Fix #3 committed correctly

After 11 audit passes — the last two have surfaced no actual chorus
bugs, only environmental quirks (tfhub cache, SSL MITM, PATH
inheritance, stale .tbi).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants