feat: unify track-rescale across IGV/matplotlib/CoolBox/notebooks + uniform DHS-augmented chrombpnet CDF#78
Merged
lucapinello merged 18 commits intomainfrom May 10, 2026
Merged
Conversation
…F alias ChromBPNet CHIP predictions emit track IDs like `CHIP:K562:REST:+`/`:-` but the background CDF stores `CHIP:K562:REST` (no strand). All CHIP normalization lookups silently returned None, falling back to raw unscaled values in IGV reports and percentile output. Fix: strip `:+`/`:-` suffix as a fallback in `_lookup`, `_lookup_batch`, and `perbin_floor_rescale_batch` in `PerTrackNormalizer`. Both strand predictions correctly share the single background distribution row. Also alias `alphagenome_pt` → `alphagenome` in `_ensure_loaded` and `get_pertrack_normalizer`: since both backends produce identical predictions (same model + weights), no separate CDF file is needed. The alias is bypassed automatically if a dedicated `alphagenome_pt_pertrack.npz` appears in the cache later. Adds 8 unit tests covering strand-suffixed lookups and the strandless key fallthrough. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
`chorus setup --oracle X` previously had no timeout for either the `mamba env create` subprocess or the weight download phase. On slow or unstable connections (e.g. remote lab servers) a stalled download would hang indefinitely. Changes: - `chorus/cli/main.py`: add `--setup-timeout SECONDS` flag (default: unlimited). Passes through to both `create_environment()` and `prefetch_for_oracle()`. - `chorus/core/environment/manager.py`: `create_environment()` gains a `timeout` parameter. A `threading.Timer` kills the `mamba env create` subprocess after N seconds and raises a descriptive `RuntimeError`. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Uninstalling chorus previously required manually removing 7 conda envs,
downloaded weights, background CDFs, and reference genomes. `chorus cleanup`
handles this in one command.
Usage:
chorus cleanup --oracle {name|all} # env + weights
chorus cleanup --backgrounds # ~/.chorus/backgrounds/*.npz
chorus cleanup --genomes # downloaded reference genomes
chorus cleanup --all # everything above
chorus cleanup --all --dry-run # preview without deleting
- Missing paths silently skipped (idempotent)
- Dry-run prints [DRY RUN] prefix on every action
- Summary line at end: "Removed N environment(s), M weight dir(s), K file(s)"
- README: Upgrading section updated to use `chorus cleanup --all`;
new Uninstalling subsection added; `--setup-timeout` usage note added
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…robe Two bugs found during scorched-earth teardown/reinstall test: 1. `chorus/utils/genome.py`: `download_with_resume` releases its lock after writing `hg38.fa.gz` but before decompression. A concurrent `chorus setup` process could decompress+delete the `.gz` between those steps, leaving the first process with a FileNotFoundError. Fix: check if `fasta_path` already exists before decompressing; use `unlink(missing_ok=True)` to tolerate concurrent deletions. 2. `chorus/core/weights_probe.py`: `_probe_chrombpnet` checked for `CHORUS_DOWNLOADS_DIR/chrombpnet/DNASE_K562` — the old ENCODE tarball path. Since v0.3 the default path (fold=0, chrombpnet_nobias) downloads via the HF slim mirror into `~/.cache/huggingface/`, so the probe always reported "Not installed" even after a successful setup. Fix: switch to `_probe_library_cached` (trust the setup marker), matching enformer and borzoi. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
dmitrypenzar1996
approved these changes
May 6, 2026
…GV. Add adaptive downsampling with max pooling for 1bp tracks (ChromBPNet), per-bin size calculation based on native resolution, IGV windowFunction hints, and CDF fallback for models without per-bin distributions (LegNet)
Resolves conflict in chorus/analysis/normalization.py: - Adopt Lorenzo's _match_track_id / _find_matching_cdf helpers - Extend _match_track_id to also strip CHIP strand suffix (:+/:-) so per-strand track IDs match merged CDF rows - Move _has_samples guard inside _find_matching_cdf so failed-build perbin rows fall through to summary CDF instead of saturating - Set perbin_floor_rescale_batch max_value default to 3.0 (matches _DISPLAY_MAX in IGV) Lorenzo's other files (_igv_report.py, multi_oracle_report.py, scripts/ regenerate_*.py, example artefacts) auto-merged.
…tebooks Single source of truth for normalization semantics: every renderer now goes through `chorus.analysis._igv_report.rescale_for_display()`. By default (no extra params) all four paths produce CDF-rescaled output — 1.0 = genome-wide p99, 3.0 cap, signed layers symmetric around 0. Key changes - New `rescale_for_display(values, layer, normalizer, oracle_name, assay_id) → (out, cfg)` returns rescaled values + display config (ymin, ymax, signed flag) usable by any renderer. - `apply_floor_rescale` (the IGV ref/alt wrapper) now returns a 4-tuple (rescaled, ref, alt, signed) so callers can pick symmetric vs unsigned scale_cfg. - New `signed_floor_rescale_batch` rescales signed values to [-DISPLAY_MAX, +DISPLAY_MAX] using p99(|cdf|) so Borzoi RNA / Sei / LentiMPRA repressive effects are visible (was clipped to 0 before). - `is_signed()` and `_match_track_id()` now share fuzzy track-id matching incl. CHIP `:+`/`:-` strand suffix stripping, so LegNet (`LentiMPRA:HepG2` → CDF row `HepG2`) correctly registers as signed. - `OraclePrediction.add()` backfills `track.assay_id` from the dict key so CoolBox/matplotlib autoload paths can find the right CDF row on tracks that left assay_id None (notably ChromBPNet). - CoolBox `get_coolbox_representation()` and matplotlib `render_track_figures()` now auto-load the per-track normalizer from `~/.chorus/backgrounds/` when called with no kwargs; pass `normalize=False` to opt out for raw values. - `ChromBPNetOracle.predict_sliding()` slides the 2114-bp model across arbitrary intervals with cigar substitutions preserved, so the multi-oracle IGV panel covers the full AlphaGenome 1 Mb locus instead of 0.2 % of it. `_predict()` auto-routes wide queries to it (PR #79's wider region had been triggering a pre-existing IndexError in `_predict_direct`'s sliding formula). - `_calculate_track_bin_size` uses `(20, "max")` for ChromBPNet (was `(20, "mean")` — code/description mismatch in PR #79); max-pool preserves 1-bp peak heights instead of diluting them by 20x. - Lower per-layer floors so peaks have visible base/shoulder: `chromatin_accessibility 0.95→0.90`, `promoter_activity 0.95→0.85`. - Causal-report IGV (`causal._build_causal_igv`) now goes through the same helper as variant + multi-oracle reports. Lorenzo's PR #79 changes preserved - `_match_track_id` / `_find_matching_cdf` (with CHIP-strand fuzzy match added on top), `_calculate_track_bin_size`, `windowFunction: "max"` IGV hint, `(per-track norm)` LegNet label suffix, `get_max_output_size()` multi-oracle region width. Tests: 376 passed, 1 skipped (env-gating), 5 deselected (integration). Updated `test_perbin_none_for_scalar_oracles` (perbin → summary fallback now succeeds for LegNet) and `test_apply_floor_rescale_passthrough` (4-tuple). Added `test_rescale_for_display_unified_helper`. Annotations directory + screenshot sweeps gitignored. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Output of the code unification — same model predictions, new IGV display semantics: - chrombpnet panel: max-pool over 20 bp/bin so peaks have body, not just spikes - legnet panel: symmetric [-3, +3] scale shows repressive (negative) half of LentiMPRA effects that was clipped to 0 before - chromatin_accessibility floor lowered p95 → p90, promoter_activity p95 → p85 so peak base / shoulder is visible Variant scoring (Effect/Activity percentile values) is unchanged — the build script's window matches the live scoring window; only the display layer was touched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Documents what was tested, what passed, and one deferred follow-up (DHS-augmented chrombpnet CDF needs to be rebuilt for all 786 tracks incl. BPNet/CHIP before uploading to HuggingFace). Also fixes two README stale claims that survived the unification work: display-rescale range is [0, 3.0] not [0, 1.5] (matches _DISPLAY_MAX in _igv_report.py). Adds a row for the new signed-layer symmetric [-3, +3] rescale semantics. Re-regenerates SORT1 chrombpnet + multi-oracle artefacts against the HF-shipped 786-track CDF (the production CDF every fresh install gets) — drops the local DHS-only 42-track CDF that the previous regen run had used. SORT1 chrombpnet effect under HF CDF: +0.318 log2FC, ≥99th %ile, Activity %ile 0.603 (same qualitative interpretation as the local-DHS run). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…et CDF
Augments the chrombpnet/BPNet CDF build with DHS-vocabulary-anchored
samples in addition to the random-genome reservoirs:
- ``--n-dhs-variants`` (default 10000): SNPs at random offsets within
±150 bp of Meuleman DHS peak summits, added to the effect CDF
reservoir alongside the existing ~10 K random SNPs.
- ``--n-dhs-peaks`` (default 5000): DHS summit positions added to the
baseline (activity) and per-bin reservoirs alongside the existing
random/cCRE/TSS positions.
Hooks live in ``build_all_models()`` so both ATAC_DNASE (42 models)
and CHIP (1259 BPNet/JASPAR models) build paths pick up the DHS
augmentation when the script is run with ``--assay all``.
Requires ``annotations/dhs_vocabulary_hg38.txt.gz`` (Meuleman 2020
hg38 DHS Index, ~90 MB). Download with::
gdown --id 16wbuNmHnwsek3USWM04nR535vPavNZka \\
-O annotations/dhs_vocabulary_hg38.txt.gz
Run with shards on a multi-GPU host (the canonical 6-shard split was
used for the existing 786-track HF release; see PR #53). Single-GPU
runs are also supported but take ~6 days serial on M3 Ultra Metal.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The ChromBPNet/BPNet CDF rebuild pipeline (and any caller of ``load_dhs_vocabulary()``) now auto-fetches the Meuleman et al. 2020 DHS Index from ``huggingface.co/datasets/lucapinello/chorus-backgrounds`` if not already cached at ``annotations/dhs_vocabulary_hg38.txt.gz``. The mirror is a verbatim copy of the original Meuleman distribution (sha256 ``0a4d2150…1c1c48``, 86.6 MB). Why: a multi-GPU CDF rebuild needs every shard to consume byte-identical input. The previous gdown step worked but required a manual download per machine and depended on Google Drive remaining reachable. Mirroring to HF gives every chorus install (every shard of every rebuild, every fresh-clone audit) the same input without extra steps. Falls back to a clear error pointing at the manual gdown command if ``huggingface_hub`` isn't installed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Expands the "Adding a new oracle" walkthrough's CDF build step from
a single 35-line code stub into a per-layer recipe so external
contributors don't have to reverse-engineer what to sample for which
layer.
New subsections under "Step 4: Write the CDF build script":
- **What goes in each CDF** — what effect_cdfs / summary_cdfs /
perbin_cdfs each capture and target sample counts (~20K / ~30-35K
/ ~1M).
- **Reservoir sampling in practice** — points at the canonical
ReservoirSampler in build_backgrounds_chrombpnet.py for bounded
memory + ``to_cdf_matrix(n_points=10000)``.
- **What to sample, layer by layer** — concrete recipes for all
eight LAYER_CONFIGS layers, each with:
* the right scoring window / formula / signed flag
* which positions to sample for the variant reservoir (random
vs DHS-anchored vs splice-site vs etc.)
* which positions for the baseline reservoir (random vs cCRE
vs DHS vs TSS vs exon midpoints vs splice sites)
* pointers to the canonical example build script for that layer
(chrombpnet for chromatin/ChIP-TF/histone, borzoi for RNA exon-
based, alphagenome for CAGE-with-TSS routing, legnet for
element-level signed, sei for sequence-class signed)
- **Common pitfalls** — five gotchas we've actually hit (all-random
baseline producing inflated percentiles; missing signed_flags for
RNA/MPRA/Sei silently clipping repressive halves; build-vs-live
scoring window mismatch; perbin samples drawn only from peak
centers; chromosome-edge boundary effects).
- **Build script skeleton** — a 50-line template that mixes the
position sources from chorus.utils.annotations (sample_ccre_positions,
sample_dhs_positions, get_gene_tss, get_gene_exons, get_screen_ccres)
showing the canonical pattern for any new oracle.
This is the documentation a new ChromBPNet-class oracle developer
needed to build the 786-track DHS-augmented CDF without copy-pasting
build_backgrounds_chrombpnet.py and editing in the dark.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rebuilt the CHIP rows of chrombpnet_pertrack.npz with the new
--n-dhs-variants / --n-dhs-peaks DHS-vocabulary augmentation. Kept the
42 ATAC/DNASE rows from the v29 production NPZ untouched (random-genome
sampling already covers them well; DHS augmentation matters most for
sparse CHIP binding-site data).
Approach:
- Sliced production NPZ to 42-row ATAC/DNASE base
- Ran --assay CHIP --shard-of 2 variants + baselines on ml007 (2× A100-40)
- merge-shards stitched 744 new CHIP rows onto the 42-row base via
PerTrackNormalizer.append_tracks (dedup on track_id keeps ATAC/DNASE
intact)
Compute: ~6 h total wall on ml007 alone (variants 2h 51min + baselines
3h 12min). User estimate of 6 days serial on M3 Ultra Metal vs ~6 h on
2× A100 = ~24× faster per model. ml008 + ml003 fanout aborted due to
another user's job on ml008 + V100 cuDNN errors on ml003.
Final NPZ on HF:
repo: lucapinello/chorus-backgrounds (dataset)
HF commit: 47908dcdc36ab13b5cc1edbb1e3aafc0482d4d29
sha256: b8f8148453e8285195b77430970a2187ecd8df2d2a2b0074c5a0a68f37cb9906
size: 78.5 MB
rows: 786 (42 ATAC/DNASE preserved + 744 new DHS-augmented CHIP)
CHIP rows: effect_counts=18672 (10K random + 8.7K DHS, exact match to
spec), summary_counts=34004, perbin_counts=1088128, signed_flags False,
CDFs monotonic.
Round-trip verified: rm local + get_pertrack_normalizer('chrombpnet')
re-downloads from HF and matches sha byte-for-byte.
Stage 6 scorched-earth (wipe envs + reinstall on ml007) was deferred —
needs explicit user confirm before destructive run.
Lessons documented in the audit re: --gpu flag overriding
CUDA_VISIBLE_DEVICES at line 206 of build_backgrounds_chrombpnet.py
(scripts/build_backgrounds_chrombpnet.py).
Audit + numbers: audits/2026-05-09_dhs_chrombpnet_full_rebuild.md
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the "ATAC/DNASE rows still v29 non-DHS" caveat from the GPU- machine handoff (commit 315a0be) by splicing the local 42-row Mac ATAC/DNASE rebuild (May 7, sha 896e72f1…b151431) into the 786-row HF NPZ produced on ml007 today, in-place at the existing track-id positions. No further model recompute needed. The new HF NPZ: - repo: huggingface.co/datasets/lucapinello/chorus-backgrounds - file: chrombpnet_pertrack.npz - sha256: 526beb2ce8310f6fdb331f766eac55ce3262b67f1a43416532d8bad8f83183eb - size: 78.5 MB - 786 tracks, all effect_counts=18672, summary_counts=34004, perbin_counts=1088128 (uniform DHS coverage end-to-end) Verification against the new NPZ (warm Mac M3 Ultra): - pytest: 376 passed, 1 skipped, 5 deselected - SORT1 chrombpnet single-oracle (DNASE:HepG2): +0.318 log2FC, %ile=0.96 (down from ≥99th under v29 non-DHS — expected conservative shift since the augmented background now includes ~8.7K DHS-anchored SNPs per track) - 18 walkthrough HTMLs programmatically IGV-inspected: 0 issues, all panels show real data - HF round-trip sha matches upload Multi-oracle SORT1 + single-oracle SORT1 chrombpnet artefacts regenerated against the uniform NPZ. Audit doc updated with the "Follow-up — 2026-05-09" section documenting the splice + verification. Branch is mergeable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The "DHS-augmented chrombpnet CDF — rebuild all 786 tracks then upload" item is no longer deferred — it shipped on 2026-05-09 via the hybrid build (744 CHIP on ml007 + local 42 ATAC/DNASE splice). Update the audit doc so the trail is self-consistent for Lorenzo: deferred-list item 1 marked ✅ closed, header caveat removed, and both inline references point to audits/2026-05-09_dhs_chrombpnet_full_rebuild.md for the sub-audit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wiped 8 chorus envs + ~/.chorus + genomes/ + annotations/, then followed README Steps 1-3 literally as a brand-new user. All documented commands worked verbatim, no P0/P1 friction. Timing on M3 Ultra Metal w/ warm conda cache: - Step 1 (env create + pip install -e .) : 1m 54s - Step 2 (chorus setup --hf-token …) : 18m 35s - Step 3 (β-globin SNP via Enformer) : 49s - pytest -m "not integration" cold : 5m 39s, 376 passed - SORT1 multi-oracle cold : 6m 52s - Total wall : ~33 min Round-trip verifications: - New uniform-DHS chrombpnet NPZ (sha 526beb2c…f83183eb, 78.5 MB) auto-fetched from HF on first install — sha matches upload ✓ - DHS vocabulary auto-fetched via load_dhs_vocabulary()'s HF auto-download (verified earlier today) - All 18 walkthrough HTMLs IGV-inspected, 0 issues, every panel has data - Cross-oracle SORT1 consensus: chrombpnet +1.241 + alphagenome +1.336 on chromatin accessibility, both ↑, agree on direction Branch fix/post-v040-followups is safe to merge into main per this audit + the two preceding (2026-05-08 PR-79-merge audit, 2026-05-09 DHS chrombpnet rebuild audit). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #78 brings substantial changes that justify a minor bump: - Unified track-rescale across IGV / matplotlib / CoolBox / notebooks (single `rescale_for_display()` helper; default-call auto-rescales) - Symmetric signed rescale for Borzoi RNA / Sei / LentiMPRA - Uniform DHS-augmented chrombpnet CDF on HuggingFace (786 tracks) - DHS vocabulary auto-fetch from HF - ChromBPNet predict_sliding for arbitrary-width regions - Per-layer CDF-sampling guide for new oracle developers Verification trail: audits/2026-05-08_post_pr79_merge_audit.md, audits/2026-05-09_dhs_chrombpnet_full_rebuild.md, audits/2026-05-09_scorched_earth_fresh_install.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Builds on @lorenzoruggerii's PR #79 (mixed-resolution IGV rendering + LegNet normalization fallback) with:
chorus.analysis._igv_report.rescale_for_display()is now the single source of truth — IGV variant reports, multi-oracle reports, causal reports, matplotlib figures, CoolBox panels, and notebooks all go through it. Default-call behaviour (no kwargs) auto-loads the per-track normalizer so a fresh user callingtrack.get_coolbox_representation()orrender_track_figures(...)gets CDF-rescaled output without thinking about it.normalize=Falseopts out.[-3, +3]usingp99(|cdf|)as the unit (signed_floor_rescale_batch) so repressive (negative) effects are visible. Previously they clipped to 0.OraclePrediction.add()backfillstrack.assay_idfrom the dict key so ChromBPNet tracks (which previously leftassay_id=None) participate cleanly in the unified rescale.ChromBPNetOracle.predict_sliding()— slides the 2114-bp model across arbitrary intervals so the multi-oracle IGV panel actually shows ChromBPNet across AlphaGenome's 1 Mb window._predict()auto-routes wide queries to it (also fixes a pre-existing IndexError in_predict_direct's sliding formula that PR Fix/igv multi resolution normalization #79's wider region triggered)._calculate_track_bin_sizecorrected: chrombpnet now usesagg="max"(PR Fix/igv multi resolution normalization #79's docstring said max, code returned mean — diluting 1 bp peaks 20×).chromatin_accessibility 0.95→0.90,promoter_activity 0.95→0.85.load_dhs_vocabulary(). No moregdownstep needed for users who want to rebuild CDFs.docs/NORMALIZATION_GUIDE.mdfor new-oracle developers (recipes for chromatin / TF / histone / CAGE / RNA / splicing / MPRA / Sei).Verification
Three audit docs in
audits/:2026-05-08_post_pr79_merge_audit.md— code-level merge of PR Fix/igv multi resolution normalization #79 + our follow-ups; CDF flow per oracle ("is this a hack or principled?" question); 376 tests passed warm2026-05-09_dhs_chrombpnet_full_rebuild.md— DHS-augmented CDF rebuild on ml007 (744 CHIP rows, ~6 h on 2× A100), Mac splice (42 ATAC/DNASE rows from local), uniform 786-track NPZ uploaded to HF, round-trip sha-verified2026-05-09_scorched_earth_fresh_install.md— wiped all 8 chorus envs + caches + genomes, then ran README Steps 1–3 literally as a brand-new user. No P0/P1 friction.End-to-end on a clean install (M3 Ultra Metal, warm conda cache):
chorus setup): 18m 35s (README's 55–75 min estimate is conservative)+1.241+ alphagenome+1.336, both ↑, agree on direction526beb2c…f83183eb, 78.5 MB) auto-fetched from HF on first install — sha matches uploadTest plan
pytest -m "not integration"→ 376 passed coldchorus setupfrom scratch → all 7 oracles ready, ~19 min on M3 UltraLentiMPRA:HepG2→ autoscale-symmetricCloses / supersedes
Closes #79 (Lorenzo's IGV multi-resolution normalization PR — merged into this branch as commit
63df601with our follow-ups layered on top).HuggingFace artifacts updated
lucapinello/chorus-backgrounds/chrombpnet_pertrack.npz— uniform-DHS 786-track CDF (was 786-track non-DHS)lucapinello/chorus-backgrounds/dhs_vocabulary_hg38.txt.gz— Meuleman 2020 DHS index, mirrored from gdown🤖 Generated with Claude Code