Document .modesep.json sidecar + quality tiers in Technical-algorithm
Cover the per-species mode-separation diagnostic output and the A/B/C/F
quality-tier rubric in the wiki, so the wiki is a complete reference for
the two-pass classifier (previously only fully documented in the repo's
docs/mode_separation.md, which is gitignored). Also describe the
boundary_mass diagnostic.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
50ba416
Update wiki for v2.7 (two-pass mode-separation + continuous discount)
Reframe the documentation around the v2.6+ two-pass architecture and the
v2.7 continuous per-intron discount:
- Technical-algorithm.md: rewrite Pipeline Overview from 5 stages to 7
(adaptive normalizer fit, first-pass classification, mode estimation +
gate, mode-separation second pass, continuous discount). Add new "Why
two passes?" rationale, first-pass / mode estimation / gate / second-pass
subsections, and a v2.7 continuous-discount section under Score
Adjustment. Reframe the legacy Bayesian valley-depth adjustment as the
gate-fail-only path. Update training-the-default-model section for the
v4_aug + v5_modesep_aug bundle (including HP optimality verification).
- Overview.md, Home.md, About.md: drop "five-stage" / single-ensemble
framing; describe two-pass + continuous discount.
- Output-files.md: reframe adjusted_score as the v2.7 calling column;
add v2.6 (first_pass_svm, modesep_route) and v2.7 (raw_sum,
svm_vs_naive, voting_frac) columns.
- Quick-start.md, Example-usage.md: refresh benchmark to v2.7 (HomSap
~40 min / ~5.3 GB at -p 5, RefSeq GFF, 257k scored introns; DroMel
~8 min / ~0.8 GB). In-memory not re-measured for v2.7.
- Usage-info.md: threshold default 95 -> 90 (correcting stale line).
- Training-data-and-PWMs.md: update preamble for the v2.6+ two-pass
bundle (v4_aug + v5_modesep_aug; 502K-row second-pass corpus).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8462d15
Update normalizer docs for v2.4.2
Three pages mentioned scaler / --load-normalizer / --normalizer-mode
behavior that drifted between v2.4.0 and v2.4.2:
- Technical-algorithm: rewrote Normalization Modes to describe the
bundled multispecies fallback scaler (new in v2.4.2), the auto/
adaptive small-input fall-through at MIN_ADAPTIVE_INTRONS=200, and
--load-normalizer's role across streaming + in-memory paths. Added
a forward reference from the v3 default-model section.
- Example-usage: split "Custom normalization" into two cases —
reproducible saved-scaler workflow (unchanged) and forcing the
bundled multispecies scaler via --normalizer-mode human (new
guidance for U12-absent / outlier genomes).
- Usage-info: refreshed the --load-normalizer help block to match
the actual contract (works in both modes, overrides bundle scaler).
b71f01b
Document Fisher's discriminant valley projection (3D) in Technical-algorithm
Updates the Stage 5 / valley-depth section to reflect the change from
naive 2D centroid direction to Fisher's discriminant in 3D (5'z, BPz, 3'z).
Notes why adding 3'z is a win under Fisher's reweighting (it isn't under
the naive direction).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
72911c0
Exclude duplicates from "Why Normalize?" empirical stats
Re-runs the human raw-score range table on 257,123 deduplicated introns
(excluded ~4,299 [d]-tagged duplicates that pre-v2.4 score_info contained;
rest of the omitted introns have NA raws and were already excluded).
Numbers move by at most 0.1 — duplicates are ~1.7% of the dataset — but
the deduplicated count is the methodologically correct one.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
c9838e3
Ground "Why Normalize?" example ranges in real human data
Replaces the ballpark per-region score ranges with empirical numbers
from scoring all 261,422 human (GRCh38 + Ensembl 104) introns post
background correction. The 5'SS in particular was understated in the
old text — it spans far more than "-50 to +10" and is heavily
negative-biased (median ~-41) because most introns are U2-type.
Clarifies the rationale for normalization (5'SS would dominate the
kernel; all regions land on comparable scales after RobustScaler).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
f200013
Align v2.4 / v2.4.1 wiki claims with the actual training corpus
- Correct "97 species" → 90 training species + 7 evaluation-only holdouts
(5 recall + 2 protist) across About, Technical-algorithm, and
Training-data-and-PWMs
- Update the AT-AA paragraph to reflect the two-stage screen result
(moderate 5'SS / 3'SS discrimination, BPS at noise floor; recall
impact is academic so PWM addition is deferred until ≥1 other nc
subtype passes the same screen)
- Note the v2.4.1 bundling of the multispecies training set in
Technical-algorithm
- Reflect 126-model default in algorithm overview
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5c7aa00
Update Training-data-and-PWMs for v2.4.1 corpus bundling
The v3 multispecies training corpus is now bundled as the default for
intronIC train, with the legacy human-anchored v2.3 set retained as
*_human.introns.iic.gz. Documents the new layout, flank length (50 bp),
and the override flags for opting into the legacy set.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
78c8ea9
Wiki: add empirical PWM-rebuild evaluation note + bundling decision
PWMs section: document that the question "should we rebuild PWMs from
multispecies data" was actually evaluated, not just deferred by
omission. The pwm_analysis pipeline built fresh per-subtype PFMs from
the multispecies gold standard and compared to the bundled defaults
via Hellinger distance — divergences were small (~0.04-0.08 mean),
consensus base unchanged at every diverging position, and cross-clade
variation only 1.2-1.7× the within-clade noise. The AT-AA U12
subtype (76 IPA-validated introns, Hellinger ~0.13-0.17 at 3SS) is
flagged as a deferred candidate for its own PWM, pending a sanity
check that AT-AA acceptors are real biology vs annotation artifacts.
Training corpus section: name the actual subset to be bundled — the
41,333-row training set the v3 model fits on (10,003 U12 + 31,330 U2),
not the broader 275k post-singleton-decay corpus. Targeted at v2.4.1.
6f0223f
Wiki: align training-data, classifier, and resource-usage pages with v2.4
Across pages: replace v2.3-era "human-only training" framing with the
v3 multispecies default; replace stale "~85% memory savings" /
"~2 GB vs ~12 GB" memory claims with the measured v2.4 figures
(~5.4 GB streaming vs ~10.1 GB in-memory on full human at -p 6); call
out bit-identical streaming/in-memory equivalence.
Per page:
- Technical-algorithm: reframe Species-Specific Background Correction
for v2.4 — its primary role shifts from "fix human-only model bias"
to "inference-time robustness layer for out-of-distribution species,"
with a note that the v3 corpus was scored with BG on so disabling at
inference creates a train/inference distribution mismatch. Update
Normalization Modes table to reflect that the v3 default has no saved
scaler and "auto" therefore falls through to "adaptive."
- Training-data-and-PWMs: clarify that the v3 multispecies training
corpus is not bundled (only the trained model is), and that
intronIC train still loads the v2.3 reference sets by default.
- About: drop the inaccurate "linear SVM" / "Platt scaling" wording —
v2.3+ uses an RBF SVM with isotonic calibration as the default
(Platt as cross-validated fallback).
- Home, Overview, Example-usage, Usage-info: update memory and runtime
numbers to match v2.4 reference benchmarks.
262b40c
Wiki: align resource usage with v2.4 streaming/in-memory equivalence
- Replace stale Streaming/Standard-mode memory and runtime estimates
(the old "~2-3 GB peak / ~6-10 min" line predates the v2.4
multispecies default model and the per-contig parallel pipeline).
- Add a reference benchmark table on full human GRCh38 + Ensembl 104
(~227k introns, -p 6): streaming ~16 min / 5.4 GB peak, in-memory
~15 min / 10.1 GB peak.
- Call out bit-identical equivalence between --streaming (default) and
--in-memory; mode choice is now purely a runtime/memory tradeoff.
- Note that --sequences and --bed input modes feed the in-memory path.
2dc2ebc
Wiki: align with v2.4 (multispecies default, threshold 90, streaming + v3)
- Home.md / Overview.md: update version to 2.4, model from 42- to
126-model multispecies ensemble, threshold from 95 to 90.
- Quick-start.md / Example-usage.md / Output-files.md /
Training-data-and-PWMs.md: same numerical updates.
- Technical-algorithm.md: rewrite "Training the Default Model" to
describe the v3 multispecies corpus (41,333 introns, 97 species,
14 clades, F1 = 1.000 vs v2.3 0.9975, ~330k scored introns FPR
comparison). Document that streaming-classify supports both v2.3
and v3 bundles via the new per-contig adaptive-fit pre-pass.
Fix the valley-depth math formula KaTeX rendering.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6e3381d
Standardize U12/U2 terminology to U12-type/U2-type throughout
All public-facing references to intron types, PWMs, reference sets,
and scoring concepts now use the formal "U12-type"/"U2-type" suffix.
Internal format strings (motif columns, LaTeX formulas) and CLI help
text unchanged.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
9985135
Update wiki for v2.3.0: 6D features, 42-model ensemble, score adjustment
- Home: version 2.3, updated feature list
- Overview: 6D/42-model pipeline, 95% threshold, score adjustment stage
- Technical-algorithm: 6D feature space, BG correction section, score
adjustment section with formula/config, updated hyperparams and ensemble
- Output-files: 32 columns (added adjusted_score, ensemble_sigma),
adjusted score subsection, rel_score = adjusted_score - threshold
- Training-data: version labels to v2.3.0
- Usage-info: threshold default 95
- Example-usage: config/log examples updated
- Quick-start: threshold 95%, adjusted probability language
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2c35fcb
Technical details: add confident U12-type counts to valley depth examples
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
f4db309
Technical details: document species-level cluster validation (valley depth)
Describe the multi-bandwidth density valley detection algorithm, the
valley depth metric, interpretation guidelines with example species
values, and the warning message for no-valley cases.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2f0a0eb
Technical details: add reference column to scoring regions table
Clarify that 5'SS coordinates are relative to the intron 5' end while
BPS and 3'SS coordinates are relative to the intron 3' end.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
88f57c5
Overview: cite Larue & Roy 2023 (WtMTA) in scientific background
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
e64de83
Training data: remove implementation detail about duplicate removal
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
8977a79
Technical details: note default settings work well for U12-absent species
The v2.2 model produces zero confident FPs in C. elegans with default
settings, so the prior adjustment is unlikely to be needed by most users.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3c89967
Technical details: improve parallelization wording
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
7a3981c
Fix WtMTA citation: add Larue & Roy 2023, keep Moyer et al. 2020
The WtMTA database paper is Larue & Roy 2023 (NAR 51:10884-10908),
separate from the original intronIC paper (Moyer et al. 2020).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
01c3500
Italicize species names, add Burke 2018 and WtMTA citations
- Italicize C. elegans and Ascaris suum in Technical-algorithm.md
- Add Burke et al. 2018 (spliceosome profiling) to branch point references
- Add Moyer et al. 2020 (WtMTA/intronIC) to U12-type intron databases
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
8c16251
Technical details: remove specific bp_scan_confidence numbers
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
d6b48f7
Technical details: clarify bp_scan_confidence values are from training data
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
92b6537
Technical details: remove unsourced specific values for bp_scan_confidence
Replace training-data-specific numbers with qualitative description.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3ec7983
Technical details: cite Pineda & Bradley 2018 for BP position distributions
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
c806b09
Use consistent U12-type/U2-type nomenclature throughout
Replace bare "U12 introns", "U2 introns", "U12 motifs" with
"U12-type introns", "U2-type introns", "U12-type motifs" in
user-facing prose.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
9403c61
Training data: clarify U12 set is superset, not replacement
The 472 U12 reference set contains all 387 original introns plus 85
IPA-conserved additions. Previous text incorrectly said it "supersedes"
the original set.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cd59366
Fix CoLa-seq citation: Luo et al. 2023 → Zeng et al. 2022
The CoLa-seq branch point data is from Zeng et al. (2022) Mol Cell
82:4681-4699, not "Luo et al. 2023". Add full citation to References.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ba62132