Document .modesep.json sidecar + quality tiers in Technical-algorithm
Cover the per-species mode-separation diagnostic output and the A/B/C/F
quality-tier rubric in the wiki, so the wiki is a complete reference for
the two-pass classifier (previously only fully documented in the repo's
docs/mode_separation.md, which is gitignored). Also describe the
boundary_mass diagnostic.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Update wiki for v2.7 (two-pass mode-separation + continuous discount)
Reframe the documentation around the v2.6+ two-pass architecture and the
v2.7 continuous per-intron discount:
- Technical-algorithm.md: rewrite Pipeline Overview from 5 stages to 7
(adaptive normalizer fit, first-pass classification, mode estimation +
gate, mode-separation second pass, continuous discount). Add new "Why
two passes?" rationale, first-pass / mode estimation / gate / second-pass
subsections, and a v2.7 continuous-discount section under Score
Adjustment. Reframe the legacy Bayesian valley-depth adjustment as the
gate-fail-only path. Update training-the-default-model section for the
v4_aug + v5_modesep_aug bundle (including HP optimality verification).
- Overview.md, Home.md, About.md: drop "five-stage" / single-ensemble
framing; describe two-pass + continuous discount.
- Output-files.md: reframe adjusted_score as the v2.7 calling column;
add v2.6 (first_pass_svm, modesep_route) and v2.7 (raw_sum,
svm_vs_naive, voting_frac) columns.
- Quick-start.md, Example-usage.md: refresh benchmark to v2.7 (HomSap
~40 min / ~5.3 GB at -p 5, RefSeq GFF, 257k scored introns; DroMel
~8 min / ~0.8 GB). In-memory not re-measured for v2.7.
- Usage-info.md: threshold default 95 -> 90 (correcting stale line).
- Training-data-and-PWMs.md: update preamble for the v2.6+ two-pass
bundle (v4_aug + v5_modesep_aug; 502K-row second-pass corpus).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Update normalizer docs for v2.4.2
Three pages mentioned scaler / --load-normalizer / --normalizer-mode
behavior that drifted between v2.4.0 and v2.4.2:
- Technical-algorithm: rewrote Normalization Modes to describe the
bundled multispecies fallback scaler (new in v2.4.2), the auto/
adaptive small-input fall-through at MIN_ADAPTIVE_INTRONS=200, and
--load-normalizer's role across streaming + in-memory paths. Added
a forward reference from the v3 default-model section.
- Example-usage: split "Custom normalization" into two cases —
reproducible saved-scaler workflow (unchanged) and forcing the
bundled multispecies scaler via --normalizer-mode human (new
guidance for U12-absent / outlier genomes).
- Usage-info: refreshed the --load-normalizer help block to match
the actual contract (works in both modes, overrides bundle scaler).
Document Fisher's discriminant valley projection (3D) in Technical-algorithm
Updates the Stage 5 / valley-depth section to reflect the change from
naive 2D centroid direction to Fisher's discriminant in 3D (5'z, BPz, 3'z).
Notes why adding 3'z is a win under Fisher's reweighting (it isn't under
the naive direction).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Exclude duplicates from "Why Normalize?" empirical stats
Re-runs the human raw-score range table on 257,123 deduplicated introns
(excluded ~4,299 [d]-tagged duplicates that pre-v2.4 score_info contained;
rest of the omitted introns have NA raws and were already excluded).
Numbers move by at most 0.1 — duplicates are ~1.7% of the dataset — but
the deduplicated count is the methodologically correct one.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ground "Why Normalize?" example ranges in real human data
Replaces the ballpark per-region score ranges with empirical numbers
from scoring all 261,422 human (GRCh38 + Ensembl 104) introns post
background correction. The 5'SS in particular was understated in the
old text — it spans far more than "-50 to +10" and is heavily
negative-biased (median ~-41) because most introns are U2-type.
Clarifies the rationale for normalization (5'SS would dominate the
kernel; all regions land on comparable scales after RobustScaler).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Align v2.4 / v2.4.1 wiki claims with the actual training corpus
- Correct "97 species" → 90 training species + 7 evaluation-only holdouts
(5 recall + 2 protist) across About, Technical-algorithm, and
Training-data-and-PWMs
- Update the AT-AA paragraph to reflect the two-stage screen result
(moderate 5'SS / 3'SS discrimination, BPS at noise floor; recall
impact is academic so PWM addition is deferred until ≥1 other nc
subtype passes the same screen)
- Note the v2.4.1 bundling of the multispecies training set in
Technical-algorithm
- Reflect 126-model default in algorithm overview
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wiki: align training-data, classifier, and resource-usage pages with v2.4
Across pages: replace v2.3-era "human-only training" framing with the
v3 multispecies default; replace stale "~85% memory savings" /
"~2 GB vs ~12 GB" memory claims with the measured v2.4 figures
(~5.4 GB streaming vs ~10.1 GB in-memory on full human at -p 6); call
out bit-identical streaming/in-memory equivalence.
Per page:
- Technical-algorithm: reframe Species-Specific Background Correction
for v2.4 — its primary role shifts from "fix human-only model bias"
to "inference-time robustness layer for out-of-distribution species,"
with a note that the v3 corpus was scored with BG on so disabling at
inference creates a train/inference distribution mismatch. Update
Normalization Modes table to reflect that the v3 default has no saved
scaler and "auto" therefore falls through to "adaptive."
- Training-data-and-PWMs: clarify that the v3 multispecies training
corpus is not bundled (only the trained model is), and that
intronIC train still loads the v2.3 reference sets by default.
- About: drop the inaccurate "linear SVM" / "Platt scaling" wording —
v2.3+ uses an RBF SVM with isotonic calibration as the default
(Platt as cross-validated fallback).
- Home, Overview, Example-usage, Usage-info: update memory and runtime
numbers to match v2.4 reference benchmarks.
Wiki: align resource usage with v2.4 streaming/in-memory equivalence
- Replace stale Streaming/Standard-mode memory and runtime estimates
(the old "~2-3 GB peak / ~6-10 min" line predates the v2.4
multispecies default model and the per-contig parallel pipeline).
- Add a reference benchmark table on full human GRCh38 + Ensembl 104
(~227k introns, -p 6): streaming ~16 min / 5.4 GB peak, in-memory
~15 min / 10.1 GB peak.
- Call out bit-identical equivalence between --streaming (default) and
--in-memory; mode choice is now purely a runtime/memory tradeoff.
- Note that --sequences and --bed input modes feed the in-memory path.
Wiki: align with v2.4 (multispecies default, threshold 90, streaming + v3)
- Home.md / Overview.md: update version to 2.4, model from 42- to
126-model multispecies ensemble, threshold from 95 to 90.
- Quick-start.md / Example-usage.md / Output-files.md /
Training-data-and-PWMs.md: same numerical updates.
- Technical-algorithm.md: rewrite "Training the Default Model" to
describe the v3 multispecies corpus (41,333 introns, 97 species,
14 clades, F1 = 1.000 vs v2.3 0.9975, ~330k scored introns FPR
comparison). Document that streaming-classify supports both v2.3
and v3 bundles via the new per-contig adaptive-fit pre-pass.
Fix the valley-depth math formula KaTeX rendering.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Standardize U12/U2 terminology to U12-type/U2-type throughout
All public-facing references to intron types, PWMs, reference sets,
and scoring concepts now use the formal "U12-type"/"U2-type" suffix.
Internal format strings (motif columns, LaTeX formulas) and CLI help
text unchanged.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Update wiki for v2.3.0: 6D features, 42-model ensemble, score adjustment
- Home: version 2.3, updated feature list
- Overview: 6D/42-model pipeline, 95% threshold, score adjustment stage
- Technical-algorithm: 6D feature space, BG correction section, score
adjustment section with formula/config, updated hyperparams and ensemble
- Output-files: 32 columns (added adjusted_score, ensemble_sigma),
adjusted score subsection, rel_score = adjusted_score - threshold
- Training-data: version labels to v2.3.0
- Usage-info: threshold default 95
- Example-usage: config/log examples updated
- Quick-start: threshold 95%, adjusted probability language
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Technical details: add confident U12-type counts to valley depth examples
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Technical details: document species-level cluster validation (valley depth)
Describe the multi-bandwidth density valley detection algorithm, the
valley depth metric, interpretation guidelines with example species
values, and the warning message for no-valley cases.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Technical details: add reference column to scoring regions table
Clarify that 5'SS coordinates are relative to the intron 5' end while
BPS and 3'SS coordinates are relative to the intron 3' end.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Technical details: note default settings work well for U12-absent species
The v2.2 model produces zero confident FPs in C. elegans with default
settings, so the prior adjustment is unlikely to be needed by most users.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Technical details: improve parallelization wording
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fix WtMTA citation: add Larue & Roy 2023, keep Moyer et al. 2020
The WtMTA database paper is Larue & Roy 2023 (NAR 51:10884-10908),
separate from the original intronIC paper (Moyer et al. 2020).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Italicize species names, add Burke 2018 and WtMTA citations
- Italicize C. elegans and Ascaris suum in Technical-algorithm.md
- Add Burke et al. 2018 (spliceosome profiling) to branch point references
- Add Moyer et al. 2020 (WtMTA/intronIC) to U12-type intron databases
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Technical details: remove specific bp_scan_confidence numbers
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Technical details: clarify bp_scan_confidence values are from training data
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Technical details: remove unsourced specific values for bp_scan_confidence
Replace training-data-specific numbers with qualitative description.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Technical details: cite Pineda & Bradley 2018 for BP position distributions
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use consistent U12-type/U2-type nomenclature throughout
Replace bare "U12 introns", "U2 introns", "U12 motifs" with
"U12-type introns", "U2-type introns", "U12-type motifs" in
user-facing prose.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fix CoLa-seq citation: Luo et al. 2023 → Zeng et al. 2022
The CoLa-seq branch point data is from Zeng et al. (2022) Mol Cell
82:4681-4699, not "Luo et al. 2023". Add full citation to References.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fix technical discrepancies found during wiki-vs-code audit
- Overview: fix "Linear SVM" → RBF SVM, update BP distance range to 10-15 nt
- Output-files: fix log base (log_10 → log_2), fix awk filter ("." → "NA"),
fix frac_pos column index ($11 → $12), update attributes format to verbose
strings, remove outdated score_info example line, remove -s flag reference
- Technical-algorithm: add StandardScaler note, harmonize memory estimates,
fix BP search region length description
- Usage-info: remove defunct -s flag, note CLI vs config.yaml default
differences for scoring coords
- Quick-start: harmonize memory estimate with Technical-algorithm page
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Technical details: fix PWM fallback description to cover both U12 and U2
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Technical details: clarify PWM selection and scoring algorithm
Document per-intron dinucleotide-based PWM selection, U2 fallback
masking, and the two-step BPS scoring process (position selection
with U12 PWM, then log-ratio at the same position with both PWMs).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Technical details: add U2 AT-AC PWMs to matrix listing
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Update wiki for v2.2.0: 8D RBF SVM default model
- Home: update version reference
- Technical details: document 8D feature set with linear coefficient reference,
RBF kernel, isotonic calibration, non-overlapping scoring regions, BPS scan
confidence metric, updated training data and evaluation results
- Training data and PWMs: document expanded reference sets (472 U12 + 30,155 U2),
CoLa-seq BPS PWMs with reference_offset, U2 AT-AC PWMs
- Output files: update score_info.iic column listing (30 columns), add bp_offset
and attributes columns to meta.iic
- Overview: update classification mode description
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>