Skip to content

History

Revisions

  • Document .modesep.json sidecar + quality tiers in Technical-algorithm Cover the per-species mode-separation diagnostic output and the A/B/C/F quality-tier rubric in the wiki, so the wiki is a complete reference for the two-pass classifier (previously only fully documented in the repo's docs/mode_separation.md, which is gitignored). Also describe the boundary_mass diagnostic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

    @glarue glarue committed May 20, 2026
    50ba416
  • Update wiki for v2.7 (two-pass mode-separation + continuous discount) Reframe the documentation around the v2.6+ two-pass architecture and the v2.7 continuous per-intron discount: - Technical-algorithm.md: rewrite Pipeline Overview from 5 stages to 7 (adaptive normalizer fit, first-pass classification, mode estimation + gate, mode-separation second pass, continuous discount). Add new "Why two passes?" rationale, first-pass / mode estimation / gate / second-pass subsections, and a v2.7 continuous-discount section under Score Adjustment. Reframe the legacy Bayesian valley-depth adjustment as the gate-fail-only path. Update training-the-default-model section for the v4_aug + v5_modesep_aug bundle (including HP optimality verification). - Overview.md, Home.md, About.md: drop "five-stage" / single-ensemble framing; describe two-pass + continuous discount. - Output-files.md: reframe adjusted_score as the v2.7 calling column; add v2.6 (first_pass_svm, modesep_route) and v2.7 (raw_sum, svm_vs_naive, voting_frac) columns. - Quick-start.md, Example-usage.md: refresh benchmark to v2.7 (HomSap ~40 min / ~5.3 GB at -p 5, RefSeq GFF, 257k scored introns; DroMel ~8 min / ~0.8 GB). In-memory not re-measured for v2.7. - Usage-info.md: threshold default 95 -> 90 (correcting stale line). - Training-data-and-PWMs.md: update preamble for the v2.6+ two-pass bundle (v4_aug + v5_modesep_aug; 502K-row second-pass corpus). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

    @glarue glarue committed May 20, 2026
    8462d15
  • Update normalizer docs for v2.4.2 Three pages mentioned scaler / --load-normalizer / --normalizer-mode behavior that drifted between v2.4.0 and v2.4.2: - Technical-algorithm: rewrote Normalization Modes to describe the bundled multispecies fallback scaler (new in v2.4.2), the auto/ adaptive small-input fall-through at MIN_ADAPTIVE_INTRONS=200, and --load-normalizer's role across streaming + in-memory paths. Added a forward reference from the v3 default-model section. - Example-usage: split "Custom normalization" into two cases — reproducible saved-scaler workflow (unchanged) and forcing the bundled multispecies scaler via --normalizer-mode human (new guidance for U12-absent / outlier genomes). - Usage-info: refreshed the --load-normalizer help block to match the actual contract (works in both modes, overrides bundle scaler).

    @glarue glarue committed May 10, 2026
    b71f01b
  • Document Fisher's discriminant valley projection (3D) in Technical-algorithm Updates the Stage 5 / valley-depth section to reflect the change from naive 2D centroid direction to Fisher's discriminant in 3D (5'z, BPz, 3'z). Notes why adding 3'z is a win under Fisher's reweighting (it isn't under the naive direction). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

    @glarue glarue committed May 10, 2026
    72911c0
  • Exclude duplicates from "Why Normalize?" empirical stats Re-runs the human raw-score range table on 257,123 deduplicated introns (excluded ~4,299 [d]-tagged duplicates that pre-v2.4 score_info contained; rest of the omitted introns have NA raws and were already excluded). Numbers move by at most 0.1 — duplicates are ~1.7% of the dataset — but the deduplicated count is the methodologically correct one. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

    @glarue glarue committed May 10, 2026
    c9838e3
  • Ground "Why Normalize?" example ranges in real human data Replaces the ballpark per-region score ranges with empirical numbers from scoring all 261,422 human (GRCh38 + Ensembl 104) introns post background correction. The 5'SS in particular was understated in the old text — it spans far more than "-50 to +10" and is heavily negative-biased (median ~-41) because most introns are U2-type. Clarifies the rationale for normalization (5'SS would dominate the kernel; all regions land on comparable scales after RobustScaler). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

    @glarue glarue committed May 10, 2026
    f200013
  • Align v2.4 / v2.4.1 wiki claims with the actual training corpus - Correct "97 species" → 90 training species + 7 evaluation-only holdouts (5 recall + 2 protist) across About, Technical-algorithm, and Training-data-and-PWMs - Update the AT-AA paragraph to reflect the two-stage screen result (moderate 5'SS / 3'SS discrimination, BPS at noise floor; recall impact is academic so PWM addition is deferred until ≥1 other nc subtype passes the same screen) - Note the v2.4.1 bundling of the multispecies training set in Technical-algorithm - Reflect 126-model default in algorithm overview Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

    @glarue glarue committed May 10, 2026
    5c7aa00
  • Update Training-data-and-PWMs for v2.4.1 corpus bundling The v3 multispecies training corpus is now bundled as the default for intronIC train, with the legacy human-anchored v2.3 set retained as *_human.introns.iic.gz. Documents the new layout, flank length (50 bp), and the override flags for opting into the legacy set. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

    @glarue glarue committed May 10, 2026
    78c8ea9
  • Wiki: add empirical PWM-rebuild evaluation note + bundling decision PWMs section: document that the question "should we rebuild PWMs from multispecies data" was actually evaluated, not just deferred by omission. The pwm_analysis pipeline built fresh per-subtype PFMs from the multispecies gold standard and compared to the bundled defaults via Hellinger distance — divergences were small (~0.04-0.08 mean), consensus base unchanged at every diverging position, and cross-clade variation only 1.2-1.7× the within-clade noise. The AT-AA U12 subtype (76 IPA-validated introns, Hellinger ~0.13-0.17 at 3SS) is flagged as a deferred candidate for its own PWM, pending a sanity check that AT-AA acceptors are real biology vs annotation artifacts. Training corpus section: name the actual subset to be bundled — the 41,333-row training set the v3 model fits on (10,003 U12 + 31,330 U2), not the broader 275k post-singleton-decay corpus. Targeted at v2.4.1.

    @glarue glarue committed May 10, 2026
    6f0223f
  • Wiki: align training-data, classifier, and resource-usage pages with v2.4 Across pages: replace v2.3-era "human-only training" framing with the v3 multispecies default; replace stale "~85% memory savings" / "~2 GB vs ~12 GB" memory claims with the measured v2.4 figures (~5.4 GB streaming vs ~10.1 GB in-memory on full human at -p 6); call out bit-identical streaming/in-memory equivalence. Per page: - Technical-algorithm: reframe Species-Specific Background Correction for v2.4 — its primary role shifts from "fix human-only model bias" to "inference-time robustness layer for out-of-distribution species," with a note that the v3 corpus was scored with BG on so disabling at inference creates a train/inference distribution mismatch. Update Normalization Modes table to reflect that the v3 default has no saved scaler and "auto" therefore falls through to "adaptive." - Training-data-and-PWMs: clarify that the v3 multispecies training corpus is not bundled (only the trained model is), and that intronIC train still loads the v2.3 reference sets by default. - About: drop the inaccurate "linear SVM" / "Platt scaling" wording — v2.3+ uses an RBF SVM with isotonic calibration as the default (Platt as cross-validated fallback). - Home, Overview, Example-usage, Usage-info: update memory and runtime numbers to match v2.4 reference benchmarks.

    @glarue glarue committed May 10, 2026
    262b40c
  • Wiki: align resource usage with v2.4 streaming/in-memory equivalence - Replace stale Streaming/Standard-mode memory and runtime estimates (the old "~2-3 GB peak / ~6-10 min" line predates the v2.4 multispecies default model and the per-contig parallel pipeline). - Add a reference benchmark table on full human GRCh38 + Ensembl 104 (~227k introns, -p 6): streaming ~16 min / 5.4 GB peak, in-memory ~15 min / 10.1 GB peak. - Call out bit-identical equivalence between --streaming (default) and --in-memory; mode choice is now purely a runtime/memory tradeoff. - Note that --sequences and --bed input modes feed the in-memory path.

    @glarue glarue committed May 10, 2026
    2dc2ebc
  • Wiki: align with v2.4 (multispecies default, threshold 90, streaming + v3) - Home.md / Overview.md: update version to 2.4, model from 42- to 126-model multispecies ensemble, threshold from 95 to 90. - Quick-start.md / Example-usage.md / Output-files.md / Training-data-and-PWMs.md: same numerical updates. - Technical-algorithm.md: rewrite "Training the Default Model" to describe the v3 multispecies corpus (41,333 introns, 97 species, 14 clades, F1 = 1.000 vs v2.3 0.9975, ~330k scored introns FPR comparison). Document that streaming-classify supports both v2.3 and v3 bundles via the new per-contig adaptive-fit pre-pass. Fix the valley-depth math formula KaTeX rendering. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

    @glarue glarue committed May 9, 2026
    6e3381d
  • Standardize U12/U2 terminology to U12-type/U2-type throughout All public-facing references to intron types, PWMs, reference sets, and scoring concepts now use the formal "U12-type"/"U2-type" suffix. Internal format strings (motif columns, LaTeX formulas) and CLI help text unchanged. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

    @glarue glarue committed Apr 24, 2026
    9985135
  • Update wiki for v2.3.0: 6D features, 42-model ensemble, score adjustment - Home: version 2.3, updated feature list - Overview: 6D/42-model pipeline, 95% threshold, score adjustment stage - Technical-algorithm: 6D feature space, BG correction section, score adjustment section with formula/config, updated hyperparams and ensemble - Output-files: 32 columns (added adjusted_score, ensemble_sigma), adjusted score subsection, rel_score = adjusted_score - threshold - Training-data: version labels to v2.3.0 - Usage-info: threshold default 95 - Example-usage: config/log examples updated - Quick-start: threshold 95%, adjusted probability language Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

    @glarue glarue committed Apr 24, 2026
    2c35fcb
  • Technical details: add confident U12-type counts to valley depth examples Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

    @glarue glarue committed Apr 13, 2026
    f4db309
  • Technical details: document species-level cluster validation (valley depth) Describe the multi-bandwidth density valley detection algorithm, the valley depth metric, interpretation guidelines with example species values, and the warning message for no-valley cases. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

    @glarue glarue committed Apr 13, 2026
    2f0a0eb
  • Technical details: add reference column to scoring regions table Clarify that 5'SS coordinates are relative to the intron 5' end while BPS and 3'SS coordinates are relative to the intron 3' end. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

    @glarue glarue committed Apr 13, 2026
    88f57c5
  • Overview: cite Larue & Roy 2023 (WtMTA) in scientific background Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

    @glarue glarue committed Apr 13, 2026
    e64de83
  • Training data: remove implementation detail about duplicate removal Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

    @glarue glarue committed Apr 13, 2026
    8977a79
  • Technical details: note default settings work well for U12-absent species The v2.2 model produces zero confident FPs in C. elegans with default settings, so the prior adjustment is unlikely to be needed by most users. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

    @glarue glarue committed Apr 13, 2026
    3c89967
  • Technical details: improve parallelization wording Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

    @glarue glarue committed Apr 13, 2026
    7a3981c
  • Fix WtMTA citation: add Larue & Roy 2023, keep Moyer et al. 2020 The WtMTA database paper is Larue & Roy 2023 (NAR 51:10884-10908), separate from the original intronIC paper (Moyer et al. 2020). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

    @glarue glarue committed Apr 13, 2026
    01c3500
  • Italicize species names, add Burke 2018 and WtMTA citations - Italicize C. elegans and Ascaris suum in Technical-algorithm.md - Add Burke et al. 2018 (spliceosome profiling) to branch point references - Add Moyer et al. 2020 (WtMTA/intronIC) to U12-type intron databases Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

    @glarue glarue committed Apr 13, 2026
    8c16251
  • Technical details: remove specific bp_scan_confidence numbers Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

    @glarue glarue committed Apr 12, 2026
    d6b48f7
  • Technical details: clarify bp_scan_confidence values are from training data Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

    @glarue glarue committed Apr 12, 2026
    92b6537
  • Technical details: remove unsourced specific values for bp_scan_confidence Replace training-data-specific numbers with qualitative description. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

    @glarue glarue committed Apr 12, 2026
    3ec7983
  • Technical details: cite Pineda & Bradley 2018 for BP position distributions Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

    @glarue glarue committed Apr 12, 2026
    c806b09
  • Use consistent U12-type/U2-type nomenclature throughout Replace bare "U12 introns", "U2 introns", "U12 motifs" with "U12-type introns", "U2-type introns", "U12-type motifs" in user-facing prose. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

    @glarue glarue committed Apr 12, 2026
    9403c61
  • Training data: clarify U12 set is superset, not replacement The 472 U12 reference set contains all 387 original introns plus 85 IPA-conserved additions. Previous text incorrectly said it "supersedes" the original set. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

    @glarue glarue committed Apr 12, 2026
    cd59366
  • Fix CoLa-seq citation: Luo et al. 2023 → Zeng et al. 2022 The CoLa-seq branch point data is from Zeng et al. (2022) Mol Cell 82:4681-4699, not "Luo et al. 2023". Add full citation to References. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

    @glarue glarue committed Apr 12, 2026
    ba62132