-
Notifications
You must be signed in to change notification settings - Fork 1
Technical algorithm
This page provides a detailed technical description of intronIC's algorithm, data flow, and machine learning architecture.
Input: Genome (FASTA) + Annotation (GFF3/GTF) + Species Name
Note: By default, intronIC uses streaming mode, which processes one chromosome at a time. Streaming and in-memory modes produce bit-identical classifications since v2.4 — the choice is purely a memory tradeoff (roughly 50% less RSS on full human at -p 6 for streaming).
-
Stage 1: Intron Extraction
- Parse annotation hierarchy (gene → transcript → CDS/exon)
- Infer intron coordinates from exon gaps
- Extract intron + flanking exon sequences from genome
- Filter duplicates, short introns, isoforms (configurable)
-
Stage 2: PWM Scoring
- Score 5' splice site, branch point, 3' splice site regions
- Calculate log-odds ratios:
$\log\left(\frac{P(\text{seq}|\text{U12})}{P(\text{seq}|\text{U2})}\right)$ - Select best branch point position from search window
-
Stage 3: Normalization
- Convert raw log-odds to z-scores via robust scaling
- Fit on reference data or adapt to experimental data
-
Stage 4: Classification
- 6D feature vector: 3 z-scores + support2 + bp_offset + bp_scan_confidence
- RBF SVM ensemble with balanced class weights (126 models in the v2.4 default = 3 seeds × 42 sub-models; v2.3 default was a single-seed 42-model ensemble)
- Probability calibration via isotonic regression
- Output: per-intron
$P(\text{U12-type})$ and ensemble$\sigma$
-
Stage 5: Score Adjustment
- Species-level valley prior from KDE bimodality along Fisher's discriminant of (5'z, BPz, 3'z)
- Per-intron ensemble disagreement penalty
- Adjusted probability → threshold (default 90%)
Output: .meta.iic, .bed.iic, .introns.iic, .score_info.iic, plots
Introns are inferred from gaps between consecutive exons (or CDS features) within the same transcript:
Exon 1 Intron 1 Exon 2 Intron 2 Exon 3
[==]------------------------------[==]------------------------------[==]
1-100 101-1600 1601-1700 1701-3200 3201-3300
Priority: CDS features are preferred over exon features when available, as they enable phase calculation.
Mixed CDS/Exon Handling: For transcripts with both CDS and exon features, introns are first generated from CDS features, then exon-only introns (typically in UTR regions) are added if they don't overlap existing CDS-derived introns. All introns are sorted by genomic position and assigned sequential indices, ensuring proper ordering (e.g., 5' UTR introns are numbered before CDS introns in coding direction).
Touching Exons (Annotation Artifacts): Some annotations contain adjacent exon features with no gap between them (zero-length "introns"). These are silently skipped and not included in the intron count (family_size). The intron index sequence remains contiguous (1, 2, 3...) with no gaps for these annotation artifacts.
| Filter | Default | Description |
|---|---|---|
| Duplicates | Exclude | Same coordinates from multiple isoforms |
| Longest isoform | Keep only | Can include all with -i
|
| Minimum length | 30 bp | Adjustable via --min-intron-len
|
| Ambiguous bases | Exclude | 'N' in scoring regions |
| Non-canonical | Include | Exclude with --no-nc
|
Position weight matrices (PWMs) capture the probability of observing each nucleotide at each position in a motif. intronIC includes PWMs for:
- U12-type: AT-AC and GT-AG subtypes
- U2-type: GT-AG, GC-AG, and AT-AC subtypes
Each subtype has PWMs for all three regions (5' splice site, branch point, 3' splice site). The U2-type AT-AC PWMs prevent inflated log-ratio scores when scoring AT-AC introns, where the U2-type fallback to GT-AG matrices would produce artificially low U2-type scores at the dinucleotide positions.
Default scoring windows:
| Region | Relative to | Start | End | Length | Description |
|---|---|---|---|---|---|
| 5' SS | Intron 5' end | -3 | +9 | 12 bp | Includes last 3 bp of upstream exon |
| Branch point | Intron 3' end | -55 | -5 (clamped at 3'SS boundary) | up to 50 bp | Search window, clamped to exclude 3'SS region |
| 3' SS | Intron 3' end | -6 | +4 | 10 bp | Core acceptor only (excludes PPT) |
Non-overlapping regions: The branch point search window is clamped at the 3'SS scoring boundary to prevent feature overlap. For a 100 bp intron, the BP search region covers positions 45–93 and the 3'SS covers 94–100.
For each intron, the terminal dinucleotides (e.g. GT-AG, AT-AC, GC-AG) determine which PWM subtype is used. Both the U12-type and U2-type PWMs are selected to match the intron's dinucleotide class:
- A GT-AG intron is scored with U12-type GT-AG and U2-type GT-AG matrices
- An AT-AC intron is scored with U12-type AT-AC and U2-type AT-AC matrices
- A GC-AG intron is scored with U12-type GT-AG (fallback) and U2-type GC-AG matrices
When either model lacks a direct match for the dinucleotide class (e.g. GC-AG introns use U12-type GT-AG as a fallback, rare boundary types fall back to GT-AG for both models), the fallback PWM's terminal dinucleotide positions are masked so that the score reflects the surrounding motif context rather than a dinucleotide mismatch penalty.
For each region, the raw score is a log-odds ratio computed by scoring the same sequence with both the U12-type and U2-type PWMs:
Where:
-
$b_i$ is the nucleotide at position$i$ - The product is taken over all positions in the scoring window
- Higher positive values favor U12-type
- Higher negative values favor U2-type
- Zero means equally likely under both models
For the 5'SS and 3'SS regions, the scoring window is fixed (see table above) and the log-ratio is computed directly.
For the branch point region, position selection and scoring are separate steps:
- Position selection: The search window is scanned with a sliding window using the U12-type BPS PWM only. The position with the highest raw U12-type probability product is selected as the putative branch point.
- Log-ratio scoring: The same sequence at the selected position is then scored with both the U12-type and U2-type BPS PWMs, and the log-ratio is computed. This ensures the ratio compares both models at the same location.
The branch point adenosine position (bp_offset) is reported relative to the 3' splice site, derived from the PWM's defined reference position (biological position 0 in the matrix). U12-type branch points cluster tightly at -10 to -15 nt from the 3'SS (median -13), while U2-type branch points are more dispersed (-20 to -35 nt) (Pineda & Bradley 2018).
The scan also computes a BPS confidence score (bp_scan_confidence): log2(best_score / mean_score) across all scan positions. This measures how sharply the best match stands out from background — high values indicate a well-defined motif (U12-like), low values indicate a flat landscape (U2-like).
Before normalization, intronIC optionally corrects the U2-type denominator PWMs using the species' own nucleotide composition.
In intronIC ≤ v2.3 this step was load-bearing: the default model was trained on human reference data, so non-human species with different composition produced systematically shifted z-score distributions. Background correction reanchored the U2-type log-odds against the target species' own distribution, recovering calibration.
In v2.4 the default model is the v3 multispecies SVM, trained on 41,333 introns drawn from 90 species across 14 broad clades. The original justification for background correction — adapting a human-only model to non-human composition — no longer applies in the same way. Background correction now plays the role of an inference-time robustness layer for out-of-distribution species: for genomes deep inside the training distribution it is largely redundant with what the multispecies model and the adaptive RobustScaler already handle, but for extreme-composition genomes (e.g., very AT-rich or GC-rich) or clades not represented in training, it remains the cheapest way to keep U2-type log-odds well-anchored. Because the v3 corpus was itself scored with background correction on, disabling it at inference would create a train/inference distribution mismatch and is not recommended for the bundled model.
The correction blends empirical per-position nucleotide frequencies from the target species' intron pool into the U2-type PWMs:
The blending weight is computed per dinucleotide subtype (GT-AG, GC-AG, AT-AC):
Where
Background correction is enabled by default. The key parameter is n0 (Bayesian shrinkage strength); use the same value at classification as was used to score the training corpus (default 1000) for the bundled model.
scoring:
species_background:
enabled: true
n0: 1000 # Shrinkage strength
min_introns: 500 # Skip correction below this countRaw log-odds scores have very different ranges and distributions per region. The numbers below come from scoring 257,123 deduplicated introns extracted from Homo sapiens GRCh38 + Ensembl 104 with the bundled v2.4 default settings (post-background-correction; [d]-tagged duplicates and unscored overlap introns excluded):
| Region | median | 1st–99th %ile | IQR |
|---|---|---|---|
| 5'SS | −41.0 | −63.8 to −8.9 | 16.7 |
| BPS | −3.4 | −20.6 to +11.7 | 9.0 |
| 3'SS | −1.3 | −5.6 to +3.2 | 2.9 |
Two things to notice. First, the absolute magnitudes are very different — the 5'SS IQR alone (≈17) is larger than the 3'SS full 1st–99th %ile spread (≈9). Without normalization the SVM kernel would be dominated by the 5'SS feature. Second, all three regions are negative-biased because raw scores are
intronIC uses robust z-score normalization via sklearn's RobustScaler:
Where IQR is the interquartile range (
Key properties:
- Robust to outliers: Median and IQR are resistant to extreme values
- Centered: Distribution centered around 0 for each feature
- Comparable scales: All three regions have similar variance after scaling
- Minimal U12-type contamination: Rare U12-type introns (~0.5% of all introns) do not significantly affect robust statistics
Note: A second normalization step (sklearn StandardScaler) is applied inside the classification pipeline to the full augmented feature vector (base z-scores + composite features + extra features). This ensures all features have zero mean and unit variance before SVM classification.
| Mode | Description | Use case |
|---|---|---|
adaptive |
Refit a fresh RobustScaler on the experimental input |
v3 default; recommended for any input ≥ 200 introns |
human |
Apply a saved (frozen) scaler — the v2.3 human-only scaler in v2.3 bundles, or the bundled multispecies fallback in v3 bundles | U12-absent / outlier genomes where you want to suppress per-species shift |
auto |
Use the v2.3-style saved scaler if the bundle provides one, otherwise adaptive (with automatic fall-through to the bundled multispecies fallback when the input is too small to fit) |
Default behavior |
Bundled multispecies fallback scaler. Since v2.4.2 the v3 bundle also ships a frozen RobustScaler (v3_fallback_normalizer.pkl) fit on 476,848 raw scores from the 90-species training corpus. It is loaded automatically alongside the model and is used by:
-
auto/adaptivemode on small inputs: when fewer thanMIN_ADAPTIVE_INTRONS(200) introns are scoreable, the per-input RobustScaler fit becomes too noisy (IQR standard error ≈ 20% at n=30 vs ≈ 7% at n=200), so the pipeline transparently falls through to the bundled fallback. This restores single-intron / tiny-annotation scoring (a regression in v2.4.0–v2.4.1). -
Explicit
--normalizer-mode human: applies the bundled fallback unconditionally — useful for U12-absent genomes where you want to suppress any per-species z-score shift. -
Explicit
--load-normalizer <path>: any saved scaler (the bundled one or one you saved with--save-normalizer) overrides whatever the bundle ships and skips the adaptive pre-pass. Honored by both streaming and in-memory paths since v2.4.2.
Notes:
- The v2.3 bundle ships its own saved scaler and is unchanged — passing
--model <v2.3-bundle>reproduces v2.3 behavior exactly. - In
--streamingmode, the v3 default (auto/adaptive) fits a freshRobustScalervia a lightweight per-contig pre-pass that scores all introns through the (BG-corrected) PWMs and pools their raw 5'/BP/3' scores before classification. Pre-pass is skipped when the bundled or--load-normalizerscaler is in effect. - For reproducible normalization across multiple runs on subsets of the same genome, pass
--save-normalizeron the first run and--load-normalizer <path>on subsequent runs.
The default model uses 6 features (6D):
Base features (3D):
-
s5z: 5' splice site z-score -
BPz: Branch point z-score -
s3z: 3' splice site z-score (core acceptor, -6 to +3)
Composite feature (1D):
-
support2: second-largest of max(0, s5z), max(0, BPz), max(0, s3z) — encodes "at least two scoring regions positively support U12-type classification." Zero when only one site is positive.
Extra features (2D):
-
bp_offset: branch point adenosine distance from 3'SS (negative integer). U12-type introns cluster at -10 to -15; U2-type at -20 to -35. -
bp_scan_confidence: log2(best/mean) of U12-type PWM scores across all BP scan positions. Measures motif sharpness; U12-type introns produce sharper peaks than U2-type.
The default model uses an RBF (radial basis function) kernel SVM (sklearn's SVC):
- Kernel: RBF (nonlinear boundary, can learn local patterns)
- Hyperparameters: C = 200, γ = 0.001 (v2.4 default; previously C = 35.11, γ = 0.01 in v2.3); selected via joint grid search on the multispecies training corpus
- Class weights: Balanced to handle ~0.5% U12-type intron prevalence
The RBF kernel allows the classifier to learn curved boundaries — for example, requiring both strong donor AND strong BP for confident U12-type calls, rather than allowing one to compensate for the other as a linear model would.
The raw SVM outputs decision distances (signed distance from hyperplane). These are converted to probabilities using sklearn's CalibratedClassifierCV. During training, both sigmoid (Platt scaling) and isotonic calibration are evaluated via cross-validation, and the method with lower log-loss is selected.
- Sigmoid (Platt scaling): Fits a logistic function with 2 parameters — more conservative probability estimates, fewer borderline calls
- Isotonic: Non-parametric monotonically increasing step function — more flexible, typically selected when sufficient training data is available
The default pretrained model uses isotonic calibration.
When --n-models > 1, multiple SVMs are trained with different U2-type subsamples:
The default v2.4 model uses a 126-model ensemble built from 3 independent random seeds × 42 sub-models per seed. Each sub-model sees all U12-type references but a different random 75% subset of U2-type references. Final predictions are the mean probability across all sub-models, and the standard deviation across them (ensemble_sigma) quantifies model agreement and feeds into the score adjustment step. (The v2.3 default used a single-seed 42-model ensemble.)
As of intronIC v2.4 the default pretrained model is the v3 multispecies SVM ensemble, replacing the human-only model used through v2.3. The PWMs and the score-adjustment layer are unchanged; the difference is purely in what reference data the SVM was trained on.
Training data:
- 41,333 introns drawn from 90 species across 14 broad evolutionary clades (vertebrates, arthropods, nematodes, monocots, eudicots, fungi, basal animals, protists, and others); a further 7 species are reserved as evaluation-only holdouts
- 10,003 U12-type positives + 31,330 U2-type and hard-negative records
- Per-intron labels are assigned by comparative genomic analysis: each intron's classification is supported by orthology evidence from multiple species, rather than a single reference annotation
- As of v2.4.1 the actual training set is bundled at
u{12,2}_reference_multispecies.introns.iic.gz(full intron sequences with 50 bp flanks); the legacy human-anchored v2.3 references remain available asu{12,2}_reference_human.introns.iic.gz - Branch point positions are derived from CoLa-seq empirical data (Zeng et al. 2022); all PWMs are unchanged from v2.3
Training process:
- For each species, score all introns through the standard intronIC PWM pipeline (12 bp BPS motifs)
- Normalize using robust scaling per-species (median/IQR), preserving cross-species composition differences
- Tune hyperparameters via 100-configuration joint grid search (C × γ × U2-subsample fraction); final values are C = 200, γ = 0.001, easy_fraction = 0.75
- Train 3 random seeds × 42 sub-models = 126-model ensemble; each sub-model sees a 75% subsample of U2-type negatives plus all U12-type positives
- Calibrate probabilities via isotonic regression (selected over sigmoid by log-loss)
Evaluation:
- Held-out test set (5 species): F1 = 1.000 (precision = 1.000, recall = 1.000), vs. 0.9975 for the v2.3 default
- 10-fold leave-clade-out cross-validation: every clade F1 ≥ 0.98
- False positive rate on 6 species lacking U12-type introns (P. tetraurelia, Y. lipolytica, C. muris, A. suum, C. elegans, T. thermophila): 12 confident calls across ~330,000 scored introns at threshold 90, vs. 26 with the v2.3 default — roughly half
The v2.3 default model remains accessible by passing --model <path-to-v2.3-bundle>. Streaming mode (--streaming, the default) supports both bundle formats: v2.3 bundles use their saved frozen scaler; v3 bundles trigger a lightweight per-contig pre-pass that fits a fresh adaptive RobustScaler before classification, falling through to a bundled multispecies fallback scaler when the input is too small for a stable per-input fit (see Normalization Modes).
After classification, intronIC performs a cluster validation step to assess whether the confident U12-type calls form a genuine population distinct from U2-type introns.
The algorithm projects all intron scores onto a 1D discriminating axis in (5'z, BPz, 3'z) space, then estimates the density along this axis using kernel density estimation (KDE). The projection direction is Fisher's linear discriminant:
where
A valley depth is then computed as the proportional drop in density between the U2-type and U12-type clusters:
where
A valley depth > 0.3 is considered evidence of bimodal separation. The metric is reported in the log output along with centroid_σ (the distance between U12-type and U2-type centroids in U2-type standard deviation units).
| Valley depth | Interpretation | Example species (depth; n confident U12-type) |
|---|---|---|
| > 0.9 | Strong bimodal separation — genuine U12-type population | H. sapiens (0.94; n=741), D. melanogaster (0.94; n=18), I. scapularis (0.93; n=416) |
| 0.5–0.9 | Moderate separation — likely real but smaller U12-type population | A. castellanii (0.59; n=13), V. jacobsoni (0.75; n=5) |
| < 0.3 | No valley — U12-type calls do not form a distinct cluster | C. elegans (0.00; n=0), A. suum (0.00; n=1) |
When no valley is detected, intronIC emits a warning:
⚠ No density valley detected (depth=0.000): U12-type intron calls may not form a
distinct cluster. Consider reviewing calls with caution.
This does not prevent classification — individual intron scores are still valid — but signals that the species may lack a distinct U12-type intron population.
After classification, intronIC adjusts the raw SVM probabilities to incorporate two independent signals: species-level population evidence and per-intron model agreement.
The adjustment operates in log-odds space:
Where:
-
$p_{\text{svm}}$ : raw ensemble mean probability (isotonic-calibrated) -
$\pi_{\text{species}}$ : species-level prior from valley depth -
$\sigma$ : ensemble standard deviation (std of per-model probabilities) -
$k_\sigma = 3.0$ : disagreement penalty coefficient
The species prior is derived from valley depth via a sigmoid:
Where
| Valley depth | Prior ( |
Effect on a 99% SVM score |
|---|---|---|
| 0.0 | 0.001 | Adjusted to ~9% |
| 0.15 | ~0.03 | Adjusted to ~75% |
| 0.30 | ~0.25 | Adjusted to ~98% |
| 0.65+ | ~0.50 | No change |
For species with strong U12-type intron populations (valley > 0.5), the adjustment is negligible. For species lacking U12-type introns (valley near 0), even high SVM scores are discounted below the threshold.
-
adjusted_scoreinscore_info.iic: the post-adjustment probability -
rel_scoreinmeta.iicandscore_info.iic:adjusted_score - threshold -
type_id(U12/U2): unchanged — still set by the raw SVM decision boundary -
svm_scoreinscore_info.iic: unchanged — still the raw ensemble mean
scoring:
threshold: 90.0
score_adjustment:
enabled: true
valley_midpoint: 0.3
transition_width: 0.25
prior_floor: 0.001
k_sigma: 3.0Species with different GC content than human may show shifted score distributions. Options:
-
Adaptive normalization: Refit scaler on experimental data (
--normalizer-mode adaptive) -
Prior adjustment: Adjust base rate expectation (
--species-prior)
The score adjustment automatically handles species lacking U12-type introns. When no valley is detected in the z-score distribution, the species prior drops to near-floor levels, discounting SVM probabilities below the confidence threshold. For example, C. elegans produces 1 confident call (vs. 4 without adjustment) and Ascaris suum produces 1 (vs. 7 without adjustment).
No special configuration is needed for species lacking U12-type introns.
The default human-trained model generalizes well to other vertebrates and most eukaryotes with U12-type introns. Performance may degrade for:
- Very distant lineages (plants, protists)
- Lineages with unusual U12-type motifs
- Species with extreme GC bias
Consider providing species-specific reference sequences for best results.
--streaming (default) and --in-memory produce bit-identical classifications since v2.4 (covered by tests/integration/test_streaming_equivalence.py); the choice is purely a runtime/memory tradeoff.
Per-contig pipeline:
- Writes intron sequences to a temporary SQLite database during extraction
- Keeps only scoring motifs in memory
- Each phase (extraction, BG correction, adaptive normalizer fit, classification) parallelizes across contigs via
multiprocessing.Pool
Loads all intron sequences into memory at extraction time. Used internally by --sequences and --bed input modes (those bypass the per-contig streaming path).
Homo sapiens GRCh38 + Ensembl 104 GTF, ~227k scored introns, single workstation with -p 6:
| Mode | Wall time | Peak RSS |
|---|---|---|
--streaming |
~16 min | ~5.4 GB |
--in-memory |
~15 min | ~10.1 GB |
Streaming uses roughly half the memory at essentially the same wall time on a multi-contig genome at this parallelism. In-memory's relative advantage grows on small single-contig inputs where the per-contig pipeline overhead dominates.
The -p N flag parallelizes PWM scoring and per-contig extraction:
- Scoring is CPU-bound and parallelizes efficiently
- Linear speedup up to ~8-16 cores
- Diminishing returns beyond that
- BG correction, adaptive-fit, and classify all dispatch through
Poolworkers in streaming mode
Original intronIC paper:
Moyer DC, Larue GE, Hershberger CE, Roy SW, Padgett RA. (2020) Comprehensive database and evolutionary dynamics of U12-type introns. Nucleic Acids Research 48(13):7066–7078. doi:10.1093/nar/gkaa464
U12-type intron databases:
Larue GE, Roy SW. (2023) Where the minor things are: a pan-eukaryotic survey suggests neutral processes may explain much of minor intron evolution. Nucleic Acids Research 51(20):10884-10908. doi:10.1093/nar/gkad797
Moyer DC, Larue GE, Hershberger CE, Roy SW, Padgett RA. (2020) Comprehensive database and evolutionary dynamics of U12-type introns. Nucleic Acids Research 48(13):7066-7078. doi:10.1093/nar/gkaa464
Alioto TS. (2007) U12DB: a database of orthologous U12-type spliceosomal introns. Nucleic Acids Research 35:D110-D115. doi:10.1093/nar/gkl842
Branch point mapping and spliceosome profiling:
Burke JE, Longhurst AD, Merkurjev D, Sales-Lee J, Rao B, Moresco JJ, Yates JR III, Li JJ, Madhani HD. (2018) Spliceosome profiling visualizes operations of a dynamic RNP at nucleotide resolution. Cell 173(4):1014-1030.e17. doi:10.1016/j.cell.2018.03.020
Zeng Y, Fair BJ, Zeng H, Krishnamohan A, Hou Y, Hall JM, Ruthenburg AJ, Li YI, Staley JP. (2022) Profiling lariat intermediates reveals genetic determinants of early and late co-transcriptional splicing. Molecular Cell 82(24):4681-4699. doi:10.1016/j.molcel.2022.11.004
Mercer TR, et al. (2015) Genome-wide discovery of human splicing branchpoints. Genome Research 25:290-303. doi:10.1101/gr.182477.114
Pineda JMB, Bradley RK. (2018) Most human introns are recognized via multiple and tissue-specific branchpoints. Genes & Development 32(7-8):577-591. doi:10.1101/gad.312058.118
U12-type intron retention and functional studies:
Niemelä EH, et al. (2014) Global analysis of the nuclear processing of transcripts with unspliced U12-type introns by the exosome. Nucleic Acids Research 42(11):7358-7369. doi:10.1093/nar/gku391
Madan V, et al. (2015) Aberrant splicing of U12-type introns is the hallmark of ZRSR2 mutant myelodysplastic syndrome. Nature Communications 6:6042. doi:10.1038/ncomms7042
Cologne A, et al. (2019) New insights into minor splicing—a transcriptomic analysis of cells derived from TALS patients. RNA 25(9):1130-1149. doi:10.1261/rna.071423.119
SVM probability calibration:
Platt JC. (1999) Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in Large Margin Classifiers pp. 61-74.