-
Notifications
You must be signed in to change notification settings - Fork 1
Technical algorithm
This page provides a detailed technical description of intronIC's algorithm, data flow, and machine learning architecture.
┌─────────────────────────────────────────────────────────────────────┐
│ INPUT │
│ Genome (FASTA) + Annotation (GFF3/GTF) + Species Name │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ STAGE 1: INTRON EXTRACTION │
│ • Parse annotation hierarchy (gene → transcript → CDS/exon) │
│ • Infer intron coordinates from exon gaps │
│ • Extract intron + flanking exon sequences from genome │
│ • Filter duplicates, short introns, isoforms (configurable) │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ STAGE 2: PWM SCORING │
│ • Score 5' splice site, branch point, 3' splice site regions │
│ • Calculate log-odds ratios: log(P(seq|U12) / P(seq|U2)) │
│ • Select best branch point position from search window │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ STAGE 3: NORMALIZATION │
│ • Convert raw log-odds to z-scores │
│ • Zero-anchored robust scaling (preserves semantic zero) │
│ • Fit on reference data or adapt to experimental data │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ STAGE 4: CLASSIFICATION │
│ • Feature augmentation (optional composite features) │
│ • Linear SVM with balanced class weights │
│ • Probability calibration via Platt scaling │
│ • Output: P(U12-type) from 0-100% │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ OUTPUT │
│ .meta.iic, .bed.iic, .introns.iic, .score_info.iic, plots │
└─────────────────────────────────────────────────────────────────────┘
Introns are inferred from gaps between consecutive exons (or CDS features) within the same transcript:
Exon 1 Intron 1 Exon 2 Intron 2 Exon 3
[====]----------[ ]--------[====]----------[ ]--------[====]
100-200 201-500 501-700 701-900 901-1100
Priority: CDS features are preferred over exon features when available, as they enable phase calculation.
| Filter | Default | Description |
|---|---|---|
| Duplicates | Exclude | Same coordinates from multiple isoforms |
| Longest isoform | Keep only | Can include all with -i
|
| Minimum length | 30 bp | Adjustable via --min-intron-len
|
| Ambiguous bases | Exclude | 'N' in scoring regions |
| Non-canonical | Include | Exclude with --no-nc
|
Position weight matrices (PWMs) capture the probability of observing each nucleotide at each position in a motif. intronIC uses separate matrices for:
| Region | U12 AT-AC | U12 GT-AG | U2 GT-AG | U2 GC-AG |
|---|---|---|---|---|
| 5' splice site | ✓ | ✓ | ✓ | ✓ |
| Branch point | ✓ | ✓ | ✓ | ✓ |
| 3' splice site | ✓ | ✓ | ✓ | ✓ |
Default scoring windows relative to splice sites:
| Region | Start | End | Length | Description |
|---|---|---|---|---|
| 5' SS | -3 | +9 | 12 bp | Includes last 3 bp of upstream exon |
| Branch point | -55 | -5 | 50 bp | Search window from 3'SS |
| 3' SS | -6 | +4 | 10 bp | Includes first 4 bp of downstream exon |
For each region, the raw score is a log-odds ratio:
Where:
-
$b_i$ is the nucleotide at position$i$ - Higher positive values favor U12-type
- Higher negative values favor U2-type
- Zero means equally likely under both models
The branch point region is searched for the highest-scoring 7-mer matching the U12 branch point consensus (TCCTTAAC or similar). Multiple adenosine position variants are tested, and the highest-scoring position is selected.
Raw log-odds scores have different ranges and distributions for each region:
- 5'SS scores might range from -50 to +10
- BP scores might range from -20 to +5
- 3'SS scores might range from -5 to +3
Normalization converts these to comparable z-scores for the SVM.
intronIC uses zero-anchored robust scaling that preserves the semantic meaning of zero:
Key properties:
- Zero preserved: A raw score of 0 (equally plausible under U12 and U2) remains 0
- Sign preserved: Positive scores stay positive, negative stay negative
- Robust: Median is resistant to outliers and rare U12s (~0.5% of data)
- Winsorization: Extreme values clipped at 99.5th percentile before computing median
| Mode | Description | Use Case |
|---|---|---|
human |
Use scales from training (human) data | U12-absent species, cross-species |
adaptive |
Refit scales on experimental data | Species with different GC content |
auto |
Use human if available in model | Default behavior |
Important: For reproducible results on genome subsets, save and reload the normalizer from a full-genome run.
Base features (3D):
-
five_z_score: 5' splice site z-score -
bp_z_score: Branch point z-score -
three_z_score: 3' splice site z-score
Augmented features (optional, 7D default):
-
min_all: min(five_z, bp_z, three_z) — Requires ALL signals strong -
absdiff_5_bp: |five_z - bp_z| — Penalizes 5'/BP imbalance -
absdiff_5_3: |five_z - three_z| — Penalizes 5'/3' imbalance -
absdiff_bp_3: |bp_z - three_z| — Penalizes BP/3' imbalance
The augmented features help the linear SVM reject "one-end-strong" false positives—U2 introns with one unusually U12-like signal.
intronIC uses a linear support vector machine (sklearn's LinearSVC):
- Kernel: Linear (interpretable coefficients)
- Class weights: Balanced to handle ~0.5% U12 prevalence
- Regularization: L2 penalty, C optimized via cross-validation
- Convergence: max_iter=50,000, tol=1e-4
The raw SVM outputs decision distances (signed distance from hyperplane). These are converted to probabilities using Platt scaling (sigmoid calibration):
Where
When --n-models > 1, multiple SVMs are trained with different U2 subsamples:
- Each model sees all U12 references but a different 80% of U2 references
- Predictions are averaged across models
- Reduces variance and improves robustness
The default pretrained model was trained on:
Training data:
- ~400 conserved human U12-type introns (multiple evidence sources)
- ~20,000 conserved human U2-type introns
Training process:
- Score all reference introns with PWMs
- Normalize using zero-anchored robust scaling
- Optimize SVM hyperparameters via 5-fold cross-validation
- Train final model on all reference data
- Calibrate probabilities via cross-validation
Evaluation:
- Nested cross-validation for unbiased performance estimates
- Balanced accuracy, F1-score, and PR-AUC reported
- Validated on held-out human introns and cross-species data
Species with different GC content than human may show shifted score distributions. Options:
-
Adaptive normalization: Refit scaler on experimental data (
--normalizer-mode adaptive) -
Prior adjustment: Adjust base rate expectation (
--species-prior)
For species known to lack U12 introns (C. elegans, many fungi):
intronIC -g genome.fa -a annotation.gff -n species \
--normalizer-mode human --species-prior 1e-6The reduced prior shifts probability thresholds to minimize false positives.
The default human-trained model generalizes well to other vertebrates and most eukaryotes with U12 introns. Performance may degrade for:
- Very distant lineages (plants, protists)
- Lineages with unusual U12 motifs
- Species with extreme GC bias
Consider providing species-specific reference sequences for best results.
Memory scales with annotation density:
- Loads all intron sequences into memory
- Human genome (~250k introns): ~12 GB peak
Dramatically reduced memory:
- Writes sequences to temporary SQLite database
- Keeps only scoring motifs in memory
- Human genome: ~2-3 GB peak (~85% reduction)
- Trade-off: Slightly slower I/O
The -p N flag parallelizes PWM scoring:
- Scoring is CPU-bound and embarrassingly parallel
- Linear speedup up to ~8-16 cores
- Diminishing returns beyond that
Original intronIC paper:
Moyer DC, Larue GE, Hershberger CE, Roy SW, Padgett RA. (2020) Comprehensive database and evolutionary dynamics of U12-type introns. Nucleic Acids Research 48(13):7066–7078. doi:10.1093/nar/gkaa464
U12 intron databases:
Alioto TS. (2007) U12DB: a database of orthologous U12-type spliceosomal introns. Nucleic Acids Research 35:D110-D115. doi:10.1093/nar/gkl842
Branch point mapping:
Mercer TR, et al. (2015) Genome-wide discovery of human splicing branchpoints. Genome Research 25:290-303. doi:10.1101/gr.182899.114
SVM probability calibration:
Platt JC. (1999) Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in Large Margin Classifiers pp. 61-74.