-
Notifications
You must be signed in to change notification settings - Fork 1
Technical algorithm
This page provides a detailed technical description of intronIC's algorithm, data flow, and machine learning architecture.
Input: Genome (FASTA) + Annotation (GFF3/GTF) + Species Name
Note: By default, intronIC uses streaming mode, which processes one chromosome at a time for ~85% memory reduction on large genomes.
-
Stage 1: Intron Extraction
- Parse annotation hierarchy (gene → transcript → CDS/exon)
- Infer intron coordinates from exon gaps
- Extract intron + flanking exon sequences from genome
- Filter duplicates, short introns, isoforms (configurable)
-
Stage 2: PWM Scoring
- Score 5' splice site, branch point, 3' splice site regions
- Calculate log-odds ratios: log(P(seq|U12) / P(seq|U2))
- Select best branch point position from search window
-
Stage 3: Normalization
- Convert raw log-odds to z-scores via robust scaling
- Fit on reference data or adapt to experimental data
-
Stage 4: Classification
- Feature augmentation (optional composite features)
- Linear SVM with balanced class weights
- Probability calibration via Platt scaling
- Output: P(U12-type) from 0-100%
Output: .meta.iic, .bed.iic, .introns.iic, .score_info.iic, plots
Introns are inferred from gaps between consecutive exons (or CDS features) within the same transcript:
Exon 1 Intron 1 Exon 2 Intron 2 Exon 3
[====]----------[ ]--------[====]----------[ ]--------[====]
100-200 201-500 501-700 701-900 901-1100
Priority: CDS features are preferred over exon features when available, as they enable phase calculation.
| Filter | Default | Description |
|---|---|---|
| Duplicates | Exclude | Same coordinates from multiple isoforms |
| Longest isoform | Keep only | Can include all with -i
|
| Minimum length | 30 bp | Adjustable via --min-intron-len
|
| Ambiguous bases | Exclude | 'N' in scoring regions |
| Non-canonical | Include | Exclude with --no-nc
|
Position weight matrices (PWMs) capture the probability of observing each nucleotide at each position in a motif. intronIC includes PWMs for:
- U12-type: AT-AC and GT-AG subtypes
- U2-type: GT-AG and GC-AG subtypes
Each subtype has PWMs for all three regions (5' splice site, branch point, 3' splice site).
Default scoring windows relative to splice sites:
| Region | Start | End | Length | Description |
|---|---|---|---|---|
| 5' SS | -3 | +9 | 12 bp | Includes last 3 bp of upstream exon |
| Branch point | -55 | -5 | 50 bp | Search window from 3'SS |
| 3' SS | -6 | +4 | 10 bp | Includes first 4 bp of downstream exon |
For each region, the raw score is a log-odds ratio:
Where:
-
$b_i$ is the nucleotide at position$i$ - Higher positive values favor U12-type
- Higher negative values favor U2-type
- Zero means equally likely under both models
The branch point region is searched for the highest-scoring 7-mer matching the U12 branch point consensus (TCCTTAAC or similar). Multiple adenosine position variants are tested, and the highest-scoring position is selected.
Raw log-odds scores have different ranges and distributions for each region:
- 5'SS scores might range from -50 to +10
- BP scores might range from -20 to +5
- 3'SS scores might range from -5 to +3
Normalization converts these to comparable z-scores for the SVM.
intronIC uses robust z-score normalization via sklearn's RobustScaler:
Where IQR is the interquartile range (75th percentile - 25th percentile).
Key properties:
- Robust to outliers: Median and IQR are resistant to extreme values
- Centered: Distribution centered around 0 for each feature
- Comparable scales: All three regions have similar variance after scaling
- Minimal U12 contamination: Rare U12s (~0.5%) don't significantly affect robust statistics
| Mode | Description | Use Case |
|---|---|---|
human |
Use scales from training (human) data | U12-absent species, cross-species |
adaptive |
Refit scales on experimental data | Species with different GC content |
auto |
Use human if available in model | Default behavior |
Note: The pretrained model includes its scaler, so results are reproducible without additional steps when using --normalizer-mode human or auto.
Base features (3D):
-
five_z_score: 5' splice site z-score -
bp_z_score: Branch point z-score -
three_z_score: 3' splice site z-score
Augmented features (optional, 7D default):
-
min_all: min(five_z, bp_z, three_z) — Requires ALL signals strong -
absdiff_5_bp: |five_z - bp_z| — Penalizes 5'/BP imbalance -
absdiff_5_3: |five_z - three_z| — Penalizes 5'/3' imbalance -
absdiff_bp_3: |bp_z - three_z| — Penalizes BP/3' imbalance
The augmented features help the linear SVM reject "one-end-strong" false positives—U2-type introns with one unusually U12-like signal.
intronIC uses a linear support vector machine (sklearn's LinearSVC):
- Kernel: Linear (interpretable coefficients)
- Class weights: Balanced to handle ~0.5% U12 prevalence
- Regularization: L2 penalty, C optimized via cross-validation
- Convergence: max_iter=50,000, tol=1e-4
The raw SVM outputs decision distances (signed distance from hyperplane). These are converted to probabilities using Platt scaling (sigmoid calibration):
Where
When --n-models > 1, multiple SVMs are trained with different U2 subsamples:
- Each model sees all U12 references but a different 80% of U2 references
- Predictions are averaged across models
- Reduces variance and improves robustness
The default pretrained model was trained on:
Training data:
- ~400 conserved human U12-type introns (multiple evidence sources)
- ~20,000 conserved human U2-type introns
Training process:
- Score all reference introns with PWMs
- Normalize using robust scaling (median/IQR)
- Optimize SVM hyperparameters via 5-fold cross-validation
- Train final model on all reference data
- Calibrate probabilities via cross-validation
Evaluation:
- Nested cross-validation for unbiased performance estimates
- Balanced accuracy, F1-score, and PR-AUC reported
- Validated on held-out human introns and cross-species data
Species with different GC content than human may show shifted score distributions. Options:
-
Adaptive normalization: Refit scaler on experimental data (
--normalizer-mode adaptive) -
Prior adjustment: Adjust base rate expectation (
--species-prior)
For species known to lack U12-type introns (C. elegans, many fungi):
intronIC -g genome.fa -a annotation.gff -n species \
--normalizer-mode human --species-prior 1e-6The reduced prior shifts probability thresholds to minimize false positives.
The default human-trained model generalizes well to other vertebrates and most eukaryotes with U12-type introns. Performance may degrade for:
- Very distant lineages (plants, protists)
- Lineages with unusual U12 motifs
- Species with extreme GC bias
Consider providing species-specific reference sequences for best results.
Memory scales with annotation density:
- Loads all intron sequences into memory
- Human genome (~250k introns): ~12 GB peak
Dramatically reduced memory:
- Writes sequences to temporary SQLite database
- Keeps only scoring motifs in memory
- Human genome: ~2-3 GB peak (~85% reduction)
- Trade-off: Slightly slower I/O
The -p N flag parallelizes PWM scoring:
- Scoring is CPU-bound and embarrassingly parallel
- Linear speedup up to ~8-16 cores
- Diminishing returns beyond that
Original intronIC paper:
Moyer DC, Larue GE, Hershberger CE, Roy SW, Padgett RA. (2020) Comprehensive database and evolutionary dynamics of U12-type introns. Nucleic Acids Research 48(13):7066–7078. doi:10.1093/nar/gkaa464
U12-type intron databases:
Alioto TS. (2007) U12DB: a database of orthologous U12-type spliceosomal introns. Nucleic Acids Research 35:D110-D115. doi:10.1093/nar/gkl842
Branch point mapping:
Mercer TR, et al. (2015) Genome-wide discovery of human splicing branchpoints. Genome Research 25:290-303. doi:10.1101/gr.182899.114
SVM probability calibration:
Platt JC. (1999) Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in Large Margin Classifiers pp. 61-74.