Technical algorithm

Technical Details

This page provides a detailed technical description of intronIC's algorithm, data flow, and machine learning architecture.

Pipeline Overview

┌─────────────────────────────────────────────────────────────────────┐
│                         INPUT                                        │
│   Genome (FASTA) + Annotation (GFF3/GTF) + Species Name             │
└─────────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────┐
│ STAGE 1: INTRON EXTRACTION                                          │
│   • Parse annotation hierarchy (gene → transcript → CDS/exon)       │
│   • Infer intron coordinates from exon gaps                         │
│   • Extract intron + flanking exon sequences from genome            │
│   • Filter duplicates, short introns, isoforms (configurable)       │
└─────────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────┐
│ STAGE 2: PWM SCORING                                                │
│   • Score 5' splice site, branch point, 3' splice site regions      │
│   • Calculate log-odds ratios: log(P(seq|U12) / P(seq|U2))         │
│   • Select best branch point position from search window            │
└─────────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────┐
│ STAGE 3: NORMALIZATION                                              │
│   • Convert raw log-odds to z-scores                                │
│   • Zero-anchored robust scaling (preserves semantic zero)          │
│   • Fit on reference data or adapt to experimental data             │
└─────────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────┐
│ STAGE 4: CLASSIFICATION                                             │
│   • Feature augmentation (optional composite features)              │
│   • Linear SVM with balanced class weights                          │
│   • Probability calibration via Platt scaling                       │
│   • Output: P(U12-type) from 0-100%                                │
└─────────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────┐
│                         OUTPUT                                       │
│   .meta.iic, .bed.iic, .introns.iic, .score_info.iic, plots        │
└─────────────────────────────────────────────────────────────────────┘

Stage 1: Intron Extraction

Coordinate Inference

Introns are inferred from gaps between consecutive exons (or CDS features) within the same transcript:

Exon 1          Intron 1        Exon 2          Intron 2        Exon 3
[====]----------[      ]--------[====]----------[      ]--------[====]
100-200         201-500         501-700         701-900         901-1100

Priority: CDS features are preferred over exon features when available, as they enable phase calculation.

Filtering Criteria

Filter	Default	Description
Duplicates	Exclude	Same coordinates from multiple isoforms
Longest isoform	Keep only	Can include all with `-i`
Minimum length	30 bp	Adjustable via `--min-intron-len`
Ambiguous bases	Exclude	'N' in scoring regions
Non-canonical	Include	Exclude with `--no-nc`

Stage 2: PWM Scoring

Position Weight Matrices

Position weight matrices (PWMs) capture the probability of observing each nucleotide at each position in a motif. intronIC uses separate matrices for:

Region	U12 AT-AC	U12 GT-AG	U2 GT-AG	U2 GC-AG
5' splice site	✓	✓	✓	✓
Branch point	✓	✓	✓	✓
3' splice site	✓	✓	✓	✓

Scoring Regions

Default scoring windows relative to splice sites:

Region	Start	End	Length	Description
5' SS	-3	+9	12 bp	Includes last 3 bp of upstream exon
Branch point	-55	-5	50 bp	Search window from 3'SS
3' SS	-6	+4	10 bp	Includes first 4 bp of downstream exon

Log-Odds Ratio Calculation

For each region, the raw score is a log-odds ratio:

$$\text{LLR} = \sum_{i=1}^{n} \log\left(\frac{P(b_i | \text{U12 matrix})}{P(b_i | \text{U2 matrix})}\right)$$

Where:

$b_i$ is the nucleotide at position $i$
Higher positive values favor U12-type
Higher negative values favor U2-type
Zero means equally likely under both models

Branch Point Selection

The branch point region is searched for the highest-scoring 7-mer matching the U12 branch point consensus (TCCTTAAC or similar). Multiple adenosine position variants are tested, and the highest-scoring position is selected.

Stage 3: Normalization

Why Normalize?

Raw log-odds scores have different ranges and distributions for each region:

5'SS scores might range from -50 to +10
BP scores might range from -20 to +5
3'SS scores might range from -5 to +3

Normalization converts these to comparable z-scores for the SVM.

Zero-Anchored Robust Scaling

intronIC uses zero-anchored robust scaling that preserves the semantic meaning of zero:

$$z = \frac{\text{raw}}{\text{median}(|\text{raw}|)}$$

Key properties:

Zero preserved: A raw score of 0 (equally plausible under U12 and U2) remains 0
Sign preserved: Positive scores stay positive, negative stay negative
Robust: Median is resistant to outliers and rare U12s (~0.5% of data)
Winsorization: Extreme values clipped at 99.5th percentile before computing median

Normalization Modes

Mode	Description	Use Case
`human`	Use scales from training (human) data	U12-absent species, cross-species
`adaptive`	Refit scales on experimental data	Species with different GC content
`auto`	Use human if available in model	Default behavior

Important: For reproducible results on genome subsets, save and reload the normalizer from a full-genome run.

Stage 4: Classification

Feature Space

Base features (3D):

five_z_score: 5' splice site z-score
bp_z_score: Branch point z-score
three_z_score: 3' splice site z-score

Augmented features (optional, 7D default):

min_all: min(five_z, bp_z, three_z) — Requires ALL signals strong
absdiff_5_bp: |five_z - bp_z| — Penalizes 5'/BP imbalance
absdiff_5_3: |five_z - three_z| — Penalizes 5'/3' imbalance
absdiff_bp_3: |bp_z - three_z| — Penalizes BP/3' imbalance

The augmented features help the linear SVM reject "one-end-strong" false positives—U2 introns with one unusually U12-like signal.

Linear SVM

intronIC uses a linear support vector machine (sklearn's LinearSVC):

Kernel: Linear (interpretable coefficients)
Class weights: Balanced to handle ~0.5% U12 prevalence
Regularization: L2 penalty, C optimized via cross-validation
Convergence: max_iter=50,000, tol=1e-4

Probability Calibration

The raw SVM outputs decision distances (signed distance from hyperplane). These are converted to probabilities using Platt scaling (sigmoid calibration):

$$P(\text{U12}) = \frac{1}{1 + \exp(A \cdot f(x) + B)}$$

Where $f(x)$ is the decision function output, and $A, B$ are fitted parameters.

Ensemble Training

When --n-models > 1, multiple SVMs are trained with different U2 subsamples:

Each model sees all U12 references but a different 80% of U2 references
Predictions are averaged across models
Reduces variance and improves robustness

Training the Default Model

The default pretrained model was trained on:

Training data:

~400 conserved human U12-type introns (multiple evidence sources)
~20,000 conserved human U2-type introns

Training process:

Score all reference introns with PWMs
Normalize using zero-anchored robust scaling
Optimize SVM hyperparameters via 5-fold cross-validation
Train final model on all reference data
Calibrate probabilities via cross-validation

Evaluation:

Nested cross-validation for unbiased performance estimates
Balanced accuracy, F1-score, and PR-AUC reported
Validated on held-out human introns and cross-species data

Species-Specific Considerations

GC Content Effects

Species with different GC content than human may show shifted score distributions. Options:

Adaptive normalization: Refit scaler on experimental data (--normalizer-mode adaptive)
Prior adjustment: Adjust base rate expectation (--species-prior)

U12-Absent Lineages

For species known to lack U12 introns (C. elegans, many fungi):

intronIC -g genome.fa -a annotation.gff -n species \
         --normalizer-mode human --species-prior 1e-6

The reduced prior shifts probability thresholds to minimize false positives.

Cross-Species Performance

The default human-trained model generalizes well to other vertebrates and most eukaryotes with U12 introns. Performance may degrade for:

Very distant lineages (plants, protists)
Lineages with unusual U12 motifs
Species with extreme GC bias

Consider providing species-specific reference sequences for best results.

Memory and Performance

Standard Mode

Memory scales with annotation density:

Loads all intron sequences into memory
Human genome (~250k introns): ~12 GB peak

Streaming Mode (`--streaming`)

Dramatically reduced memory:

Writes sequences to temporary SQLite database
Keeps only scoring motifs in memory
Human genome: ~2-3 GB peak (~85% reduction)
Trade-off: Slightly slower I/O

Parallelization

The -p N flag parallelizes PWM scoring:

Scoring is CPU-bound and embarrassingly parallel
Linear speedup up to ~8-16 cores
Diminishing returns beyond that

References

Original intronIC paper:

Moyer DC, Larue GE, Hershberger CE, Roy SW, Padgett RA. (2020) Comprehensive database and evolutionary dynamics of U12-type introns. Nucleic Acids Research 48(13):7066–7078. doi:10.1093/nar/gkaa464

U12 intron databases:

Alioto TS. (2007) U12DB: a database of orthologous U12-type spliceosomal introns. Nucleic Acids Research 35:D110-D115. doi:10.1093/nar/gkl842

Branch point mapping:

Mercer TR, et al. (2015) Genome-wide discovery of human splicing branchpoints. Genome Research 25:290-303. doi:10.1101/gr.182899.114

SVM probability calibration:

Platt JC. (1999) Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in Large Margin Classifiers pp. 61-74.

Technical algorithm

Technical Details

Pipeline Overview

Stage 1: Intron Extraction

Coordinate Inference

Filtering Criteria

Stage 2: PWM Scoring

Position Weight Matrices

Scoring Regions

Log-Odds Ratio Calculation

Branch Point Selection

Stage 3: Normalization

Why Normalize?

Zero-Anchored Robust Scaling

Normalization Modes

Stage 4: Classification

Feature Space

Linear SVM

Probability Calibration

Ensemble Training

Training the Default Model

Species-Specific Considerations

GC Content Effects

U12-Absent Lineages

Cross-Species Performance

Memory and Performance

Standard Mode

Streaming Mode (--streaming)

Parallelization

References

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

Streaming Mode (`--streaming`)