Skip to content

Technical algorithm

Graham edited this page Apr 24, 2026 · 41 revisions

Technical Details

This page provides a detailed technical description of intronIC's algorithm, data flow, and machine learning architecture.

Pipeline Overview

Input: Genome (FASTA) + Annotation (GFF3/GTF) + Species Name

Note: By default, intronIC uses streaming mode, which processes one chromosome at a time for ~85% memory reduction on large genomes.

  1. Stage 1: Intron Extraction

    • Parse annotation hierarchy (gene → transcript → CDS/exon)
    • Infer intron coordinates from exon gaps
    • Extract intron + flanking exon sequences from genome
    • Filter duplicates, short introns, isoforms (configurable)
  2. Stage 2: PWM Scoring

    • Score 5' splice site, branch point, 3' splice site regions
    • Calculate log-odds ratios: $\log\left(\frac{P(\text{seq}|\text{U12})}{P(\text{seq}|\text{U2})}\right)$
    • Select best branch point position from search window
  3. Stage 3: Normalization

    • Convert raw log-odds to z-scores via robust scaling
    • Fit on reference data or adapt to experimental data
  4. Stage 4: Classification

    • 6D feature vector: 3 z-scores + support2 + bp_offset + bp_scan_confidence
    • 42-model RBF SVM ensemble with balanced class weights
    • Probability calibration via isotonic regression
    • Output: per-intron $P(\text{U12-type})$ and ensemble $\sigma$
  5. Stage 5: Score Adjustment

    • Species-level valley prior from 2D KDE bimodality (5'z, BPz)
    • Per-intron ensemble disagreement penalty
    • Adjusted probability → threshold (default 95%)

Output: .meta.iic, .bed.iic, .introns.iic, .score_info.iic, plots


Stage 1: Intron Extraction

Coordinate Inference

Introns are inferred from gaps between consecutive exons (or CDS features) within the same transcript:

Exon 1        Intron 1          Exon 2          Intron 2          Exon 3
[==]------------------------------[==]------------------------------[==]
1-100          101-1600         1601-1700        1701-3200        3201-3300

Priority: CDS features are preferred over exon features when available, as they enable phase calculation.

Mixed CDS/Exon Handling: For transcripts with both CDS and exon features, introns are first generated from CDS features, then exon-only introns (typically in UTR regions) are added if they don't overlap existing CDS-derived introns. All introns are sorted by genomic position and assigned sequential indices, ensuring proper ordering (e.g., 5' UTR introns are numbered before CDS introns in coding direction).

Touching Exons (Annotation Artifacts): Some annotations contain adjacent exon features with no gap between them (zero-length "introns"). These are silently skipped and not included in the intron count (family_size). The intron index sequence remains contiguous (1, 2, 3...) with no gaps for these annotation artifacts.

Filtering Criteria

Filter Default Description
Duplicates Exclude Same coordinates from multiple isoforms
Longest isoform Keep only Can include all with -i
Minimum length 30 bp Adjustable via --min-intron-len
Ambiguous bases Exclude 'N' in scoring regions
Non-canonical Include Exclude with --no-nc

Stage 2: PWM Scoring

Position Weight Matrices

Position weight matrices (PWMs) capture the probability of observing each nucleotide at each position in a motif. intronIC includes PWMs for:

  • U12-type: AT-AC and GT-AG subtypes
  • U2-type: GT-AG, GC-AG, and AT-AC subtypes

Each subtype has PWMs for all three regions (5' splice site, branch point, 3' splice site). The U2 AT-AC PWMs prevent inflated log-ratio scores when scoring AT-AC introns, where the U2 fallback to GT-AG matrices would produce artificially low U2 scores at the dinucleotide positions.

Scoring Regions

Default scoring windows:

Region Relative to Start End Length Description
5' SS Intron 5' end -3 +9 12 bp Includes last 3 bp of upstream exon
Branch point Intron 3' end -55 -5 (clamped at 3'SS boundary) up to 50 bp Search window, clamped to exclude 3'SS region
3' SS Intron 3' end -6 +4 10 bp Core acceptor only (excludes PPT)

Non-overlapping regions: The branch point search window is clamped at the 3'SS scoring boundary to prevent feature overlap. For a 100 bp intron, the BP search region covers positions 45–93 and the 3'SS covers 94–100.

PWM Selection

For each intron, the terminal dinucleotides (e.g. GT-AG, AT-AC, GC-AG) determine which PWM subtype is used. Both the U12 and U2 PWMs are selected to match the intron's dinucleotide class:

  • A GT-AG intron is scored with U12 GT-AG and U2 GT-AG matrices
  • An AT-AC intron is scored with U12 AT-AC and U2 AT-AC matrices
  • A GC-AG intron is scored with U12 GT-AG (fallback) and U2 GC-AG matrices

When either model lacks a direct match for the dinucleotide class (e.g. GC-AG introns use U12 GT-AG as a fallback, rare boundary types fall back to GT-AG for both models), the fallback PWM's terminal dinucleotide positions are masked so that the score reflects the surrounding motif context rather than a dinucleotide mismatch penalty.

Log-Odds Ratio Calculation

For each region, the raw score is a log-odds ratio computed by scoring the same sequence with both the U12 and U2 PWMs:

$$\text{LLR} = \log_2\left(\frac{\prod_{i} P(b_i | \text{U12 matrix})}{\prod_{i} P(b_i | \text{U2 matrix})}\right)$$

Where:

  • $b_i$ is the nucleotide at position $i$
  • The product is taken over all positions in the scoring window
  • Higher positive values favor U12-type
  • Higher negative values favor U2-type
  • Zero means equally likely under both models

For the 5'SS and 3'SS regions, the scoring window is fixed (see table above) and the log-ratio is computed directly.

Branch Point Selection

For the branch point region, position selection and scoring are separate steps:

  1. Position selection: The search window is scanned with a sliding window using the U12 BPS PWM only. The position with the highest raw U12 probability product is selected as the putative branch point.
  2. Log-ratio scoring: The same sequence at the selected position is then scored with both the U12 and U2 BPS PWMs, and the log-ratio is computed. This ensures the ratio compares both models at the same location.

The branch point adenosine position (bp_offset) is reported relative to the 3' splice site, derived from the PWM's defined reference position (biological position 0 in the matrix). U12-type branch points cluster tightly at -10 to -15 nt from the 3'SS (median -13), while U2-type branch points are more dispersed (-20 to -35 nt) (Pineda & Bradley 2018).

The scan also computes a BPS confidence score (bp_scan_confidence): log2(best_score / mean_score) across all scan positions. This measures how sharply the best match stands out from background — high values indicate a well-defined motif (U12-like), low values indicate a flat landscape (U2-like).


Species-Specific Background Correction

Before normalization, intronIC optionally corrects the U2 denominator PWMs using the species' own nucleotide composition. This addresses a systematic bias: species with different GC content than human (the training species) can produce shifted z-score distributions, inflating or deflating log-odds ratios.

Method

The correction blends empirical per-position nucleotide frequencies from the target species' intron pool into the U2 PWMs:

$$\text{PWM}_{\text{corrected}} = w \cdot \text{PWM}_{\text{empirical}} + (1 - w) \cdot \text{PWM}_{\text{human}}$$

The blending weight is computed per dinucleotide subtype (GT-AG, GC-AG, AT-AC):

$$w = \frac{n_{\text{subtype}}}{n_{\text{subtype}} + n_0}$$

Where $n_{\text{subtype}}$ is the number of introns with that terminal dinucleotide and $n_0$ is a shrinkage parameter (default 1000). Rare subtypes (e.g., AT-AC with few hundred introns) naturally stay close to the human prior. Common subtypes (e.g., GT-AG with 200k+ introns) shift toward the species' empirical distribution.

Only the U2 denominator PWMs are modified. The U12 numerator PWMs remain fixed, preserving the signal that defines U12-type introns.

Configuration

Background correction is enabled by default. The key parameter is n0 (Bayesian shrinkage strength). The model's internal scaler is fitted on BG-corrected z-scores during training, so classification runs should use the same n0 value (default 1000) for consistency.

scoring:
  species_background:
    enabled: true
    n0: 1000          # Shrinkage strength
    min_introns: 500  # Skip correction below this count

Stage 3: Normalization

Why Normalize?

Raw log-odds scores have different ranges and distributions for each region:

  • 5'SS scores might range from -50 to +10
  • BP scores might range from -20 to +5
  • 3'SS scores might range from -5 to +3

Normalization converts these to comparable z-scores for the SVM.

Robust Scaling

intronIC uses robust z-score normalization via sklearn's RobustScaler:

$$z = \frac{\text{raw} - \text{median}}{\text{IQR}}$$

Where IQR is the interquartile range ($Q_{75} - Q_{25}$).

Key properties:

  • Robust to outliers: Median and IQR are resistant to extreme values
  • Centered: Distribution centered around 0 for each feature
  • Comparable scales: All three regions have similar variance after scaling
  • Minimal U12 contamination: Rare U12s (~0.5%) don't significantly affect robust statistics

Note: A second normalization step (sklearn StandardScaler) is applied inside the classification pipeline to the full augmented feature vector (base z-scores + composite features + extra features). This ensures all features have zero mean and unit variance before SVM classification.

Normalization Modes

Mode Description Use Case
human Use scales from training (human) data U12-absent species, cross-species
adaptive Refit scales on experimental data Species with different GC content
auto Use human if available in model Default behavior

Note: The pretrained model includes its scaler, so results are reproducible without additional steps when using --normalizer-mode human or auto.


Stage 4: Classification

Feature Space

The default model uses 6 features (6D):

Base features (3D):

  • s5z: 5' splice site z-score
  • BPz: Branch point z-score
  • s3z: 3' splice site z-score (core acceptor, -6 to +3)

Composite feature (1D):

  • support2: second-largest of max(0, s5z), max(0, BPz), max(0, s3z) — encodes "at least two scoring regions positively support U12." Zero when only one site is positive.

Extra features (2D):

  • bp_offset: branch point adenosine distance from 3'SS (negative integer). U12-type introns cluster at -10 to -15; U2-type at -20 to -35.
  • bp_scan_confidence: log2(best/mean) of U12 PWM scores across all BP scan positions. Measures motif sharpness; U12-type introns produce sharper peaks than U2-type.

RBF SVM

The default model uses an RBF (radial basis function) kernel SVM (sklearn's SVC):

  • Kernel: RBF (nonlinear boundary, can learn local patterns)
  • Hyperparameters: C=35.11, gamma=0.01 (optimized via iterative grid search)
  • Class weights: Balanced to handle ~0.5% U12 prevalence

The RBF kernel allows the classifier to learn curved boundaries — for example, requiring both strong donor AND strong BP for confident U12 calls, rather than allowing one to compensate for the other as a linear model would.

Probability Calibration

The raw SVM outputs decision distances (signed distance from hyperplane). These are converted to probabilities using sklearn's CalibratedClassifierCV. During training, both sigmoid (Platt scaling) and isotonic calibration are evaluated via cross-validation, and the method with lower log-loss is selected.

  • Sigmoid (Platt scaling): Fits a logistic function with 2 parameters — more conservative probability estimates, fewer borderline calls
  • Isotonic: Non-parametric monotonically increasing step function — more flexible, typically selected when sufficient training data is available

The default pretrained model uses isotonic calibration.

Ensemble Training

When --n-models > 1, multiple SVMs are trained with different U2 subsamples:

The default model uses a 42-model ensemble. Each model sees all U12 references but a different random 75% subset of U2 references. Final predictions are the mean probability across all models. The standard deviation of per-model probabilities (ensemble_sigma) quantifies model agreement and feeds into the score adjustment step.


Training the Default Model

The default pretrained model (v2.3.0) was trained on:

Training data:

  • 472 human U12-type introns from IPA-conserved gold standard with CoLa-seq empirical branch points (Zeng et al. 2022)
  • 30,155 human U2-type introns (20,690 original + 9,465 IPA-conserved/hard-negative, deduplicated by sequence hash)

Training process:

  1. Score all reference introns with PWMs (12 bp BPS motifs derived from CoLa-seq data)
  2. Normalize using robust scaling (median/IQR)
  3. Optimize SVM hyperparameters (C, gamma, penalty, class weight multiplier) via iterative grid search with 7-fold cross-validation
  4. Train 42-model ensemble on all reference data with 75% U2 subsampling per model
  5. Calibrate probabilities via isotonic regression (selected over sigmoid by log-loss)

Evaluation:

  • 7-fold nested cross-validation: mean F1 = 0.9947 ± 0.0065, PR-AUC = 0.9999 ± 0.0002
  • Validated across 13 species spanning vertebrates, arthropods, nematodes, plants, fungi, and protists
  • False positive assessment in species thought to lack U12-type introns: 0 confident calls in C. elegans, 1 in Ascaris suum

Species-Level Cluster Validation

After classification, intronIC performs a cluster validation step to assess whether the confident U12-type calls form a genuine population distinct from U2-type introns.

Valley depth metric

The algorithm projects all intron scores onto the 1D axis connecting the U2 centroid to the U12 centroid in 5'z × BPz space, then estimates the density along this axis using kernel density estimation (KDE). A valley depth is computed as the proportional drop in density between the U2 and U12 clusters:

$$\text{valley_depth} = 1 - \frac{\text{density_min_in_gap}}{\min(\text{density_U2_side}, \text{density_U12_side})}$$

Values range from 0 (no separation) to 1 (complete separation). To ensure robustness, the density is evaluated at multiple KDE bandwidths (0.5× to 5× Silverman's rule) and the median depth across bandwidths is reported. This prevents both over-smoothing (which hides real valleys) and under-smoothing (which creates spurious valleys from noise).

A valley depth > 0.3 is considered evidence of bimodal separation. The metric is reported in the log output along with centroid_σ (the distance between U12 and U2 centroids in U2 standard deviation units).

Interpretation

Valley depth Interpretation Example species (depth; n confident U12-type)
> 0.9 Strong bimodal separation — genuine U12-type population H. sapiens (0.94; n=741), D. melanogaster (0.94; n=18), I. scapularis (0.93; n=416)
0.5–0.9 Moderate separation — likely real but smaller U12-type population A. castellanii (0.59; n=13), V. jacobsoni (0.75; n=5)
< 0.3 No valley — U12-type calls do not form a distinct cluster C. elegans (0.00; n=0), A. suum (0.00; n=1)

When no valley is detected, intronIC emits a warning:

⚠ No density valley detected (depth=0.000): U12-type intron calls may not form a
distinct cluster. Consider reviewing calls with caution.

This does not prevent classification — individual intron scores are still valid — but signals that the species may lack a distinct U12-type intron population.


Score Adjustment

After classification, intronIC adjusts the raw SVM probabilities to incorporate two independent signals: species-level population evidence and per-intron model agreement.

Formula

The adjustment operates in log-odds space:

$$\text{logit}(p_{\text{adj}}) = \text{logit}(p_{\text{svm}}) + \log\left(\frac{\pi_{\text{species}}}{0.5}\right) - k_\sigma \cdot \sigma$$

Where:

  • $p_{\text{svm}}$: raw ensemble mean probability (isotonic-calibrated)
  • $\pi_{\text{species}}$: species-level prior from valley depth
  • $\sigma$: ensemble standard deviation (std of per-model probabilities)
  • $k_\sigma = 3.0$: disagreement penalty coefficient

Valley-to-prior mapping

The species prior is derived from valley depth via a sigmoid:

$$\pi_{\text{species}} = \pi_{\text{floor}} + \frac{0.5 - \pi_{\text{floor}}}{1 + \exp\left(-\frac{4.394}{w} \cdot (d - m)\right)}$$

Where $d$ is valley depth, $m$ is the midpoint (default 0.3), $w$ is the transition width (default 0.25), and $\pi_{\text{floor}}$ is the minimum prior (default 0.001).

Valley depth Prior ($\pi$) Effect on a 99% SVM score
0.0 0.001 Adjusted to ~9%
0.15 ~0.03 Adjusted to ~75%
0.30 ~0.25 Adjusted to ~98%
0.65+ ~0.50 No change

For species with strong U12 populations (valley > 0.5), the adjustment is negligible. For U12-absent species (valley near 0), even high SVM scores are discounted below the threshold.

Effect on output

  • adjusted_score in score_info.iic: the post-adjustment probability
  • rel_score in meta.iic and score_info.iic: adjusted_score - threshold
  • type_id (U12/U2): unchanged — still set by the raw SVM decision boundary
  • svm_score in score_info.iic: unchanged — still the raw ensemble mean

Configuration

scoring:
  threshold: 95.0
  score_adjustment:
    enabled: true
    valley_midpoint: 0.3
    transition_width: 0.25
    prior_floor: 0.001
    k_sigma: 3.0

Species-Specific Considerations

GC Content Effects

Species with different GC content than human may show shifted score distributions. Options:

  1. Adaptive normalization: Refit scaler on experimental data (--normalizer-mode adaptive)
  2. Prior adjustment: Adjust base rate expectation (--species-prior)

U12-Absent Lineages

The score adjustment automatically handles species lacking U12-type introns. When no valley is detected in the z-score distribution, the species prior drops to near-floor levels, discounting SVM probabilities below the confidence threshold. For example, C. elegans produces 1 confident call (vs. 4 without adjustment) and Ascaris suum produces 1 (vs. 7 without adjustment).

No special configuration is needed for U12-absent species.

Cross-Species Performance

The default human-trained model generalizes well to other vertebrates and most eukaryotes with U12-type introns. Performance may degrade for:

  • Very distant lineages (plants, protists)
  • Lineages with unusual U12-type motifs
  • Species with extreme GC bias

Consider providing species-specific reference sequences for best results.


Memory and Performance

Standard Mode

Memory scales with annotation density:

  • Loads all intron sequences into memory
  • Human genome (~250k introns): ~12-18 GB peak (varies by system and annotation density)

Streaming Mode (--streaming)

Dramatically reduced memory:

  • Writes sequences to temporary SQLite database
  • Keeps only scoring motifs in memory
  • Human genome: ~2-3 GB peak (~85% reduction)
  • Trade-off: Slightly slower I/O

Parallelization

The -p N flag parallelizes PWM scoring:

  • Scoring is CPU-bound and parallelizes efficiently
  • Linear speedup up to ~8-16 cores
  • Diminishing returns beyond that

References

Original intronIC paper:

Moyer DC, Larue GE, Hershberger CE, Roy SW, Padgett RA. (2020) Comprehensive database and evolutionary dynamics of U12-type introns. Nucleic Acids Research 48(13):7066–7078. doi:10.1093/nar/gkaa464

U12-type intron databases:

Larue GE, Roy SW. (2023) Where the minor things are: a pan-eukaryotic survey suggests neutral processes may explain much of minor intron evolution. Nucleic Acids Research 51(20):10884-10908. doi:10.1093/nar/gkad797

Moyer DC, Larue GE, Hershberger CE, Roy SW, Padgett RA. (2020) Comprehensive database and evolutionary dynamics of U12-type introns. Nucleic Acids Research 48(13):7066-7078. doi:10.1093/nar/gkaa464

Alioto TS. (2007) U12DB: a database of orthologous U12-type spliceosomal introns. Nucleic Acids Research 35:D110-D115. doi:10.1093/nar/gkl842

Branch point mapping and spliceosome profiling:

Burke JE, Longhurst AD, Merkurjev D, Sales-Lee J, Rao B, Moresco JJ, Yates JR III, Li JJ, Madhani HD. (2018) Spliceosome profiling visualizes operations of a dynamic RNP at nucleotide resolution. Cell 173(4):1014-1030.e17. doi:10.1016/j.cell.2018.03.020

Zeng Y, Fair BJ, Zeng H, Krishnamohan A, Hou Y, Hall JM, Ruthenburg AJ, Li YI, Staley JP. (2022) Profiling lariat intermediates reveals genetic determinants of early and late co-transcriptional splicing. Molecular Cell 82(24):4681-4699. doi:10.1016/j.molcel.2022.11.004

Mercer TR, et al. (2015) Genome-wide discovery of human splicing branchpoints. Genome Research 25:290-303. doi:10.1101/gr.182477.114

Pineda JMB, Bradley RK. (2018) Most human introns are recognized via multiple and tissue-specific branchpoints. Genes & Development 32(7-8):577-591. doi:10.1101/gad.312058.118

U12-type intron retention and functional studies:

Niemelä EH, et al. (2014) Global analysis of the nuclear processing of transcripts with unspliced U12-type introns by the exosome. Nucleic Acids Research 42(11):7358-7369. doi:10.1093/nar/gku391

Madan V, et al. (2015) Aberrant splicing of U12-type introns is the hallmark of ZRSR2 mutant myelodysplastic syndrome. Nature Communications 6:6042. doi:10.1038/ncomms7042

Cologne A, et al. (2019) New insights into minor splicing—a transcriptomic analysis of cells derived from TALS patients. RNA 25(9):1130-1149. doi:10.1261/rna.071423.119

SVM probability calibration:

Platt JC. (1999) Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in Large Margin Classifiers pp. 61-74.

Clone this wiki locally