Skip to content

Prototype multi-effect candidates for splice-site variants #262

@iskandr

Description

@iskandr

Background

Splice-site variants are the natural first test case for a multi-effect-candidate model because varcode already handles them — but in a way that forces a single winner where multiple outcomes are biologically real. Before tackling the harder problem of SV-induced splice variation (#259), we should get the modeling right for simple cases we already encounter: a point mutation near an exon boundary that simultaneously affects coding sequence and splice signals.

User-reported bugs related to this

Several filed issues trace back to the same underlying problems with splice-site handling:

  • Consider alternative effect of ExonicSpliceSite for "top priority" #233 — A variant's top priority effect was ExonicSpliceSite but the alternate_effect was a stop gain. The stop gain should have been surfaced. This was "fixed" by adding select_between_exonic_splice_site_and_alternate_effect() to the priority logic, which picks whichever has higher priority — but this is a band-aid that discards one of the two real effects instead of reporting both.
  • ExonicSpliceSite mutations are classified as Noncoding #136ExonicSpliceSite mutations were dropped by drop_silent_and_noncoding() even when their alternate_effect was a Substitution (e.g., p.E448K). The workaround treats ExonicSpliceSite as coding if it has a coding alternate, but the real problem is that the coding effect shouldn't be buried inside a wrapper in the first place.
  • WT and mutation AA incorrect for variant in codon split across exon boundary #236 — Wrong amino acid reported for a variant in a codon split across an exon boundary (p.W138R instead of p.C138S). The wrong transcript was selected as "top priority" because it had a longer CDS, but it was a non-canonical isoform with different splicing. This is a symptom of the single-effect model: when you force one winner, isoform selection heuristics can pick the wrong one.
  • Exception when deletion extends beyond exon boundary #224 — Deletion extending beyond an exon boundary (GTGAAGG where exon ends at GTGAAG) threw a ValueError. The reporter noted: "This seems like it would likely cause a splice variant, since both the exonic and intronic splice signal is potentially damaged." This is exactly the kind of variant that should produce multiple candidate effects (coding deletion + splice disruption).

The current approach and its problems

1. ExonicSpliceSite wraps a single alternate — then one gets discarded

When a variant falls at an exon boundary, effect_prediction.py creates an ExonicSpliceSite that stores the coding effect as alternate_effect (line 269 of effect_classes.py):

class ExonicSpliceSite(Exonic, SpliceSite):
    def __init__(self, variant, transcript, exon, alternate_effect):
        ...

But then:

  • mutant_protein_sequence ignores the splice disruption entirely and returns the alternate coding effect's protein sequence, with the comment: "TODO: determine when exonic splice variants cause exon skipping vs. translation of the underlying modified coding sequence. For now just pretending like there is no effect on splicing."
  • top_priority_effect() calls select_between_exonic_splice_site_and_alternate_effect() which picks one of the two based on priority — either the splice effect or the coding effect, never both
  • ExonicSpliceSite sits below Substitution/Insertion/Deletion in the priority list, so the coding effect often "wins" and the splice disruption is silently dropped

This means: a missense SNV at the last position of an exon is reported as a substitution, with no indication that it also likely disrupts splicing. The real biology is that both outcomes may occur — some transcripts splice normally (producing the amino acid change), some mis-splice (producing exon skipping, intron retention, or cryptic splice site usage).

2. Intronic splice classification ignores sequence content

choose_intronic_effect_class() in effect_prediction.py classifies variants as SpliceDonor, SpliceAcceptor, or IntronicSpliceSite based solely on distance to the nearest exon (lines 230-278). It never examines:

  • What the reference splice signal actually is (maybe it's already non-canonical)
  • What the variant changes it to (a GT→GC donor mutation is very different from a GT→AT mutation in terms of residual splice activity)
  • Whether the mutation creates or strengthens a competing splice signal nearby

A mutation 2bp into an intron is always SpliceDonor whether it changes the canonical GT to something else, leaves it intact (e.g., a variant at +2 that doesn't touch the GT dinucleotide), or even restores a broken splice signal.

3. Exonic splice-site detection has a likely bug

In changes_exonic_splice_site() (effect_helpers.py, lines 149-160):

end_of_reference_exon = transcript.sequence[exon_end_offset - 2:exon_end_offset + 1]
if matches_exon_end_pattern(end_of_reference_exon):
    end_of_variant_exon = end_of_reference_exon  # <-- no mutation applied!
    if matches_exon_end_pattern(end_of_variant_exon):
        return True

end_of_variant_exon is set to the reference sequence without applying the variant. The function then checks if the reference matches the splice pattern and, if so, checks the reference again — always returning True when the reference has a canonical splice signal, regardless of whether the variant disrupts it or not.

What this issue should accomplish

Multi-effect candidate model

Instead of forcing a single effect, splice-site variants should produce a set of candidate effects, representing the possible outcomes:

  • Normal splicing occurs → the coding effect (substitution, frameshift, etc.) is the outcome
  • Exon skipping → the affected exon is excluded from the transcript
  • Intron retention → the intron is included, likely causing a frameshift or premature stop
  • Cryptic splice site activation → a nearby alternative splice signal is used, producing a truncated or extended exon

Each candidate should ideally carry some indication of plausibility (even if initially just "canonical signal disrupted" vs. "canonical signal preserved").

Sequence-aware splice prediction

The classification of splice effects should examine the actual nucleotide content:

  • Read the reference splice signal (donor: GT at +1/+2; acceptor: AG at -2/-1; exonic context: MAG|GU and YAG|R)
  • Determine whether the variant disrupts, preserves, or creates splice signals
  • Score how severely the consensus is broken (e.g., GT→GC is a known weak donor that sometimes still functions; GT→AA is complete loss)

This doesn't need to be a full splice prediction algorithm — MaxEntScan or similar could be integrated later. But even checking "does the variant change the canonical dinucleotide" would be a large improvement over pure distance-based classification.

Fix the exonic splice site detection bug

changes_exonic_splice_site() should apply the variant to the exon-end sequence before checking whether the splice pattern is disrupted.

Tests that should exist but don't

The existing test suite for splice behavior is extremely thin — test_exonic_splice_site.py has a single test, and test_effect_classes.py has one ExonicSpliceSite test plus a TODO comment (# TODO: SpliceDonor, SpliceReceptor). There are zero tests for SpliceDonor, SpliceAcceptor, or IntronicSpliceSite effects.

Tests that would have caught the known bugs:

  1. SNV at exon boundary that is both a coding change and splice disruption — assert that both effects are accessible, not just the priority winner. (Would have prevented Consider alternative effect of ExonicSpliceSite for "top priority" #233 and ExonicSpliceSite mutations are classified as Noncoding #136 from being a surprise.)

  2. SNV at exon boundary that does NOT disrupt the splice signal — e.g., a synonymous change at the -3 position (MAG) that preserves the M base. Should not be classified as ExonicSpliceSite. (Would have caught the changes_exonic_splice_site bug where the reference is checked against itself.)

  3. SNV at exon boundary that DOES disrupt the splice signal — e.g., changing the G in MAG to T. Verify that the splice disruption is detected. (Would have caught the same bug from the other direction.)

  4. Intronic variant at +1/+2 that changes the canonical GT donor — verify SpliceDonor is returned AND that the mutated sequence is noted. (Would have exposed the lack of sequence content checking.)

  5. Intronic variant at +1/+2 that does NOT change the canonical GT — e.g., a variant at +3 that happens to be 2bp from the exon due to coordinate math. Should not be SpliceDonor if the GT is intact. (Would have caught pure-distance classification errors.)

  6. Intronic variant at -1/-2 that changes the canonical AG acceptor — verify SpliceAcceptor and note the sequence change.

  7. Deletion spanning exon-intron boundary — should produce both a coding effect and a splice effect, not crash (Exception when deletion extends beyond exon boundary #224) or silently pick one.

  8. drop_silent_and_noncoding() on ExonicSpliceSite with coding alternate — should retain the effect (ExonicSpliceSite mutations are classified as Noncoding #136).

  9. top_priority_effect() on ExonicSpliceSite with PrematureStop alternate — should return the PrematureStop (Consider alternative effect of ExonicSpliceSite for "top priority" #233). (This test exists now in test_exonic_splice_site.py but was added after the bug was found.)

  10. Variant in codon split across exon boundary — verify correct amino acids are reported regardless of which transcript is chosen (WT and mutation AA incorrect for variant in codon split across exon boundary #236).

Relationship to other issues

This is the proving ground for the multi-effect model described in #259. The design patterns established here (how candidate effects are represented, stored, prioritized, and optionally resolved by RNA evidence) should generalize to the harder SV cases. If the model works for "SNV at exon boundary produces both a coding change and a splice disruption," it should scale to "translocation produces three possible fusion isoforms."

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions