You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Splice-site variants are the natural first test case for a multi-effect-candidate model because varcode already handles them — but in a way that forces a single winner where multiple outcomes are biologically real. Before tackling the harder problem of SV-induced splice variation (#259), we should get the modeling right for simple cases we already encounter: a point mutation near an exon boundary that simultaneously affects coding sequence and splice signals.
User-reported bugs related to this
Several filed issues trace back to the same underlying problems with splice-site handling:
Consider alternative effect of ExonicSpliceSite for "top priority" #233 — A variant's top priority effect was ExonicSpliceSite but the alternate_effect was a stop gain. The stop gain should have been surfaced. This was "fixed" by adding select_between_exonic_splice_site_and_alternate_effect() to the priority logic, which picks whichever has higher priority — but this is a band-aid that discards one of the two real effects instead of reporting both.
ExonicSpliceSite mutations are classified as Noncoding #136 — ExonicSpliceSite mutations were dropped by drop_silent_and_noncoding() even when their alternate_effect was a Substitution (e.g., p.E448K). The workaround treats ExonicSpliceSite as coding if it has a coding alternate, but the real problem is that the coding effect shouldn't be buried inside a wrapper in the first place.
WT and mutation AA incorrect for variant in codon split across exon boundary #236 — Wrong amino acid reported for a variant in a codon split across an exon boundary (p.W138R instead of p.C138S). The wrong transcript was selected as "top priority" because it had a longer CDS, but it was a non-canonical isoform with different splicing. This is a symptom of the single-effect model: when you force one winner, isoform selection heuristics can pick the wrong one.
Exception when deletion extends beyond exon boundary #224 — Deletion extending beyond an exon boundary (GTGAAGG where exon ends at GTGAAG) threw a ValueError. The reporter noted: "This seems like it would likely cause a splice variant, since both the exonic and intronic splice signal is potentially damaged." This is exactly the kind of variant that should produce multiple candidate effects (coding deletion + splice disruption).
The current approach and its problems
1. ExonicSpliceSite wraps a single alternate — then one gets discarded
When a variant falls at an exon boundary, effect_prediction.py creates an ExonicSpliceSite that stores the coding effect as alternate_effect (line 269 of effect_classes.py):
mutant_protein_sequenceignores the splice disruption entirely and returns the alternate coding effect's protein sequence, with the comment: "TODO: determine when exonic splice variants cause exon skipping vs. translation of the underlying modified coding sequence. For now just pretending like there is no effect on splicing."
top_priority_effect() calls select_between_exonic_splice_site_and_alternate_effect() which picks one of the two based on priority — either the splice effect or the coding effect, never both
ExonicSpliceSite sits below Substitution/Insertion/Deletion in the priority list, so the coding effect often "wins" and the splice disruption is silently dropped
This means: a missense SNV at the last position of an exon is reported as a substitution, with no indication that it also likely disrupts splicing. The real biology is that both outcomes may occur — some transcripts splice normally (producing the amino acid change), some mis-splice (producing exon skipping, intron retention, or cryptic splice site usage).
choose_intronic_effect_class() in effect_prediction.py classifies variants as SpliceDonor, SpliceAcceptor, or IntronicSpliceSite based solely on distance to the nearest exon (lines 230-278). It never examines:
What the reference splice signal actually is (maybe it's already non-canonical)
What the variant changes it to (a GT→GC donor mutation is very different from a GT→AT mutation in terms of residual splice activity)
Whether the mutation creates or strengthens a competing splice signal nearby
A mutation 2bp into an intron is always SpliceDonor whether it changes the canonical GT to something else, leaves it intact (e.g., a variant at +2 that doesn't touch the GT dinucleotide), or even restores a broken splice signal.
3. Exonic splice-site detection has a likely bug
In changes_exonic_splice_site() (effect_helpers.py, lines 149-160):
end_of_reference_exon=transcript.sequence[exon_end_offset-2:exon_end_offset+1]
ifmatches_exon_end_pattern(end_of_reference_exon):
end_of_variant_exon=end_of_reference_exon# <-- no mutation applied!ifmatches_exon_end_pattern(end_of_variant_exon):
returnTrue
end_of_variant_exon is set to the reference sequence without applying the variant. The function then checks if the reference matches the splice pattern and, if so, checks the reference again — always returning True when the reference has a canonical splice signal, regardless of whether the variant disrupts it or not.
What this issue should accomplish
Multi-effect candidate model
Instead of forcing a single effect, splice-site variants should produce a set of candidate effects, representing the possible outcomes:
Normal splicing occurs → the coding effect (substitution, frameshift, etc.) is the outcome
Exon skipping → the affected exon is excluded from the transcript
Intron retention → the intron is included, likely causing a frameshift or premature stop
Cryptic splice site activation → a nearby alternative splice signal is used, producing a truncated or extended exon
Each candidate should ideally carry some indication of plausibility (even if initially just "canonical signal disrupted" vs. "canonical signal preserved").
Sequence-aware splice prediction
The classification of splice effects should examine the actual nucleotide content:
Read the reference splice signal (donor: GT at +1/+2; acceptor: AG at -2/-1; exonic context: MAG|GU and YAG|R)
Determine whether the variant disrupts, preserves, or creates splice signals
Score how severely the consensus is broken (e.g., GT→GC is a known weak donor that sometimes still functions; GT→AA is complete loss)
This doesn't need to be a full splice prediction algorithm — MaxEntScan or similar could be integrated later. But even checking "does the variant change the canonical dinucleotide" would be a large improvement over pure distance-based classification.
Fix the exonic splice site detection bug
changes_exonic_splice_site() should apply the variant to the exon-end sequence before checking whether the splice pattern is disrupted.
Tests that should exist but don't
The existing test suite for splice behavior is extremely thin — test_exonic_splice_site.py has a single test, and test_effect_classes.py has one ExonicSpliceSite test plus a TODO comment (# TODO: SpliceDonor, SpliceReceptor). There are zero tests for SpliceDonor, SpliceAcceptor, or IntronicSpliceSite effects.
SNV at exon boundary that does NOT disrupt the splice signal — e.g., a synonymous change at the -3 position (MAG) that preserves the M base. Should not be classified as ExonicSpliceSite. (Would have caught the changes_exonic_splice_site bug where the reference is checked against itself.)
SNV at exon boundary that DOES disrupt the splice signal — e.g., changing the G in MAG to T. Verify that the splice disruption is detected. (Would have caught the same bug from the other direction.)
Intronic variant at +1/+2 that changes the canonical GT donor — verify SpliceDonor is returned AND that the mutated sequence is noted. (Would have exposed the lack of sequence content checking.)
Intronic variant at +1/+2 that does NOT change the canonical GT — e.g., a variant at +3 that happens to be 2bp from the exon due to coordinate math. Should not be SpliceDonor if the GT is intact. (Would have caught pure-distance classification errors.)
Intronic variant at -1/-2 that changes the canonical AG acceptor — verify SpliceAcceptor and note the sequence change.
This is the proving ground for the multi-effect model described in #259. The design patterns established here (how candidate effects are represented, stored, prioritized, and optionally resolved by RNA evidence) should generalize to the harder SV cases. If the model works for "SNV at exon boundary produces both a coding change and a splice disruption," it should scale to "translocation produces three possible fusion isoforms."
Background
Splice-site variants are the natural first test case for a multi-effect-candidate model because varcode already handles them — but in a way that forces a single winner where multiple outcomes are biologically real. Before tackling the harder problem of SV-induced splice variation (#259), we should get the modeling right for simple cases we already encounter: a point mutation near an exon boundary that simultaneously affects coding sequence and splice signals.
User-reported bugs related to this
Several filed issues trace back to the same underlying problems with splice-site handling:
ExonicSpliceSitebut thealternate_effectwas a stop gain. The stop gain should have been surfaced. This was "fixed" by addingselect_between_exonic_splice_site_and_alternate_effect()to the priority logic, which picks whichever has higher priority — but this is a band-aid that discards one of the two real effects instead of reporting both.ExonicSpliceSitemutations were dropped bydrop_silent_and_noncoding()even when theiralternate_effectwas aSubstitution(e.g., p.E448K). The workaround treatsExonicSpliceSiteas coding if it has a coding alternate, but the real problem is that the coding effect shouldn't be buried inside a wrapper in the first place.GTGAAGGwhere exon ends atGTGAAG) threw aValueError. The reporter noted: "This seems like it would likely cause a splice variant, since both the exonic and intronic splice signal is potentially damaged." This is exactly the kind of variant that should produce multiple candidate effects (coding deletion + splice disruption).The current approach and its problems
1. ExonicSpliceSite wraps a single alternate — then one gets discarded
When a variant falls at an exon boundary,
effect_prediction.pycreates anExonicSpliceSitethat stores the coding effect asalternate_effect(line 269 ofeffect_classes.py):But then:
mutant_protein_sequenceignores the splice disruption entirely and returns the alternate coding effect's protein sequence, with the comment: "TODO: determine when exonic splice variants cause exon skipping vs. translation of the underlying modified coding sequence. For now just pretending like there is no effect on splicing."top_priority_effect()callsselect_between_exonic_splice_site_and_alternate_effect()which picks one of the two based on priority — either the splice effect or the coding effect, never bothExonicSpliceSitesits belowSubstitution/Insertion/Deletionin the priority list, so the coding effect often "wins" and the splice disruption is silently droppedThis means: a missense SNV at the last position of an exon is reported as a substitution, with no indication that it also likely disrupts splicing. The real biology is that both outcomes may occur — some transcripts splice normally (producing the amino acid change), some mis-splice (producing exon skipping, intron retention, or cryptic splice site usage).
2. Intronic splice classification ignores sequence content
choose_intronic_effect_class()ineffect_prediction.pyclassifies variants asSpliceDonor,SpliceAcceptor, orIntronicSpliceSitebased solely on distance to the nearest exon (lines 230-278). It never examines:A mutation 2bp into an intron is always
SpliceDonorwhether it changes the canonical GT to something else, leaves it intact (e.g., a variant at +2 that doesn't touch the GT dinucleotide), or even restores a broken splice signal.3. Exonic splice-site detection has a likely bug
In
changes_exonic_splice_site()(effect_helpers.py, lines 149-160):end_of_variant_exonis set to the reference sequence without applying the variant. The function then checks if the reference matches the splice pattern and, if so, checks the reference again — always returning True when the reference has a canonical splice signal, regardless of whether the variant disrupts it or not.What this issue should accomplish
Multi-effect candidate model
Instead of forcing a single effect, splice-site variants should produce a set of candidate effects, representing the possible outcomes:
Each candidate should ideally carry some indication of plausibility (even if initially just "canonical signal disrupted" vs. "canonical signal preserved").
Sequence-aware splice prediction
The classification of splice effects should examine the actual nucleotide content:
This doesn't need to be a full splice prediction algorithm — MaxEntScan or similar could be integrated later. But even checking "does the variant change the canonical dinucleotide" would be a large improvement over pure distance-based classification.
Fix the exonic splice site detection bug
changes_exonic_splice_site()should apply the variant to the exon-end sequence before checking whether the splice pattern is disrupted.Tests that should exist but don't
The existing test suite for splice behavior is extremely thin —
test_exonic_splice_site.pyhas a single test, andtest_effect_classes.pyhas oneExonicSpliceSitetest plus a TODO comment (# TODO: SpliceDonor, SpliceReceptor). There are zero tests forSpliceDonor,SpliceAcceptor, orIntronicSpliceSiteeffects.Tests that would have caught the known bugs:
SNV at exon boundary that is both a coding change and splice disruption — assert that both effects are accessible, not just the priority winner. (Would have prevented Consider alternative effect of ExonicSpliceSite for "top priority" #233 and ExonicSpliceSite mutations are classified as Noncoding #136 from being a surprise.)
SNV at exon boundary that does NOT disrupt the splice signal — e.g., a synonymous change at the -3 position (MAG) that preserves the M base. Should not be classified as
ExonicSpliceSite. (Would have caught thechanges_exonic_splice_sitebug where the reference is checked against itself.)SNV at exon boundary that DOES disrupt the splice signal — e.g., changing the G in MAG to T. Verify that the splice disruption is detected. (Would have caught the same bug from the other direction.)
Intronic variant at +1/+2 that changes the canonical GT donor — verify
SpliceDonoris returned AND that the mutated sequence is noted. (Would have exposed the lack of sequence content checking.)Intronic variant at +1/+2 that does NOT change the canonical GT — e.g., a variant at +3 that happens to be 2bp from the exon due to coordinate math. Should not be
SpliceDonorif the GT is intact. (Would have caught pure-distance classification errors.)Intronic variant at -1/-2 that changes the canonical AG acceptor — verify
SpliceAcceptorand note the sequence change.Deletion spanning exon-intron boundary — should produce both a coding effect and a splice effect, not crash (Exception when deletion extends beyond exon boundary #224) or silently pick one.
drop_silent_and_noncoding()on ExonicSpliceSite with coding alternate — should retain the effect (ExonicSpliceSite mutations are classified as Noncoding #136).top_priority_effect()on ExonicSpliceSite with PrematureStop alternate — should return the PrematureStop (Consider alternative effect of ExonicSpliceSite for "top priority" #233). (This test exists now intest_exonic_splice_site.pybut was added after the bug was found.)Variant in codon split across exon boundary — verify correct amino acids are reported regardless of which transcript is chosen (WT and mutation AA incorrect for variant in codon split across exon boundary #236).
Relationship to other issues
This is the proving ground for the multi-effect model described in #259. The design patterns established here (how candidate effects are represented, stored, prioritized, and optionally resolved by RNA evidence) should generalize to the harder SV cases. If the model works for "SNV at exon boundary produces both a coding change and a splice disruption," it should scale to "translocation produces three possible fusion isoforms."