Fix off-by-one in predict_variant_effect ref-allele check#32
Merged
lucapinello merged 1 commit intochorus-applicationsfrom Apr 21, 2026
Merged
Conversation
The ref-allele sanity check at chorus/core/base.py:322 was reading the
base one position to the right of what the user asked for, so
predict_variant_effect(
genomic_region="chr1:109274000-109275000",
variant_position="chr1:109274968",
alleles=["G", "T"], # rs12740374, dbSNP/UCSC
...)
triggered
WARNING Provided reference allele 'G' does not match the genome at
this position ('T'). Chorus will use the provided allele.
even though the user-supplied ref allele is exactly what dbSNP/UCSC/IGV
say is at that position. The substitution at line 335 then placed the
ref/alt base one position off the user's intended coordinate, so every
shipped walkthrough was computing predictions at pos+1 rather than pos.
The regulatory signal happens to be coherent across ±1 bp on rs12740374
et al., so effect *directions* still reproduce — but the check was
broken and the warning was misleading.
Root cause: `genomic_region`/`variant_position` strings follow dbSNP/
UCSC/IGV and `chorus.utils.extract_sequence` — 1-based inclusive — but
`GenomeRef` uses 0-based half-open (documented at interval.py:26).
predict_variant_effect was handing the 1-based values straight to
GenomeRef and to `ref2query`, conflating the two conventions.
Fix: convert `region_start` and `var_pos` to 0-based at the point we
build the Interval. Verified against 7 famous SNPs (rs12740374,
rs1421085, rs1427407, rs4988235, rs334, rs6025, rs7412) — internal
ref-check now agrees with extract_sequence for all of them and no
warning fires when the user supplies the correct dbSNP ref allele.
Also includes a regression test (tests/test_prediction_methods.py::
test_variant_position_is_1_based) that builds a synthetic genome with a
single 'G' anchor at 1-based position 100,000 and asserts:
- no warning when user passes ref='G'
- warning DOES fire when user passes ref='T' (proves the check still
catches real mismatches — we haven't just silenced it)
Scope: this PR only fixes predict_variant_effect. predict_region_replacement
and predict_region_insertion_at treat their coordinates as 0-based half-
open internally; changing that would silently shift existing users'
sequence replacements by 1 bp, so it's deferred to a separate PR with
full example-output regeneration.
Consequence for shipped walkthroughs: variant effect predictions will
now substitute the ref/alt base at the correct 1-based position. Effect
sizes may drift by a small amount (1-bp shift relative to existing
committed numbers); regenerate example_output.* as follow-up.
Tests: 334 passed / 1 skipped (fast suite, 8m 34s), up from 333.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced Apr 21, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The ref-allele sanity check at
chorus/core/base.py:322was reading the base one position to the right of what the user asked for. Sotriggered
even though
Gat chr1:109274968 is exactly what dbSNP/UCSC/IGV/extract_sequence('chr1:109274968-109274968')all say. The substitution at line 335 then placed the ref/alt base at pos+1, so every shipped walkthrough was computing predictions at the wrong coordinate. The regulatory signal happens to be coherent across ±1 bp on rs12740374, rs1421085, rs1427407 etc., so effect directions still reproduced — which is why this has been shipping.Root cause
genomic_region/variant_positionstrings follow dbSNP/UCSC/IGV andchorus.utils.extract_sequence— 1-based inclusive.GenomeRefuses 0-based half-open (documented atinterval.py:26).predict_variant_effecthanded the 1-based values straight toGenomeRefand toref2query, conflating the two conventions.Fix
Convert
region_startandvar_posto 0-based at the point we build theInterval. That's it — just two-1s with an explanatory comment.Verification
Tested against 7 famous SNPs — for every one, the internal ref-check now agrees with
extract_sequenceand dbSNP:Also works for the MCP
_auto_regioncase (a 2 bp region exactly spanning the variant) — the old code silently read pos+1; the fix correctly lands on pos.New regression test
tests/test_prediction_methods.py::test_variant_position_is_1_basedbuilds a synthetic genome with a singleGanchor at 1-based position 100,000 and asserts:ref='G'ref='T'(proves the check still catches real mismatches — we haven't just silenced it)Scope
Only
predict_variant_effectis touched.predict_region_replacementandpredict_region_insertion_atalso treat their string coordinates as 0-based half-open internally; changing those would silently shift existing sequence-engineering results by 1 bp, so that's deferred to a separate PR with example-output regeneration.Downstream consequence
Variant effect predictions will now substitute the ref/alt base at the correct 1-based position. Effect sizes in the shipped walkthroughs may drift by a small amount relative to the currently committed
example_output.*files (±1 bp shift in where the substitution lands). Recommend regenerating those in a follow-up once this merges — the effect direction and magnitudes should stay within CPU non-determinism for AlphaGenome, and will likely shift more noticeably for ChromBPNet (1 bp native resolution) where a 1-bp shift lands on a different bin.Test plan
pytest tests/ --ignore=tests/test_smoke_predict.py -q→ 334 passed / 1 skipped (up from 333 — the new test)🤖 Generated with Claude Code