Skip to content

Fix off-by-one in predict_variant_effect ref-allele check#32

Merged
lucapinello merged 1 commit intochorus-applicationsfrom
fix/2026-04-21-ref-allele-off-by-one
Apr 21, 2026
Merged

Fix off-by-one in predict_variant_effect ref-allele check#32
lucapinello merged 1 commit intochorus-applicationsfrom
fix/2026-04-21-ref-allele-off-by-one

Conversation

@lucapinello
Copy link
Copy Markdown
Contributor

Problem

The ref-allele sanity check at chorus/core/base.py:322 was reading the base one position to the right of what the user asked for. So

predict_variant_effect(
    genomic_region="chr1:109274000-109275000",
    variant_position="chr1:109274968",
    alleles=["G", "T"],       # rs12740374, as in dbSNP/UCSC
    ...
)

triggered

WARNING  Provided reference allele 'G' does not match the genome at this
         position ('T'). Chorus will use the provided allele.

even though G at chr1:109274968 is exactly what dbSNP/UCSC/IGV/extract_sequence('chr1:109274968-109274968') all say. The substitution at line 335 then placed the ref/alt base at pos+1, so every shipped walkthrough was computing predictions at the wrong coordinate. The regulatory signal happens to be coherent across ±1 bp on rs12740374, rs1421085, rs1427407 etc., so effect directions still reproduced — which is why this has been shipping.

Root cause

  • genomic_region / variant_position strings follow dbSNP/UCSC/IGV and chorus.utils.extract_sequence1-based inclusive.
  • GenomeRef uses 0-based half-open (documented at interval.py:26).
  • predict_variant_effect handed the 1-based values straight to GenomeRef and to ref2query, conflating the two conventions.

Fix

Convert region_start and var_pos to 0-based at the point we build the Interval. That's it — just two -1s with an explanatory comment.

Verification

Tested against 7 famous SNPs — for every one, the internal ref-check now agrees with extract_sequence and dbSNP:

rsID Gene hg38 dbSNP ref internal (before) internal (after)
rs12740374 SORT1 chr1:109274968 G T G ✓
rs1421085 FTO chr16:53767042 T A T ✓
rs1427407 BCL11A chr2:60718347 G C G ✓
rs4988235 LCT chr2:135851076 G (+ strand) C G ✓
rs334 HBB chr11:5227002 T (+ strand) C T ✓
rs6025 F5 Leiden chr1:169519049 T (+ strand) C T ✓
rs7412 APOE chr19:44908822 C G C ✓

Also works for the MCP _auto_region case (a 2 bp region exactly spanning the variant) — the old code silently read pos+1; the fix correctly lands on pos.

New regression test

tests/test_prediction_methods.py::test_variant_position_is_1_based builds a synthetic genome with a single G anchor at 1-based position 100,000 and asserts:

  • no warning when the user passes ref='G'
  • warning does fire when the user passes ref='T' (proves the check still catches real mismatches — we haven't just silenced it)

Scope

Only predict_variant_effect is touched. predict_region_replacement and predict_region_insertion_at also treat their string coordinates as 0-based half-open internally; changing those would silently shift existing sequence-engineering results by 1 bp, so that's deferred to a separate PR with example-output regeneration.

Downstream consequence

Variant effect predictions will now substitute the ref/alt base at the correct 1-based position. Effect sizes in the shipped walkthroughs may drift by a small amount relative to the currently committed example_output.* files (±1 bp shift in where the substitution lands). Recommend regenerating those in a follow-up once this merges — the effect direction and magnitudes should stay within CPU non-determinism for AlphaGenome, and will likely shift more noticeably for ChromBPNet (1 bp native resolution) where a 1-bp shift lands on a different bin.

Test plan

  • pytest tests/ --ignore=tests/test_smoke_predict.py -q334 passed / 1 skipped (up from 333 — the new test)
  • Manual check against 7 famous SNPs — all consistent
  • Regression test validates both the fixed path and the still-working mismatch path

🤖 Generated with Claude Code

The ref-allele sanity check at chorus/core/base.py:322 was reading the
base one position to the right of what the user asked for, so

  predict_variant_effect(
    genomic_region="chr1:109274000-109275000",
    variant_position="chr1:109274968",
    alleles=["G", "T"],       # rs12740374, dbSNP/UCSC
    ...)

triggered

  WARNING  Provided reference allele 'G' does not match the genome at
           this position ('T'). Chorus will use the provided allele.

even though the user-supplied ref allele is exactly what dbSNP/UCSC/IGV
say is at that position. The substitution at line 335 then placed the
ref/alt base one position off the user's intended coordinate, so every
shipped walkthrough was computing predictions at pos+1 rather than pos.
The regulatory signal happens to be coherent across ±1 bp on rs12740374
et al., so effect *directions* still reproduce — but the check was
broken and the warning was misleading.

Root cause: `genomic_region`/`variant_position` strings follow dbSNP/
UCSC/IGV and `chorus.utils.extract_sequence` — 1-based inclusive — but
`GenomeRef` uses 0-based half-open (documented at interval.py:26).
predict_variant_effect was handing the 1-based values straight to
GenomeRef and to `ref2query`, conflating the two conventions.

Fix: convert `region_start` and `var_pos` to 0-based at the point we
build the Interval. Verified against 7 famous SNPs (rs12740374,
rs1421085, rs1427407, rs4988235, rs334, rs6025, rs7412) — internal
ref-check now agrees with extract_sequence for all of them and no
warning fires when the user supplies the correct dbSNP ref allele.

Also includes a regression test (tests/test_prediction_methods.py::
test_variant_position_is_1_based) that builds a synthetic genome with a
single 'G' anchor at 1-based position 100,000 and asserts:
- no warning when user passes ref='G'
- warning DOES fire when user passes ref='T' (proves the check still
  catches real mismatches — we haven't just silenced it)

Scope: this PR only fixes predict_variant_effect. predict_region_replacement
and predict_region_insertion_at treat their coordinates as 0-based half-
open internally; changing that would silently shift existing users'
sequence replacements by 1 bp, so it's deferred to a separate PR with
full example-output regeneration.

Consequence for shipped walkthroughs: variant effect predictions will
now substitute the ref/alt base at the correct 1-based position. Effect
sizes may drift by a small amount (1-bp shift relative to existing
committed numbers); regenerate example_output.* as follow-up.

Tests: 334 passed / 1 skipped (fast suite, 8m 34s), up from 333.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@lucapinello lucapinello merged commit b6aed3d into chorus-applications Apr 21, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant