Skip to content

Drop WT leak from "_wt" rows in parse_stripped_hgvs#34

Merged
odcambc merged 1 commit into
mainfrom
fix-rosace-wt-leak
May 25, 2026
Merged

Drop WT leak from "_wt" rows in parse_stripped_hgvs#34
odcambc merged 1 commit into
mainfrom
fix-rosace-wt-leak

Conversation

@odcambc
Copy link
Copy Markdown
Owner

@odcambc odcambc commented May 25, 2026

run_rosace.R:62 (and the mirrored helper at tests/r/testthat/ helper-functions.R:28) ran:

WT <- substr(hgvs_string, 1, 1)

ABOVE the "wt" short-circuit. For wildtype rows this set WT to "" before the function returned, leaving variant metadata with WT = "_" (the underscore from the sentinel string, not an amino acid). finalize_variants_in_rosace then propagated that into rosace_obj@var.data$wildtype, where it sat alongside real amino acid codes like "A", "M", etc.

Every downstream regex branch (synonymous, missense, nonsense, deletion, insertion) sets WT from its own str_match capture group, so the pre-branch substr() was a redundant initial guess that got overwritten in every successful parse. Removing it makes _wt return with WT = "" (matches the function's default initializer above), which finalize_variants_in_rosace happily passes through.

Side effect for malformed inputs that match no regex branch: WT changes from substr(hgvs, 1, 1) to "". The malformed-return case already has variant="", pos=-1, len=-1, mutation_type="" — adding WT="" to that list is more honest about "we couldn't parse this" than echoing back a stray first character.

Tighten the existing _wt test to assert WT = "" explicitly. The old test asserted variant/pos/mutation_type but skipped WT, which is how this lived for so long.

run_rosace.R:62 (and the mirrored helper at tests/r/testthat/
helper-functions.R:28) ran:

  WT <- substr(hgvs_string, 1, 1)

ABOVE the "_wt" short-circuit. For wildtype rows this set WT to "_"
before the function returned, leaving variant metadata with
WT = "_" (the underscore from the sentinel string, not an amino acid).
finalize_variants_in_rosace then propagated that into
rosace_obj@var.data$wildtype, where it sat alongside real amino acid
codes like "A", "M", etc.

Every downstream regex branch (synonymous, missense, nonsense,
deletion, insertion) sets WT from its own str_match capture group,
so the pre-branch substr() was a redundant initial guess that got
overwritten in every successful parse. Removing it makes _wt return
with WT = "" (matches the function's default initializer above),
which finalize_variants_in_rosace happily passes through.

Side effect for malformed inputs that match no regex branch: WT
changes from substr(hgvs, 1, 1) to "". The malformed-return case
already has variant="", pos=-1, len=-1, mutation_type="" — adding
WT="" to that list is more honest about "we couldn't parse this"
than echoing back a stray first character.

Tighten the existing _wt test to assert WT = "" explicitly. The old
test asserted variant/pos/mutation_type but skipped WT, which is
how this lived for so long.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@odcambc odcambc merged commit 3b60f8d into main May 25, 2026
@odcambc odcambc deleted the fix-rosace-wt-leak branch May 25, 2026 17:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant