Add pVACseq report loader (closes #94)#167
Conversation
`read_pvacseq(path)` parses both pVACseq output flavors into Topiary long form:
- `*.all_epitopes.aggregated.tsv` — one row per variant; Best Peptide x
Allele chosen by Median IC50.
- `*.all_epitopes.tsv` — one row per peptide x allele x length candidate.
Median MT IC50 / percentile populate `value` / `percentile_rank` with
`prediction_method_name="pvacseq"`; WT IC50 / percentile populate the
`wt_*` schema so existing DSL expressions (`Affinity.value`,
`wt.Affinity.value`) work without changes.
For aggregated rows the WT peptide sequence is reconstructed from
`Best Peptide` + `Pos` + `AA Change` for unambiguous missense (292 of
317 rows on HCC1395; non-missense and flanking rows stay NaN, where
the unaggregated `all_epitopes.tsv` carries `WT Epitope Seq` directly).
Per-algorithm columns in the all_epitopes flavor pass through as
`pvacseq_<algo>_{ic50,pct}_{mt,wt}` annotation columns, accessible via
`Column("...")`.
Multiple files (MHC-I + MHC-II, or a mix of flavors) compose through
`topiary.concat([read_pvacseq(p1), read_pvacseq(p2)])`.
Fixtures: 10-row slices of the HCC1395 MHC-I and MHC-II aggregated
files from the cAIrn WashU protocol, plus a hand-crafted all_epitopes
TSV covering 2 variants x 2 alleles x 2 lengths.
Review feedback from PR #167: - Replace `df.get(col, [])` fallbacks (which silently produce length mismatches if a column is missing) with explicit `if col in df.columns` guards. - Trim `na_values=["NA", "X"]` to just `["X"]`: "NA" is pandas' default; "X" is pVACseq's "this algorithm didn't score" sentinel. - Use `pd.NA` for `predictor_version` / `wt_predictor_version` instead of `None` so the column dtype is consistent. - Fix stale comment in `_finalize` that referenced a fallback branch the code never takes. - Document the per-algorithm-column DSL-invisibility tradeoff in the module docstring. New tests: - `_reconstruct_wt_peptide` happy path + position-out-of-range + pos/AA Change disagreement + non-missense (FS, multi-AA) + non-string inputs (5 cases). - `tag=` kwarg overrides source label. - Round-trip a loaded pVACseq result through `to_tsv` / `read_tsv` and confirm columns and numeric values survive.
… tests Second-pass review fixes: Docs: - Module docstring + CHANGELOG cited `wt(Affinity.value)`, but `wt` is a Scope object, not callable. Correct to `wt.Affinity.value`. Parser symmetry: - Replace `_parse_all_epitopes`'s chained `df.get(a, df.get(b))` calls with a `_first_present_column` helper so both flavor parsers use the same "guard then read" pattern. - `_finalize` no longer mutates its input — adds a defensive `.copy()` at entry rather than relying on callers discarding the original. Effect classifier: - Tighten frameshift recognition from `"fs" in s.lower()` (broad substring match) to `^FS\d` (pVACseq's actual format). Adds explicit test that a missense like "F206S" does not get misclassified. Tests: - Round-trip assertion strengthens to set-equality (minus the `source` provenance column read_tsv adds) and checks peptide/allele/wt_peptide plus the four numeric score columns. - New chr-coord variant fallback test exercises `_parse_all_epitopes`'s Index-absent branch (previously unreached by fixtures). - New `TestClassifyEffect` covers missense / frameshift / multi-residue Substitution / variant-type-wins / unknown-returns-None, including the F206S false-positive regression. 40 tests pass; 568 across pvacseq/io/io_lens/dsl/ranking/wide clean.
…wt length Final-pass cleanups: - Remove unused module-level `logging.getLogger(__name__)` and the `import logging` it required. - Restore the rationale comment on `peptide_offset = 0` (pVACseq doesn't ship source-protein offset; matches LENS loader convention). - Restructure `_finalize` to allocate the augmented frame once via a single `DataFrame.assign(**...)` call instead of mutating a defensive copy in place and copying again on the way out. Extracts a `_build_source_sequence_name` helper for the gene/variant fallback. - `wt_peptide_length` is now nullable `Int64` (was `object` with mixed int / `pd.NA`), tightening semantics and easing downstream comparisons. - `_parse_all_epitopes` uses `pd.NA` instead of the scalar `None` when `WT Epitope Seq` / `Variant Type` are absent — round-trips cleaner through pandas' nullable type system. Tests: - New `mhc_ii_all_epitopes.tsv` fixture (3 rows, DRB1 + DPA1-DPB1 heterodimer alleles, NetMHCIIpan per-algorithm columns). - `test_mhc_ii_all_epitopes_loads` asserts class II all_epitopes parses end-to-end, heterodimer alleles round-trip through mhcgnomes, and the per-algorithm passthrough handles the class-II algorithm name. - `test_wt_peptide_length_is_nullable_int` pins the new Int64 dtype. 42 tests pass; 570 across pvacseq/io/io_lens/dsl/ranking/wide clean.
Drop the _AE abbreviation in favor of _ALL across: - module constants: _AE_SIGNATURE/_AE_MEDIAN/_AE_BEST/_AE_ANNOTATIONS - test fixture handles: MHC_I_AE/MHC_II_AE The file-flavor identifier "all_epitopes" itself is unchanged — that's pVACseq's actual format name.
Derived columns matching TopiaryPredictor output so downstream
consumers (vaxrank in particular) don't need to special-case loader
source:
- `mhc_class` ("I" / "II") — derived from the allele string. Lets
concat-ed MHC-I + MHC-II results be split or filtered by class
without re-parsing alleles.
- `contains_mutant_residues` (nullable boolean) — true iff pVACseq's
reported mutation position falls inside the candidate peptide.
False for flanking-only peptides where pVACseq scored a 9-mer next
to the mutation but the mutation itself is outside the window
(28 of 634 HCC1395 rows).
- `mutation_start_in_peptide` / `mutation_end_in_peptide` (Int64,
0-based half-open) — derived from pVACseq's 1-based Pos /
Mutation Position. Single-residue semantics; multi-residue
mutations collapse to a representative position.
- `source` (per-row provenance label, matching topiary.read_tsv) — keeps
multi-file concats distinguishable without rooting through Metadata.
Review nits also closed:
- `_build_source_sequence_name` now raises an informative KeyError
rather than tripping a misleading `KeyError: 'gene'` if both
source columns are missing (unreachable from real pVACseq input,
but defensive).
- Round-trip test uses `Series.equals` for NaN-aware equality
instead of the `.fillna(-1)` sentinel trick.
- Added `test_predictor_version_is_na` so concat-warning regressions
don't sneak back in.
- Loosened `test_mhc_ii_alleles_get_hla_prefix` to `HLA-D` so it
survives a future re-slice of the fixture to include heterodimers.
Verified against real HCC1395:
- 634 combined rows; 2 distinct `source` labels; 317 class I + 317
class II.
- 606 contains_mutant_residues=True, 28 False (flanking-only).
- 584 WT peptides reconstructed (subset of contains_mutant_residues
that are unambiguous missense).
50 tests pass; 578 across pvacseq/io/io_lens/dsl/ranking/wide clean.
Three add-ons that close the remaining gap between read_pvacseq's
output and what vaxrank's downstream code expects from a TopiaryPredictor
run, plus integration tests that lock the contract from topiary's side.
`Metadata.extra["kind_support"]` is populated with the same shape as
`TopiaryPredictor.kind_support` (model_key -> {kind -> {mhc_dependence,
mhc_class}}). pVACseq always scores per allele, so mhc_dependence is
fixed at "single_allele"; mhc_class reflects what's in the file
("I" / "II" / "both" / "none"). Vaxrank's evaluate_scores can pass
this straight through to apply_filter / EvalContext without building
a parallel metadata dict.
`melt_pvacseq_algorithms(result)` expands the all_epitopes flavor's
per-algorithm columns into separate prediction_method_name=<algo>
rows so the DSL's `Affinity["mhcflurry"].value` /
`Affinity["netmhcpan"].score` selectors reach individual algorithms
natively. Median rows (prediction_method_name="pvacseq") are
preserved; melt is a no-op on aggregated input. The function
extends kind_support to register each melted algorithm under the
same mhc_class as the parent pvacseq entry.
Integration tests:
- TestKindSupport — pins the file-level mhc_class summary
("I" / "II" / "both") and the kind_support dict shape.
- TestMeltAlgorithms — verifies aggregated melt is a no-op, all_epitopes
expands to N+1 rows per group, per-algorithm values lift from
passthrough columns correctly, kind_support extends, and the DSL
`Affinity["mhcflurry"]` selector finds melted rows.
- TestVaxrankComposition — bridge test pinning the realistic vaxrank
filter shape (pandas masking for categorical clauses on mhc_class /
contains_mutant_residues, topiary DSL for the numeric Affinity.value
clause). Also verifies exclude_by(df, ref_sequences) composes with
loaded pVACseq rows.
`read_pvacseq_fragments` removed from `docs/fragments.md`'s queued list:
pVACseq output is already at peptide x allele granularity, so the
fragment-window abstraction doesn't apply. Re-prediction composes via
`predict_from_named_peptides`, not via ProteinFragment.
61 pvacseq tests pass; 589 across pvacseq/io/io_lens/dsl/ranking/wide.
Adds an `IsIn` DSL node plus `Column.eq(value)` / `Column.ne(value)` /
`Column.isin(values)` methods so non-numeric columns (mhc_class,
source, gene, ...) are filterable natively in the DSL without
pre-masking via pandas. The string parser additionally accepts
string literals on the right-hand side of `==` and `!=` so
`parse('mhc_class == "I"')` works end-to-end.
`DSLNode.__eq__` stays unoverridden (nodes remain hashable for use in
sets/dicts) — the explicit `.eq()` / `.ne()` / `.isin()` methods plus
the parser path are the supported entry points. `IsIn` reads its
column raw, bypassing the float cast in `Column.eval`, so any dtype
works.
Why:
- pVACseq loader (5.16.0 first commit) emits a per-row `mhc_class`
column ("I" / "II") so concat-ed multi-class results can be split.
Filtering by class previously required pre-masking via pandas
because the DSL couldn't compare strings; vaxrank-shape filters
that combine numeric + categorical clauses needed two paths.
- Same path now supports filtering by `source` (provenance) and any
other string column added by future loaders.
Implementation:
- New `IsIn(col_name, values, negate=False)` node in `ranking/nodes.py`.
Constructor accepts scalar or iterable values; reads the column
raw (no float cast); evaluates to a boolean Series indexed by
`group_index`. `__invert__` flips `negate`.
- `Column.eq` / `Column.ne` / `Column.isin` produce `IsIn` instances.
- Parser's `_cmp` detects a STRING token on the RHS of `==` / `!=`
and produces `IsIn` directly; comparison with `<` / `<=` / `>` /
`>=` against a string raises with a clear error.
- `apply._collect_column_names` recognizes `IsIn` so column-existence
validation runs the same way as for `Column`.
- Exports: `IsIn` surfaces through `topiary.ranking` and the top-level
`topiary` package.
Tests (138 across DSL + pvacseq suites, 696 across all relevant io /
dsl / ranking / wide suites; no regressions):
- `TestIsIn` in `tests/test_dsl_extensions.py` — happy paths,
invert, compound with affinity, non-numeric column survives,
parser emits IsIn for string RHS, parser rejects `< "..."`,
parser rejects string-LHS, validation collects IsIn columns.
- `TestVaxrankComposition` in `tests/test_io_pvacseq.py` rewritten
to use the native DSL form (`Column("mhc_class").eq("I")`),
and a new parser-string variant.
- `topiary.class_i` and `topiary.class_ii` — pre-built `IsIn` nodes
referencing the `mhc_class` column. Compose with any other DSL node:
`apply_filter(df, class_i & (Affinity.value <= 500))`.
- `topiary.derive_mhc_class(allele_series)` — public utility for
stamping `mhc_class` onto a DataFrame produced outside the pvacseq
loader (e.g. a fresh `TopiaryPredictor` result that carries class
only at the model-level `kind_support`). Returns a Series of
`"I"` / `"II"` / `pd.NA`.
`docs/ranking.md` — `Column()` section gets an "Equality / membership
on any dtype" subsection covering:
- `Column("x").eq(value)` / `.ne(value)` / `.isin(values)` and why
`Column("x") == "y"` doesn't compose (`__eq__` deliberately not
overridden — nodes stay hashable).
- The `parse('mhc_class == "I"')` string-DSL form and the
`<` / `<=` / `>` / `>=` rejection on string operands.
- The `class_i` / `class_ii` shortcuts and the `derive_mhc_class`
utility for non-pvacseq DataFrames.
Also tightens the "Only numeric columns" wording — it's still true
for arithmetic and ordered comparisons, but no longer for equality /
membership.
Tests: 4 new `TestIsIn` cases for the shortcuts and their composition
with numeric clauses; `test_derive_mhc_class_public_utility` covers
the new package-level export across class I / class II / heterodimer /
unknown / None inputs.
143 dsl + pvacseq tests pass; 701 across pvacseq / io / io_lens / dsl /
ranking / wide.
Five tactical cleanups from the latest review of PR #167: - `IsIn.__init__` had `isinstance(values, float) and math.isnan(values)` as a fallback scalar-detection arm. `float` is already in the preceding `isinstance` tuple, so any NaN short-circuits there. Drop the unreachable arm; add a one-line comment noting NaN's float membership. - `IsIn.__repr__` was a three-level nested ternary. Rewrite as a straight `if len(values) == 1` / else split — three lines shorter and reads in one pass. - `_derive_mhc_class = derive_mhc_class` was a "backwards-compatible alias" with exactly one internal caller and no external exposure. Update the caller to the public name and drop the alias. - `test_invert_negates` only asserted "I" is absent from the result; strengthen to assert positive retention (DDD + EEE survive) and the NaN-survives-`~eq` pandas-vs-SQL semantic. - `test_parser_string_lhs_raises` was triggering the unrelated "Unexpected token" path in `_atom`, not the parser's explicit "column reference on the LHS" guard. Add a separate test using `affinity.value == "I"` that does parse the LHS and gets rejected at the IsIn-construction site. Docs: `ranking.md` Equality / membership subsection gains a NaN semantics footnote — pandas-style (NaN → False for `.eq` / `.isin`, True for `.ne` / `~.eq`), not SQL `IS NOT NULL`. 144 dsl + pvacseq tests pass; 702 across pvacseq / io / io_lens / dsl / ranking / wide.
Three fixes from the review nudge after the prior "clean stopping
point" claim — all real, not nits.
1. `IsIn.eval` empty-DataFrame branch built a Series from a
zero-length boolean array indexed by `ctx.group_index`. Crashes
if the context happened to carry a non-empty index alongside an
empty df ("Length of values (0) does not match length of
index (N)"). Unreachable on current callers but fragile, and
inconsistent with `Column.eval`'s empty path. Switch to
`ctx.empty_series().astype("boolean")` for index alignment and
pattern parity.
2. `_collect_column_names` walker grew an `isinstance + elif` chain
when `IsIn` landed. The docstring promised "implement
child_nodes() and you're done" — duck-type the leaf check via
`getattr(n, "col_name", None)` so future column-bearing leaves
work without touching the walker. Drops the now-unused `Column`
and `IsIn` imports from `apply.py`.
3. `docs/api.md` was missing four real public exports: `IsIn`,
`class_i`, `class_ii`, `derive_mhc_class`. Add `IsIn` to the
DSL leaves table; new "Categorical equality and membership"
subsection under Column covering the methods, the string-DSL
`==` / `!=` form, and NaN semantics; pre-built shortcuts table
for `class_i` / `class_ii`; row for `derive_mhc_class` in the
input-functions table.
Tests: new `test_empty_dataframe` exercising IsIn against an
empty input (regression guard); 145 dsl + pvacseq tests pass; 705
across pvacseq / io / io_lens / dsl / ranking / wide / dataframe.
New `docs/pvacseq.md` (~260 lines) covers the full read_pvacseq surface end-to-end: - What pVACseq is, why import its output (re-score / compare / re-rank / combine MHC-I + MHC-II). - The two TSV flavors (aggregated vs all_epitopes) — what differs, how the loader auto-detects. - Schema mapping table: pVACseq column → topiary column for both flavors. - Derived columns (mhc_class, contains_mutant_residues, mutation_start/end_in_peptide, source) — what they are and why vaxrank cares. - Annotation passthroughs naming. - Worked examples: single-class load, MHC-I + MHC-II concat, filter by class with shortcuts and the parsed-string DSL form. - Per-algorithm scores: the pvacseq_<algo>_<field>_<mtwt> passthrough pattern + when to reach for melt_pvacseq_algorithms. - WT peptide reconstruction (the missense-only path, with the HCC1395 292/317 stat). - Re-scoring with TopiaryPredictor.predict_from_named_peptides via the derived peptide dict. - derive_mhc_class for fresh prediction frames without an mhc_class column. - Metadata stamping — pvacseq_format, kind_support shape. - Caveats: metrics.json not used, flanking peptides, kind collapsing, no CLI. Wired into mkdocs.yml nav (between Cached Predictions and Cross-reactivity) and mentioned in docs/index.md's "Multiple input modes" feature bullet.
|
Downstream signal: vaxrank intends to adopt this — filed openvax/vaxrank#297 to track the migration. The plan is to replace vaxrank's own
I'm using the same HCC1395 sample (from cAIrn/JLF reviewer protocol) as my downstream smoke, so if you want a real end-to-end test case for any edge-case rows during review, happy to share. One question for the API: does No blocking feedback — happy to wait for this to land + release, then drive the vaxrank-side adoption. |
Three issues found in the last review pass of `docs/pvacseq.md`: 1. The re-scoring section's `concat([r, fresh_result])` example referenced an undefined variable (`fresh_result` vs `fresh` from the assignment above), and even with the name fixed `concat` doesn't accept bare DataFrames — `TopiaryPredictor.predict_*` returns a DataFrame, not a TopiaryResult. Rather than widen `concat` (scope creep beyond the loader's needs) or wrap the DataFrame artificially, drop the downstream-composition example entirely. The re-scoring section shows how to fresh-predict; users compose with pandas as needed. 2. `"class lives in `Metadata.kind_support`"` — `Metadata` (the comment-block dataclass) doesn't have a `kind_support` field. The correct location is `TopiaryPredictor.kind_support`. 3. The schema-mapping table's `(or Best MT IC50 Score)` parenthetical read as "either column is acceptable." Clarified to tie the fallback to pVACseq's `--top-score-metric=Best` configuration; same for the WT percentile column (Median vs Corresponding). Also tightens the peptide-dict construction to a dict-comprehension that makes the intent explicit instead of a dict(zip(...)) expression that collapses duplicates as a side effect. Two follow-up issues filed for the design surface that came out of the review but doesn't belong in this PR: - #168 — Add `allele_set` column to faithfully store haplotype-mode predictions (schema layer) - #169 — DSL: explicit per-`mhc_dependence` peptide-level projection (query layer) Together those two unblock cross-source / cross-kind prediction comparison cleanly; without them, validator / combine-predictions helpers can only approximate. Out of scope for the pVACseq loader.
Two surface-quality fixes from another review pass:
- `detect_pvacseq_format` had a 5-word docstring for a public export
("Return 'aggregated', 'all_epitopes', or None.") — fine as a one-liner
for an internal helper, but this one is at the package root. Replaced
with a proper Numpy-style docstring documenting parameters, return
values, and the use case (callers writing their own readers / routing).
- Added return-type hints on `derive_mhc_class` and
`melt_pvacseq_algorithms` for consistency with `read_pvacseq` and
`detect_pvacseq_format`, which already had them.
No behavior change.
Summary
read_pvacseq(path)parses both pVACseq output flavors into Topiary long form:*.all_epitopes.aggregated.tsv— one row per variant; Best Peptide × Allele chosen by Median IC50*.all_epitopes.tsv— one row per peptide × allele × length candidateMedian MT IC50 / percentile populate
value/percentile_rank; WT IC50 / percentile populate thewt_*schema so existing DSL works without changes (Affinity.value,wt.Affinity.value,Affinity.value - wt.Affinity.value, etc.).For aggregated rows, the WT peptide sequence is reconstructed from
Best Peptide+Pos+AA Changefor unambiguous missense (covers 292 of 317 rows on the HCC1395 sample; non-missense and flanking-peptide rows stay NaN — for those, use the unaggregatedall_epitopes.tsv, which hasWT Epitope Seqdirectly).Per-algorithm columns in the all_epitopes flavor pass through as
pvacseq_<algo>_{ic50,pct}_{mt,wt}annotation columns, accessible viaColumn("...").Multiple files (MHC-I + MHC-II, or a mix of flavors) compose through
topiary.concat([read_pvacseq(p1), read_pvacseq(p2)])— no dedicated multi-file entry point needed.Closes #94.
Test plan
pytest tests/test_io_pvacseq.py— 26 tests pass (format detection, aggregated load, missense WT reconstruction, all_epitopes load, multi-file concat, error path, DSL filter/sort/wt-scope round-trips)Affinity.value <= 100round-trips to 18 strong binderspytest tests/test_io.py tests/test_io_lens.py tests/test_dsl_extensions.py tests/test_ranking.py— 508 passed, no regressions