Add pVACseq report loader (closes #94) by iskandr · Pull Request #167 · openvax/topiary

iskandr · 2026-05-13T19:47:22Z

Summary

read_pvacseq(path) parses both pVACseq output flavors into Topiary long form:

*.all_epitopes.aggregated.tsv — one row per variant; Best Peptide × Allele chosen by Median IC50
*.all_epitopes.tsv — one row per peptide × allele × length candidate

Median MT IC50 / percentile populate value / percentile_rank; WT IC50 / percentile populate the wt_* schema so existing DSL works without changes (Affinity.value, wt.Affinity.value, Affinity.value - wt.Affinity.value, etc.).

For aggregated rows, the WT peptide sequence is reconstructed from Best Peptide + Pos + AA Change for unambiguous missense (covers 292 of 317 rows on the HCC1395 sample; non-missense and flanking-peptide rows stay NaN — for those, use the unaggregated all_epitopes.tsv, which has WT Epitope Seq directly).

Per-algorithm columns in the all_epitopes flavor pass through as pvacseq_<algo>_{ic50,pct}_{mt,wt} annotation columns, accessible via Column("...").

Multiple files (MHC-I + MHC-II, or a mix of flavors) compose through topiary.concat([read_pvacseq(p1), read_pvacseq(p2)]) — no dedicated multi-file entry point needed.

Closes #94.

Test plan

pytest tests/test_io_pvacseq.py — 26 tests pass (format detection, aggregated load, missense WT reconstruction, all_epitopes load, multi-file concat, error path, DSL filter/sort/wt-scope round-trips)
Smoke-loads the real HCC1395 MHC-I (317 rows) and MHC-II (317 rows) aggregated files from cAIrn; 292/317 missense WT peptides reconstruct correctly; DSL Affinity.value <= 100 round-trips to 18 strong binders
pytest tests/test_io.py tests/test_io_lens.py tests/test_dsl_extensions.py tests/test_ranking.py — 508 passed, no regressions

coveralls · 2026-05-13T19:57:38Z

coverage: 89.87% (+0.2%) from 89.625% — pvacseq-loader into master

`read_pvacseq(path)` parses both pVACseq output flavors into Topiary long form: - `*.all_epitopes.aggregated.tsv` — one row per variant; Best Peptide x Allele chosen by Median IC50. - `*.all_epitopes.tsv` — one row per peptide x allele x length candidate. Median MT IC50 / percentile populate `value` / `percentile_rank` with `prediction_method_name="pvacseq"`; WT IC50 / percentile populate the `wt_*` schema so existing DSL expressions (`Affinity.value`, `wt.Affinity.value`) work without changes. For aggregated rows the WT peptide sequence is reconstructed from `Best Peptide` + `Pos` + `AA Change` for unambiguous missense (292 of 317 rows on HCC1395; non-missense and flanking rows stay NaN, where the unaggregated `all_epitopes.tsv` carries `WT Epitope Seq` directly). Per-algorithm columns in the all_epitopes flavor pass through as `pvacseq_<algo>_{ic50,pct}_{mt,wt}` annotation columns, accessible via `Column("...")`. Multiple files (MHC-I + MHC-II, or a mix of flavors) compose through `topiary.concat([read_pvacseq(p1), read_pvacseq(p2)])`. Fixtures: 10-row slices of the HCC1395 MHC-I and MHC-II aggregated files from the cAIrn WashU protocol, plus a hand-crafted all_epitopes TSV covering 2 variants x 2 alleles x 2 lengths.

Review feedback from PR #167: - Replace `df.get(col, [])` fallbacks (which silently produce length mismatches if a column is missing) with explicit `if col in df.columns` guards. - Trim `na_values=["NA", "X"]` to just `["X"]`: "NA" is pandas' default; "X" is pVACseq's "this algorithm didn't score" sentinel. - Use `pd.NA` for `predictor_version` / `wt_predictor_version` instead of `None` so the column dtype is consistent. - Fix stale comment in `_finalize` that referenced a fallback branch the code never takes. - Document the per-algorithm-column DSL-invisibility tradeoff in the module docstring. New tests: - `_reconstruct_wt_peptide` happy path + position-out-of-range + pos/AA Change disagreement + non-missense (FS, multi-AA) + non-string inputs (5 cases). - `tag=` kwarg overrides source label. - Round-trip a loaded pVACseq result through `to_tsv` / `read_tsv` and confirm columns and numeric values survive.

… tests Second-pass review fixes: Docs: - Module docstring + CHANGELOG cited `wt(Affinity.value)`, but `wt` is a Scope object, not callable. Correct to `wt.Affinity.value`. Parser symmetry: - Replace `_parse_all_epitopes`'s chained `df.get(a, df.get(b))` calls with a `_first_present_column` helper so both flavor parsers use the same "guard then read" pattern. - `_finalize` no longer mutates its input — adds a defensive `.copy()` at entry rather than relying on callers discarding the original. Effect classifier: - Tighten frameshift recognition from `"fs" in s.lower()` (broad substring match) to `^FS\d` (pVACseq's actual format). Adds explicit test that a missense like "F206S" does not get misclassified. Tests: - Round-trip assertion strengthens to set-equality (minus the `source` provenance column read_tsv adds) and checks peptide/allele/wt_peptide plus the four numeric score columns. - New chr-coord variant fallback test exercises `_parse_all_epitopes`'s Index-absent branch (previously unreached by fixtures). - New `TestClassifyEffect` covers missense / frameshift / multi-residue Substitution / variant-type-wins / unknown-returns-None, including the F206S false-positive regression. 40 tests pass; 568 across pvacseq/io/io_lens/dsl/ranking/wide clean.

…wt length Final-pass cleanups: - Remove unused module-level `logging.getLogger(__name__)` and the `import logging` it required. - Restore the rationale comment on `peptide_offset = 0` (pVACseq doesn't ship source-protein offset; matches LENS loader convention). - Restructure `_finalize` to allocate the augmented frame once via a single `DataFrame.assign(**...)` call instead of mutating a defensive copy in place and copying again on the way out. Extracts a `_build_source_sequence_name` helper for the gene/variant fallback. - `wt_peptide_length` is now nullable `Int64` (was `object` with mixed int / `pd.NA`), tightening semantics and easing downstream comparisons. - `_parse_all_epitopes` uses `pd.NA` instead of the scalar `None` when `WT Epitope Seq` / `Variant Type` are absent — round-trips cleaner through pandas' nullable type system. Tests: - New `mhc_ii_all_epitopes.tsv` fixture (3 rows, DRB1 + DPA1-DPB1 heterodimer alleles, NetMHCIIpan per-algorithm columns). - `test_mhc_ii_all_epitopes_loads` asserts class II all_epitopes parses end-to-end, heterodimer alleles round-trip through mhcgnomes, and the per-algorithm passthrough handles the class-II algorithm name. - `test_wt_peptide_length_is_nullable_int` pins the new Int64 dtype. 42 tests pass; 570 across pvacseq/io/io_lens/dsl/ranking/wide clean.

Drop the _AE abbreviation in favor of _ALL across: - module constants: _AE_SIGNATURE/_AE_MEDIAN/_AE_BEST/_AE_ANNOTATIONS - test fixture handles: MHC_I_AE/MHC_II_AE The file-flavor identifier "all_epitopes" itself is unchanged — that's pVACseq's actual format name.

Derived columns matching TopiaryPredictor output so downstream consumers (vaxrank in particular) don't need to special-case loader source: - `mhc_class` ("I" / "II") — derived from the allele string. Lets concat-ed MHC-I + MHC-II results be split or filtered by class without re-parsing alleles. - `contains_mutant_residues` (nullable boolean) — true iff pVACseq's reported mutation position falls inside the candidate peptide. False for flanking-only peptides where pVACseq scored a 9-mer next to the mutation but the mutation itself is outside the window (28 of 634 HCC1395 rows). - `mutation_start_in_peptide` / `mutation_end_in_peptide` (Int64, 0-based half-open) — derived from pVACseq's 1-based Pos / Mutation Position. Single-residue semantics; multi-residue mutations collapse to a representative position. - `source` (per-row provenance label, matching topiary.read_tsv) — keeps multi-file concats distinguishable without rooting through Metadata. Review nits also closed: - `_build_source_sequence_name` now raises an informative KeyError rather than tripping a misleading `KeyError: 'gene'` if both source columns are missing (unreachable from real pVACseq input, but defensive). - Round-trip test uses `Series.equals` for NaN-aware equality instead of the `.fillna(-1)` sentinel trick. - Added `test_predictor_version_is_na` so concat-warning regressions don't sneak back in. - Loosened `test_mhc_ii_alleles_get_hla_prefix` to `HLA-D` so it survives a future re-slice of the fixture to include heterodimers. Verified against real HCC1395: - 634 combined rows; 2 distinct `source` labels; 317 class I + 317 class II. - 606 contains_mutant_residues=True, 28 False (flanking-only). - 584 WT peptides reconstructed (subset of contains_mutant_residues that are unambiguous missense). 50 tests pass; 578 across pvacseq/io/io_lens/dsl/ranking/wide clean.

Three add-ons that close the remaining gap between read_pvacseq's output and what vaxrank's downstream code expects from a TopiaryPredictor run, plus integration tests that lock the contract from topiary's side. `Metadata.extra["kind_support"]` is populated with the same shape as `TopiaryPredictor.kind_support` (model_key -> {kind -> {mhc_dependence, mhc_class}}). pVACseq always scores per allele, so mhc_dependence is fixed at "single_allele"; mhc_class reflects what's in the file ("I" / "II" / "both" / "none"). Vaxrank's evaluate_scores can pass this straight through to apply_filter / EvalContext without building a parallel metadata dict. `melt_pvacseq_algorithms(result)` expands the all_epitopes flavor's per-algorithm columns into separate prediction_method_name=<algo> rows so the DSL's `Affinity["mhcflurry"].value` / `Affinity["netmhcpan"].score` selectors reach individual algorithms natively. Median rows (prediction_method_name="pvacseq") are preserved; melt is a no-op on aggregated input. The function extends kind_support to register each melted algorithm under the same mhc_class as the parent pvacseq entry. Integration tests: - TestKindSupport — pins the file-level mhc_class summary ("I" / "II" / "both") and the kind_support dict shape. - TestMeltAlgorithms — verifies aggregated melt is a no-op, all_epitopes expands to N+1 rows per group, per-algorithm values lift from passthrough columns correctly, kind_support extends, and the DSL `Affinity["mhcflurry"]` selector finds melted rows. - TestVaxrankComposition — bridge test pinning the realistic vaxrank filter shape (pandas masking for categorical clauses on mhc_class / contains_mutant_residues, topiary DSL for the numeric Affinity.value clause). Also verifies exclude_by(df, ref_sequences) composes with loaded pVACseq rows. `read_pvacseq_fragments` removed from `docs/fragments.md`'s queued list: pVACseq output is already at peptide x allele granularity, so the fragment-window abstraction doesn't apply. Re-prediction composes via `predict_from_named_peptides`, not via ProteinFragment. 61 pvacseq tests pass; 589 across pvacseq/io/io_lens/dsl/ranking/wide.

Adds an `IsIn` DSL node plus `Column.eq(value)` / `Column.ne(value)` / `Column.isin(values)` methods so non-numeric columns (mhc_class, source, gene, ...) are filterable natively in the DSL without pre-masking via pandas. The string parser additionally accepts string literals on the right-hand side of `==` and `!=` so `parse('mhc_class == "I"')` works end-to-end. `DSLNode.__eq__` stays unoverridden (nodes remain hashable for use in sets/dicts) — the explicit `.eq()` / `.ne()` / `.isin()` methods plus the parser path are the supported entry points. `IsIn` reads its column raw, bypassing the float cast in `Column.eval`, so any dtype works. Why: - pVACseq loader (5.16.0 first commit) emits a per-row `mhc_class` column ("I" / "II") so concat-ed multi-class results can be split. Filtering by class previously required pre-masking via pandas because the DSL couldn't compare strings; vaxrank-shape filters that combine numeric + categorical clauses needed two paths. - Same path now supports filtering by `source` (provenance) and any other string column added by future loaders. Implementation: - New `IsIn(col_name, values, negate=False)` node in `ranking/nodes.py`. Constructor accepts scalar or iterable values; reads the column raw (no float cast); evaluates to a boolean Series indexed by `group_index`. `__invert__` flips `negate`. - `Column.eq` / `Column.ne` / `Column.isin` produce `IsIn` instances. - Parser's `_cmp` detects a STRING token on the RHS of `==` / `!=` and produces `IsIn` directly; comparison with `<` / `<=` / `>` / `>=` against a string raises with a clear error. - `apply._collect_column_names` recognizes `IsIn` so column-existence validation runs the same way as for `Column`. - Exports: `IsIn` surfaces through `topiary.ranking` and the top-level `topiary` package. Tests (138 across DSL + pvacseq suites, 696 across all relevant io / dsl / ranking / wide suites; no regressions): - `TestIsIn` in `tests/test_dsl_extensions.py` — happy paths, invert, compound with affinity, non-numeric column survives, parser emits IsIn for string RHS, parser rejects `< "..."`, parser rejects string-LHS, validation collects IsIn columns. - `TestVaxrankComposition` in `tests/test_io_pvacseq.py` rewritten to use the native DSL form (`Column("mhc_class").eq("I")`), and a new parser-string variant.

- `topiary.class_i` and `topiary.class_ii` — pre-built `IsIn` nodes referencing the `mhc_class` column. Compose with any other DSL node: `apply_filter(df, class_i & (Affinity.value <= 500))`. - `topiary.derive_mhc_class(allele_series)` — public utility for stamping `mhc_class` onto a DataFrame produced outside the pvacseq loader (e.g. a fresh `TopiaryPredictor` result that carries class only at the model-level `kind_support`). Returns a Series of `"I"` / `"II"` / `pd.NA`. `docs/ranking.md` — `Column()` section gets an "Equality / membership on any dtype" subsection covering: - `Column("x").eq(value)` / `.ne(value)` / `.isin(values)` and why `Column("x") == "y"` doesn't compose (`__eq__` deliberately not overridden — nodes stay hashable). - The `parse('mhc_class == "I"')` string-DSL form and the `<` / `<=` / `>` / `>=` rejection on string operands. - The `class_i` / `class_ii` shortcuts and the `derive_mhc_class` utility for non-pvacseq DataFrames. Also tightens the "Only numeric columns" wording — it's still true for arithmetic and ordered comparisons, but no longer for equality / membership. Tests: 4 new `TestIsIn` cases for the shortcuts and their composition with numeric clauses; `test_derive_mhc_class_public_utility` covers the new package-level export across class I / class II / heterodimer / unknown / None inputs. 143 dsl + pvacseq tests pass; 701 across pvacseq / io / io_lens / dsl / ranking / wide.

Five tactical cleanups from the latest review of PR #167: - `IsIn.__init__` had `isinstance(values, float) and math.isnan(values)` as a fallback scalar-detection arm. `float` is already in the preceding `isinstance` tuple, so any NaN short-circuits there. Drop the unreachable arm; add a one-line comment noting NaN's float membership. - `IsIn.__repr__` was a three-level nested ternary. Rewrite as a straight `if len(values) == 1` / else split — three lines shorter and reads in one pass. - `_derive_mhc_class = derive_mhc_class` was a "backwards-compatible alias" with exactly one internal caller and no external exposure. Update the caller to the public name and drop the alias. - `test_invert_negates` only asserted "I" is absent from the result; strengthen to assert positive retention (DDD + EEE survive) and the NaN-survives-`~eq` pandas-vs-SQL semantic. - `test_parser_string_lhs_raises` was triggering the unrelated "Unexpected token" path in `_atom`, not the parser's explicit "column reference on the LHS" guard. Add a separate test using `affinity.value == "I"` that does parse the LHS and gets rejected at the IsIn-construction site. Docs: `ranking.md` Equality / membership subsection gains a NaN semantics footnote — pandas-style (NaN → False for `.eq` / `.isin`, True for `.ne` / `~.eq`), not SQL `IS NOT NULL`. 144 dsl + pvacseq tests pass; 702 across pvacseq / io / io_lens / dsl / ranking / wide.

Three fixes from the review nudge after the prior "clean stopping point" claim — all real, not nits. 1. `IsIn.eval` empty-DataFrame branch built a Series from a zero-length boolean array indexed by `ctx.group_index`. Crashes if the context happened to carry a non-empty index alongside an empty df ("Length of values (0) does not match length of index (N)"). Unreachable on current callers but fragile, and inconsistent with `Column.eval`'s empty path. Switch to `ctx.empty_series().astype("boolean")` for index alignment and pattern parity. 2. `_collect_column_names` walker grew an `isinstance + elif` chain when `IsIn` landed. The docstring promised "implement child_nodes() and you're done" — duck-type the leaf check via `getattr(n, "col_name", None)` so future column-bearing leaves work without touching the walker. Drops the now-unused `Column` and `IsIn` imports from `apply.py`. 3. `docs/api.md` was missing four real public exports: `IsIn`, `class_i`, `class_ii`, `derive_mhc_class`. Add `IsIn` to the DSL leaves table; new "Categorical equality and membership" subsection under Column covering the methods, the string-DSL `==` / `!=` form, and NaN semantics; pre-built shortcuts table for `class_i` / `class_ii`; row for `derive_mhc_class` in the input-functions table. Tests: new `test_empty_dataframe` exercising IsIn against an empty input (regression guard); 145 dsl + pvacseq tests pass; 705 across pvacseq / io / io_lens / dsl / ranking / wide / dataframe.

New `docs/pvacseq.md` (~260 lines) covers the full read_pvacseq surface end-to-end: - What pVACseq is, why import its output (re-score / compare / re-rank / combine MHC-I + MHC-II). - The two TSV flavors (aggregated vs all_epitopes) — what differs, how the loader auto-detects. - Schema mapping table: pVACseq column → topiary column for both flavors. - Derived columns (mhc_class, contains_mutant_residues, mutation_start/end_in_peptide, source) — what they are and why vaxrank cares. - Annotation passthroughs naming. - Worked examples: single-class load, MHC-I + MHC-II concat, filter by class with shortcuts and the parsed-string DSL form. - Per-algorithm scores: the pvacseq_<algo>_<field>_<mtwt> passthrough pattern + when to reach for melt_pvacseq_algorithms. - WT peptide reconstruction (the missense-only path, with the HCC1395 292/317 stat). - Re-scoring with TopiaryPredictor.predict_from_named_peptides via the derived peptide dict. - derive_mhc_class for fresh prediction frames without an mhc_class column. - Metadata stamping — pvacseq_format, kind_support shape. - Caveats: metrics.json not used, flanking peptides, kind collapsing, no CLI. Wired into mkdocs.yml nav (between Cached Predictions and Cross-reactivity) and mentioned in docs/index.md's "Multiple input modes" feature bullet.

iskandr · 2026-05-14T18:27:50Z

Downstream signal: vaxrank intends to adopt this — filed openvax/vaxrank#297 to track the migration.

The plan is to replace vaxrank's own load_pvacseq (which only handles the aggregated flavor and has known WT-comparator gaps on rows missing WT Epitope Seq) with a thin adapter over read_pvacseq. The wins on the vaxrank side:

Drops the duplicate aggregated-TSV parser
Adds the all_epitopes flavor — much more useful for actual expert-review-style vaccine design than pVACseq's pre-picked "best peptide"
Per-algorithm columns become DSL-accessible, so users can write combined_score_expr like max(Column(\"pvacseq_mhcflurry_ic50_mt\"), Column(\"pvacseq_netmhcpan_ic50_mt\"))
MHC-I + MHC-II in one ranking run via topiary.concat
Closes openvax/vaxrank#282 (presentation-percentile / presentation-score gap)

I'm using the same HCC1395 sample (from cAIrn/JLF reviewer protocol) as my downstream smoke, so if you want a real end-to-end test case for any edge-case rows during review, happy to share. One question for the API: does read_pvacseq return the standard topiary long-form pd.DataFrame, or a wrapped object? Asking because the vaxrank adapter needs to reverse-map back to CandidateEpitope and I want to mirror however read_lens / native loaders do it.

No blocking feedback — happy to wait for this to land + release, then drive the vaxrank-side adoption.

Three issues found in the last review pass of `docs/pvacseq.md`: 1. The re-scoring section's `concat([r, fresh_result])` example referenced an undefined variable (`fresh_result` vs `fresh` from the assignment above), and even with the name fixed `concat` doesn't accept bare DataFrames — `TopiaryPredictor.predict_*` returns a DataFrame, not a TopiaryResult. Rather than widen `concat` (scope creep beyond the loader's needs) or wrap the DataFrame artificially, drop the downstream-composition example entirely. The re-scoring section shows how to fresh-predict; users compose with pandas as needed. 2. `"class lives in `Metadata.kind_support`"` — `Metadata` (the comment-block dataclass) doesn't have a `kind_support` field. The correct location is `TopiaryPredictor.kind_support`. 3. The schema-mapping table's `(or Best MT IC50 Score)` parenthetical read as "either column is acceptable." Clarified to tie the fallback to pVACseq's `--top-score-metric=Best` configuration; same for the WT percentile column (Median vs Corresponding). Also tightens the peptide-dict construction to a dict-comprehension that makes the intent explicit instead of a dict(zip(...)) expression that collapses duplicates as a side effect. Two follow-up issues filed for the design surface that came out of the review but doesn't belong in this PR: - #168 — Add `allele_set` column to faithfully store haplotype-mode predictions (schema layer) - #169 — DSL: explicit per-`mhc_dependence` peptide-level projection (query layer) Together those two unblock cross-source / cross-kind prediction comparison cleanly; without them, validator / combine-predictions helpers can only approximate. Out of scope for the pVACseq loader.

Two surface-quality fixes from another review pass: - `detect_pvacseq_format` had a 5-word docstring for a public export ("Return 'aggregated', 'all_epitopes', or None.") — fine as a one-liner for an internal helper, but this one is at the package root. Replaced with a proper Numpy-style docstring documenting parameters, return values, and the use case (callers writing their own readers / routing). - Added return-type hints on `derive_mhc_class` and `melt_pvacseq_algorithms` for consistency with `read_pvacseq` and `detect_pvacseq_format`, which already had them. No behavior change.

iskandr force-pushed the pvacseq-loader branch from d868545 to 1b44867 Compare May 13, 2026 22:11

iskandr added 4 commits May 13, 2026 22:14

iskandr mentioned this pull request May 14, 2026

Migrate --input-pvacseq to topiary.read_pvacseq (unaggregated flavor, per-algorithm scores, MHC-II) openvax/vaxrank#297

Closed

iskandr added 7 commits May 14, 2026 10:57

This was referenced May 14, 2026

Add allele_set column to faithfully store haplotype-mode predictions #168

Open

DSL: explicit per-mhc_dependence peptide-level projection #169

Open

iskandr mentioned this pull request May 14, 2026

Combine predictions from separate TopiaryPredictor runs into one result #170

Closed

iskandr added 2 commits May 14, 2026 16:51

Bump version to 5.16.0

0cb7493

iskandr merged commit 330809a into master May 14, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add pVACseq report loader (closes #94)#167

Add pVACseq report loader (closes #94)#167
iskandr merged 15 commits into
masterfrom
pvacseq-loader

iskandr commented May 13, 2026 •

edited

Loading

Uh oh!

coveralls commented May 13, 2026 •

edited

Loading

Uh oh!

iskandr commented May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

iskandr commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

coveralls commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iskandr commented May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

iskandr commented May 13, 2026 •

edited

Loading

coveralls commented May 13, 2026 •

edited

Loading