Skip to content

Add pVACseq report loader (closes #94)#167

Merged
iskandr merged 15 commits into
masterfrom
pvacseq-loader
May 14, 2026
Merged

Add pVACseq report loader (closes #94)#167
iskandr merged 15 commits into
masterfrom
pvacseq-loader

Conversation

@iskandr
Copy link
Copy Markdown
Contributor

@iskandr iskandr commented May 13, 2026

Summary

read_pvacseq(path) parses both pVACseq output flavors into Topiary long form:

  • *.all_epitopes.aggregated.tsv — one row per variant; Best Peptide × Allele chosen by Median IC50
  • *.all_epitopes.tsv — one row per peptide × allele × length candidate

Median MT IC50 / percentile populate value / percentile_rank; WT IC50 / percentile populate the wt_* schema so existing DSL works without changes (Affinity.value, wt.Affinity.value, Affinity.value - wt.Affinity.value, etc.).

For aggregated rows, the WT peptide sequence is reconstructed from Best Peptide + Pos + AA Change for unambiguous missense (covers 292 of 317 rows on the HCC1395 sample; non-missense and flanking-peptide rows stay NaN — for those, use the unaggregated all_epitopes.tsv, which has WT Epitope Seq directly).

Per-algorithm columns in the all_epitopes flavor pass through as pvacseq_<algo>_{ic50,pct}_{mt,wt} annotation columns, accessible via Column("...").

Multiple files (MHC-I + MHC-II, or a mix of flavors) compose through topiary.concat([read_pvacseq(p1), read_pvacseq(p2)]) — no dedicated multi-file entry point needed.

Closes #94.

Test plan

  • pytest tests/test_io_pvacseq.py — 26 tests pass (format detection, aggregated load, missense WT reconstruction, all_epitopes load, multi-file concat, error path, DSL filter/sort/wt-scope round-trips)
  • Smoke-loads the real HCC1395 MHC-I (317 rows) and MHC-II (317 rows) aggregated files from cAIrn; 292/317 missense WT peptides reconstruct correctly; DSL Affinity.value <= 100 round-trips to 18 strong binders
  • pytest tests/test_io.py tests/test_io_lens.py tests/test_dsl_extensions.py tests/test_ranking.py — 508 passed, no regressions

@coveralls
Copy link
Copy Markdown

coveralls commented May 13, 2026

Coverage Status

coverage: 89.87% (+0.2%) from 89.625% — pvacseq-loader into master

`read_pvacseq(path)` parses both pVACseq output flavors into Topiary long form:

- `*.all_epitopes.aggregated.tsv` — one row per variant; Best Peptide x
  Allele chosen by Median IC50.
- `*.all_epitopes.tsv` — one row per peptide x allele x length candidate.

Median MT IC50 / percentile populate `value` / `percentile_rank` with
`prediction_method_name="pvacseq"`; WT IC50 / percentile populate the
`wt_*` schema so existing DSL expressions (`Affinity.value`,
`wt.Affinity.value`) work without changes.

For aggregated rows the WT peptide sequence is reconstructed from
`Best Peptide` + `Pos` + `AA Change` for unambiguous missense (292 of
317 rows on HCC1395; non-missense and flanking rows stay NaN, where
the unaggregated `all_epitopes.tsv` carries `WT Epitope Seq` directly).

Per-algorithm columns in the all_epitopes flavor pass through as
`pvacseq_<algo>_{ic50,pct}_{mt,wt}` annotation columns, accessible via
`Column("...")`.

Multiple files (MHC-I + MHC-II, or a mix of flavors) compose through
`topiary.concat([read_pvacseq(p1), read_pvacseq(p2)])`.

Fixtures: 10-row slices of the HCC1395 MHC-I and MHC-II aggregated
files from the cAIrn WashU protocol, plus a hand-crafted all_epitopes
TSV covering 2 variants x 2 alleles x 2 lengths.
iskandr added 4 commits May 13, 2026 22:14
Review feedback from PR #167:

- Replace `df.get(col, [])` fallbacks (which silently produce length
  mismatches if a column is missing) with explicit `if col in df.columns`
  guards.
- Trim `na_values=["NA", "X"]` to just `["X"]`: "NA" is pandas' default;
  "X" is pVACseq's "this algorithm didn't score" sentinel.
- Use `pd.NA` for `predictor_version` / `wt_predictor_version` instead
  of `None` so the column dtype is consistent.
- Fix stale comment in `_finalize` that referenced a fallback branch the
  code never takes.
- Document the per-algorithm-column DSL-invisibility tradeoff in the
  module docstring.

New tests:

- `_reconstruct_wt_peptide` happy path + position-out-of-range +
  pos/AA Change disagreement + non-missense (FS, multi-AA) + non-string
  inputs (5 cases).
- `tag=` kwarg overrides source label.
- Round-trip a loaded pVACseq result through `to_tsv` / `read_tsv` and
  confirm columns and numeric values survive.
… tests

Second-pass review fixes:

Docs:
- Module docstring + CHANGELOG cited `wt(Affinity.value)`, but `wt` is a
  Scope object, not callable.  Correct to `wt.Affinity.value`.

Parser symmetry:
- Replace `_parse_all_epitopes`'s chained `df.get(a, df.get(b))` calls
  with a `_first_present_column` helper so both flavor parsers use the
  same "guard then read" pattern.
- `_finalize` no longer mutates its input — adds a defensive `.copy()`
  at entry rather than relying on callers discarding the original.

Effect classifier:
- Tighten frameshift recognition from `"fs" in s.lower()` (broad
  substring match) to `^FS\d` (pVACseq's actual format).  Adds explicit
  test that a missense like "F206S" does not get misclassified.

Tests:
- Round-trip assertion strengthens to set-equality (minus the `source`
  provenance column read_tsv adds) and checks peptide/allele/wt_peptide
  plus the four numeric score columns.
- New chr-coord variant fallback test exercises `_parse_all_epitopes`'s
  Index-absent branch (previously unreached by fixtures).
- New `TestClassifyEffect` covers missense / frameshift / multi-residue
  Substitution / variant-type-wins / unknown-returns-None, including
  the F206S false-positive regression.

40 tests pass; 568 across pvacseq/io/io_lens/dsl/ranking/wide clean.
…wt length

Final-pass cleanups:

- Remove unused module-level `logging.getLogger(__name__)` and the
  `import logging` it required.
- Restore the rationale comment on `peptide_offset = 0` (pVACseq doesn't
  ship source-protein offset; matches LENS loader convention).
- Restructure `_finalize` to allocate the augmented frame once via a
  single `DataFrame.assign(**...)` call instead of mutating a defensive
  copy in place and copying again on the way out.  Extracts a
  `_build_source_sequence_name` helper for the gene/variant fallback.
- `wt_peptide_length` is now nullable `Int64` (was `object` with mixed
  int / `pd.NA`), tightening semantics and easing downstream comparisons.
- `_parse_all_epitopes` uses `pd.NA` instead of the scalar `None` when
  `WT Epitope Seq` / `Variant Type` are absent — round-trips cleaner
  through pandas' nullable type system.

Tests:

- New `mhc_ii_all_epitopes.tsv` fixture (3 rows, DRB1 + DPA1-DPB1
  heterodimer alleles, NetMHCIIpan per-algorithm columns).
- `test_mhc_ii_all_epitopes_loads` asserts class II all_epitopes parses
  end-to-end, heterodimer alleles round-trip through mhcgnomes, and the
  per-algorithm passthrough handles the class-II algorithm name.
- `test_wt_peptide_length_is_nullable_int` pins the new Int64 dtype.

42 tests pass; 570 across pvacseq/io/io_lens/dsl/ranking/wide clean.
Drop the _AE abbreviation in favor of _ALL across:
- module constants: _AE_SIGNATURE/_AE_MEDIAN/_AE_BEST/_AE_ANNOTATIONS
- test fixture handles: MHC_I_AE/MHC_II_AE

The file-flavor identifier "all_epitopes" itself is unchanged — that's
pVACseq's actual format name.
iskandr added 7 commits May 14, 2026 10:57
Derived columns matching TopiaryPredictor output so downstream
consumers (vaxrank in particular) don't need to special-case loader
source:

- `mhc_class` ("I" / "II") — derived from the allele string.  Lets
  concat-ed MHC-I + MHC-II results be split or filtered by class
  without re-parsing alleles.
- `contains_mutant_residues` (nullable boolean) — true iff pVACseq's
  reported mutation position falls inside the candidate peptide.
  False for flanking-only peptides where pVACseq scored a 9-mer next
  to the mutation but the mutation itself is outside the window
  (28 of 634 HCC1395 rows).
- `mutation_start_in_peptide` / `mutation_end_in_peptide` (Int64,
  0-based half-open) — derived from pVACseq's 1-based Pos /
  Mutation Position.  Single-residue semantics; multi-residue
  mutations collapse to a representative position.
- `source` (per-row provenance label, matching topiary.read_tsv) — keeps
  multi-file concats distinguishable without rooting through Metadata.

Review nits also closed:

- `_build_source_sequence_name` now raises an informative KeyError
  rather than tripping a misleading `KeyError: 'gene'` if both
  source columns are missing (unreachable from real pVACseq input,
  but defensive).
- Round-trip test uses `Series.equals` for NaN-aware equality
  instead of the `.fillna(-1)` sentinel trick.
- Added `test_predictor_version_is_na` so concat-warning regressions
  don't sneak back in.
- Loosened `test_mhc_ii_alleles_get_hla_prefix` to `HLA-D` so it
  survives a future re-slice of the fixture to include heterodimers.

Verified against real HCC1395:
- 634 combined rows; 2 distinct `source` labels; 317 class I + 317
  class II.
- 606 contains_mutant_residues=True, 28 False (flanking-only).
- 584 WT peptides reconstructed (subset of contains_mutant_residues
  that are unambiguous missense).

50 tests pass; 578 across pvacseq/io/io_lens/dsl/ranking/wide clean.
Three add-ons that close the remaining gap between read_pvacseq's
output and what vaxrank's downstream code expects from a TopiaryPredictor
run, plus integration tests that lock the contract from topiary's side.

`Metadata.extra["kind_support"]` is populated with the same shape as
`TopiaryPredictor.kind_support` (model_key -> {kind -> {mhc_dependence,
mhc_class}}).  pVACseq always scores per allele, so mhc_dependence is
fixed at "single_allele"; mhc_class reflects what's in the file
("I" / "II" / "both" / "none").  Vaxrank's evaluate_scores can pass
this straight through to apply_filter / EvalContext without building
a parallel metadata dict.

`melt_pvacseq_algorithms(result)` expands the all_epitopes flavor's
per-algorithm columns into separate prediction_method_name=<algo>
rows so the DSL's `Affinity["mhcflurry"].value` /
`Affinity["netmhcpan"].score` selectors reach individual algorithms
natively.  Median rows (prediction_method_name="pvacseq") are
preserved; melt is a no-op on aggregated input.  The function
extends kind_support to register each melted algorithm under the
same mhc_class as the parent pvacseq entry.

Integration tests:

- TestKindSupport — pins the file-level mhc_class summary
  ("I" / "II" / "both") and the kind_support dict shape.
- TestMeltAlgorithms — verifies aggregated melt is a no-op, all_epitopes
  expands to N+1 rows per group, per-algorithm values lift from
  passthrough columns correctly, kind_support extends, and the DSL
  `Affinity["mhcflurry"]` selector finds melted rows.
- TestVaxrankComposition — bridge test pinning the realistic vaxrank
  filter shape (pandas masking for categorical clauses on mhc_class /
  contains_mutant_residues, topiary DSL for the numeric Affinity.value
  clause).  Also verifies exclude_by(df, ref_sequences) composes with
  loaded pVACseq rows.

`read_pvacseq_fragments` removed from `docs/fragments.md`'s queued list:
pVACseq output is already at peptide x allele granularity, so the
fragment-window abstraction doesn't apply.  Re-prediction composes via
`predict_from_named_peptides`, not via ProteinFragment.

61 pvacseq tests pass; 589 across pvacseq/io/io_lens/dsl/ranking/wide.
Adds an `IsIn` DSL node plus `Column.eq(value)` / `Column.ne(value)` /
`Column.isin(values)` methods so non-numeric columns (mhc_class,
source, gene, ...) are filterable natively in the DSL without
pre-masking via pandas.  The string parser additionally accepts
string literals on the right-hand side of `==` and `!=` so
`parse('mhc_class == "I"')` works end-to-end.

`DSLNode.__eq__` stays unoverridden (nodes remain hashable for use in
sets/dicts) — the explicit `.eq()` / `.ne()` / `.isin()` methods plus
the parser path are the supported entry points.  `IsIn` reads its
column raw, bypassing the float cast in `Column.eval`, so any dtype
works.

Why:

- pVACseq loader (5.16.0 first commit) emits a per-row `mhc_class`
  column ("I" / "II") so concat-ed multi-class results can be split.
  Filtering by class previously required pre-masking via pandas
  because the DSL couldn't compare strings; vaxrank-shape filters
  that combine numeric + categorical clauses needed two paths.
- Same path now supports filtering by `source` (provenance) and any
  other string column added by future loaders.

Implementation:

- New `IsIn(col_name, values, negate=False)` node in `ranking/nodes.py`.
  Constructor accepts scalar or iterable values; reads the column
  raw (no float cast); evaluates to a boolean Series indexed by
  `group_index`.  `__invert__` flips `negate`.
- `Column.eq` / `Column.ne` / `Column.isin` produce `IsIn` instances.
- Parser's `_cmp` detects a STRING token on the RHS of `==` / `!=`
  and produces `IsIn` directly; comparison with `<` / `<=` / `>` /
  `>=` against a string raises with a clear error.
- `apply._collect_column_names` recognizes `IsIn` so column-existence
  validation runs the same way as for `Column`.
- Exports: `IsIn` surfaces through `topiary.ranking` and the top-level
  `topiary` package.

Tests (138 across DSL + pvacseq suites, 696 across all relevant io /
dsl / ranking / wide suites; no regressions):

- `TestIsIn` in `tests/test_dsl_extensions.py` — happy paths,
  invert, compound with affinity, non-numeric column survives,
  parser emits IsIn for string RHS, parser rejects `< "..."`,
  parser rejects string-LHS, validation collects IsIn columns.
- `TestVaxrankComposition` in `tests/test_io_pvacseq.py` rewritten
  to use the native DSL form (`Column("mhc_class").eq("I")`),
  and a new parser-string variant.
- `topiary.class_i` and `topiary.class_ii` — pre-built `IsIn` nodes
  referencing the `mhc_class` column.  Compose with any other DSL node:
  `apply_filter(df, class_i & (Affinity.value <= 500))`.
- `topiary.derive_mhc_class(allele_series)` — public utility for
  stamping `mhc_class` onto a DataFrame produced outside the pvacseq
  loader (e.g. a fresh `TopiaryPredictor` result that carries class
  only at the model-level `kind_support`).  Returns a Series of
  `"I"` / `"II"` / `pd.NA`.

`docs/ranking.md` — `Column()` section gets an "Equality / membership
on any dtype" subsection covering:

- `Column("x").eq(value)` / `.ne(value)` / `.isin(values)` and why
  `Column("x") == "y"` doesn't compose (`__eq__` deliberately not
  overridden — nodes stay hashable).
- The `parse('mhc_class == "I"')` string-DSL form and the
  `<` / `<=` / `>` / `>=` rejection on string operands.
- The `class_i` / `class_ii` shortcuts and the `derive_mhc_class`
  utility for non-pvacseq DataFrames.

Also tightens the "Only numeric columns" wording — it's still true
for arithmetic and ordered comparisons, but no longer for equality /
membership.

Tests: 4 new `TestIsIn` cases for the shortcuts and their composition
with numeric clauses; `test_derive_mhc_class_public_utility` covers
the new package-level export across class I / class II / heterodimer /
unknown / None inputs.

143 dsl + pvacseq tests pass; 701 across pvacseq / io / io_lens / dsl /
ranking / wide.
Five tactical cleanups from the latest review of PR #167:

- `IsIn.__init__` had `isinstance(values, float) and math.isnan(values)`
  as a fallback scalar-detection arm.  `float` is already in the
  preceding `isinstance` tuple, so any NaN short-circuits there.  Drop
  the unreachable arm; add a one-line comment noting NaN's float
  membership.
- `IsIn.__repr__` was a three-level nested ternary.  Rewrite as a
  straight `if len(values) == 1` / else split — three lines shorter
  and reads in one pass.
- `_derive_mhc_class = derive_mhc_class` was a "backwards-compatible
  alias" with exactly one internal caller and no external exposure.
  Update the caller to the public name and drop the alias.
- `test_invert_negates` only asserted "I" is absent from the result;
  strengthen to assert positive retention (DDD + EEE survive) and the
  NaN-survives-`~eq` pandas-vs-SQL semantic.
- `test_parser_string_lhs_raises` was triggering the unrelated
  "Unexpected token" path in `_atom`, not the parser's explicit
  "column reference on the LHS" guard.  Add a separate test using
  `affinity.value == "I"` that does parse the LHS and gets rejected
  at the IsIn-construction site.

Docs: `ranking.md` Equality / membership subsection gains a NaN
semantics footnote — pandas-style (NaN → False for `.eq` / `.isin`,
True for `.ne` / `~.eq`), not SQL `IS NOT NULL`.

144 dsl + pvacseq tests pass; 702 across pvacseq / io / io_lens / dsl
/ ranking / wide.
Three fixes from the review nudge after the prior "clean stopping
point" claim — all real, not nits.

1. `IsIn.eval` empty-DataFrame branch built a Series from a
   zero-length boolean array indexed by `ctx.group_index`.  Crashes
   if the context happened to carry a non-empty index alongside an
   empty df ("Length of values (0) does not match length of
   index (N)").  Unreachable on current callers but fragile, and
   inconsistent with `Column.eval`'s empty path.  Switch to
   `ctx.empty_series().astype("boolean")` for index alignment and
   pattern parity.

2. `_collect_column_names` walker grew an `isinstance + elif` chain
   when `IsIn` landed.  The docstring promised "implement
   child_nodes() and you're done" — duck-type the leaf check via
   `getattr(n, "col_name", None)` so future column-bearing leaves
   work without touching the walker.  Drops the now-unused `Column`
   and `IsIn` imports from `apply.py`.

3. `docs/api.md` was missing four real public exports: `IsIn`,
   `class_i`, `class_ii`, `derive_mhc_class`.  Add `IsIn` to the
   DSL leaves table; new "Categorical equality and membership"
   subsection under Column covering the methods, the string-DSL
   `==` / `!=` form, and NaN semantics; pre-built shortcuts table
   for `class_i` / `class_ii`; row for `derive_mhc_class` in the
   input-functions table.

Tests: new `test_empty_dataframe` exercising IsIn against an
empty input (regression guard); 145 dsl + pvacseq tests pass; 705
across pvacseq / io / io_lens / dsl / ranking / wide / dataframe.
New `docs/pvacseq.md` (~260 lines) covers the full read_pvacseq surface
end-to-end:

- What pVACseq is, why import its output (re-score / compare / re-rank
  / combine MHC-I + MHC-II).
- The two TSV flavors (aggregated vs all_epitopes) — what differs,
  how the loader auto-detects.
- Schema mapping table: pVACseq column → topiary column for both
  flavors.
- Derived columns (mhc_class, contains_mutant_residues,
  mutation_start/end_in_peptide, source) — what they are and why
  vaxrank cares.
- Annotation passthroughs naming.
- Worked examples: single-class load, MHC-I + MHC-II concat, filter
  by class with shortcuts and the parsed-string DSL form.
- Per-algorithm scores: the pvacseq_<algo>_<field>_<mtwt> passthrough
  pattern + when to reach for melt_pvacseq_algorithms.
- WT peptide reconstruction (the missense-only path, with the
  HCC1395 292/317 stat).
- Re-scoring with TopiaryPredictor.predict_from_named_peptides via
  the derived peptide dict.
- derive_mhc_class for fresh prediction frames without an
  mhc_class column.
- Metadata stamping — pvacseq_format, kind_support shape.
- Caveats: metrics.json not used, flanking peptides, kind collapsing,
  no CLI.

Wired into mkdocs.yml nav (between Cached Predictions and
Cross-reactivity) and mentioned in docs/index.md's "Multiple input
modes" feature bullet.
@iskandr
Copy link
Copy Markdown
Contributor Author

iskandr commented May 14, 2026

Downstream signal: vaxrank intends to adopt this — filed openvax/vaxrank#297 to track the migration.

The plan is to replace vaxrank's own load_pvacseq (which only handles the aggregated flavor and has known WT-comparator gaps on rows missing WT Epitope Seq) with a thin adapter over read_pvacseq. The wins on the vaxrank side:

  • Drops the duplicate aggregated-TSV parser
  • Adds the all_epitopes flavor — much more useful for actual expert-review-style vaccine design than pVACseq's pre-picked "best peptide"
  • Per-algorithm columns become DSL-accessible, so users can write combined_score_expr like max(Column(\"pvacseq_mhcflurry_ic50_mt\"), Column(\"pvacseq_netmhcpan_ic50_mt\"))
  • MHC-I + MHC-II in one ranking run via topiary.concat
  • Closes openvax/vaxrank#282 (presentation-percentile / presentation-score gap)

I'm using the same HCC1395 sample (from cAIrn/JLF reviewer protocol) as my downstream smoke, so if you want a real end-to-end test case for any edge-case rows during review, happy to share. One question for the API: does read_pvacseq return the standard topiary long-form pd.DataFrame, or a wrapped object? Asking because the vaxrank adapter needs to reverse-map back to CandidateEpitope and I want to mirror however read_lens / native loaders do it.

No blocking feedback — happy to wait for this to land + release, then drive the vaxrank-side adoption.

Three issues found in the last review pass of `docs/pvacseq.md`:

1. The re-scoring section's `concat([r, fresh_result])` example referenced
   an undefined variable (`fresh_result` vs `fresh` from the assignment
   above), and even with the name fixed `concat` doesn't accept bare
   DataFrames — `TopiaryPredictor.predict_*` returns a DataFrame, not
   a TopiaryResult.  Rather than widen `concat` (scope creep beyond
   the loader's needs) or wrap the DataFrame artificially, drop the
   downstream-composition example entirely.  The re-scoring section
   shows how to fresh-predict; users compose with pandas as needed.

2. `"class lives in `Metadata.kind_support`"` — `Metadata` (the
   comment-block dataclass) doesn't have a `kind_support` field.
   The correct location is `TopiaryPredictor.kind_support`.

3. The schema-mapping table's `(or Best MT IC50 Score)` parenthetical
   read as "either column is acceptable."  Clarified to tie the
   fallback to pVACseq's `--top-score-metric=Best` configuration;
   same for the WT percentile column (Median vs Corresponding).

Also tightens the peptide-dict construction to a dict-comprehension
that makes the intent explicit instead of a dict(zip(...)) expression
that collapses duplicates as a side effect.

Two follow-up issues filed for the design surface that came out of
the review but doesn't belong in this PR:

- #168 — Add `allele_set` column to faithfully store haplotype-mode
  predictions (schema layer)
- #169 — DSL: explicit per-`mhc_dependence` peptide-level projection
  (query layer)

Together those two unblock cross-source / cross-kind prediction
comparison cleanly; without them, validator / combine-predictions
helpers can only approximate.  Out of scope for the pVACseq loader.
iskandr added 2 commits May 14, 2026 16:51
Two surface-quality fixes from another review pass:

- `detect_pvacseq_format` had a 5-word docstring for a public export
  ("Return 'aggregated', 'all_epitopes', or None.") — fine as a one-liner
  for an internal helper, but this one is at the package root.  Replaced
  with a proper Numpy-style docstring documenting parameters, return
  values, and the use case (callers writing their own readers / routing).
- Added return-type hints on `derive_mhc_class` and
  `melt_pvacseq_algorithms` for consistency with `read_pvacseq` and
  `detect_pvacseq_format`, which already had them.

No behavior change.
@iskandr iskandr merged commit 330809a into master May 14, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Import and re-score pVACseq results

2 participants