Skip to content

CachedPredictor CLI + real NetMHC fixtures + README reorder#136

Merged
iskandr merged 8 commits into
masterfrom
cached-predictor-cli
Apr 16, 2026
Merged

CachedPredictor CLI + real NetMHC fixtures + README reorder#136
iskandr merged 8 commits into
masterfrom
cached-predictor-cli

Conversation

@iskandr
Copy link
Copy Markdown
Contributor

@iskandr iskandr commented Apr 15, 2026

Summary

Three related changes on the cached-predictions story (orthogonal to PR #135 which is still in review on the SelfProteome side):

  1. CLI support for CachedPredictor. New --mhc-cache-* argument group lets users run topiary entirely from pre-computed prediction files without invoking a live MHC predictor.
  2. Real NetMHC-family fixtures captured from the netmhc-bundle binaries for peptide SLLQHLIGL at HLA-A*02:01 / A*24:02 / B*07:02 (substituted for B*07:01 which NetMHCpan doesn't recognize). 15 new tests exercise the full parsing + CLI chain against real data.
  3. README section reorder — "MHC prediction models" moved near the top; "Cached predictions" moved to the end.

CLI shape

topiary --peptide-csv peptides.csv \
    --mhc-cache-file netmhcpan_run.out \
    --mhc-cache-format netmhcpan_stdout \
    --output-csv results.csv

Supported formats (via --mhc-cache-format): topiary_output, mhcflurry, tsv, netmhcpan_stdout, netmhc_stdout, netmhcpan_cons_stdout, netmhciipan_stdout, netmhcstabpan_stdout. Plus --mhc-cache-directory with --mhc-cache-directory-pattern for sharded caches.

--mhc-predictor / --mhc-alleles become optional when a cache is supplying predictions. Validation covers missing format, mutually-exclusive file vs directory, and no-source-at-all cases.

Full flag reference is documented in docs/cached.md.

Fixtures

tests/data/netmhc_fixtures/:

  • netmhcpan_41_SLLQHLIGL_{A0201,A2402,B0702}.out — NetMHCpan 4.1b stdout, single-allele per file.
  • netmhcpan_40_SLLQHLIGL_*.out — NetMHCpan 4.0 stdout.
  • netmhc_40_SLLQHLIGL.out — NetMHC classic 4.0 (multi-allele works).
  • netmhcstabpan_SLLQHLIGL_*.out — NetMHCstabpan single-allele.
  • Multi-allele NetMHCpan and stabpan files are also committed but don't parse today due to mhctools' per-allele-header-line bug, filed as parse_netmhcpan_stdout crashes on multi-allele NetMHCpan output mhctools#195.

NetMHCcons skipped — the binary calls python2 which isn't on PATH here.

Minor fix

topiary/cached.py: expanded the version-header regex search window from 2000 to 10000 chars. NetMHCpan preambles with full argument dumps push the # NetMHCpan version X.Yb line past 2000 chars (real captures show it at char ~2700), which was making predictor_version default to "unknown" instead of parsing from the header.

Test plan

  • 6 new TestRealNetMHCFixtures tests in tests/test_cached_predictor.py assert actual numeric predictions (e.g. SLLQHLIGL × A*02:01 = 8.82 nM, 0.105%rank) through each from_*_stdout loader.
  • 9 new tests in `tests/test_cli_cached_predictor.py` exercise the CLI end-to-end — error paths, happy paths for netmhcpan/netmhc/netmhcstabpan/topiary-output/tsv formats, explicit version override, TSV column mapping.
  • ./test.sh — 1126 passed, 3 skipped.
  • ./lint.sh — clean.
  • mkdocs build --strict — clean.
  • CI green.

Related

## CLI

New `--mhc-cache-*` argument group lets users run topiary entirely
from pre-computed prediction files, without invoking a live MHC
predictor.  Every CachedPredictor loader has a CLI path:

  topiary --peptide-csv peptides.csv \
      --mhc-cache-file netmhcpan_run.out \
      --mhc-cache-format netmhcpan_stdout \
      --output-csv results.csv

Flags: --mhc-cache-file, --mhc-cache-directory (+ pattern),
--mhc-cache-format (8 choices), --mhc-cache-predictor-name /
--mhc-cache-predictor-version overrides, --mhc-cache-tsv-column
(repeatable), --mhc-cache-tsv-sep, plus format-specific mode and
version knobs for the NetMHC family.

--mhc-predictor / --mhc-alleles become optional when a cache is
supplying predictions.  A combined --mhc-cache-file +
--mhc-cache-directory raises; missing --mhc-cache-format with a
--mhc-cache-file raises; etc. — all validated with actionable
messages.

## Real NetMHC fixtures

tests/data/netmhc_fixtures/ now holds captured stdout from the
netmhc-bundle binaries for peptide SLLQHLIGL against HLA-A*02:01,
HLA-A*24:02, HLA-B*07:02 (substituted for B*07:01 which isn't in
NetMHCpan's allele list):

- netmhcpan-4.1 and 4.0 output (single-allele per file)
- netmhc-4.0 output (multi-allele works)
- netmhcstabpan output (single-allele)

Multi-allele NetMHCpan outputs are also committed but currently
don't parse through mhctools; filed as openvax/mhctools#195 with
the root-cause analysis (per-allele "Distance to training data"
header lines confuse the row parser).  NetMHCcons skipped because
the binary requires python2 which isn't on this machine's PATH.

## Tests

- 6 new TestRealNetMHCFixtures tests assert the actual numeric
  predictions (e.g. SLLQHLIGL × HLA-A*02:01 = 8.82 nM, 0.105%rank)
  parse through each from_*_stdout loader.
- 9 new tests in tests/test_cli_cached_predictor.py exercise the
  CLI end-to-end: error paths, happy paths for netmhcpan / netmhc /
  netmhcstabpan / topiary-output / tsv formats, explicit version
  override, TSV column mapping.

1126 tests pass (up from 1111 on this branch's base), lint clean.

## Minor fix

topiary/cached.py: expanded the stdout-header version-regex search
window from 2000 to 10000 chars.  NetMHCpan preambles with full
argument dumps push the "# NetMHCpan version X.Yb" line past 2000
chars, which was causing predictor_version to default to "unknown"
instead of parsing from the header.  Real captured fixtures showed
the version line at char ~2700.

## README reorder

Moved "MHC prediction models" section from line 481 to just after
"Predicting MHC binding" at line 66 — it's more fundamental for
readers than where to find it buried.  Moved "Cached predictions"
from line 442 to the end of the README — it's an advanced/optional
feature that belongs after the basics.  New order: Installation →
Predicting MHC binding → MHC prediction models → Filtering and
ranking → Expression data → CLI usage → Output → Peptide properties
→ Protein sources → Cached predictions.

## Docs

docs/cached.md gains a "From the CLI" section with a full flag
table and per-format command examples.  README's Cached predictions
section shows a one-liner CLI example with a link to the full
reference.
@coveralls
Copy link
Copy Markdown

coveralls commented Apr 15, 2026

Coverage Status

coverage: 88.202% (-0.04%) from 88.246% — cached-predictor-cli into master

@iskandr
Copy link
Copy Markdown
Contributor Author

iskandr commented Apr 15, 2026

Why do you need _stdout at the end of all these?

netmhcpan_stdout, netmhc_stdout, netmhcpan_cons_stdout, netmhciipan_stdout, netmhcstabpan_stdout

@iskandr
Copy link
Copy Markdown
Contributor Author

iskandr commented Apr 15, 2026

And is there a CLI option for version specification?

And can the format and version both be sniffed from the text file?

## Bug fix (critical)

Cached-predictor loaders were not stamping the `kind` column, so DSL
expressions like `Affinity.value <= 500` (which dispatches on
`kind == "pMHC_affinity"`) matched zero rows.  Surfaced by a new CLI
filter/sort test.

Fix: `_bindings_to_dataframe(preds, *, kind)` now requires a kind
argument; each `from_*_stdout` loader passes the right value
(`pMHC_affinity` for binding-affinity mode, `pMHC_presentation` for
elution-score mode, `pMHC_stability` for NetMHCstabpan, inferred from
mhcflurry column signatures for `from_mhcflurry`).

## User feedback

- **Drop `_stdout` suffix from CLI format names.** `netmhcpan_stdout`
  → `netmhcpan`, etc.  The suffix was noise; users never wondered if
  it was stdout vs .xls because we only ship stdout today.  Internal
  loader classmethod names kept their `_stdout` suffix for the
  separation between stdout and future .xls parsers in the Python
  API.
- **Auto-sniff `--mhc-cache-format` from file content.**  New
  `_sniff_format(path)` helper detects NetMHCpan / NetMHC /
  NetMHCstabpan / NetMHCcons / NetMHCIIpan from stdout preamble
  lines, mhcflurry from `mhcflurry_*` column names, topiary_output
  from `prediction_method_name` + `predictor_version` column headers,
  and parquet via magic bytes.  `--mhc-cache-format` becomes
  optional; only the generic `tsv` path still strictly requires it
  (nothing to sniff).
- **Version specification clarified.**  Existing flags kept:
  `--mhc-cache-predictor-version` for the identity stamp (auto-parsed
  from stdout preamble, auto-composed for mhcflurry); tool-version
  flags `--mhc-cache-netmhc-version` and
  `--mhc-cache-netmhciipan-version` pick between version-specific
  parsers when multiple are supported.

## Review feedback

- **Multi-allele fixtures get xfail tests** (blocker from agent
  review).  3 new strict-xfail tests in `TestMultiAlleleFixturesKnownBroken`
  document the expected mhctools#195 fix; they'll flip to passing
  when upstream lands, producing a reviewable diff that promotes them
  to happy-path tests.
- **CLI directory sharding test** (should fix).
  `TestCachedPredictorCliSharding::test_mhc_cache_directory_merges_shards`
  exercises the full CLI path through `--mhc-cache-directory` with a
  pattern glob.
- **CLI filter/sort interaction tests** (should fix).  Two tests
  verify `--filter-by` / `--sort-by` work against cached-prediction
  output end-to-end.
- **Fixture-pinning docstring** (nit #4).  `tests/test_cached_predictor.py`
  notes that numeric assertions pin specific NetMHC versions and
  headers carry machine-specific comment lines that don't affect
  parsing.
- **`_actions` mutation comment** (nit #5).  Explains why reaching
  into argparse internals is safe and what would happen if mhctools
  changes its action layout.

## Stats

1133 passed, 4 skipped, 3 xfailed (up from 1126 baseline).  Lint
clean, mkdocs --strict clean.
@iskandr
Copy link
Copy Markdown
Contributor Author

iskandr commented Apr 15, 2026

Responding to the three questions plus the agent-review items — all in 4f66745.

Your feedback

Dropped `_stdout` suffix from CLI format names. New choices: `topiary_output`, `mhcflurry`, `tsv`, `netmhcpan`, `netmhc`, `netmhccons`, `netmhciipan`, `netmhcstabpan`. Cleaner. (Python API classmethod names kept their `_stdout` suffix to leave room for future `_xls` parsers.)

`--mhc-cache-format` is now optional when content is self-identifying. New `_sniff_format(path)` helper detects:

  • NetMHCpan / NetMHC / NetMHCstabpan / NetMHCcons / NetMHCIIpan from the "version X.Yb" preamble lines
  • mhcflurry from `mhcflurry_*` column names in the first line
  • topiary's own output from `prediction_method_name` + `predictor_version` columns
  • Parquet via magic bytes (`PAR1`)

Only the generic `tsv` format still needs an explicit `--mhc-cache-format` (nothing to sniff). Error message for the unsniffable case is actionable.

# Sniffed — no --mhc-cache-format needed
topiary --peptide-csv peptides.csv \\
    --mhc-cache-file netmhcpan_run.out \\
    --output-csv results.csv

Version specification via CLI — already supported, now clarified in help text:

  • `--mhc-cache-predictor-version` sets the identity stamp. Auto-parsed from the NetMHC stdout preamble, or auto-composed from the local mhcflurry install, when omitted.
  • `--mhc-cache-netmhc-version` picks between NetMHC classic parser variants (3 / 4 / 4.1). Same for `--mhc-cache-netmhciipan-version` (legacy / 4 / 4.3).

So version is fully discoverable from the file for the NetMHC family: the preamble supplies both the tool identity (for sniffing) and the version stamp (for the invariant).

Bonus: found and fixed a real bug

The filter/sort integration test from the agent review ("do `--filter-by` / `--sort-by` work on cached output?") surfaced a bug: cached-predictor loaders weren't stamping the `kind` column, so DSL expressions like `Affinity.value <= 500` (which dispatches on `kind == "pMHC_affinity"`) matched zero rows silently. Fixed by requiring `_bindings_to_dataframe(..., kind=...)` and having each loader pass the right value (`pMHC_affinity` / `pMHC_presentation` / `pMHC_stability` by tool + mode).

Review-agent items resolved

  • Multi-allele fixtures now have 3 strict-xfail tests cross-linked to mhctools#195 — flip to passing on upstream fix.
  • CLI `--mhc-cache-directory` sharding test added.
  • CLI `--filter-by` / `--sort-by` interaction tests added.
  • Fixture-pinning docstring explains version-drift semantics.
  • `_actions` mutation comment explains the argparse-internals access.

1133 tests pass (up from 1126), lint + mkdocs clean.

## Blocker: stale '_stdout'-suffixed help strings

The prior rename missed four CLI help strings that still referenced
the old 'netmhcpan_stdout' / 'netmhc_stdout' / 'netmhciipan_stdout'
format names (the actual choices are now the unsuffixed forms).
Renamed.  Functional behavior unchanged; user-visible help text now
matches the --mhc-cache-format choices.

## Should fix: TSV loader now stamps 'kind'

from_tsv() was silently dropping the kind column (not in
_CACHE_COLUMNS' required list, never populated), so DSL expressions
like Affinity.value <= 500 on a TSV-loaded cache matched zero rows —
same class of bug the previous round fixed for the NetMHC-family
loaders.

Fix: from_tsv gains a kind="pMHC_affinity" kwarg (default).  If the
TSV file itself has a kind column, it wins; otherwise the kwarg value
is stamped on every row.  CLI exposes the knob as
--mhc-cache-tsv-kind with choices {pMHC_affinity, pMHC_presentation,
pMHC_stability, antigen_processing}, default pMHC_affinity.

Two new regression tests:

- test_tsv_with_column_mapping now asserts
  row['kind'] == 'pMHC_affinity' (default).
- test_tsv_explicit_kind covers --mhc-cache-tsv-kind=pMHC_stability
  for stability predictor output.

## Nit: deleted the unconditionally-skipped obsolete test

test_cache_file_without_format_raises was superseded by
test_tsv_not_autodetected and kept as pytest.skip(...) body.
Unconditional skips add noise to test reports.  Deleted outright.

Full suite 1134 pass (up from 1133 on the last push).  Lint clean.
iskandr added 2 commits April 15, 2026 15:20
Back out the half-done multi-kind explode for mhcflurry in this PR
because the cache's internal index keys on (peptide, allele,
peptide_length) — storing multiple kinds under the same key collapses
to the last write.  Supporting multi-kind cleanly means expanding the
key to a 4-tuple across _build_index / predict_peptides_dataframe /
_fallback_resolve / concat, a non-trivial refactor that doesn't
belong in this CLI-focused PR.

from_mhcflurry reverts to the single-kind-picking heuristic with a
TODO comment cross-referencing the follow-up issue.  Test renamed
back to test_from_mhcflurry with a TODO note on the limitation.

Tracked as #137.

Full suite 1134 pass, lint clean.
CachedPredictor now keys internally on
(peptide, allele, peptide_length, kind) instead of the 3-tuple, so a
single cache holds every kind mhcflurry's class1_presentation pipeline
or NetMHCpan's -BA flag emits — affinity + presentation + processing
rows coexist under the same (peptide, allele) without collisions.

## Loader changes

**from_mhcflurry** now explodes wide-format CSVs into one row per
(peptide, allele, kind):

  mhcflurry_affinity / _affinity_percentile   → pMHC_affinity row
  mhcflurry_presentation_score / _percentile  → pMHC_presentation row
  mhcflurry_processing_score                  → antigen_processing row

Rows whose columns for a given kind are all NaN are skipped, so
affinity-only and presentation-only CSVs still produce sensible
single-kind output.  n_flank / c_flank / source_sequence_name /
peptide_offset / sample_name columns are preserved per-kind.

**from_netmhcpan_stdout** switches from parse_netmhcpan_stdout (kind-
collapses via mode=) to parse_netmhcpan_to_preds which returns every
Prediction in the stdout.  A NetMHCpan -BA run now lands in the cache
as affinity + presentation rows — users no longer pick one at load
time.  Dropped the mode= kwarg entirely (no longer meaningful).

**from_tsv** now requires a 'kind' column on every row (either natively
or via the `columns=` mapping).  Multi-kind TSVs work: add a kind
column, list one value per row.  Clearer than the --mhc-cache-tsv-kind
crutch.

## CLI changes

- Dropped --mhc-cache-tsv-kind (superseded by the required kind column
  on TSV input).
- Dropped --mhc-cache-netmhcpan-mode (parse_netmhcpan_to_preds returns
  everything; no mode pick needed).

## Schema additions

_CACHE_COLUMNS now carries n_flank, c_flank, source_sequence_name,
peptide_offset, sample_name when the source file provides them.
Optional — absent → None.  Round-trips through save/from_topiary_output
intact.

## Version invariant

Unchanged.  Every row of a cache still shares exactly one
(prediction_method_name, predictor_version) pair — multi-kind just
means the pair covers all the kinds that predictor emits.

## Tests

- TestLoaders gains affinity-only / affinity+presentation /
  full-presentation-pipeline mhcflurry tests.
- TestNetMHCLoaders asserts the 2× row count for NetMHCpan -BA.
- TestCachedPredictorCliTsv gains test_tsv_multi_kind_in_one_file.
- Existing tests updated to use the 4-tuple index key where they
  probed _index directly, and to expect multi-kind outputs where
  applicable (filter/sort, auto-sniff, CLI happy paths).

1137 tests pass (up from 1133), lint + mkdocs clean.

Closes #137.
@iskandr
Copy link
Copy Markdown
Contributor Author

iskandr commented Apr 15, 2026

Shipped multi-kind support in `ff1fe12` (closes #137).

What changed

`CachedPredictor` now keys internally on `(peptide, allele, peptide_length, kind)` instead of the 3-tuple, so a single cache holds every kind mhcflurry's class1_presentation pipeline or NetMHCpan `-BA` emits — no more silent data loss.

mhcflurry

`from_mhcflurry` explodes a wide-format CSV into one row per (peptide, allele, kind):

```
mhcflurry_affinity / _affinity_percentile → pMHC_affinity row
mhcflurry_presentation_score / _percentile → pMHC_presentation row
mhcflurry_processing_score → antigen_processing row
```

Rows whose columns for a given kind are all NaN are skipped — affinity-only and presentation-only CSVs still produce sensible single-kind output.

NetMHCpan

`from_netmhcpan_stdout` now uses `parse_netmhcpan_to_preds` (which returns every Prediction in the stdout) instead of the kind-collapsing `parse_netmhcpan_stdout`. A `-BA` run lands in the cache with both affinity + presentation rows per (peptide, allele) — users no longer pick one at load time. Dropped the `mode=` kwarg entirely.

Provenance columns preserved

`n_flank`, `c_flank`, `source_sequence_name`, `peptide_offset`, `sample_name` are now carried through when the source file has them. Same schema across loaders; round-trips cleanly through `save` / `from_topiary_output`.

TSV loader

`from_tsv` now requires a `kind` column per row (natively or via `columns=` mapping). Multi-kind TSVs work: add the column, list one kind per row.

CLI simplification

Dropped `--mhc-cache-tsv-kind` (superseded by the required `kind` column) and `--mhc-cache-netmhcpan-mode` (no longer meaningful — we return every kind).

Version invariant

Unchanged. Every row of a cache shares exactly one `(prediction_method_name, predictor_version)` pair. "Multi-kind" just means that pair covers all the kinds that predictor emits in one run.

Tests

1137 total (up from 1133):

  • `TestLoaders`: affinity-only / affinity+presentation / full-pipeline mhcflurry tests.
  • `TestNetMHCLoaders::test_from_netmhcpan_stdout`: expects 4 rows (2 peptides × 2 kinds).
  • `TestCachedPredictorCliTsv::test_tsv_multi_kind_in_one_file`: multi-kind TSV end-to-end through the CLI.
  • Existing tests updated for the 4-tuple index key and multi-kind outputs.

Lint + mkdocs clean.

Closes #137.

iskandr added 2 commits April 15, 2026 16:06
Flanks affect predictions for mhcflurry's antigen_processing and
pMHC_presentation kinds — the same peptide at two different protein
positions (different flanking context) produces different scores.
The previous 4-tuple key (peptide, allele, length, kind) silently
collapsed those distinct predictions.

Key is now a 6-tuple (peptide, allele, length, kind, n_flank,
c_flank).  None flanks (from predictors / sources that don't use
them) coexist cleanly — they just produce a single (None, None)
entry per (peptide, allele, kind).

## Changes

- _KEY_COLS gains n_flank, c_flank.
- _build_index and _fallback_resolve's row-key helpers use the
  6-tuple.
- _normalize backfills n_flank / c_flank with None when absent so
  every row has a well-defined key shape.
- _resolve_overlaps' callable path uses groupby(..., dropna=False)
  so None-flank rows don't get silently dropped by the default
  groupby NaN behavior.
- predict_peptides_dataframe (which receives peptide strings without
  flank context) iterates the index with a 3-element prefix match
  and returns every kind + flank combo.  A new _lookup_by_prefix
  helper encapsulates this.
- A new _flank_key module helper normalizes NaN / None / "" into a
  single None sentinel so the hash-key is stable across
  round-trip / TSV / parquet differences in null representation.

## Tests

New TestLoaders::test_from_mhcflurry_same_peptide_different_flanks
covers the key differentiator: two mhcflurry processing predictions
at the same (peptide, allele) but different flanks both survive
caching.  Also updates the _index direct-probe tests to use 6-tuple
keys (existing test fixtures have None flanks).

1137 pass, lint + mkdocs clean.
Simpler than None/NaN handling:
- No NaN-vs-None hash ambiguity in tuple keys.
- No special groupby(..., dropna=False) — "" is a regular string
  that pandas groups without surprises.
- _flank_key coerces NaN / None / whitespace all to "" on normalize,
  so TSV / Parquet round-trips stay stable.
- _normalize fills missing flank columns with "" and passes existing
  values through _flank_key (uppercase + strip + NaN coercion) so
  every row in the cache ends up with consistent flank shapes.

Test-level _index probes updated from None to "" for flank
components.  No behavior change beyond the sentinel value.
@iskandr
Copy link
Copy Markdown
Contributor Author

iskandr commented Apr 15, 2026

Vaxrank-consumer review — PR #136

Read the diff. Vaxrank itself won't feel the CLI surface directly (vaxrank uses topiary's Python API, not topiary --mhc-cache-*), but the infrastructure here — real-fixture parser coverage, version-header regex expansion, sharding plumbed to CLI — is the validation layer that makes the Python-side CachedPredictor adoption in vaxrank#227 lower-risk. Concrete feedback below.

What's good

  • Format auto-detection (_sniff_format) with "more-specific tool name first" ordering (NetMHCstabpan → NetMHCIIpan → NetMHCcons → NetMHCpan → NetMHC) is exactly the right precedence. Users passing netmhcstabpan.out won't accidentally get mis-classified as NetMHC.
  • Real-fixture tests — asserting SLLQHLIGL × A*02:01 = 8.82 nM / 0.105%rank through the actual parser is much stronger than synthetic input. This is how parser tests should work.
  • Version-header regex window 2000 → 10000 chars — a sensible fix. NetMHCpan's argument-dump preamble really does push the version line past 2000 chars in the real captures I've seen in my own pipelines.
  • Directory sharding via --mhc-cache-directory-pattern composes cleanly with the Add CachedPredictor sharding — v5.6.0 (closes #128) #133 from_directory work. Users can do topiary --mhc-cache-directory cache/ --mhc-cache-directory-pattern '*.parquet' and it Just Works.

Concerns / suggestions

1. Mutating arg_parser._actions to flip required=False on mhctools flags is a real hack. The comment acknowledges it, but in practice:

  • If mhctools changes how add_mhc_args registers its options (renames, adds subparsers, migrates to a decorator), this silently fails open — cached-predictions callers will hit mhctools requiring --mhc-predictor again without topiary noticing.
  • A CI regression test that runs topiary --mhc-cache-file fixtures/... --peptide-csv ... without --mhc-predictor would at least fail loudly if the hack stops working.

Alternative: upstream a keyword arg to mhctools.cli.add_mhc_args(required=False) and pass it through. A few lines in mhctools, cleaner contract, no private-attribute access. Worth filing if you agree.

2. _sniff_format reads the file twice. Opens binary (first 4 bytes for Parquet magic), then opens text (first 10000 chars). Open once in binary, check magic, decode the rest of the buffer if not Parquet. Minor IO/clarity win, and it eliminates the race where a file could change between the two opens.

3. Format sniffing has no tests. The CLI tests pass explicit --mhc-cache-format every time. The sniffing branches — Parquet magic, NetMHC-family preambles, mhcflurry/topiary-output column detection — are entirely untested. Given how much depends on this (users won't pass --mhc-cache-format for the common case), a parameterized test running each fixture through _sniff_format() and asserting the right format-name comes back would pin this down cheaply.

4. Multi-allele NetMHCpan fixtures are aspirational. The PR commits netmhcpan_41_SLLQHLIGL.out (multi-allele, no _AXXXX suffix) but notes they don't parse today due to mhctools#195. Two options:

  • Mark them explicitly as "parsing blocked on mhctools#195" in a README or xfail-marked test so future contributors don't treat them as working fixtures.
  • Delete until mhctools#195 lands. Dead fixtures rot.

5. _CACHE_FORMATS includes netmhciipan but there's no NetMHCIIpan fixture. If from_netmhciipan_stdout exists and is in the CLI whitelist, at least one fixture file (even a minimal one) would complete the coverage. If no parser exists yet, remove from _CACHE_FORMATS until it does.

6. --mhc-cache-netmhciipan-version defaults to "4.3". Confirming: is NetMHCIIpan 4.3 actually released, or is "4" the current stable? The current choices=("legacy", "4", "4.3") enumerates three options — worth a docstring line clarifying what distinguishes "4" from "4.3" so users pick the right one without digging into the parser source.

7. --mhc-cache-tsv-column repeatable flag vs a single comma-joined string. --mhc-cache-tsv-columns affinity=IC50,percentile_rank=Rank halves the keystrokes. Minor UX choice — the repeatable form is also fine, but worth documenting the pattern in the error message when users mistype (expects KEY=VALUE; got 'affinity=IC50,percentile_rank=Rank' is helpful if they go the comma route accidentally).

Vaxrank-side implication

The headline win for vaxrank adoption of CachedPredictor (issue #227 item #3 equivalent) is "pre-run NetMHCpan offline, pipe the stdout files straight into vaxrank." That path now exists end-to-end once this PR lands. The multi-allele fixture blocker (mhctools#195) matters because vaxrank users frequently run multi-allele NetMHCpan batches — worth tracking as a separate dependency.

Overall: approve after (3) format-sniffer tests land. (1) is a real-but-acceptable hack for this PR; fix when mhctools can absorb a keyword. (4–7) are polish.

## Stale docs cleaned

- docs/cached.md: removed the stale `mode="binding_affinity"` param
  from the from_netmhcpan_stdout example — that kwarg no longer exists.
  Added a note explaining multi-kind behavior for -BA runs.  Removed
  the stale --mhc-cache-netmhcpan-mode row from the CLI flag table.
- docs/api.md: updated the CachedPredictor.from_netmhcpan_stdout entry
  to drop mode= and explain multi-kind output.

## Multi-allele fixtures promoted

The three tests in TestMultiAlleleFixturesKnownBroken are now XPASS
(strict) — they pass because from_netmhcpan_stdout now uses
parse_netmhcpan_to_preds, which handles multi-allele stdout correctly
(the old parse_netmhcpan_stdout crashed on per-allele header lines,
which was mhctools#195).

Renamed the class from TestMultiAlleleFixturesKnownBroken to
TestMultiAlleleFixtures and dropped all xfail markers.  Added a
richer assertion for NetMHCpan 4.1 multi-allele: 3 alleles × 2 kinds
= 6 rows expected.

## Net

0 xfail remaining (were 3 strict xfail, now 0).  1141 tests pass
(up from 1138), lint clean, mkdocs clean.
@iskandr iskandr merged commit d17db1f into master Apr 16, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants