CachedPredictor CLI + real NetMHC fixtures + README reorder by iskandr · Pull Request #136 · openvax/topiary

iskandr · 2026-04-15T17:49:25Z

Summary

Three related changes on the cached-predictions story (orthogonal to PR #135 which is still in review on the SelfProteome side):

CLI support for CachedPredictor. New --mhc-cache-* argument group lets users run topiary entirely from pre-computed prediction files without invoking a live MHC predictor.
Real NetMHC-family fixtures captured from the netmhc-bundle binaries for peptide SLLQHLIGL at HLA-A*02:01 / A*24:02 / B*07:02 (substituted for B*07:01 which NetMHCpan doesn't recognize). 15 new tests exercise the full parsing + CLI chain against real data.
README section reorder — "MHC prediction models" moved near the top; "Cached predictions" moved to the end.

CLI shape

topiary --peptide-csv peptides.csv \
    --mhc-cache-file netmhcpan_run.out \
    --mhc-cache-format netmhcpan_stdout \
    --output-csv results.csv

Supported formats (via --mhc-cache-format): topiary_output, mhcflurry, tsv, netmhcpan_stdout, netmhc_stdout, netmhcpan_cons_stdout, netmhciipan_stdout, netmhcstabpan_stdout. Plus --mhc-cache-directory with --mhc-cache-directory-pattern for sharded caches.

--mhc-predictor / --mhc-alleles become optional when a cache is supplying predictions. Validation covers missing format, mutually-exclusive file vs directory, and no-source-at-all cases.

Full flag reference is documented in docs/cached.md.

Fixtures

tests/data/netmhc_fixtures/:

netmhcpan_41_SLLQHLIGL_{A0201,A2402,B0702}.out — NetMHCpan 4.1b stdout, single-allele per file.
netmhcpan_40_SLLQHLIGL_*.out — NetMHCpan 4.0 stdout.
netmhc_40_SLLQHLIGL.out — NetMHC classic 4.0 (multi-allele works).
netmhcstabpan_SLLQHLIGL_*.out — NetMHCstabpan single-allele.
Multi-allele NetMHCpan and stabpan files are also committed but don't parse today due to mhctools' per-allele-header-line bug, filed as parse_netmhcpan_stdout crashes on multi-allele NetMHCpan output mhctools#195.

NetMHCcons skipped — the binary calls python2 which isn't on PATH here.

Minor fix

topiary/cached.py: expanded the version-header regex search window from 2000 to 10000 chars. NetMHCpan preambles with full argument dumps push the # NetMHCpan version X.Yb line past 2000 chars (real captures show it at char ~2700), which was making predictor_version default to "unknown" instead of parsing from the header.

Test plan

6 new TestRealNetMHCFixtures tests in tests/test_cached_predictor.py assert actual numeric predictions (e.g. SLLQHLIGL × A*02:01 = 8.82 nM, 0.105%rank) through each from_*_stdout loader.
9 new tests in `tests/test_cli_cached_predictor.py` exercise the CLI end-to-end — error paths, happy paths for netmhcpan/netmhc/netmhcstabpan/topiary-output/tsv formats, explicit version override, TSV column mapping.
./test.sh — 1126 passed, 3 skipped.
./lint.sh — clean.
mkdocs build --strict — clean.
CI green.

## CLI New `--mhc-cache-*` argument group lets users run topiary entirely from pre-computed prediction files, without invoking a live MHC predictor. Every CachedPredictor loader has a CLI path: topiary --peptide-csv peptides.csv \ --mhc-cache-file netmhcpan_run.out \ --mhc-cache-format netmhcpan_stdout \ --output-csv results.csv Flags: --mhc-cache-file, --mhc-cache-directory (+ pattern), --mhc-cache-format (8 choices), --mhc-cache-predictor-name / --mhc-cache-predictor-version overrides, --mhc-cache-tsv-column (repeatable), --mhc-cache-tsv-sep, plus format-specific mode and version knobs for the NetMHC family. --mhc-predictor / --mhc-alleles become optional when a cache is supplying predictions. A combined --mhc-cache-file + --mhc-cache-directory raises; missing --mhc-cache-format with a --mhc-cache-file raises; etc. — all validated with actionable messages. ## Real NetMHC fixtures tests/data/netmhc_fixtures/ now holds captured stdout from the netmhc-bundle binaries for peptide SLLQHLIGL against HLA-A*02:01, HLA-A*24:02, HLA-B*07:02 (substituted for B*07:01 which isn't in NetMHCpan's allele list): - netmhcpan-4.1 and 4.0 output (single-allele per file) - netmhc-4.0 output (multi-allele works) - netmhcstabpan output (single-allele) Multi-allele NetMHCpan outputs are also committed but currently don't parse through mhctools; filed as openvax/mhctools#195 with the root-cause analysis (per-allele "Distance to training data" header lines confuse the row parser). NetMHCcons skipped because the binary requires python2 which isn't on this machine's PATH. ## Tests - 6 new TestRealNetMHCFixtures tests assert the actual numeric predictions (e.g. SLLQHLIGL × HLA-A*02:01 = 8.82 nM, 0.105%rank) parse through each from_*_stdout loader. - 9 new tests in tests/test_cli_cached_predictor.py exercise the CLI end-to-end: error paths, happy paths for netmhcpan / netmhc / netmhcstabpan / topiary-output / tsv formats, explicit version override, TSV column mapping. 1126 tests pass (up from 1111 on this branch's base), lint clean. ## Minor fix topiary/cached.py: expanded the stdout-header version-regex search window from 2000 to 10000 chars. NetMHCpan preambles with full argument dumps push the "# NetMHCpan version X.Yb" line past 2000 chars, which was causing predictor_version to default to "unknown" instead of parsing from the header. Real captured fixtures showed the version line at char ~2700. ## README reorder Moved "MHC prediction models" section from line 481 to just after "Predicting MHC binding" at line 66 — it's more fundamental for readers than where to find it buried. Moved "Cached predictions" from line 442 to the end of the README — it's an advanced/optional feature that belongs after the basics. New order: Installation → Predicting MHC binding → MHC prediction models → Filtering and ranking → Expression data → CLI usage → Output → Peptide properties → Protein sources → Cached predictions. ## Docs docs/cached.md gains a "From the CLI" section with a full flag table and per-format command examples. README's Cached predictions section shows a one-liner CLI example with a link to the full reference.

coveralls · 2026-04-15T17:58:13Z

coverage: 88.202% (-0.04%) from 88.246% — cached-predictor-cli into master

iskandr · 2026-04-15T18:03:18Z

Why do you need _stdout at the end of all these?

netmhcpan_stdout, netmhc_stdout, netmhcpan_cons_stdout, netmhciipan_stdout, netmhcstabpan_stdout

iskandr · 2026-04-15T18:04:47Z

And is there a CLI option for version specification?

And can the format and version both be sniffed from the text file?

## Bug fix (critical) Cached-predictor loaders were not stamping the `kind` column, so DSL expressions like `Affinity.value <= 500` (which dispatches on `kind == "pMHC_affinity"`) matched zero rows. Surfaced by a new CLI filter/sort test. Fix: `_bindings_to_dataframe(preds, *, kind)` now requires a kind argument; each `from_*_stdout` loader passes the right value (`pMHC_affinity` for binding-affinity mode, `pMHC_presentation` for elution-score mode, `pMHC_stability` for NetMHCstabpan, inferred from mhcflurry column signatures for `from_mhcflurry`). ## User feedback - **Drop `_stdout` suffix from CLI format names.** `netmhcpan_stdout` → `netmhcpan`, etc. The suffix was noise; users never wondered if it was stdout vs .xls because we only ship stdout today. Internal loader classmethod names kept their `_stdout` suffix for the separation between stdout and future .xls parsers in the Python API. - **Auto-sniff `--mhc-cache-format` from file content.** New `_sniff_format(path)` helper detects NetMHCpan / NetMHC / NetMHCstabpan / NetMHCcons / NetMHCIIpan from stdout preamble lines, mhcflurry from `mhcflurry_*` column names, topiary_output from `prediction_method_name` + `predictor_version` column headers, and parquet via magic bytes. `--mhc-cache-format` becomes optional; only the generic `tsv` path still strictly requires it (nothing to sniff). - **Version specification clarified.** Existing flags kept: `--mhc-cache-predictor-version` for the identity stamp (auto-parsed from stdout preamble, auto-composed for mhcflurry); tool-version flags `--mhc-cache-netmhc-version` and `--mhc-cache-netmhciipan-version` pick between version-specific parsers when multiple are supported. ## Review feedback - **Multi-allele fixtures get xfail tests** (blocker from agent review). 3 new strict-xfail tests in `TestMultiAlleleFixturesKnownBroken` document the expected mhctools#195 fix; they'll flip to passing when upstream lands, producing a reviewable diff that promotes them to happy-path tests. - **CLI directory sharding test** (should fix). `TestCachedPredictorCliSharding::test_mhc_cache_directory_merges_shards` exercises the full CLI path through `--mhc-cache-directory` with a pattern glob. - **CLI filter/sort interaction tests** (should fix). Two tests verify `--filter-by` / `--sort-by` work against cached-prediction output end-to-end. - **Fixture-pinning docstring** (nit #4). `tests/test_cached_predictor.py` notes that numeric assertions pin specific NetMHC versions and headers carry machine-specific comment lines that don't affect parsing. - **`_actions` mutation comment** (nit #5). Explains why reaching into argparse internals is safe and what would happen if mhctools changes its action layout. ## Stats 1133 passed, 4 skipped, 3 xfailed (up from 1126 baseline). Lint clean, mkdocs --strict clean.

iskandr · 2026-04-15T18:52:22Z

Responding to the three questions plus the agent-review items — all in 4f66745.

Your feedback

Dropped `_stdout` suffix from CLI format names. New choices: `topiary_output`, `mhcflurry`, `tsv`, `netmhcpan`, `netmhc`, `netmhccons`, `netmhciipan`, `netmhcstabpan`. Cleaner. (Python API classmethod names kept their `_stdout` suffix to leave room for future `_xls` parsers.)

`--mhc-cache-format` is now optional when content is self-identifying. New `_sniff_format(path)` helper detects:

NetMHCpan / NetMHC / NetMHCstabpan / NetMHCcons / NetMHCIIpan from the "version X.Yb" preamble lines
mhcflurry from `mhcflurry_*` column names in the first line
topiary's own output from `prediction_method_name` + `predictor_version` columns
Parquet via magic bytes (`PAR1`)

Only the generic `tsv` format still needs an explicit `--mhc-cache-format` (nothing to sniff). Error message for the unsniffable case is actionable.

# Sniffed — no --mhc-cache-format needed
topiary --peptide-csv peptides.csv \\
    --mhc-cache-file netmhcpan_run.out \\
    --output-csv results.csv

Version specification via CLI — already supported, now clarified in help text:

`--mhc-cache-predictor-version` sets the identity stamp. Auto-parsed from the NetMHC stdout preamble, or auto-composed from the local mhcflurry install, when omitted.
`--mhc-cache-netmhc-version` picks between NetMHC classic parser variants (3 / 4 / 4.1). Same for `--mhc-cache-netmhciipan-version` (legacy / 4 / 4.3).

So version is fully discoverable from the file for the NetMHC family: the preamble supplies both the tool identity (for sniffing) and the version stamp (for the invariant).

Bonus: found and fixed a real bug

The filter/sort integration test from the agent review ("do `--filter-by` / `--sort-by` work on cached output?") surfaced a bug: cached-predictor loaders weren't stamping the `kind` column, so DSL expressions like `Affinity.value <= 500` (which dispatches on `kind == "pMHC_affinity"`) matched zero rows silently. Fixed by requiring `_bindings_to_dataframe(..., kind=...)` and having each loader pass the right value (`pMHC_affinity` / `pMHC_presentation` / `pMHC_stability` by tool + mode).

Review-agent items resolved

Multi-allele fixtures now have 3 strict-xfail tests cross-linked to mhctools#195 — flip to passing on upstream fix.
CLI `--mhc-cache-directory` sharding test added.
CLI `--filter-by` / `--sort-by` interaction tests added.
Fixture-pinning docstring explains version-drift semantics.
`_actions` mutation comment explains the argparse-internals access.

1133 tests pass (up from 1126), lint + mkdocs clean.

## Blocker: stale '_stdout'-suffixed help strings The prior rename missed four CLI help strings that still referenced the old 'netmhcpan_stdout' / 'netmhc_stdout' / 'netmhciipan_stdout' format names (the actual choices are now the unsuffixed forms). Renamed. Functional behavior unchanged; user-visible help text now matches the --mhc-cache-format choices. ## Should fix: TSV loader now stamps 'kind' from_tsv() was silently dropping the kind column (not in _CACHE_COLUMNS' required list, never populated), so DSL expressions like Affinity.value <= 500 on a TSV-loaded cache matched zero rows — same class of bug the previous round fixed for the NetMHC-family loaders. Fix: from_tsv gains a kind="pMHC_affinity" kwarg (default). If the TSV file itself has a kind column, it wins; otherwise the kwarg value is stamped on every row. CLI exposes the knob as --mhc-cache-tsv-kind with choices {pMHC_affinity, pMHC_presentation, pMHC_stability, antigen_processing}, default pMHC_affinity. Two new regression tests: - test_tsv_with_column_mapping now asserts row['kind'] == 'pMHC_affinity' (default). - test_tsv_explicit_kind covers --mhc-cache-tsv-kind=pMHC_stability for stability predictor output. ## Nit: deleted the unconditionally-skipped obsolete test test_cache_file_without_format_raises was superseded by test_tsv_not_autodetected and kept as pytest.skip(...) body. Unconditional skips add noise to test reports. Deleted outright. Full suite 1134 pass (up from 1133 on the last push). Lint clean.

Back out the half-done multi-kind explode for mhcflurry in this PR because the cache's internal index keys on (peptide, allele, peptide_length) — storing multiple kinds under the same key collapses to the last write. Supporting multi-kind cleanly means expanding the key to a 4-tuple across _build_index / predict_peptides_dataframe / _fallback_resolve / concat, a non-trivial refactor that doesn't belong in this CLI-focused PR. from_mhcflurry reverts to the single-kind-picking heuristic with a TODO comment cross-referencing the follow-up issue. Test renamed back to test_from_mhcflurry with a TODO note on the limitation. Tracked as #137. Full suite 1134 pass, lint clean.

CachedPredictor now keys internally on (peptide, allele, peptide_length, kind) instead of the 3-tuple, so a single cache holds every kind mhcflurry's class1_presentation pipeline or NetMHCpan's -BA flag emits — affinity + presentation + processing rows coexist under the same (peptide, allele) without collisions. ## Loader changes **from_mhcflurry** now explodes wide-format CSVs into one row per (peptide, allele, kind): mhcflurry_affinity / _affinity_percentile → pMHC_affinity row mhcflurry_presentation_score / _percentile → pMHC_presentation row mhcflurry_processing_score → antigen_processing row Rows whose columns for a given kind are all NaN are skipped, so affinity-only and presentation-only CSVs still produce sensible single-kind output. n_flank / c_flank / source_sequence_name / peptide_offset / sample_name columns are preserved per-kind. **from_netmhcpan_stdout** switches from parse_netmhcpan_stdout (kind- collapses via mode=) to parse_netmhcpan_to_preds which returns every Prediction in the stdout. A NetMHCpan -BA run now lands in the cache as affinity + presentation rows — users no longer pick one at load time. Dropped the mode= kwarg entirely (no longer meaningful). **from_tsv** now requires a 'kind' column on every row (either natively or via the `columns=` mapping). Multi-kind TSVs work: add a kind column, list one value per row. Clearer than the --mhc-cache-tsv-kind crutch. ## CLI changes - Dropped --mhc-cache-tsv-kind (superseded by the required kind column on TSV input). - Dropped --mhc-cache-netmhcpan-mode (parse_netmhcpan_to_preds returns everything; no mode pick needed). ## Schema additions _CACHE_COLUMNS now carries n_flank, c_flank, source_sequence_name, peptide_offset, sample_name when the source file provides them. Optional — absent → None. Round-trips through save/from_topiary_output intact. ## Version invariant Unchanged. Every row of a cache still shares exactly one (prediction_method_name, predictor_version) pair — multi-kind just means the pair covers all the kinds that predictor emits. ## Tests - TestLoaders gains affinity-only / affinity+presentation / full-presentation-pipeline mhcflurry tests. - TestNetMHCLoaders asserts the 2× row count for NetMHCpan -BA. - TestCachedPredictorCliTsv gains test_tsv_multi_kind_in_one_file. - Existing tests updated to use the 4-tuple index key where they probed _index directly, and to expect multi-kind outputs where applicable (filter/sort, auto-sniff, CLI happy paths). 1137 tests pass (up from 1133), lint + mkdocs clean. Closes #137.

iskandr · 2026-04-15T19:43:38Z

Shipped multi-kind support in `ff1fe12` (closes #137).

What changed

`CachedPredictor` now keys internally on `(peptide, allele, peptide_length, kind)` instead of the 3-tuple, so a single cache holds every kind mhcflurry's class1_presentation pipeline or NetMHCpan `-BA` emits — no more silent data loss.

mhcflurry

`from_mhcflurry` explodes a wide-format CSV into one row per (peptide, allele, kind):

```
mhcflurry_affinity / _affinity_percentile → pMHC_affinity row
mhcflurry_presentation_score / _percentile → pMHC_presentation row
mhcflurry_processing_score → antigen_processing row
```

Rows whose columns for a given kind are all NaN are skipped — affinity-only and presentation-only CSVs still produce sensible single-kind output.

NetMHCpan

`from_netmhcpan_stdout` now uses `parse_netmhcpan_to_preds` (which returns every Prediction in the stdout) instead of the kind-collapsing `parse_netmhcpan_stdout`. A `-BA` run lands in the cache with both affinity + presentation rows per (peptide, allele) — users no longer pick one at load time. Dropped the `mode=` kwarg entirely.

Provenance columns preserved

`n_flank`, `c_flank`, `source_sequence_name`, `peptide_offset`, `sample_name` are now carried through when the source file has them. Same schema across loaders; round-trips cleanly through `save` / `from_topiary_output`.

TSV loader

`from_tsv` now requires a `kind` column per row (natively or via `columns=` mapping). Multi-kind TSVs work: add the column, list one kind per row.

CLI simplification

Dropped `--mhc-cache-tsv-kind` (superseded by the required `kind` column) and `--mhc-cache-netmhcpan-mode` (no longer meaningful — we return every kind).

Version invariant

Unchanged. Every row of a cache shares exactly one `(prediction_method_name, predictor_version)` pair. "Multi-kind" just means that pair covers all the kinds that predictor emits in one run.

Tests

1137 total (up from 1133):

`TestLoaders`: affinity-only / affinity+presentation / full-pipeline mhcflurry tests.
`TestNetMHCLoaders::test_from_netmhcpan_stdout`: expects 4 rows (2 peptides × 2 kinds).
`TestCachedPredictorCliTsv::test_tsv_multi_kind_in_one_file`: multi-kind TSV end-to-end through the CLI.
Existing tests updated for the 4-tuple index key and multi-kind outputs.

Lint + mkdocs clean.

Closes #137.

Flanks affect predictions for mhcflurry's antigen_processing and pMHC_presentation kinds — the same peptide at two different protein positions (different flanking context) produces different scores. The previous 4-tuple key (peptide, allele, length, kind) silently collapsed those distinct predictions. Key is now a 6-tuple (peptide, allele, length, kind, n_flank, c_flank). None flanks (from predictors / sources that don't use them) coexist cleanly — they just produce a single (None, None) entry per (peptide, allele, kind). ## Changes - _KEY_COLS gains n_flank, c_flank. - _build_index and _fallback_resolve's row-key helpers use the 6-tuple. - _normalize backfills n_flank / c_flank with None when absent so every row has a well-defined key shape. - _resolve_overlaps' callable path uses groupby(..., dropna=False) so None-flank rows don't get silently dropped by the default groupby NaN behavior. - predict_peptides_dataframe (which receives peptide strings without flank context) iterates the index with a 3-element prefix match and returns every kind + flank combo. A new _lookup_by_prefix helper encapsulates this. - A new _flank_key module helper normalizes NaN / None / "" into a single None sentinel so the hash-key is stable across round-trip / TSV / parquet differences in null representation. ## Tests New TestLoaders::test_from_mhcflurry_same_peptide_different_flanks covers the key differentiator: two mhcflurry processing predictions at the same (peptide, allele) but different flanks both survive caching. Also updates the _index direct-probe tests to use 6-tuple keys (existing test fixtures have None flanks). 1137 pass, lint + mkdocs clean.

Simpler than None/NaN handling: - No NaN-vs-None hash ambiguity in tuple keys. - No special groupby(..., dropna=False) — "" is a regular string that pandas groups without surprises. - _flank_key coerces NaN / None / whitespace all to "" on normalize, so TSV / Parquet round-trips stay stable. - _normalize fills missing flank columns with "" and passes existing values through _flank_key (uppercase + strip + NaN coercion) so every row in the cache ends up with consistent flank shapes. Test-level _index probes updated from None to "" for flank components. No behavior change beyond the sentinel value.

iskandr · 2026-04-15T22:32:29Z

Vaxrank-consumer review — PR #136

Read the diff. Vaxrank itself won't feel the CLI surface directly (vaxrank uses topiary's Python API, not topiary --mhc-cache-*), but the infrastructure here — real-fixture parser coverage, version-header regex expansion, sharding plumbed to CLI — is the validation layer that makes the Python-side CachedPredictor adoption in vaxrank#227 lower-risk. Concrete feedback below.

What's good

Format auto-detection (_sniff_format) with "more-specific tool name first" ordering (NetMHCstabpan → NetMHCIIpan → NetMHCcons → NetMHCpan → NetMHC) is exactly the right precedence. Users passing netmhcstabpan.out won't accidentally get mis-classified as NetMHC.
Real-fixture tests — asserting SLLQHLIGL × A*02:01 = 8.82 nM / 0.105%rank through the actual parser is much stronger than synthetic input. This is how parser tests should work.
Version-header regex window 2000 → 10000 chars — a sensible fix. NetMHCpan's argument-dump preamble really does push the version line past 2000 chars in the real captures I've seen in my own pipelines.
Directory sharding via --mhc-cache-directory-pattern composes cleanly with the Add CachedPredictor sharding — v5.6.0 (closes #128) #133 from_directory work. Users can do topiary --mhc-cache-directory cache/ --mhc-cache-directory-pattern '*.parquet' and it Just Works.

Concerns / suggestions

1. Mutating arg_parser._actions to flip required=False on mhctools flags is a real hack. The comment acknowledges it, but in practice:

If mhctools changes how add_mhc_args registers its options (renames, adds subparsers, migrates to a decorator), this silently fails open — cached-predictions callers will hit mhctools requiring --mhc-predictor again without topiary noticing.
A CI regression test that runs topiary --mhc-cache-file fixtures/... --peptide-csv ... without --mhc-predictor would at least fail loudly if the hack stops working.

Alternative: upstream a keyword arg to mhctools.cli.add_mhc_args(required=False) and pass it through. A few lines in mhctools, cleaner contract, no private-attribute access. Worth filing if you agree.

2. _sniff_format reads the file twice. Opens binary (first 4 bytes for Parquet magic), then opens text (first 10000 chars). Open once in binary, check magic, decode the rest of the buffer if not Parquet. Minor IO/clarity win, and it eliminates the race where a file could change between the two opens.

3. Format sniffing has no tests. The CLI tests pass explicit --mhc-cache-format every time. The sniffing branches — Parquet magic, NetMHC-family preambles, mhcflurry/topiary-output column detection — are entirely untested. Given how much depends on this (users won't pass --mhc-cache-format for the common case), a parameterized test running each fixture through _sniff_format() and asserting the right format-name comes back would pin this down cheaply.

4. Multi-allele NetMHCpan fixtures are aspirational. The PR commits netmhcpan_41_SLLQHLIGL.out (multi-allele, no _AXXXX suffix) but notes they don't parse today due to mhctools#195. Two options:

Mark them explicitly as "parsing blocked on mhctools#195" in a README or xfail-marked test so future contributors don't treat them as working fixtures.
Delete until mhctools#195 lands. Dead fixtures rot.

5. _CACHE_FORMATS includes netmhciipan but there's no NetMHCIIpan fixture. If from_netmhciipan_stdout exists and is in the CLI whitelist, at least one fixture file (even a minimal one) would complete the coverage. If no parser exists yet, remove from _CACHE_FORMATS until it does.

6. --mhc-cache-netmhciipan-version defaults to "4.3". Confirming: is NetMHCIIpan 4.3 actually released, or is "4" the current stable? The current choices=("legacy", "4", "4.3") enumerates three options — worth a docstring line clarifying what distinguishes "4" from "4.3" so users pick the right one without digging into the parser source.

7. --mhc-cache-tsv-column repeatable flag vs a single comma-joined string. --mhc-cache-tsv-columns affinity=IC50,percentile_rank=Rank halves the keystrokes. Minor UX choice — the repeatable form is also fine, but worth documenting the pattern in the error message when users mistype (expects KEY=VALUE; got 'affinity=IC50,percentile_rank=Rank' is helpful if they go the comma route accidentally).

Vaxrank-side implication

The headline win for vaxrank adoption of CachedPredictor (issue #227 item #3 equivalent) is "pre-run NetMHCpan offline, pipe the stdout files straight into vaxrank." That path now exists end-to-end once this PR lands. The multi-allele fixture blocker (mhctools#195) matters because vaxrank users frequently run multi-allele NetMHCpan batches — worth tracking as a separate dependency.

Overall: approve after (3) format-sniffer tests land. (1) is a real-but-acceptable hack for this PR; fix when mhctools can absorb a keyword. (4–7) are polish.

## Stale docs cleaned - docs/cached.md: removed the stale `mode="binding_affinity"` param from the from_netmhcpan_stdout example — that kwarg no longer exists. Added a note explaining multi-kind behavior for -BA runs. Removed the stale --mhc-cache-netmhcpan-mode row from the CLI flag table. - docs/api.md: updated the CachedPredictor.from_netmhcpan_stdout entry to drop mode= and explain multi-kind output. ## Multi-allele fixtures promoted The three tests in TestMultiAlleleFixturesKnownBroken are now XPASS (strict) — they pass because from_netmhcpan_stdout now uses parse_netmhcpan_to_preds, which handles multi-allele stdout correctly (the old parse_netmhcpan_stdout crashed on per-allele header lines, which was mhctools#195). Renamed the class from TestMultiAlleleFixturesKnownBroken to TestMultiAlleleFixtures and dropped all xfail markers. Added a richer assertion for NetMHCpan 4.1 multi-allele: 3 alleles × 2 kinds = 6 rows expected. ## Net 0 xfail remaining (were 3 strict xfail, now 0). 1141 tests pass (up from 1138), lint clean, mkdocs clean.

iskandr mentioned this pull request Apr 15, 2026

CachedPredictor: support multi-kind cache (affinity + presentation + processing per key) #137

Closed

iskandr added 2 commits April 15, 2026 15:20

iskandr added 2 commits April 15, 2026 16:06

iskandr merged commit d17db1f into master Apr 16, 2026
8 checks passed

iskandr added a commit that referenced this pull request Apr 16, 2026

Bump to v5.7.0 + CHANGELOG for CachedPredictor CLI + multi-kind (#136)

6bea6b0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CachedPredictor CLI + real NetMHC fixtures + README reorder#136

CachedPredictor CLI + real NetMHC fixtures + README reorder#136
iskandr merged 8 commits into
masterfrom
cached-predictor-cli

iskandr commented Apr 15, 2026 •

edited

Loading

Uh oh!

coveralls commented Apr 15, 2026 •

edited

Loading

Uh oh!

iskandr commented Apr 15, 2026

Uh oh!

iskandr commented Apr 15, 2026

Uh oh!

iskandr commented Apr 15, 2026

Uh oh!

iskandr commented Apr 15, 2026

Uh oh!

iskandr commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

iskandr commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

CLI shape

Fixtures

Minor fix

Test plan

Related

Uh oh!

coveralls commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iskandr commented Apr 15, 2026

Uh oh!

iskandr commented Apr 15, 2026

Uh oh!

iskandr commented Apr 15, 2026

Your feedback

Bonus: found and fixed a real bug

Review-agent items resolved

Uh oh!

iskandr commented Apr 15, 2026

What changed

mhcflurry

NetMHCpan

Provenance columns preserved

TSV loader

CLI simplification

Version invariant

Tests

Uh oh!

iskandr commented Apr 15, 2026

Vaxrank-consumer review — PR #136

What's good

Concerns / suggestions

Vaxrank-side implication

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

iskandr commented Apr 15, 2026 •

edited

Loading

coveralls commented Apr 15, 2026 •

edited

Loading