CachedPredictor CLI + real NetMHC fixtures + README reorder#136
Conversation
## CLI
New `--mhc-cache-*` argument group lets users run topiary entirely
from pre-computed prediction files, without invoking a live MHC
predictor. Every CachedPredictor loader has a CLI path:
topiary --peptide-csv peptides.csv \
--mhc-cache-file netmhcpan_run.out \
--mhc-cache-format netmhcpan_stdout \
--output-csv results.csv
Flags: --mhc-cache-file, --mhc-cache-directory (+ pattern),
--mhc-cache-format (8 choices), --mhc-cache-predictor-name /
--mhc-cache-predictor-version overrides, --mhc-cache-tsv-column
(repeatable), --mhc-cache-tsv-sep, plus format-specific mode and
version knobs for the NetMHC family.
--mhc-predictor / --mhc-alleles become optional when a cache is
supplying predictions. A combined --mhc-cache-file +
--mhc-cache-directory raises; missing --mhc-cache-format with a
--mhc-cache-file raises; etc. — all validated with actionable
messages.
## Real NetMHC fixtures
tests/data/netmhc_fixtures/ now holds captured stdout from the
netmhc-bundle binaries for peptide SLLQHLIGL against HLA-A*02:01,
HLA-A*24:02, HLA-B*07:02 (substituted for B*07:01 which isn't in
NetMHCpan's allele list):
- netmhcpan-4.1 and 4.0 output (single-allele per file)
- netmhc-4.0 output (multi-allele works)
- netmhcstabpan output (single-allele)
Multi-allele NetMHCpan outputs are also committed but currently
don't parse through mhctools; filed as openvax/mhctools#195 with
the root-cause analysis (per-allele "Distance to training data"
header lines confuse the row parser). NetMHCcons skipped because
the binary requires python2 which isn't on this machine's PATH.
## Tests
- 6 new TestRealNetMHCFixtures tests assert the actual numeric
predictions (e.g. SLLQHLIGL × HLA-A*02:01 = 8.82 nM, 0.105%rank)
parse through each from_*_stdout loader.
- 9 new tests in tests/test_cli_cached_predictor.py exercise the
CLI end-to-end: error paths, happy paths for netmhcpan / netmhc /
netmhcstabpan / topiary-output / tsv formats, explicit version
override, TSV column mapping.
1126 tests pass (up from 1111 on this branch's base), lint clean.
## Minor fix
topiary/cached.py: expanded the stdout-header version-regex search
window from 2000 to 10000 chars. NetMHCpan preambles with full
argument dumps push the "# NetMHCpan version X.Yb" line past 2000
chars, which was causing predictor_version to default to "unknown"
instead of parsing from the header. Real captured fixtures showed
the version line at char ~2700.
## README reorder
Moved "MHC prediction models" section from line 481 to just after
"Predicting MHC binding" at line 66 — it's more fundamental for
readers than where to find it buried. Moved "Cached predictions"
from line 442 to the end of the README — it's an advanced/optional
feature that belongs after the basics. New order: Installation →
Predicting MHC binding → MHC prediction models → Filtering and
ranking → Expression data → CLI usage → Output → Peptide properties
→ Protein sources → Cached predictions.
## Docs
docs/cached.md gains a "From the CLI" section with a full flag
table and per-format command examples. README's Cached predictions
section shows a one-liner CLI example with a link to the full
reference.
|
Why do you need |
|
And is there a CLI option for version specification? And can the format and version both be sniffed from the text file? |
## Bug fix (critical) Cached-predictor loaders were not stamping the `kind` column, so DSL expressions like `Affinity.value <= 500` (which dispatches on `kind == "pMHC_affinity"`) matched zero rows. Surfaced by a new CLI filter/sort test. Fix: `_bindings_to_dataframe(preds, *, kind)` now requires a kind argument; each `from_*_stdout` loader passes the right value (`pMHC_affinity` for binding-affinity mode, `pMHC_presentation` for elution-score mode, `pMHC_stability` for NetMHCstabpan, inferred from mhcflurry column signatures for `from_mhcflurry`). ## User feedback - **Drop `_stdout` suffix from CLI format names.** `netmhcpan_stdout` → `netmhcpan`, etc. The suffix was noise; users never wondered if it was stdout vs .xls because we only ship stdout today. Internal loader classmethod names kept their `_stdout` suffix for the separation between stdout and future .xls parsers in the Python API. - **Auto-sniff `--mhc-cache-format` from file content.** New `_sniff_format(path)` helper detects NetMHCpan / NetMHC / NetMHCstabpan / NetMHCcons / NetMHCIIpan from stdout preamble lines, mhcflurry from `mhcflurry_*` column names, topiary_output from `prediction_method_name` + `predictor_version` column headers, and parquet via magic bytes. `--mhc-cache-format` becomes optional; only the generic `tsv` path still strictly requires it (nothing to sniff). - **Version specification clarified.** Existing flags kept: `--mhc-cache-predictor-version` for the identity stamp (auto-parsed from stdout preamble, auto-composed for mhcflurry); tool-version flags `--mhc-cache-netmhc-version` and `--mhc-cache-netmhciipan-version` pick between version-specific parsers when multiple are supported. ## Review feedback - **Multi-allele fixtures get xfail tests** (blocker from agent review). 3 new strict-xfail tests in `TestMultiAlleleFixturesKnownBroken` document the expected mhctools#195 fix; they'll flip to passing when upstream lands, producing a reviewable diff that promotes them to happy-path tests. - **CLI directory sharding test** (should fix). `TestCachedPredictorCliSharding::test_mhc_cache_directory_merges_shards` exercises the full CLI path through `--mhc-cache-directory` with a pattern glob. - **CLI filter/sort interaction tests** (should fix). Two tests verify `--filter-by` / `--sort-by` work against cached-prediction output end-to-end. - **Fixture-pinning docstring** (nit #4). `tests/test_cached_predictor.py` notes that numeric assertions pin specific NetMHC versions and headers carry machine-specific comment lines that don't affect parsing. - **`_actions` mutation comment** (nit #5). Explains why reaching into argparse internals is safe and what would happen if mhctools changes its action layout. ## Stats 1133 passed, 4 skipped, 3 xfailed (up from 1126 baseline). Lint clean, mkdocs --strict clean.
|
Responding to the three questions plus the agent-review items — all in Your feedbackDropped `_stdout` suffix from CLI format names. New choices: `topiary_output`, `mhcflurry`, `tsv`, `netmhcpan`, `netmhc`, `netmhccons`, `netmhciipan`, `netmhcstabpan`. Cleaner. (Python API classmethod names kept their `_stdout` suffix to leave room for future `_xls` parsers.) `--mhc-cache-format` is now optional when content is self-identifying. New `_sniff_format(path)` helper detects:
Only the generic `tsv` format still needs an explicit `--mhc-cache-format` (nothing to sniff). Error message for the unsniffable case is actionable. # Sniffed — no --mhc-cache-format needed
topiary --peptide-csv peptides.csv \\
--mhc-cache-file netmhcpan_run.out \\
--output-csv results.csvVersion specification via CLI — already supported, now clarified in help text:
So version is fully discoverable from the file for the NetMHC family: the preamble supplies both the tool identity (for sniffing) and the version stamp (for the invariant). Bonus: found and fixed a real bugThe filter/sort integration test from the agent review ("do `--filter-by` / `--sort-by` work on cached output?") surfaced a bug: cached-predictor loaders weren't stamping the `kind` column, so DSL expressions like `Affinity.value <= 500` (which dispatches on `kind == "pMHC_affinity"`) matched zero rows silently. Fixed by requiring `_bindings_to_dataframe(..., kind=...)` and having each loader pass the right value (`pMHC_affinity` / `pMHC_presentation` / `pMHC_stability` by tool + mode). Review-agent items resolved
1133 tests pass (up from 1126), lint + mkdocs clean. |
## Blocker: stale '_stdout'-suffixed help strings
The prior rename missed four CLI help strings that still referenced
the old 'netmhcpan_stdout' / 'netmhc_stdout' / 'netmhciipan_stdout'
format names (the actual choices are now the unsuffixed forms).
Renamed. Functional behavior unchanged; user-visible help text now
matches the --mhc-cache-format choices.
## Should fix: TSV loader now stamps 'kind'
from_tsv() was silently dropping the kind column (not in
_CACHE_COLUMNS' required list, never populated), so DSL expressions
like Affinity.value <= 500 on a TSV-loaded cache matched zero rows —
same class of bug the previous round fixed for the NetMHC-family
loaders.
Fix: from_tsv gains a kind="pMHC_affinity" kwarg (default). If the
TSV file itself has a kind column, it wins; otherwise the kwarg value
is stamped on every row. CLI exposes the knob as
--mhc-cache-tsv-kind with choices {pMHC_affinity, pMHC_presentation,
pMHC_stability, antigen_processing}, default pMHC_affinity.
Two new regression tests:
- test_tsv_with_column_mapping now asserts
row['kind'] == 'pMHC_affinity' (default).
- test_tsv_explicit_kind covers --mhc-cache-tsv-kind=pMHC_stability
for stability predictor output.
## Nit: deleted the unconditionally-skipped obsolete test
test_cache_file_without_format_raises was superseded by
test_tsv_not_autodetected and kept as pytest.skip(...) body.
Unconditional skips add noise to test reports. Deleted outright.
Full suite 1134 pass (up from 1133 on the last push). Lint clean.
Back out the half-done multi-kind explode for mhcflurry in this PR because the cache's internal index keys on (peptide, allele, peptide_length) — storing multiple kinds under the same key collapses to the last write. Supporting multi-kind cleanly means expanding the key to a 4-tuple across _build_index / predict_peptides_dataframe / _fallback_resolve / concat, a non-trivial refactor that doesn't belong in this CLI-focused PR. from_mhcflurry reverts to the single-kind-picking heuristic with a TODO comment cross-referencing the follow-up issue. Test renamed back to test_from_mhcflurry with a TODO note on the limitation. Tracked as #137. Full suite 1134 pass, lint clean.
CachedPredictor now keys internally on (peptide, allele, peptide_length, kind) instead of the 3-tuple, so a single cache holds every kind mhcflurry's class1_presentation pipeline or NetMHCpan's -BA flag emits — affinity + presentation + processing rows coexist under the same (peptide, allele) without collisions. ## Loader changes **from_mhcflurry** now explodes wide-format CSVs into one row per (peptide, allele, kind): mhcflurry_affinity / _affinity_percentile → pMHC_affinity row mhcflurry_presentation_score / _percentile → pMHC_presentation row mhcflurry_processing_score → antigen_processing row Rows whose columns for a given kind are all NaN are skipped, so affinity-only and presentation-only CSVs still produce sensible single-kind output. n_flank / c_flank / source_sequence_name / peptide_offset / sample_name columns are preserved per-kind. **from_netmhcpan_stdout** switches from parse_netmhcpan_stdout (kind- collapses via mode=) to parse_netmhcpan_to_preds which returns every Prediction in the stdout. A NetMHCpan -BA run now lands in the cache as affinity + presentation rows — users no longer pick one at load time. Dropped the mode= kwarg entirely (no longer meaningful). **from_tsv** now requires a 'kind' column on every row (either natively or via the `columns=` mapping). Multi-kind TSVs work: add a kind column, list one value per row. Clearer than the --mhc-cache-tsv-kind crutch. ## CLI changes - Dropped --mhc-cache-tsv-kind (superseded by the required kind column on TSV input). - Dropped --mhc-cache-netmhcpan-mode (parse_netmhcpan_to_preds returns everything; no mode pick needed). ## Schema additions _CACHE_COLUMNS now carries n_flank, c_flank, source_sequence_name, peptide_offset, sample_name when the source file provides them. Optional — absent → None. Round-trips through save/from_topiary_output intact. ## Version invariant Unchanged. Every row of a cache still shares exactly one (prediction_method_name, predictor_version) pair — multi-kind just means the pair covers all the kinds that predictor emits. ## Tests - TestLoaders gains affinity-only / affinity+presentation / full-presentation-pipeline mhcflurry tests. - TestNetMHCLoaders asserts the 2× row count for NetMHCpan -BA. - TestCachedPredictorCliTsv gains test_tsv_multi_kind_in_one_file. - Existing tests updated to use the 4-tuple index key where they probed _index directly, and to expect multi-kind outputs where applicable (filter/sort, auto-sniff, CLI happy paths). 1137 tests pass (up from 1133), lint + mkdocs clean. Closes #137.
|
Shipped multi-kind support in `ff1fe12` (closes #137). What changed`CachedPredictor` now keys internally on `(peptide, allele, peptide_length, kind)` instead of the 3-tuple, so a single cache holds every kind mhcflurry's class1_presentation pipeline or NetMHCpan `-BA` emits — no more silent data loss. mhcflurry`from_mhcflurry` explodes a wide-format CSV into one row per (peptide, allele, kind): ``` Rows whose columns for a given kind are all NaN are skipped — affinity-only and presentation-only CSVs still produce sensible single-kind output. NetMHCpan`from_netmhcpan_stdout` now uses `parse_netmhcpan_to_preds` (which returns every Prediction in the stdout) instead of the kind-collapsing `parse_netmhcpan_stdout`. A `-BA` run lands in the cache with both affinity + presentation rows per (peptide, allele) — users no longer pick one at load time. Dropped the `mode=` kwarg entirely. Provenance columns preserved`n_flank`, `c_flank`, `source_sequence_name`, `peptide_offset`, `sample_name` are now carried through when the source file has them. Same schema across loaders; round-trips cleanly through `save` / `from_topiary_output`. TSV loader`from_tsv` now requires a `kind` column per row (natively or via `columns=` mapping). Multi-kind TSVs work: add the column, list one kind per row. CLI simplificationDropped `--mhc-cache-tsv-kind` (superseded by the required `kind` column) and `--mhc-cache-netmhcpan-mode` (no longer meaningful — we return every kind). Version invariantUnchanged. Every row of a cache shares exactly one `(prediction_method_name, predictor_version)` pair. "Multi-kind" just means that pair covers all the kinds that predictor emits in one run. Tests1137 total (up from 1133):
Lint + mkdocs clean. Closes #137. |
Flanks affect predictions for mhcflurry's antigen_processing and pMHC_presentation kinds — the same peptide at two different protein positions (different flanking context) produces different scores. The previous 4-tuple key (peptide, allele, length, kind) silently collapsed those distinct predictions. Key is now a 6-tuple (peptide, allele, length, kind, n_flank, c_flank). None flanks (from predictors / sources that don't use them) coexist cleanly — they just produce a single (None, None) entry per (peptide, allele, kind). ## Changes - _KEY_COLS gains n_flank, c_flank. - _build_index and _fallback_resolve's row-key helpers use the 6-tuple. - _normalize backfills n_flank / c_flank with None when absent so every row has a well-defined key shape. - _resolve_overlaps' callable path uses groupby(..., dropna=False) so None-flank rows don't get silently dropped by the default groupby NaN behavior. - predict_peptides_dataframe (which receives peptide strings without flank context) iterates the index with a 3-element prefix match and returns every kind + flank combo. A new _lookup_by_prefix helper encapsulates this. - A new _flank_key module helper normalizes NaN / None / "" into a single None sentinel so the hash-key is stable across round-trip / TSV / parquet differences in null representation. ## Tests New TestLoaders::test_from_mhcflurry_same_peptide_different_flanks covers the key differentiator: two mhcflurry processing predictions at the same (peptide, allele) but different flanks both survive caching. Also updates the _index direct-probe tests to use 6-tuple keys (existing test fixtures have None flanks). 1137 pass, lint + mkdocs clean.
Simpler than None/NaN handling: - No NaN-vs-None hash ambiguity in tuple keys. - No special groupby(..., dropna=False) — "" is a regular string that pandas groups without surprises. - _flank_key coerces NaN / None / whitespace all to "" on normalize, so TSV / Parquet round-trips stay stable. - _normalize fills missing flank columns with "" and passes existing values through _flank_key (uppercase + strip + NaN coercion) so every row in the cache ends up with consistent flank shapes. Test-level _index probes updated from None to "" for flank components. No behavior change beyond the sentinel value.
Vaxrank-consumer review — PR #136Read the diff. Vaxrank itself won't feel the CLI surface directly (vaxrank uses topiary's Python API, not What's good
Concerns / suggestions1. Mutating
Alternative: upstream a keyword arg to 2. 3. Format sniffing has no tests. The CLI tests pass explicit 4. Multi-allele NetMHCpan fixtures are aspirational. The PR commits
5. 6. 7. Vaxrank-side implicationThe headline win for vaxrank adoption of CachedPredictor (issue #227 item #3 equivalent) is "pre-run NetMHCpan offline, pipe the stdout files straight into vaxrank." That path now exists end-to-end once this PR lands. The multi-allele fixture blocker (mhctools#195) matters because vaxrank users frequently run multi-allele NetMHCpan batches — worth tracking as a separate dependency. Overall: approve after (3) format-sniffer tests land. (1) is a real-but-acceptable hack for this PR; fix when mhctools can absorb a keyword. (4–7) are polish. |
## Stale docs cleaned - docs/cached.md: removed the stale `mode="binding_affinity"` param from the from_netmhcpan_stdout example — that kwarg no longer exists. Added a note explaining multi-kind behavior for -BA runs. Removed the stale --mhc-cache-netmhcpan-mode row from the CLI flag table. - docs/api.md: updated the CachedPredictor.from_netmhcpan_stdout entry to drop mode= and explain multi-kind output. ## Multi-allele fixtures promoted The three tests in TestMultiAlleleFixturesKnownBroken are now XPASS (strict) — they pass because from_netmhcpan_stdout now uses parse_netmhcpan_to_preds, which handles multi-allele stdout correctly (the old parse_netmhcpan_stdout crashed on per-allele header lines, which was mhctools#195). Renamed the class from TestMultiAlleleFixturesKnownBroken to TestMultiAlleleFixtures and dropped all xfail markers. Added a richer assertion for NetMHCpan 4.1 multi-allele: 3 alleles × 2 kinds = 6 rows expected. ## Net 0 xfail remaining (were 3 strict xfail, now 0). 1141 tests pass (up from 1138), lint clean, mkdocs clean.
Summary
Three related changes on the cached-predictions story (orthogonal to PR #135 which is still in review on the SelfProteome side):
CachedPredictor. New--mhc-cache-*argument group lets users run topiary entirely from pre-computed prediction files without invoking a live MHC predictor.netmhc-bundlebinaries for peptideSLLQHLIGLatHLA-A*02:01 / A*24:02 / B*07:02(substituted forB*07:01which NetMHCpan doesn't recognize). 15 new tests exercise the full parsing + CLI chain against real data.CLI shape
topiary --peptide-csv peptides.csv \ --mhc-cache-file netmhcpan_run.out \ --mhc-cache-format netmhcpan_stdout \ --output-csv results.csvSupported formats (via
--mhc-cache-format):topiary_output,mhcflurry,tsv,netmhcpan_stdout,netmhc_stdout,netmhcpan_cons_stdout,netmhciipan_stdout,netmhcstabpan_stdout. Plus--mhc-cache-directorywith--mhc-cache-directory-patternfor sharded caches.--mhc-predictor/--mhc-allelesbecome optional when a cache is supplying predictions. Validation covers missing format, mutually-exclusive file vs directory, and no-source-at-all cases.Full flag reference is documented in
docs/cached.md.Fixtures
tests/data/netmhc_fixtures/:netmhcpan_41_SLLQHLIGL_{A0201,A2402,B0702}.out— NetMHCpan 4.1b stdout, single-allele per file.netmhcpan_40_SLLQHLIGL_*.out— NetMHCpan 4.0 stdout.netmhc_40_SLLQHLIGL.out— NetMHC classic 4.0 (multi-allele works).netmhcstabpan_SLLQHLIGL_*.out— NetMHCstabpan single-allele.NetMHCcons skipped — the binary calls
python2which isn't on PATH here.Minor fix
topiary/cached.py: expanded the version-header regex search window from 2000 to 10000 chars. NetMHCpan preambles with full argument dumps push the# NetMHCpan version X.Ybline past 2000 chars (real captures show it at char ~2700), which was makingpredictor_versiondefault to"unknown"instead of parsing from the header.Test plan
TestRealNetMHCFixturestests intests/test_cached_predictor.pyassert actual numeric predictions (e.g. SLLQHLIGL × A*02:01 = 8.82 nM, 0.105%rank) through eachfrom_*_stdoutloader../test.sh— 1126 passed, 3 skipped../lint.sh— clean.mkdocs build --strict— clean.Related