Add CachedPredictor — predictions-as-data core (part 1 of #128) by iskandr · Pull Request #130 · openvax/topiary

iskandr · 2026-04-14T18:17:14Z

Summary

First PR implementing #128 (pluggable prediction sources). Ships the `CachedPredictor` class, the core `(prediction_method_name, predictor_version)` invariant, fallback semantics, and four loaders. Does not include sharding (`concat` / `from_directory`) or the NetMHCpan .xls parser — separate follow-up PRs.

No version bump, no PyPI release: the feature lands across multiple PRs; release once #128 is substantially complete.

What's in

`topiary/cached.py` — `CachedPredictor` class. Implements the mhctools predictor protocol (`predict_peptides_dataframe`, `predict_proteins_dataframe`, `alleles`, `default_peptide_lengths`) by looking up rows in a normalized DataFrame keyed on `(peptide, allele, peptide_length)`.
Core invariant: a single cache holds predictions from exactly one `(prediction_method_name, predictor_version)` pair. Enforced at construction (mixed rows → `ValueError`) and on first fallback call (lazy version-check against the fallback's output).
Fallback: `fallback=None` → miss raises `KeyError` with missed peptides listed. `fallback=` → miss delegates and the result is merged back into the cache. No `populate_on_miss` flag — caching fallback hits is always the right default.
Loaders: `from_dataframe`, `from_topiary_output` (Parquet / TSV), `from_tsv` (generic w/ column mapping), `from_mhcflurry` (maps `mhcflurry_*` columns onto canonical names).
`save(path)` round-trip for Parquet / TSV / CSV so caches persist across sessions.
`topiary.CachedPredictor` exported at the top level.

What's deferred

`from_netmhcpan` (.xls parsing — both Class-I and Class-II formats)
`concat([...])`, `from_directory(...)`, overlap-resolution policy
CHANGELOG entry (deferred to the release-prep PR when Pluggable prediction sources — external formats, topiary round-trip, sharded caches #128 lands in full)

Test plan

`python -m pytest tests/test_cached_predictor.py` — 23 passed
`./test.sh` — 1075 passed, 3 skipped (up from 1052)
`./lint.sh` — clean
CI green

Part 1 of #128.

… of #128) Core class + four loaders, no sharding or version-match tests yet. Implements enough of the mhctools predictor protocol (predict_peptides_dataframe, predict_proteins_dataframe, alleles, default_peptide_lengths) to drop into TopiaryPredictor(models=cache). Core invariant: a single CachedPredictor holds predictions from exactly one (prediction_method_name, predictor_version) pair — never mixes versions. Enforced at construction and on the first fallback call (lazy version-check against the fallback's output). Fallback semantics: - fallback=None (default): miss raises KeyError with missed peptides. - fallback=<predictor>: miss delegates to the fallback, result is merged back into the cache so subsequent queries serve locally. No populate-on-miss flag — caching fallback hits is always the right default for the expected (batch prediction) workload. Loaders: - from_dataframe: in-memory, with optional predictor_name / predictor_version backfill for DataFrames that lack those columns. - from_topiary_output: Parquet / TSV round-trip of topiary's own predict_* output schema. - from_tsv: generic tab- or comma-delimited third-party files with a caller-supplied column-name mapping. - from_mhcflurry: mhcflurry-predict CSV — maps the mhcflurry_* columns onto canonical affinity / percentile_rank / score. Defer to follow-up PRs: - from_netmhcpan (.xls parsing is meatier; separate scope). - concat / from_directory sharding. 23 tests covering construction, version invariant (mixed versions rejected, name/version round-trips as string), predict_peptides + predict_proteins sliding-window, fallback hit + miss + version mismatch, all four loaders, and end-to-end integration with TopiaryPredictor(models=cache).

coveralls · 2026-04-14T18:26:50Z

coverage: 88.297% (+0.2%) from 88.127% — add-cached-predictor into master

Six add-ons from the PR #128 design discussion, bundled: 1. Empty-cache + fallback mode — CachedPredictor(fallback=p) with no df is now valid; (name, version) are discovered from the fallback's output on the first query and locked in thereafter. Pure read- through cache. 2. Reject null / NaN / empty-string on prediction_method_name and predictor_version at construction. Silent "I don't know" agreement would mask the invariant; force the user to supply a real identity. 3. also_accept_versions opt-in equivalence set. A fallback's version passes if it equals the cache's OR is in the caller's accepted set. Names stay strict — different predictors are different. Default (empty set) keeps today's strict equality. 4. mhcflurry_composite_version() helper and auto-default in from_mhcflurry. Discovers mhcflurry.__version__ plus mhcflurry.downloads.get_current_release() and composes e.g. "2.2.1+release-2.2.0". Users never enumerate model bundles — CachedPredictor.from_mhcflurry("predictions.csv") just works. Explicit predictor_version= still overrides when a custom label is needed. 5. from_mhcflurry docstring now explains the mhcflurry-specific package-version + model-release composition (scoped to this loader, not the generic class — NetMHCpan et al. have their model data baked into the binary and don't need composite versions). 6. Lookup refactor: one pass to identify peptides with any miss, one fallback call per round, one pass to assemble rows from the updated cache. Fixes a latent duplicate-row bug in the prior implementation (allele-level partial results + fallback union). 15 new tests across null-version rejection, empty-cache-plus-fallback identity discovery, equivalence-set acceptance vs. rejection, the helper's composition + error paths (stubbed mhcflurry module — avoids tensorflow/libomp collisions in the test env), and auto-compose vs. explicit-override for from_mhcflurry.

- docs/cached.md: new page covering motivation, all four loaders, the version invariant + also_accept_versions opt-in, mhcflurry's composite-version auto-composition, fallback semantics (including pure read-through), persistence, and when NOT to use the cache. - mkdocs.yml: wire "Cached Predictions" into the nav between "Protein Fragments" and "Ranking DSL". - docs/quickstart.md: new "From cached predictions" subsection at the end of the Python-API examples, linking to cached.md. - docs/api.md: "CachedPredictor" section with a loader-table, constructor knobs, and the mhcflurry_composite_version helper. - docs/index.md: add to the feature list and cross-link paragraph. - README.md: new top-level "Cached predictions" section. mkdocs build --strict passes.

iskandr · 2026-04-14T19:35:06Z

Vaxrank-consumer review

Reading from vaxrank 2.1.2's angle — this lands on a real vaxrank pain point I hadn't explicitly asked for in #121: vaxrank calls predict_from_named_sequences twice per mutation (mutant + WT), and the two share most of their sliding-window 9-mers. A CachedPredictor(fallback=mhc_predictor) persisted across a vaxrank run collapses those duplicate queries end-to-end — this is a bigger win than the straight WT-pass (topiary#123) on realistic inputs.

Design calls I like

Single (name, version) invariant, enforced at construction and lazily on fallback attach. ValueError with full multi-pair listing is exactly the diagnostic I'd want.
also_accept_versions as opt-in, not auto-bypass. Upgrading a predictor across a patch release is semantically risky; forcing the caller to name the specific accepted version captures intent without a silent trap.
mhcflurry_composite_version() composing pkg + model-release is the right granularity. The auto-compose on from_mhcflurry(..., predictor_version=None) means users never hand-enumerate bundles.
TopiaryPredictor(models=cache) singular form (line test_cache_as_model_predict_from_sequences) — nice to see; matches what I'd asked for on Refactor predict_from_variants onto AntigenFragment (PR B) #121. Is single-model auto-wrapping now always supported, or just when the arg is a cache?

Concerns / suggestions

Partial-allele coverage over-queries the fallback. In predict_peptides_dataframe the miss-detection is at peptide granularity: if peptide P1 has cache hits for allele A but is missing for allele B, the whole P1 gets sent to fallback.predict_peptides_dataframe([P1]). The fallback typically predicts across all its alleles, so (P1, A) gets re-predicted and merged back, overwriting the cached row with the fallback's output. Correct, but wasteful if the caller's cache was partial-on-purpose. Worth either (a) documenting "fallback results overwrite overlapping cache rows" or (b) filtering the fallback output to only the actually-missed (peptide, allele) pairs before merging. Vaxrank usage probably has uniform allele coverage so won't hit this, but doc is cheap.
peptide_length lock-in is invisible. A cache built from only 9-mers has default_peptide_lengths == [9]; predict_proteins_dataframe will silently slide 9-mers only, dropping 8/10/11 windows. For a cache backed by a fallback that supports more lengths this is fine, but a cache used standalone can silently under-return. Worth a docstring note on default_peptide_lengths — or, if the cache has a fallback, maybe union the fallback's lengths on empty-cache to avoid surprises (which is what you're already doing — good; just make that explicit in the protein-path doc).
_index as dict-of-dicts has O(rows) Python-object overhead. Fine for vaxrank (thousands of peptides per run); but whole-proteome caches (millions of rows, which I'd expect some topiary consumers to build) will feel it. A row-index lookup into the DataFrame, or a MultiIndex, would be more memory-efficient. Not blocking — flag for a follow-up benchmark.
Thread safety. _fallback_resolve mutates self._df and self._index without locking. Two threads concurrently missing the same peptide would both call the fallback and race on the merge. Vaxrank is single-threaded, so no blocker — but a one-line docstring note ("not thread-safe; wrap in a lock if shared across workers") would save a future bug report.
Pristine empty cache + save() edge case. CachedPredictor(fallback=X) (empty _df, prediction_method_name=None) lets you save before any queries run. The resulting TSV/Parquet has the schema columns but no rows, and reloading via from_topiary_output would hit _unique_version_pair with empty pairs → ValueError("empty DataFrame"). So roundtripping a never-queried cache breaks. Minor, and probably not a real use case, but either error-early in save() when identity is unset or document "save a populated cache only."
mhctools protocol shape. predict_dataframe = predict_peptides_dataframe — nice compat alias. Worth checking whether downstream code paths in topiary's own TopiaryPredictor._predict_raw* use the positional-arg vs keyword-arg shape consistently; I noticed vaxrank wraps bare mhctools predictors on every call (TopiaryPredictor(models=[mhc_predictor]) in epitope_logic.py) rather than keeping a long-lived predictor. A CachedPredictor wrapping that bare predictor plus persistence across calls would be the idiomatic fix on the vaxrank side.

Vaxrank follow-up sketch (for my own tracking)

Once this lands:

One CachedPredictor(fallback=mhc_predictor, save_path=...) per vaxrank run, shared across the mutant + WT predict_from_named_sequences calls.
Combined with topiary#123 (predict_wt=True), the ~40 lines of WT plumbing in vaxrank/epitope_logic.py:120–161 collapse to a single call with both directions cached.
The predictor_version stamping gives users a natural reproducibility artifact (current vaxrank has none).

Overall: this is a well-scoped part 1. Ship it; the concerns above are mostly docs/edge cases.

5.5.0 is a minor bump per project convention (new feature, non- breaking). CHANGELOG entry documents the full CachedPredictor surface shipped in this PR: loaders, version invariant, opt-in equivalence, mhcflurry auto-composition, fallback, and the related mhctools#193 issue.

Addresses concrete concerns from the consumer review on PR #130: 1. **Preserve pre-loaded cache rows through fallback overlap.** The fallback typically returns rows for every allele it's configured with, not just the specific (peptide, allele) that missed. A partial-allele cache (peptide P hit for A, miss for B) would see its (P, A) row silently overwritten when (P, B) triggered the fallback. _fallback_resolve now filters fallback output to keys not already in the index before merging. 2. **Docstring note on silent peptide-length lock-in.** A cache loaded from a file containing only one peptide length (say 9-mers) will only scan 9-mer windows in predict_proteins_dataframe. The current behavior is correct but easy to miss; the class docstring now calls it out explicitly so users check default_peptide_lengths after loading. 3. **Thread-safety disclaimer in the class docstring.** _fallback_resolve mutates _index and _df without locking; concurrent misses would race. Single-threaded usage is the norm, but document the constraint so downstream code doesn't hit it by surprise. 4. **save() on an empty, never-queried cache raises.** A CachedPredictor(fallback=p) with no pre-loaded data and no queries yet has no (prediction_method_name, predictor_version) and no rows; saving produced a schema-only file that from_topiary_output couldn't round-trip. Fail fast at save time with a message pointing users at the fix (run at least one query first). 3 new regression tests (41 total for cached, up from 38): partial-allele preservation, save-before-query rejection, and save-after-populate round-trip success. Full suite 1093 passed (up from 1090), lint clean.

Follow-up to #130 addressing concrete concerns from the vaxrank-consumer review (comment 4246617648): - Partial-allele cache preservation: _fallback_resolve filters fallback output to keys not already in the index before merging, so pre-loaded rows aren't overwritten when the fallback covers overlapping alleles. - Docstring notes on silent peptide-length lock-in and non-thread-safety. - save() raises on empty never-queried caches so users don't write schema-only files that can't be round-tripped. 3 new regression tests (41 total for CachedPredictor, up from 38), full suite 1093 passing. Concern #3 (_index memory footprint on whole-proteome caches) is deferred to a benchmark-driven follow-up.

iskandr added 2 commits April 14, 2026 15:24

iskandr merged commit 1ea85e9 into master Apr 14, 2026
8 checks passed

iskandr mentioned this pull request Apr 14, 2026

CachedPredictor vaxrank-review polish (#130 follow-up) #131

Merged

4 tasks

This was referenced Apr 15, 2026

CachedPredictor._index memory footprint on whole-proteome caches #134

Closed

TopiaryPredictor(predict_wt=True) — first-class WT predictions pass #123

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CachedPredictor — predictions-as-data core (part 1 of #128)#130

Add CachedPredictor — predictions-as-data core (part 1 of #128)#130
iskandr merged 4 commits into
masterfrom
add-cached-predictor

iskandr commented Apr 14, 2026

Uh oh!

coveralls commented Apr 14, 2026 •

edited

Loading

Uh oh!

iskandr commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

iskandr commented Apr 14, 2026

Summary

What's in

What's deferred

Test plan

Uh oh!

coveralls commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iskandr commented Apr 14, 2026

Vaxrank-consumer review

Design calls I like

Concerns / suggestions

Vaxrank follow-up sketch (for my own tracking)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coveralls commented Apr 14, 2026 •

edited

Loading