Skip to content

Add CachedPredictor — predictions-as-data core (part 1 of #128)#130

Merged
iskandr merged 4 commits into
masterfrom
add-cached-predictor
Apr 14, 2026
Merged

Add CachedPredictor — predictions-as-data core (part 1 of #128)#130
iskandr merged 4 commits into
masterfrom
add-cached-predictor

Conversation

@iskandr
Copy link
Copy Markdown
Contributor

@iskandr iskandr commented Apr 14, 2026

Summary

First PR implementing #128 (pluggable prediction sources). Ships the `CachedPredictor` class, the core `(prediction_method_name, predictor_version)` invariant, fallback semantics, and four loaders. Does not include sharding (`concat` / `from_directory`) or the NetMHCpan .xls parser — separate follow-up PRs.

No version bump, no PyPI release: the feature lands across multiple PRs; release once #128 is substantially complete.

What's in

  • `topiary/cached.py` — `CachedPredictor` class. Implements the mhctools predictor protocol (`predict_peptides_dataframe`, `predict_proteins_dataframe`, `alleles`, `default_peptide_lengths`) by looking up rows in a normalized DataFrame keyed on `(peptide, allele, peptide_length)`.
  • Core invariant: a single cache holds predictions from exactly one `(prediction_method_name, predictor_version)` pair. Enforced at construction (mixed rows → `ValueError`) and on first fallback call (lazy version-check against the fallback's output).
  • Fallback: `fallback=None` → miss raises `KeyError` with missed peptides listed. `fallback=` → miss delegates and the result is merged back into the cache. No `populate_on_miss` flag — caching fallback hits is always the right default.
  • Loaders: `from_dataframe`, `from_topiary_output` (Parquet / TSV), `from_tsv` (generic w/ column mapping), `from_mhcflurry` (maps `mhcflurry_*` columns onto canonical names).
  • `save(path)` round-trip for Parquet / TSV / CSV so caches persist across sessions.
  • `topiary.CachedPredictor` exported at the top level.

What's deferred

Test plan

  • `python -m pytest tests/test_cached_predictor.py` — 23 passed
  • `./test.sh` — 1075 passed, 3 skipped (up from 1052)
  • `./lint.sh` — clean
  • CI green

Part 1 of #128.

… of #128)

Core class + four loaders, no sharding or version-match tests yet.
Implements enough of the mhctools predictor protocol
(predict_peptides_dataframe, predict_proteins_dataframe, alleles,
default_peptide_lengths) to drop into TopiaryPredictor(models=cache).

Core invariant: a single CachedPredictor holds predictions from
exactly one (prediction_method_name, predictor_version) pair — never
mixes versions.  Enforced at construction and on the first fallback
call (lazy version-check against the fallback's output).

Fallback semantics:
- fallback=None (default): miss raises KeyError with missed peptides.
- fallback=<predictor>: miss delegates to the fallback, result is
  merged back into the cache so subsequent queries serve locally.
  No populate-on-miss flag — caching fallback hits is always the
  right default for the expected (batch prediction) workload.

Loaders:
- from_dataframe: in-memory, with optional predictor_name /
  predictor_version backfill for DataFrames that lack those columns.
- from_topiary_output: Parquet / TSV round-trip of topiary's own
  predict_* output schema.
- from_tsv: generic tab- or comma-delimited third-party files with a
  caller-supplied column-name mapping.
- from_mhcflurry: mhcflurry-predict CSV — maps the mhcflurry_*
  columns onto canonical affinity / percentile_rank / score.

Defer to follow-up PRs:
- from_netmhcpan (.xls parsing is meatier; separate scope).
- concat / from_directory sharding.

23 tests covering construction, version invariant (mixed versions
rejected, name/version round-trips as string), predict_peptides +
predict_proteins sliding-window, fallback hit + miss + version
mismatch, all four loaders, and end-to-end integration with
TopiaryPredictor(models=cache).
@coveralls
Copy link
Copy Markdown

coveralls commented Apr 14, 2026

Coverage Status

coverage: 88.297% (+0.2%) from 88.127% — add-cached-predictor into master

iskandr added 2 commits April 14, 2026 15:24
Six add-ons from the PR #128 design discussion, bundled:

1. Empty-cache + fallback mode — CachedPredictor(fallback=p) with no
   df is now valid; (name, version) are discovered from the fallback's
   output on the first query and locked in thereafter.  Pure read-
   through cache.

2. Reject null / NaN / empty-string on prediction_method_name and
   predictor_version at construction.  Silent "I don't know" agreement
   would mask the invariant; force the user to supply a real identity.

3. also_accept_versions opt-in equivalence set.  A fallback's version
   passes if it equals the cache's OR is in the caller's accepted set.
   Names stay strict — different predictors are different.  Default
   (empty set) keeps today's strict equality.

4. mhcflurry_composite_version() helper and auto-default in
   from_mhcflurry.  Discovers mhcflurry.__version__ plus
   mhcflurry.downloads.get_current_release() and composes e.g.
   "2.2.1+release-2.2.0".  Users never enumerate model bundles —
   CachedPredictor.from_mhcflurry("predictions.csv") just works.
   Explicit predictor_version= still overrides when a custom label is
   needed.

5. from_mhcflurry docstring now explains the mhcflurry-specific
   package-version + model-release composition (scoped to this loader,
   not the generic class — NetMHCpan et al. have their model data baked
   into the binary and don't need composite versions).

6. Lookup refactor: one pass to identify peptides with any miss, one
   fallback call per round, one pass to assemble rows from the updated
   cache.  Fixes a latent duplicate-row bug in the prior implementation
   (allele-level partial results + fallback union).

15 new tests across null-version rejection, empty-cache-plus-fallback
identity discovery, equivalence-set acceptance vs. rejection, the
helper's composition + error paths (stubbed mhcflurry module — avoids
tensorflow/libomp collisions in the test env), and auto-compose vs.
explicit-override for from_mhcflurry.
- docs/cached.md: new page covering motivation, all four loaders,
  the version invariant + also_accept_versions opt-in, mhcflurry's
  composite-version auto-composition, fallback semantics (including
  pure read-through), persistence, and when NOT to use the cache.
- mkdocs.yml: wire "Cached Predictions" into the nav between
  "Protein Fragments" and "Ranking DSL".
- docs/quickstart.md: new "From cached predictions" subsection at
  the end of the Python-API examples, linking to cached.md.
- docs/api.md: "CachedPredictor" section with a loader-table,
  constructor knobs, and the mhcflurry_composite_version helper.
- docs/index.md: add to the feature list and cross-link paragraph.
- README.md: new top-level "Cached predictions" section.

mkdocs build --strict passes.
@iskandr
Copy link
Copy Markdown
Contributor Author

iskandr commented Apr 14, 2026

Vaxrank-consumer review

Reading from vaxrank 2.1.2's angle — this lands on a real vaxrank pain point I hadn't explicitly asked for in #121: vaxrank calls predict_from_named_sequences twice per mutation (mutant + WT), and the two share most of their sliding-window 9-mers. A CachedPredictor(fallback=mhc_predictor) persisted across a vaxrank run collapses those duplicate queries end-to-end — this is a bigger win than the straight WT-pass (topiary#123) on realistic inputs.

Design calls I like

  • Single (name, version) invariant, enforced at construction and lazily on fallback attach. ValueError with full multi-pair listing is exactly the diagnostic I'd want.
  • also_accept_versions as opt-in, not auto-bypass. Upgrading a predictor across a patch release is semantically risky; forcing the caller to name the specific accepted version captures intent without a silent trap.
  • mhcflurry_composite_version() composing pkg + model-release is the right granularity. The auto-compose on from_mhcflurry(..., predictor_version=None) means users never hand-enumerate bundles.
  • TopiaryPredictor(models=cache) singular form (line test_cache_as_model_predict_from_sequences) — nice to see; matches what I'd asked for on Refactor predict_from_variants onto AntigenFragment (PR B) #121. Is single-model auto-wrapping now always supported, or just when the arg is a cache?

Concerns / suggestions

  1. Partial-allele coverage over-queries the fallback. In predict_peptides_dataframe the miss-detection is at peptide granularity: if peptide P1 has cache hits for allele A but is missing for allele B, the whole P1 gets sent to fallback.predict_peptides_dataframe([P1]). The fallback typically predicts across all its alleles, so (P1, A) gets re-predicted and merged back, overwriting the cached row with the fallback's output. Correct, but wasteful if the caller's cache was partial-on-purpose. Worth either (a) documenting "fallback results overwrite overlapping cache rows" or (b) filtering the fallback output to only the actually-missed (peptide, allele) pairs before merging. Vaxrank usage probably has uniform allele coverage so won't hit this, but doc is cheap.

  2. peptide_length lock-in is invisible. A cache built from only 9-mers has default_peptide_lengths == [9]; predict_proteins_dataframe will silently slide 9-mers only, dropping 8/10/11 windows. For a cache backed by a fallback that supports more lengths this is fine, but a cache used standalone can silently under-return. Worth a docstring note on default_peptide_lengths — or, if the cache has a fallback, maybe union the fallback's lengths on empty-cache to avoid surprises (which is what you're already doing — good; just make that explicit in the protein-path doc).

  3. _index as dict-of-dicts has O(rows) Python-object overhead. Fine for vaxrank (thousands of peptides per run); but whole-proteome caches (millions of rows, which I'd expect some topiary consumers to build) will feel it. A row-index lookup into the DataFrame, or a MultiIndex, would be more memory-efficient. Not blocking — flag for a follow-up benchmark.

  4. Thread safety. _fallback_resolve mutates self._df and self._index without locking. Two threads concurrently missing the same peptide would both call the fallback and race on the merge. Vaxrank is single-threaded, so no blocker — but a one-line docstring note ("not thread-safe; wrap in a lock if shared across workers") would save a future bug report.

  5. Pristine empty cache + save() edge case. CachedPredictor(fallback=X) (empty _df, prediction_method_name=None) lets you save before any queries run. The resulting TSV/Parquet has the schema columns but no rows, and reloading via from_topiary_output would hit _unique_version_pair with empty pairs → ValueError("empty DataFrame"). So roundtripping a never-queried cache breaks. Minor, and probably not a real use case, but either error-early in save() when identity is unset or document "save a populated cache only."

  6. mhctools protocol shape. predict_dataframe = predict_peptides_dataframe — nice compat alias. Worth checking whether downstream code paths in topiary's own TopiaryPredictor._predict_raw* use the positional-arg vs keyword-arg shape consistently; I noticed vaxrank wraps bare mhctools predictors on every call (TopiaryPredictor(models=[mhc_predictor]) in epitope_logic.py) rather than keeping a long-lived predictor. A CachedPredictor wrapping that bare predictor plus persistence across calls would be the idiomatic fix on the vaxrank side.

Vaxrank follow-up sketch (for my own tracking)

Once this lands:

  • One CachedPredictor(fallback=mhc_predictor, save_path=...) per vaxrank run, shared across the mutant + WT predict_from_named_sequences calls.
  • Combined with topiary#123 (predict_wt=True), the ~40 lines of WT plumbing in vaxrank/epitope_logic.py:120–161 collapse to a single call with both directions cached.
  • The predictor_version stamping gives users a natural reproducibility artifact (current vaxrank has none).

Overall: this is a well-scoped part 1. Ship it; the concerns above are mostly docs/edge cases.

5.5.0 is a minor bump per project convention (new feature, non-
breaking).  CHANGELOG entry documents the full CachedPredictor
surface shipped in this PR: loaders, version invariant, opt-in
equivalence, mhcflurry auto-composition, fallback, and the related
mhctools#193 issue.
@iskandr iskandr merged commit 1ea85e9 into master Apr 14, 2026
8 checks passed
iskandr added a commit that referenced this pull request Apr 14, 2026
Addresses concrete concerns from the consumer review on PR #130:

1. **Preserve pre-loaded cache rows through fallback overlap.**
   The fallback typically returns rows for every allele it's
   configured with, not just the specific (peptide, allele) that
   missed.  A partial-allele cache (peptide P hit for A, miss for B)
   would see its (P, A) row silently overwritten when (P, B) triggered
   the fallback.  _fallback_resolve now filters fallback output to
   keys not already in the index before merging.

2. **Docstring note on silent peptide-length lock-in.**
   A cache loaded from a file containing only one peptide length
   (say 9-mers) will only scan 9-mer windows in
   predict_proteins_dataframe.  The current behavior is correct but
   easy to miss; the class docstring now calls it out explicitly so
   users check default_peptide_lengths after loading.

3. **Thread-safety disclaimer in the class docstring.**
   _fallback_resolve mutates _index and _df without locking; concurrent
   misses would race.  Single-threaded usage is the norm, but document
   the constraint so downstream code doesn't hit it by surprise.

4. **save() on an empty, never-queried cache raises.**
   A CachedPredictor(fallback=p) with no pre-loaded data and no
   queries yet has no (prediction_method_name, predictor_version) and
   no rows; saving produced a schema-only file that from_topiary_output
   couldn't round-trip.  Fail fast at save time with a message pointing
   users at the fix (run at least one query first).

3 new regression tests (41 total for cached, up from 38):
partial-allele preservation, save-before-query rejection, and
save-after-populate round-trip success.  Full suite 1093 passed
(up from 1090), lint clean.
iskandr added a commit that referenced this pull request Apr 14, 2026
Follow-up to #130 addressing concrete concerns from the vaxrank-consumer review (comment 4246617648):

- Partial-allele cache preservation: _fallback_resolve filters fallback output to keys not already in the index before merging, so pre-loaded rows aren't overwritten when the fallback covers overlapping alleles.
- Docstring notes on silent peptide-length lock-in and non-thread-safety.
- save() raises on empty never-queried caches so users don't write schema-only files that can't be round-tripped.

3 new regression tests (41 total for CachedPredictor, up from 38), full suite 1093 passing.

Concern #3 (_index memory footprint on whole-proteome caches) is deferred to a benchmark-driven follow-up.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants