Add CachedPredictor — predictions-as-data core (part 1 of #128)#130
Conversation
… of #128) Core class + four loaders, no sharding or version-match tests yet. Implements enough of the mhctools predictor protocol (predict_peptides_dataframe, predict_proteins_dataframe, alleles, default_peptide_lengths) to drop into TopiaryPredictor(models=cache). Core invariant: a single CachedPredictor holds predictions from exactly one (prediction_method_name, predictor_version) pair — never mixes versions. Enforced at construction and on the first fallback call (lazy version-check against the fallback's output). Fallback semantics: - fallback=None (default): miss raises KeyError with missed peptides. - fallback=<predictor>: miss delegates to the fallback, result is merged back into the cache so subsequent queries serve locally. No populate-on-miss flag — caching fallback hits is always the right default for the expected (batch prediction) workload. Loaders: - from_dataframe: in-memory, with optional predictor_name / predictor_version backfill for DataFrames that lack those columns. - from_topiary_output: Parquet / TSV round-trip of topiary's own predict_* output schema. - from_tsv: generic tab- or comma-delimited third-party files with a caller-supplied column-name mapping. - from_mhcflurry: mhcflurry-predict CSV — maps the mhcflurry_* columns onto canonical affinity / percentile_rank / score. Defer to follow-up PRs: - from_netmhcpan (.xls parsing is meatier; separate scope). - concat / from_directory sharding. 23 tests covering construction, version invariant (mixed versions rejected, name/version round-trips as string), predict_peptides + predict_proteins sliding-window, fallback hit + miss + version mismatch, all four loaders, and end-to-end integration with TopiaryPredictor(models=cache).
Six add-ons from the PR #128 design discussion, bundled: 1. Empty-cache + fallback mode — CachedPredictor(fallback=p) with no df is now valid; (name, version) are discovered from the fallback's output on the first query and locked in thereafter. Pure read- through cache. 2. Reject null / NaN / empty-string on prediction_method_name and predictor_version at construction. Silent "I don't know" agreement would mask the invariant; force the user to supply a real identity. 3. also_accept_versions opt-in equivalence set. A fallback's version passes if it equals the cache's OR is in the caller's accepted set. Names stay strict — different predictors are different. Default (empty set) keeps today's strict equality. 4. mhcflurry_composite_version() helper and auto-default in from_mhcflurry. Discovers mhcflurry.__version__ plus mhcflurry.downloads.get_current_release() and composes e.g. "2.2.1+release-2.2.0". Users never enumerate model bundles — CachedPredictor.from_mhcflurry("predictions.csv") just works. Explicit predictor_version= still overrides when a custom label is needed. 5. from_mhcflurry docstring now explains the mhcflurry-specific package-version + model-release composition (scoped to this loader, not the generic class — NetMHCpan et al. have their model data baked into the binary and don't need composite versions). 6. Lookup refactor: one pass to identify peptides with any miss, one fallback call per round, one pass to assemble rows from the updated cache. Fixes a latent duplicate-row bug in the prior implementation (allele-level partial results + fallback union). 15 new tests across null-version rejection, empty-cache-plus-fallback identity discovery, equivalence-set acceptance vs. rejection, the helper's composition + error paths (stubbed mhcflurry module — avoids tensorflow/libomp collisions in the test env), and auto-compose vs. explicit-override for from_mhcflurry.
- docs/cached.md: new page covering motivation, all four loaders, the version invariant + also_accept_versions opt-in, mhcflurry's composite-version auto-composition, fallback semantics (including pure read-through), persistence, and when NOT to use the cache. - mkdocs.yml: wire "Cached Predictions" into the nav between "Protein Fragments" and "Ranking DSL". - docs/quickstart.md: new "From cached predictions" subsection at the end of the Python-API examples, linking to cached.md. - docs/api.md: "CachedPredictor" section with a loader-table, constructor knobs, and the mhcflurry_composite_version helper. - docs/index.md: add to the feature list and cross-link paragraph. - README.md: new top-level "Cached predictions" section. mkdocs build --strict passes.
Vaxrank-consumer reviewReading from vaxrank 2.1.2's angle — this lands on a real vaxrank pain point I hadn't explicitly asked for in #121: vaxrank calls Design calls I like
Concerns / suggestions
Vaxrank follow-up sketch (for my own tracking)Once this lands:
Overall: this is a well-scoped part 1. Ship it; the concerns above are mostly docs/edge cases. |
5.5.0 is a minor bump per project convention (new feature, non- breaking). CHANGELOG entry documents the full CachedPredictor surface shipped in this PR: loaders, version invariant, opt-in equivalence, mhcflurry auto-composition, fallback, and the related mhctools#193 issue.
Addresses concrete concerns from the consumer review on PR #130: 1. **Preserve pre-loaded cache rows through fallback overlap.** The fallback typically returns rows for every allele it's configured with, not just the specific (peptide, allele) that missed. A partial-allele cache (peptide P hit for A, miss for B) would see its (P, A) row silently overwritten when (P, B) triggered the fallback. _fallback_resolve now filters fallback output to keys not already in the index before merging. 2. **Docstring note on silent peptide-length lock-in.** A cache loaded from a file containing only one peptide length (say 9-mers) will only scan 9-mer windows in predict_proteins_dataframe. The current behavior is correct but easy to miss; the class docstring now calls it out explicitly so users check default_peptide_lengths after loading. 3. **Thread-safety disclaimer in the class docstring.** _fallback_resolve mutates _index and _df without locking; concurrent misses would race. Single-threaded usage is the norm, but document the constraint so downstream code doesn't hit it by surprise. 4. **save() on an empty, never-queried cache raises.** A CachedPredictor(fallback=p) with no pre-loaded data and no queries yet has no (prediction_method_name, predictor_version) and no rows; saving produced a schema-only file that from_topiary_output couldn't round-trip. Fail fast at save time with a message pointing users at the fix (run at least one query first). 3 new regression tests (41 total for cached, up from 38): partial-allele preservation, save-before-query rejection, and save-after-populate round-trip success. Full suite 1093 passed (up from 1090), lint clean.
Follow-up to #130 addressing concrete concerns from the vaxrank-consumer review (comment 4246617648): - Partial-allele cache preservation: _fallback_resolve filters fallback output to keys not already in the index before merging, so pre-loaded rows aren't overwritten when the fallback covers overlapping alleles. - Docstring notes on silent peptide-length lock-in and non-thread-safety. - save() raises on empty never-queried caches so users don't write schema-only files that can't be round-tripped. 3 new regression tests (41 total for CachedPredictor, up from 38), full suite 1093 passing. Concern #3 (_index memory footprint on whole-proteome caches) is deferred to a benchmark-driven follow-up.
Summary
First PR implementing #128 (pluggable prediction sources). Ships the `CachedPredictor` class, the core `(prediction_method_name, predictor_version)` invariant, fallback semantics, and four loaders. Does not include sharding (`concat` / `from_directory`) or the NetMHCpan .xls parser — separate follow-up PRs.
No version bump, no PyPI release: the feature lands across multiple PRs; release once #128 is substantially complete.
What's in
What's deferred
Test plan
Part 1 of #128.