Add CachedPredictor sharding — v5.6.0 (closes #128)#133
Merged
Conversation
Final piece of the pluggable prediction-sources work. concat() merges multiple caches into one, from_directory() globs a directory and concats every matching file through from_topiary_output. All shards must share (predictor_name, predictor_version) per the core invariant. Overlap resolution (on_overlap): - "raise" (default) — fail if any (peptide, allele, peptide_length) appears in more than one shard; sample of conflicts in the error. - "last" — later shard in the list wins. - "first" — earlier shard wins. - callable(row_a, row_b) -> row — user-supplied pairwise resolver per duplicate group. Canonical pattern: keep the row with lower affinity (stronger binder). Docs updated: new "Sharding" section in docs/cached.md with examples for each policy; docs/api.md table extended with concat + from_directory. Version bumped to 5.6.0 (minor — new features since 5.5.0: #131 vaxrank polish, #132 NetMHC loaders, this PR sharding). CHANGELOG documents all three work streams under the 5.6.0 section. 12 new sharding tests (1111 total, up from 1099).
iskandr
added a commit
that referenced
this pull request
Apr 15, 2026
README.md, docs/index.md, and docs/quickstart.md advertised CachedPredictor's mhcflurry + TSV + topiary-round-trip loaders but not the NetMHC-family loaders (#132) or sharding (#133) shipped in v5.6.0. docs/cached.md and docs/api.md already cover them comprehensively; this commit brings the overview pages into parity so readers discover the features without diving in first.
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Final part of #128. Adds `CachedPredictor.concat([caches])` and `CachedPredictor.from_directory(path)` with a pluggable overlap-resolution policy, bumps to v5.6.0, documents the full surface.
```python
CachedPredictor.concat([shard_a, shard_b, shard_c])
CachedPredictor.from_directory("caches/", pattern="*.parquet")
```
All shards must share `(predictor_name, predictor_version)` per the core invariant — the constructor's existing check fires when the combined DataFrame has mixed pairs.
Overlap resolution (`on_overlap=`)
Version bump to v5.6.0
Accumulates three merged-but-unreleased work streams since v5.5.0:
All additive / non-breaking → minor bump per project convention.
Test plan
Closes #128.