Skip to content

Add CachedPredictor sharding — v5.6.0 (closes #128)#133

Merged
iskandr merged 1 commit into
masterfrom
add-cache-sharding
Apr 15, 2026
Merged

Add CachedPredictor sharding — v5.6.0 (closes #128)#133
iskandr merged 1 commit into
masterfrom
add-cache-sharding

Conversation

@iskandr
Copy link
Copy Markdown
Contributor

@iskandr iskandr commented Apr 15, 2026

Summary

Final part of #128. Adds `CachedPredictor.concat([caches])` and `CachedPredictor.from_directory(path)` with a pluggable overlap-resolution policy, bumps to v5.6.0, documents the full surface.

```python
CachedPredictor.concat([shard_a, shard_b, shard_c])
CachedPredictor.from_directory("caches/", pattern="*.parquet")
```

All shards must share `(predictor_name, predictor_version)` per the core invariant — the constructor's existing check fires when the combined DataFrame has mixed pairs.

Overlap resolution (`on_overlap=`)

  • `"raise"` (default) — fail if any `(peptide, allele, peptide_length)` appears in more than one shard; sample of conflicts in the error.
  • `"last"` — later shard in the input list wins.
  • `"first"` — earlier shard wins.
  • `callable(row_a, row_b) -> row` — user-supplied resolver, called pairwise per duplicate group. Canonical pattern: `lambda a, b: a if a["affinity"] <= b["affinity"] else b` (keep the stronger binder).

Version bump to v5.6.0

Accumulates three merged-but-unreleased work streams since v5.5.0:

All additive / non-breaking → minor bump per project convention.

Test plan

  • 12 new sharding tests in `tests/test_cached_predictor.py` (59 total, up from 47)
  • `./test.sh` — 1111 passed, 3 skipped (up from 1099)
  • `./lint.sh` — clean
  • `mkdocs build --strict` — clean
  • CI green

Closes #128.

Final piece of the pluggable prediction-sources work.  concat()
merges multiple caches into one, from_directory() globs a directory
and concats every matching file through from_topiary_output.  All
shards must share (predictor_name, predictor_version) per the core
invariant.

Overlap resolution (on_overlap):
- "raise" (default) — fail if any (peptide, allele, peptide_length)
  appears in more than one shard; sample of conflicts in the error.
- "last" — later shard in the list wins.
- "first" — earlier shard wins.
- callable(row_a, row_b) -> row — user-supplied pairwise resolver
  per duplicate group.  Canonical pattern: keep the row with lower
  affinity (stronger binder).

Docs updated: new "Sharding" section in docs/cached.md with
examples for each policy; docs/api.md table extended with concat +
from_directory.

Version bumped to 5.6.0 (minor — new features since 5.5.0:
#131 vaxrank polish, #132 NetMHC loaders, this PR sharding).
CHANGELOG documents all three work streams under the 5.6.0
section.  12 new sharding tests (1111 total, up from 1099).
@iskandr iskandr merged commit bf01e57 into master Apr 15, 2026
7 checks passed
@coveralls
Copy link
Copy Markdown

Coverage Status

coverage: 88.246% (+0.1%) from 88.11% — add-cache-sharding into master

iskandr added a commit that referenced this pull request Apr 15, 2026
README.md, docs/index.md, and docs/quickstart.md advertised
CachedPredictor's mhcflurry + TSV + topiary-round-trip loaders but
not the NetMHC-family loaders (#132) or sharding (#133) shipped in
v5.6.0.  docs/cached.md and docs/api.md already cover them
comprehensively; this commit brings the overview pages into parity
so readers discover the features without diving in first.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Pluggable prediction sources — external formats, topiary round-trip, sharded caches

2 participants