v0.4.2: Native LIK in PyLate and colpali-engine

tonywu71 released this 09 Jun 13:29

· 4 commits to main since this release

0a89a6b

Added

Real-recipe e2e training benchmarks for ColQwen2 and PyLate
(bench_colpali_e2e.py, bench_pylate_e2e.py). Both instrument the loss
head to record per-MaxSim-call VRAM in-train, replay each recorded shape on
an isolated graph (exact forward/saved/backward brackets), and treat OOM as
a recorded sweep outcome rather than a crash; --variant vanilla|lik
toggles the patch, with summarize_*_e2e.py + scripts/sky_*_e2e.yaml
driving the sweep (fresh process per cell). Measured on 1×H100 80 GB:
ColQwen2's MaxSim op costs 7.81 GiB vanilla vs 61 MiB with LIK at B=128
(~130×), step time at parity, and vanilla OOMs at B=128 (a 1.81 GiB request
with 25 GiB reserved-but-unallocated) where LIK trains it — 2× batch
headroom; PyLate (grad-ckpt regime) drops step peak 54.1 → 29.7 GiB at
B=512, runs 1.07–1.12× faster per step, and trains B=1024 where vanilla
OOMs. The ColQwen2 bench targets released colpali-engine 0.3.16 and shims
its two ContrastiveTrainer bugs under transformers 5.x (fixed upstream in
colpali#412, unreleased).
Tables in docs/benchmarks.md.

Changed

patch_pylate() / patch_colpali_engine() defer to the native LIK
backends. PyLate ≥ 1.5.1 (pylate#222)
and colpali-engine ≥ 0.3.17 (colpali#412)
now ship their own LIK dispatch (pip install "pylate[lik]" /
"colpali-engine[lik]", via auto / PYLATE_SCORES_BACKEND /
COLPALI_SCORES_BACKEND). On those versions the patches are deprecated
no-ops that detect native support by package version and step aside (patching
PyLate would also break ColBERTScores, which forwards backend=); older
versions are unaffected. The native backends call maxsim / maxsim_pairs
/ maxsim_mps by keyword, so those signatures are now pinned by a test.
benchmarks/ is grouped per comparison stack — kernels/ (incl. the
platform-specific bench_mps.py), plaid/, colpali/, and pylate/, each
e2e bench next to its summarizer. Pure moves: --only tags and JSON output
names are unchanged, so existing results stay comparable. bench_lateon.py
→ kernels/bench_longdoc.py (the value is the long-document regime, Ld up
to 16 384), and the sky_run_all_benchmarks.yaml RUN_ONLY tag lateon →
longdoc.

Fixed

patch_pylate() works on PyLate 1.5 again. 1.5 renamed the scoring
module (pylate.scores.scores → pylate.scores.colbert) and rerouted the
contrastive losses through ColBERTScores; the patch now detects the
layout, patches the defining module (covering the loss path), and rewrites
only Distillation's import-time capture on 1.5. The pylate extra's
>=1.3.3,<2 range is accurate again — no more 1.3.3 pin.

Removed

The previous e2e training benches (bench_colpali_training.py,
bench_colpali_realdata.py, bench_pylate_training.py,
bench_pylate_realdata.py, bench_pylate_lateon.py), their shared
_bench_common.py, and the sky_colpali_benchmark.yaml /
sky_pylate_benchmark.yaml jobs — superseded by the e2e harnesses above
(bench_colpali_loss.py is kept; historical numbers stay in
docs/benchmarks.md). Plus four stale one-offs: bench_backward_0_5.py,
bench_fastplaid.py, bench_training.py, and the autotune-persistence
reproducer (scripts/_bench_autotune_persistence.py +
scripts/sky_bench_autotune_persistence.yaml).

Assets 2