Add support for late-interaction-kernels (LIK) by tonywu71 · Pull Request #412 · illuin-tech/colpali

tonywu71 · 2026-05-22T16:30:49Z

Summary

Adds an optional [lik] extra that routes ColBERT MaxSim scoring through late-interaction-kernels (LIK), a fused Triton kernel, on CUDA Ampere+ / Apple Silicon, with a transparent pure-torch fallback everywhere else. It is opt-in and feature-flagged (COLPALI_SCORES_BACKEND selects auto/torch/lik), with no change to the public API or training semantics: the kernel and the torch reference return the same scores and the same loss.

On the MaxSim operation itself the kernel's speedup is unambiguous: isolating the loss head on ColPali-like shapes (no encoder), LIK runs the forward+backward up to 2.5–4.3× faster at large batch×negatives, with the win growing with B × n_neg (LIK 0.4.1 benchmarks). In ColVision training, however, MaxSim is a few milliseconds inside a step dominated by the 2B-parameter model forward/backward, so the op speedup dilutes to per-step parity end-to-end. What survives the dilution is the memory win: measured at the op level, vanilla MaxSim costs 7.8 GiB of VRAM at B=128 where LIK costs 62 MiB (129×), and that B²-growing term is exactly what caps the trainable batch size. Removing it doubles the batch on an 80 GB H100 (vanilla OOMs at B=128, LIK trains it).

What this PR adds

colpali_engine/utils/maxsim.py: the dispatcher (maxsim_inbatch, maxsim_kd) selecting between the LIK backend and the torch einsum + amax + sum reference per COLPALI_SCORES_BACKEND.
colpali_engine/utils/_lik_backend.py: the lazily-imported LIK implementations, input validation, and the LIKUnsupportedError sentinel.
Routes score_multi_vector, the three in-batch ColBERT losses (ColbertLoss, ColbertPairwiseCELoss, ColbertSigmoidLoss), and the two negative-doc losses (ColbertNegativeCELoss, ColbertPairwiseNegativeCELoss, via LIK's kd_layout) through the dispatcher.
pyproject.toml: optional extra lik = ["late-interaction-kernels>=0.4.1,<0.5.0"]; README section documenting the extra and the env var.
CPU dispatch tests plus CUDA parity and training-smoke tests.
[Fix] Two transformers-5.x trainer fixes hit while validating the path (_get_train_sampler signature; single-dataset compute_loss prefixes).

The benchmarking harnesses used for the numbers below were added and then removed within this PR's history, keeping the final tree lean: the op-level VRAM harness and its results live at 1717e37, the original batch-size sweep at 2749bd5 (a pre-rebase commit GitHub keeps accessible).

Design

maxsim_inbatch(Q, D) handles the in-batch [B, Lq, d] x [B, Ld, d] grid (used by score_multi_vector and the in-batch losses); maxsim_kd(Q, D) handles the per-query candidate layout [B, N, Ld, d] (negative-doc losses). The LIK implementations live in a lazily-imported _lik_backend module that validates each call (CUDA Ampere+ or MPS, embedding dim above the kernel's tile floor, matching devices) and raises a LIKUnsupportedError sentinel when the kernel cannot run; real kernel errors always propagate. Both paths treat padded tokens as exactly-zero embeddings rather than an explicit mask; ColQwen2 already zeroes padded positions through the attention mask, so the scores match.

The design deliberately matches PyLate's integration (lightonai/pylate#222), so using both libraries means one mental model: the extra is [lik], the backend module split is the same, and COLPALI_SCORES_BACKEND (read per call) mirrors PYLATE_SCORES_BACKEND with the same three values: auto (default) uses LIK when eligible and silently falls back to torch, torch forces the reference, and lik is strict, raising LIKUnsupportedError instead of falling back.

Results

The kernel itself is much faster than the einsum it replaces. LIK's 0.4.1 benchmarks isolate the loss head on ColPali-like shapes (Lq=32, Ld=1030, no encoder, forward+backward at matched numerics): the speedup climbs 1.13× → 4.31× as B × n_neg grows (2.50× at B256×n8, 4.31× at B256×n16), with ~25–30% lower peak memory on the head. In LIK's own words, this is "the throughput the encoder hides" in end-to-end training.

End-to-end, the model forward dilutes that speedup to parity. A ColQwen2 training step is dominated by the 2B-parameter doc/query towers; MaxSim is a few milliseconds of a >1 s step. We measure per-step parity (B=64: 7.19 vs 7.23 samples/s), and LIK's own end-to-end ColQwen2 table shows the same 0.97–1.02×. The dilution is mechanical, not a kernel property: on a 17M encoder, where MaxSim is a bigger slice of the step, the same kernel shows up as a 1.1–1.3× end-to-end speedup.

What survives at ColQwen2 scale is the memory win, measured at the op level. We instrumented the dispatcher during real ColQwen2 training steps, then replayed each recorded shape on an isolated graph where the op's backward can be measured exactly (the replayed forward numbers match the in-train ones to the MiB). The VRAM attributable to MaxSim:

batch size	vanilla: held	vanilla: bwd spike	vanilla: total	LIK: held	LIK: bwd spike	LIK: total
16	32 MiB	73 MiB	106 MiB	1 MiB	7 MiB	7 MiB
32	151 MiB	339 MiB	489 MiB	1 MiB	13 MiB	14 MiB
64	615 MiB	1.35 GiB	1.95 GiB	3 MiB	26 MiB	29 MiB
128	2.40 GiB	5.41 GiB	7.81 GiB	9 MiB	53 MiB	62 MiB

The score grid is fp32 in practice (autocast computes the embedding L2-norm in fp32 and the division promotes, so the loss runs on fp32 embeddings). At B=128 the [B,B,Lq,Ld] tensor is 2.4 GiB, held from the op's forward until its backward, where the op spikes another 2.25× that (the grid's gradient plus the amax scatter): 5.4 GiB. LIK holds only the [B,B] output and its backward allocates only the input gradients (dominated by grad_D), so its footprint grows linearly in B instead of quadratically: 62 MiB total at B=128, a 129× reduction.

That op footprint is what sets the batch-size ceiling. Sweeping per_device_train_batch_size until OOM (ColQwen2 + LoRA, grad-checkpointing, bf16, 80 GB H100): whole-step peak allocated VRAM is identical while both fit, then splits at B=128.

batch size	vanilla (LIK off)	LIK on
16	10.9 GiB	10.9 GiB
32	17.1 GiB	17.1 GiB
64	29.5 GiB	29.5 GiB
128	OOM	54.4 GiB
256	OOM	OOM

Note

Vanilla and LIK look identical in the VRAM table up to B=64 because the score tensors are freed before the peak: the op's backward runs first, then the model backward where the peak lives. At B=128 vanilla dies not because it uses more peak memory, but because its score grid needs multi-GiB contiguous blocks that memory fragmentation makes impossible to satisfy (the observed OOM is a 1.81 GiB request failing while 25 GiB sit reserved but unallocated). LIK's 62 MiB fits in whatever scraps remain.

Vanilla maxes out at B=64, LIK at B=128 (2× headroom). At B=256 both paths OOM: pushing 256 pages of ~768 visual tokens each through the 2B doc tower is the limit, regardless of the score tensor. The first steps pay a one-time Triton autotune warmup that amortizes over a full run. The loss matches the torch reference within bf16 noise.

Full sweep table

1× H100, vidore/colqwen2-base, ColbertPairwiseCELoss, grad-checkpointing on, LIK 0.4.1. Fresh process per (B, backend) so an OOM is isolated.

batch size	vanilla fits?	vanilla peak alloc (MiB)	vanilla samples/s	LIK fits?	LIK peak alloc (MiB)	LIK samples/s
16	yes	11134	5.50	yes	11133	2.51
32	yes	17497	6.85	yes	17497	5.76
64	yes	30247	7.23	yes	30247	7.19
128	no (OOM)	54.0 GiB pre-OOM	OOM	yes	55691	7.98
256	no (OOM)	n/a	OOM	no (OOM)	n/a	OOM
384	no (OOM)	n/a	OOM	no (OOM)	n/a	OOM
512	no (OOM)	n/a	OOM	no (OOM)	n/a	OOM

Throughput is from 4-step runs, so it is autotune-warmup-affected (LIK looks slower at B=16 only because warmup dominates 4 steps). The point that matters: LIK runs B=128 at 7.98 samples/s, which vanilla cannot reach.

Reproduce

The op-level VRAM harness and its result JSONs were added and removed within this PR's history; check out 1717e37 to get both. The harness wraps maxsim_inbatch during training to record the forward peak and the bytes held for backward, then replays each recorded shape on an isolated graph to bracket the op's backward exactly (a grad hook cannot bracket it in-train: it fires as a pre-hook of the producing node, after the whole doc-tower backward).

git checkout 1717e37                                                  # op-level harness + results present here
pip install -e ".[train,lik]"
sky launch -c colpali-lik-vram bench_lik/sky_maxsim_vram_sweep.yaml   # B in {16..128} x {auto, torch}
rsync -avP 'colpali-lik-vram:sky_workdir/bench_lik/results/' ./bench_lik/results/
python bench_lik/summarize_maxsim_vram.py --results-dir bench_lik/results   # op table + maxsim_vram.png

The whole-step batch-size sweep (B up to 512) ran on an earlier harness iteration at 2749bd5 (sky_batch_sweep.yaml + summarize_sweep.py, pre-rebase commit kept accessible by GitHub). The CUDA test runner (scripts/sky_test_lik.yaml) lives at 7ae3402; the slow suite itself stays in-tree (pytest -m slow tests/utils/test_maxsim_cuda.py on a CUDA Ampere+ host).

Force a single run onto a backend with COLPALI_SCORES_BACKEND=auto|torch|lik (lik errors instead of silently falling back).

Next steps

When this PR is merged, drop the corresponding patches in the late-interaction-kernels repository.
Cut a new colpali-engine release once merged (0.3.17 if nothing else lands in between) so the [lik] extra is installable from PyPI.

ManuelFay

this is super cool !
I think we need to run a real training with this kernel to see before merging what the gains are: how much can we increase batch size by, how much more speed do we gain?

this would also be a good opportunity to reverify some of the training scripts and the doc here to make sure it s starightforward to do (i am not so sure).

@mlconti1 @antoineedy ?

- Add maxsim dispatcher (maxsim_inbatch, maxsim_kd) with a pure-torch einsum reference and a lazily-imported LIK backend - Mirror PyLate's design: [lik] extra, COLPALI_SCORES_BACKEND env var (auto/torch/lik) read per call, LIKUnsupportedError sentinel - Route score_multi_vector and the five ColBERT losses through the dispatcher (negative-doc losses via LIK's kd_layout) - Add CPU dispatch tests plus CUDA parity and training-smoke tests - Fix transformers-5.x trainer breakage (_get_train_sampler signature, single-dataset compute_loss prefixes) - Document the extra and the backend toggle in README and CHANGELOG

- Add bench_train.py: runs training steps with the maxsim dispatcher instrumented (per-call forward peak and bytes held for backward), then replays each recorded shape on an isolated graph to bracket the op's backward exactly - Add SkyPilot sweep over B in {16..128} x {auto, torch}, fresh process per cell so an OOM is isolated - Add summarizer emitting the markdown table and the log-log plot - Add train-subset loaders so runs skip the full 52 GB train set

8 cells from a 1x H100 run (LIK 0.4.1, ColQwen2 + LoRA): per-op forward/held/backward VRAM plus whole-step peaks; the vanilla B=128 cell records the fragmentation OOM message.

Keep the final tree lean: the harness and results stay reachable at the two prior commits, referenced from the PR description.

- Cache ~/.cache/huggingface across runs, keyed on test file contents - Serve cached files without network calls in CI via a conftest shim - Surface .no_exist markers as EntryNotFoundError, matching online 404 handling

Copilot

Pull request overview

Adds an opt-in MaxSim backend that can route late-interaction scoring through late-interaction-kernels (LIK) when eligible (CUDA Ampere+ / Apple Silicon), with a transparent pure-torch fallback otherwise. This integrates the dispatcher into scoring and ColBERT-style losses, adds an optional dependency extra, and introduces targeted tests and CI caching to keep the default CPU test suite stable.

Changes:

Introduce maxsim_inbatch / maxsim_kd dispatchers with a lazy LIK backend and env-var selection (COLPALI_SCORES_BACKEND=auto|torch|lik).
Route score_multi_vector and ColBERT losses through the dispatcher (keeping smooth-max on torch).
Add CPU/CUDA parity + smoke tests and CI Hugging Face cache support; document the new optional [lik] extra.

Reviewed changes

Copilot reviewed 12 out of 13 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`tests/utils/test_maxsim.py`	CPU-side dispatcher semantics + parity tests against the torch reference.
`tests/utils/test_maxsim_cuda.py`	CUDA-gated LIK forward/backward parity + training smoke tests for updated losses.
`tests/conftest.py`	CI-only Hugging Face cache-first monkeypatch to reduce flaky Hub/network dependency.
`README.md`	Document optional fused MaxSim kernels and `COLPALI_SCORES_BACKEND`.
`pyproject.toml`	Add `[lik]` optional extra and include it in `[all]`.
`colpali_engine/utils/processing_utils.py`	Switch `score_multi_vector` to use the MaxSim dispatcher.
`colpali_engine/utils/maxsim.py`	New MaxSim dispatcher + torch reference implementations.
`colpali_engine/utils/_lik_backend.py`	New lazily-imported LIK backend with eligibility checks and sentinel error.
`colpali_engine/trainer/contrastive_trainer.py`	Transformers 5.x compatibility fix + initialize dataset prefix fields for single-dataset path.
`colpali_engine/loss/late_interaction_losses.py`	Route hard-max paths through dispatcher for in-batch + negative-doc losses.
`CHANGELOG.md`	Changelog entry for the new optional `[lik]` backend and env var.
`.gitignore`	Ignore `uv.lock`.
`.github/workflows/test.yml`	Cache Hugging Face files in CI prior to running the test suite.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

QuentinJGMace

Looks great ! Just waiting for some results on ViDoRe to confirm everything works seemlessly with no accuracy degradation.

Thanks for the contrib <3

tonywu71 · 2026-06-05T15:12:57Z

Looks great ! Just waiting for some results on ViDoRe to confirm everything works seemlessly with no accuracy degradation.

Thanks for the contrib <3

Hey @QuentinJGMace, tsm for the review! 🙌🏼

What experiment do you have in mind exactly? Imo running full end-to-end training on the entire train set is feasible or necessary here. I already have tests showing that, given the same inputs, LIK and einsum produce the same outputs.

Here’s what I’d suggest: check that, with and without LIK, we get the same training loss after a given number of steps. Would that work for you?

QuentinJGMace · 2026-06-05T15:34:49Z

Looks great ! Just waiting for some results on ViDoRe to confirm everything works seemlessly with no accuracy degradation.
Thanks for the contrib <3

Hey @QuentinJGMace, tsm for the review! 🙌🏼

What experiment do you have in mind exactly? Imo running full end-to-end training on the entire train set is feasible or necessary here. I already have tests showing that, given the same inputs, LIK and einsum produce the same outputs.

Here’s what I’d suggest: check that, with and without LIK, we get the same training loss after a given number of steps. Would that work for you?

I agree that training end to end is not completely necessary, I just thought you had done it.

I think a convincing experiment would be to have 3 runs of 50-100 steps (same seed for each collator):

1 on the current colpali mainversion
1 on the version from this branch without using lik
1 on the version from this branch with lik

It would ensure that lik does not break anything (even if as I understand it, we're already pretty sure of that) and that the code changes done with this PR did not change the way training behaved with previous versions.

wdyt ?

tonywu71 · 2026-06-05T15:40:50Z

I think a convincing experiment would be to have 3 runs of 50-100 steps (same seed for each collator):

1 on the current colpali mainversion

1 on the version from this branch without using lik

1 on the version from this branch with lik

It would ensure that lik does not break anything (even if as I understand it, we're already pretty sure of that) and that the code changes done with this PR did not change the way training behaved with previous versions.

wdyt ?

Sounds good, I'm on it! I'll share the details on my experiments in a comment not to pollute the commit history.

tonywu71 · 2026-06-05T16:09:22Z

@QuentinJGMace Ran the 3 trainings you asked for. Same seed (seed=42, data_seed=42), same collator, same data order, 50 steps each on a single H100.

Settings

Model: vidore/colpaligemma-3b-pt-448-base (bf16, sdpa) + LoRA r=32 on the language model layers and custom_text_proj
Loss: ColbertPairwiseCELoss (in-batch, so it exercises the maxsim_inbatch dispatch path)
Data: first 256 rows of vidore/colpali_train_set, batch size 4, lr 5e-5, no warmup
Stack: transformers 5.10.2, torch 2.11.0, late-interaction-kernels 0.4.1
Both installs share the exact same dependency set; the main run swaps in upstream colpali_engine (2e0b927) with --no-deps, so the only variable is this PR's code
Run 3 used COLPALI_SCORES_BACKEND=lik in strict mode, which raises instead of falling back, so the kernel is guaranteed to have run

Results

	main	branch (torch)	branch (LIK)
loss @ step 1	0.6938	0.6938	0.6938
loss @ step 50	0.6887	0.6871	0.6880
mean loss	0.6903	0.6896	0.6901
max Δ vs main	–	0.0039	0.0033
mean Δ vs main	–	0.0009	0.0008

max |LIK − torch| on the branch is 0.0039 (mean 0.0008). For scale: main vs branch-torch run the same math, and they differ by just as much. The LIK deltas sit at bf16 step-to-step noise level, and the three curves overlap:

One caveat to be transparent about

Current main can't run this training as-is. With the declared dependency range (transformers>=5.3,<6) it crashes twice in the single-dataset path:

_get_train_sampler(self) doesn't accept the dataset argument that every transformers 5.x release passes positionally (TypeError at dataloader build)
query_prefix/pos_prefix/neg_prefix are only set in the multi-dataset branch of get_train_dataloader, but compute_loss reads them on every step (AttributeError)

This branch already fixes both, which is why runs 2 and 3 needed no changes. For run 1, I shimmed main in the harness only (adapted the sampler call convention and primed the prefix attributes from the collator), with zero changes to any scoring or loss math. The produced sampler is identical, since transformers passes self.train_dataset and main falls back to that same attribute.

One shim applied to all three runs equally: transformers 5.x leaves the tied lm_head.weight on the meta device when loading ColPali, so the harness re-ties it to the input embeddings before training. ColPali never uses the LM head.

The LIK run's wall time (54s vs 21s for torch) is the cold-start autotune sweep discussed earlier in this thread. It's a fixed one-off cost, irrelevant beyond a few steps.

Co-authored-by: Aurélien Lac <56725662+uminaty@users.noreply.github.com>

QuentinJGMace · 2026-06-08T14:06:04Z

Thanks ! merging now

tonywu71 mentioned this pull request May 22, 2026

Add support for late-interaction-kernels tonywu71/colpali#1

Closed

tonywu71 changed the title ~~Add support for late interaction kernels~~ Add support for late-interaction-kernels (LIK) May 22, 2026

ManuelFay reviewed May 22, 2026

View reviewed changes

tonywu71 force-pushed the add-support-for-late-interaction-kernels branch from 4a733b6 to 1223fbb Compare June 4, 2026 17:55

tonywu71 added 4 commits June 4, 2026 22:45

test: add MaxSim VRAM sweep results

1717e37

8 cells from a 1x H100 run (LIK 0.4.1, ColQwen2 + LoRA): per-op forward/held/backward VRAM plus whole-step peaks; the vanilla B=128 cell records the fragmentation OOM message.

chore: remove benchmarking harness and results

1d3363c

Keep the final tree lean: the harness and results stay reachable at the two prior commits, referenced from the PR description.

tonywu71 force-pushed the add-support-for-late-interaction-kernels branch from cacf178 to 1d3363c Compare June 4, 2026 20:48

tonywu71 added 2 commits June 4, 2026 22:58

docs: scope the LIK batch-size claim to the measured benchmark

10cf76c

ci: make tests resilient to Hugging Face Hub rate limiting

dd096be

- Cache ~/.cache/huggingface across runs, keyed on test file contents - Serve cached files without network calls in CI via a conftest shim - Surface .no_exist markers as EntryNotFoundError, matching online 404 handling

tonywu71 force-pushed the add-support-for-late-interaction-kernels branch from a758065 to dd096be Compare June 5, 2026 09:55

tonywu71 mentioned this pull request Jun 5, 2026

Rebuild the e2e training benchmarks around per-op MaxSim VRAM (ColQwen2 + PyLate) hcompai/late-interaction-kernels#102

Merged

QuentinJGMace requested review from QuentinJGMace and Copilot June 5, 2026 13:55

Copilot started reviewing on behalf of QuentinJGMace June 5, 2026 13:55 View session

Copilot AI reviewed Jun 5, 2026

View reviewed changes

Comment thread tests/conftest.py

fix: return real snapshot path from cache shim instead of empty string

0c8ea96

QuentinJGMace approved these changes Jun 5, 2026

View reviewed changes

tonywu71 and others added 2 commits June 5, 2026 18:13

docs: document ContrastiveTrainer single-dataset fixes in changelog

0b3afd6

chore: add Aurélien Lac as co-author

7cd6b0e

Co-authored-by: Aurélien Lac <56725662+uminaty@users.noreply.github.com>

QuentinJGMace merged commit 0c630e3 into illuin-tech:main Jun 8, 2026
6 checks passed

h-aurelien-lac mentioned this pull request Jun 8, 2026

fix: defer to native LIK backends in PyLate & colpali-engine hcompai/late-interaction-kernels#105

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for late-interaction-kernels (LIK)#412

Add support for late-interaction-kernels (LIK)#412
QuentinJGMace merged 9 commits into
illuin-tech:mainfrom
tonywu71:add-support-for-late-interaction-kernels

tonywu71 commented May 22, 2026 •

edited

Loading

Uh oh!

ManuelFay left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

QuentinJGMace left a comment •

edited

Loading

Uh oh!

tonywu71 commented Jun 5, 2026

Uh oh!

QuentinJGMace commented Jun 5, 2026

Uh oh!

tonywu71 commented Jun 5, 2026

Uh oh!

tonywu71 commented Jun 5, 2026 •

edited

Loading

Uh oh!

QuentinJGMace commented Jun 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

tonywu71 commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What this PR adds

Design

Results

Reproduce

Next steps

Uh oh!

ManuelFay left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

QuentinJGMace left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tonywu71 commented Jun 5, 2026

Uh oh!

QuentinJGMace commented Jun 5, 2026

Uh oh!

tonywu71 commented Jun 5, 2026

Uh oh!

tonywu71 commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

QuentinJGMace commented Jun 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tonywu71 commented May 22, 2026 •

edited

Loading

QuentinJGMace left a comment •

edited

Loading

tonywu71 commented Jun 5, 2026 •

edited

Loading