v0.4.4: Perf roundup — one-shot norm check, shape-stable compiles, autotuned backwards
LatestFixed
maxsim(normalize=False)ran a.item()norm-check device sync on every
call; it now runs once per process, restoring CUDA-graph capture on the
PyLate / colpali-engine hot path.
Changed
- PLAID centroid codes are handled as int32 end to end (any integer dtype is
still accepted; out-of-range codes inplaid_approx_scoreclamp to
centroid 0). - Batch sizes (
Nq,Nd) are runtime kernel arguments, not constexpr — no
more recompiles per batch shape under dynamic batching. - GPU family detection is keyed on compute capability, not the device name;
Blackwell now gets first-class autotune configs. - The varlen, packed-pairs and residual backward launches are autotuned like
the dense backwards (up to 1.42× on H100 at training shapes). - Backward gradient buffers that the kernels overwrite in full use
torch.emptyinstead oftorch.zeros(atomic-scatter buffers stay zeroed). - The fp8 autotune key no longer splits on mask presence.
Full rationale and H100 measurements in #111.