Skip to content

v5.6.0 - Fixes for Causal LM Rerankers, Hard-Negative Mining, and More

Latest

Choose a tag to compare

@tomaarsen tomaarsen released this 16 Jun 14:02
· 1 commit to main since this release

This minor version is a correctness- and robustness-focused release. It fixes a silent scoring bug for causal-LM rerankers, corrects several hard-negative mining and GIST loss edge cases, restores TSDAE on transformers v5, and adds Apple Silicon (MPS) support for the cached losses.

The headline fix affects chat-template models that read the final token position, i.e. causal-LM rerankers (like Qwen3-Reranker) and last-token-pooling embedders: when an over-long input was truncated, the chat template's trailing suffix (e.g. the assistant prefill the model scores from) was silently dropped, producing wrong scores with no error. There's also a forward-looking deprecation: loading local custom code without trust_remote_code=True now warns, and will require it from v6.0.

Install this version with

# Training + Inference
pip install sentence-transformers[train]==5.6.0

# Inference only, use one of:
pip install sentence-transformers==5.6.0
pip install sentence-transformers[onnx-gpu]==5.6.0
pip install sentence-transformers[onnx]==5.6.0
pip install sentence-transformers[openvino]==5.6.0

# Multimodal dependencies (optional):
pip install sentence-transformers[image]==5.6.0
pip install sentence-transformers[audio]==5.6.0
pip install sentence-transformers[video]==5.6.0

# Or combine as needed:
pip install sentence-transformers[train,onnx,image]==5.6.0

Fixed silently wrong scores when truncation drops chat-template suffixes (#3787)

Chat-template models render the full conversation to a flat string before tokenizing, so when the rendered input is longer than the tokenizer's model_max_length, the tokenizer truncates it from the right and drops the template's trailing suffix: the fixed tokens a template appends after the content, e.g. a prompt, instruction, [/INST], or a trailing EOS. For models that read the final token position, this silently corrupted the result:

  • causal-LM rerankers (e.g. Qwen/Qwen3-Reranker-0.6B) score a pair from the last token's yes/no logits, and
  • last-token-pooling embedders read the final hidden state.

When the suffix was truncated away, that final position landed mid-document instead of after the prefill, so the score or embedding came from the wrong place.

Transformer.preprocess now detects when truncation drops the suffix and splices it back onto the tail of each truncated row. Because the fix lives in the shared base Transformer, it applies across SentenceTransformer, CrossEncoder, and SparseEncoder. It's enabled by default and saved to the model configuration. Pass processing_kwargs={"chat_template": {"restore_suffix": False}} to opt back into raw truncation.

Hard-negative mining and GIST loss correctness (#3821, #3817, #3816)

A trio of correctness and scalability fixes for hard-negative mining and the GIST losses:

  • Sign-independent relative margin: mine_hard_negatives(relative_margin=...) and the margin_strategy="relative" branch of GISTEmbedLoss / CachedGISTEmbedLoss used a multiplicative threshold (positive * (1 - margin)) that only behaves correctly when the positive-pair similarity is positive. When that similarity was negative, the threshold moved the wrong way and let through false negatives: candidates more similar to the anchor than the true positive. The threshold is now positive - |positive| * margin, identical to before for positive scores but correct for negative ones.
  • Distributed positive masking in the GIST losses: with gather_across_devices=True and a non-zero margin, the false-negative suppression mask protected the wrong columns on ranks beyond the first (it ignored the per-rank offset into the gathered batch), which set the true positive's logit to -inf and produced a +inf loss. The mask now accounts for the cross-rank offset, so multi-GPU GIST training stays finite.
  • Memory-bounded mining without FAISS: mine_hard_negatives(use_faiss=False) (the default) materialized the full (queries × corpus) similarity matrix at once, which could OOM on large corpora. It now batches over the query axis (controlled by faiss_batch_size, default 16384), bounding peak memory while producing identical results.

TSDAE weight tying restored on transformers v5 (#3781)

transformers v5 removed the private PreTrainedModel._tie_encoder_decoder_weights helper that DenoisingAutoEncoderLoss (TSDAE) used to tie its separate encoder and decoder. As a stopgap, v5.5 raised a RuntimeError for the default tie_encoder_decoder=True on transformers >= 5.0.0, effectively breaking TSDAE there unless you pinned an older transformers or disabled tying. TSDAE now ships its own tying routine that shares storage between encoder and decoder, so it works on both transformers <5 and >=5 with the default settings.

Deprecation: loading local custom code without trust_remote_code (#3807)

Sentence Transformers has historically treated any local model directory as implicitly trusted: local custom code (e.g. modeling_*.py) loaded even with trust_remote_code=False, unlike transformers. This discrepancy might be unexpected, so loading local custom code this way now emits a FutureWarning, and from v6.0 it will require trust_remote_code=True like in transformers.

Apple Silicon (MPS) support (#3812, #3818)

Two fixes for training on Apple Silicon:

  • Cached losses on MPS: CachedMultipleNegativesRankingLoss and CachedGISTEmbedLoss crashed at construction on MPS because their RandContext used a CUDA-only RNG path. They now run on MPS with deterministic replay preserved.
  • Legacy fit path and SparseEncoder sparsity on MPS: the legacy model.fit(..., use_amp=True) path hard-coded CUDA's AMP GradScaler / autocast, and SparseEncoder sparsity statistics called to_sparse_csr(), which is unimplemented on MPS. Both now work on Apple Silicon.

Bug Fixes

  • Cast learning-to-rank loss logits to float32 in #3800: the listwise learning-to-rank losses scatter the model's logits into a float32 matrix, which crashed with a dtype mismatch when the model itself was in bf16/fp16 and trained without bf16=True/fp16=True (with those enabled, autocast outputs float32 logits, so the common path was unaffected). Logits are now upcast to float32 in the loss.
  • Don't override device_map placement with the device argument in #3823: loading with model_kwargs={"device_map": ...} previously placed the backbone via accelerate, then immediately moved it to the default device, defeating device_map. It now keeps the backbone in place (moving the other modules onto its device), and warns if both device and device_map are passed.
  • Collapse single-key multimodal dicts to the bare modality in #3779: a single-key input like {"image": img} was classified as a combined modality and rejected by models that support only that one modality (e.g. BGE-VL), with a self-contradicting error. It is now treated as the bare "image" input, which unblocks vision-retrieval benchmarks like MTEB that pass {"image": ...}.
  • Clarify unsupported-modality error messages in #3792: mixed or combined multimodal inputs on models that can't fuse modalities produced confusing errors (e.g. Modality 'message' is not supported). The errors are now scenario-specific and suggest what to do, such as encoding each modality separately.
  • Guard distributed APIs in get_device_name in #3798: on PyTorch builds where torch.distributed is present but unavailable (some ROCm and CPU-only builds), get_device_name() crashed with AttributeError: module 'torch.distributed' has no attribute 'is_initialized'. It now checks is_available() first, across all distributed call sites.
  • Fix OpenVINO static quantization for optimum-intel 2.0 / OpenVINO 2026 in #3814: export_static_quantized_openvino_model defaulted its calibration dataset to the bare id "glue", which the stricter Hub repo-id validation now rejects. The default is now the namespaced "nyu-mll/glue".

Examples, Documentation, and Notebooks

A batch of example and documentation modernization, mostly migrating example scripts off deprecated datasets script-loaders and bare ids onto maintained Hugging Face datasets so they run on datasets 4.x:

  • Fix example datasets that crash on datasets 4.x in #3782 (e.g. quora, nq_open, yahoo_answers_topics).
  • Migrate the MS MARCO examples to datasets in #3783.
  • Migrate the multilingual parallel-sentences data-prep scripts to datasets in #3784.
  • Migrate the Quora semantic-search and clustering examples to datasets in #3785.
  • Modernize the AugSBERT data-augmentation STSb scripts in #3806, also fixing the QQP cross-domain script to use the quora-duplicates labels it previously ignored.
  • Refresh the CLIP / image-search notebooks in #3780, and modernize the CLIP training notebook in #3805.
  • Fix a reStructuredText table in the ContrastiveTensionLoss docstring in #3788, and add a low-VRAM hardware note to the efficiency docs in #3802.
  • Fix package build warnings and errors in #3809, and a batch of Sphinx doc-build problems in #3811.

All Changes

  • [chore] Increment dev version by @tomaarsen in #3775
  • [fix] Collapse single-key multimodal dicts to bare modality by @tomaarsen in #3779
  • chore: enable Dependabot weekly GitHub Actions bumps by @hf-dependantbot-rollout[bot] in #3786
  • Bump the actions group with 3 updates by @dependabot[bot] in #3790
  • Fix example datasets that crash on datasets 4.x by @omkar-334 in #3782
  • CLIP Notebooks Refresh by @lbourdois in #3780
  • fix TSDAE weight tying on transformers v5 by @omkar-334 in #3781
  • Migrate MS MARCO examples to hf datasets by @omkar-334 in #3783
  • Migrate multilingual parallel-sentences scripts to hf datasets by @omkar-334 in #3784
  • Fix formatting in contrastive_tension.py documentation by @jadermcs in #3788
  • migrate quora semantic-search / clustering examples to hf by @omkar-334 in #3785
  • Bump actions/checkout from 6 to 6.0.2 in the actions group by @dependabot[bot] in #3797
  • [fix] Cast learning-to-rank loss logits to float32 for bf16/fp16 CrossEncoder training by @Incheonkirin in #3800
  • Guard distributed APIs in get_device_name by @swankystark in #3798
  • Update GitHub Actions to resolve Node 20 deprecation warnings by @kurtmckee in #3803
  • [ci] Pass HF read token for main CI by @tomaarsen in #3808
  • Fix package build warnings and errors by @kurtmckee in #3809
  • [tests] Reuse model/dataset fixtures more in tests by @tomaarsen in #3810
  • Modernize the data-augmentation STSb scripts by @omkar-334 in #3806
  • modernize clip notebook by @omkar-334 in #3805
  • docs : added a hardware constraint warning for low vram GPUs for preventing memory thrashing by @sreyanshacharya in #3802
  • [ci] Avoid model2vec distill install in CI as it limits the transformers version by @tomaarsen in #3813
  • [openvino] Fix calibration default & tests for optimum-intel 2.0 / openvino 2026 by @tomaarsen in #3814
  • Bump the actions group with 2 updates by @dependabot[bot] in #3815
  • [fix] Clarify unsupported-modality error messages by @tomaarsen in #3792
  • [fix] Avoid materializing the full similarity matrix in mine_hard_negatives without FAISS by @Incheonkirin in #3816
  • Fix doc build problems (part 1) by @kurtmckee in #3811
  • [fix] Fix positive masking in GIST losses with gather_across_devices by @Incheonkirin in #3817
  • fix MPS errors by @omkar-334 in #3818
  • [fix] Support MPS in the cached losses' RandContext by @Incheonkirin in #3812
  • [deprecation] Warn when loading local custom code without trust_remote_code by @tomaarsen in #3807
  • [fix] Make relative margin sign-independent in mining and GIST losses by @Incheonkirin in #3821
  • Fix causal LM reranker scoring when max_length truncates chat-template suffix by @hotchpotch in #3787
  • [fix] Don't override device_map placement with the device argument by @tomaarsen in #3823

New Contributors

Full Changelog: v5.5.1...v5.6.0