Release v5.6.0 - Fixes for Causal LM Rerankers, Hard-Negative Mining, and More · huggingface/sentence-transformers

This minor version is a correctness- and robustness-focused release. It fixes a silent scoring bug for causal-LM rerankers, corrects several hard-negative mining and GIST loss edge cases, restores TSDAE on transformers v5, and adds Apple Silicon (MPS) support for the cached losses.

The headline fix affects chat-template models that read the final token position, i.e. causal-LM rerankers (like Qwen3-Reranker) and last-token-pooling embedders: when an over-long input was truncated, the chat template's trailing suffix (e.g. the assistant prefill the model scores from) was silently dropped, producing wrong scores with no error. There's also a forward-looking deprecation: loading local custom code without trust_remote_code=True now warns, and will require it from v6.0.

Install this version with

# Training + Inference
pip install sentence-transformers[train]==5.6.0

# Inference only, use one of:
pip install sentence-transformers==5.6.0
pip install sentence-transformers[onnx-gpu]==5.6.0
pip install sentence-transformers[onnx]==5.6.0
pip install sentence-transformers[openvino]==5.6.0

# Multimodal dependencies (optional):
pip install sentence-transformers[image]==5.6.0
pip install sentence-transformers[audio]==5.6.0
pip install sentence-transformers[video]==5.6.0

# Or combine as needed:
pip install sentence-transformers[train,onnx,image]==5.6.0

Fixed silently wrong scores when truncation drops chat-template suffixes (#3787)

Chat-template models render the full conversation to a flat string before tokenizing, so when the rendered input is longer than the tokenizer's model_max_length, the tokenizer truncates it from the right and drops the template's trailing suffix: the fixed tokens a template appends after the content, e.g. a prompt, instruction, [/INST], or a trailing EOS. For models that read the final token position, this silently corrupted the result:

causal-LM rerankers (e.g. Qwen/Qwen3-Reranker-0.6B) score a pair from the last token's yes/no logits, and
last-token-pooling embedders read the final hidden state.

When the suffix was truncated away, that final position landed mid-document instead of after the prefill, so the score or embedding came from the wrong place.

Transformer.preprocess now detects when truncation drops the suffix and splices it back onto the tail of each truncated row. Because the fix lives in the shared base Transformer, it applies across SentenceTransformer, CrossEncoder, and SparseEncoder. It's enabled by default and saved to the model configuration. Pass processing_kwargs={"chat_template": {"restore_suffix": False}} to opt back into raw truncation.

Hard-negative mining and GIST loss correctness (#3821, #3817, #3816)

A trio of correctness and scalability fixes for hard-negative mining and the GIST losses:

Sign-independent relative margin: mine_hard_negatives(relative_margin=...) and the margin_strategy="relative" branch of GISTEmbedLoss / CachedGISTEmbedLoss used a multiplicative threshold (positive * (1 - margin)) that only behaves correctly when the positive-pair similarity is positive. When that similarity was negative, the threshold moved the wrong way and let through false negatives: candidates more similar to the anchor than the true positive. The threshold is now positive - |positive| * margin, identical to before for positive scores but correct for negative ones.
Distributed positive masking in the GIST losses: with gather_across_devices=True and a non-zero margin, the false-negative suppression mask protected the wrong columns on ranks beyond the first (it ignored the per-rank offset into the gathered batch), which set the true positive's logit to -inf and produced a +inf loss. The mask now accounts for the cross-rank offset, so multi-GPU GIST training stays finite.
Memory-bounded mining without FAISS: mine_hard_negatives(use_faiss=False) (the default) materialized the full (queries × corpus) similarity matrix at once, which could OOM on large corpora. It now batches over the query axis (controlled by faiss_batch_size, default 16384), bounding peak memory while producing identical results.

TSDAE weight tying restored on `transformers` v5 (#3781)

transformers v5 removed the private PreTrainedModel._tie_encoder_decoder_weights helper that DenoisingAutoEncoderLoss (TSDAE) used to tie its separate encoder and decoder. As a stopgap, v5.5 raised a RuntimeError for the default tie_encoder_decoder=True on transformers >= 5.0.0, effectively breaking TSDAE there unless you pinned an older transformers or disabled tying. TSDAE now ships its own tying routine that shares storage between encoder and decoder, so it works on both transformers <5 and >=5 with the default settings.

Deprecation: loading local custom code without `trust_remote_code` (#3807)

Sentence Transformers has historically treated any local model directory as implicitly trusted: local custom code (e.g. modeling_*.py) loaded even with trust_remote_code=False, unlike transformers. This discrepancy might be unexpected, so loading local custom code this way now emits a FutureWarning, and from v6.0 it will require trust_remote_code=True like in transformers.

Apple Silicon (MPS) support (#3812, #3818)

Two fixes for training on Apple Silicon:

Cached losses on MPS: CachedMultipleNegativesRankingLoss and CachedGISTEmbedLoss crashed at construction on MPS because their RandContext used a CUDA-only RNG path. They now run on MPS with deterministic replay preserved.
Legacy fit path and SparseEncoder sparsity on MPS: the legacy model.fit(..., use_amp=True) path hard-coded CUDA's AMP GradScaler / autocast, and SparseEncoder sparsity statistics called to_sparse_csr(), which is unimplemented on MPS. Both now work on Apple Silicon.

Bug Fixes

Cast learning-to-rank loss logits to float32 in #3800: the listwise learning-to-rank losses scatter the model's logits into a float32 matrix, which crashed with a dtype mismatch when the model itself was in bf16/fp16 and trained without bf16=True/fp16=True (with those enabled, autocast outputs float32 logits, so the common path was unaffected). Logits are now upcast to float32 in the loss.
Don't override device_map placement with the device argument in #3823: loading with model_kwargs={"device_map": ...} previously placed the backbone via accelerate, then immediately moved it to the default device, defeating device_map. It now keeps the backbone in place (moving the other modules onto its device), and warns if both device and device_map are passed.
Collapse single-key multimodal dicts to the bare modality in #3779: a single-key input like {"image": img} was classified as a combined modality and rejected by models that support only that one modality (e.g. BGE-VL), with a self-contradicting error. It is now treated as the bare "image" input, which unblocks vision-retrieval benchmarks like MTEB that pass {"image": ...}.
Clarify unsupported-modality error messages in #3792: mixed or combined multimodal inputs on models that can't fuse modalities produced confusing errors (e.g. Modality 'message' is not supported). The errors are now scenario-specific and suggest what to do, such as encoding each modality separately.
Guard distributed APIs in get_device_name in #3798: on PyTorch builds where torch.distributed is present but unavailable (some ROCm and CPU-only builds), get_device_name() crashed with AttributeError: module 'torch.distributed' has no attribute 'is_initialized'. It now checks is_available() first, across all distributed call sites.
Fix OpenVINO static quantization for optimum-intel 2.0 / OpenVINO 2026 in #3814: export_static_quantized_openvino_model defaulted its calibration dataset to the bare id "glue", which the stricter Hub repo-id validation now rejects. The default is now the namespaced "nyu-mll/glue".

Examples, Documentation, and Notebooks

A batch of example and documentation modernization, mostly migrating example scripts off deprecated datasets script-loaders and bare ids onto maintained Hugging Face datasets so they run on datasets 4.x:

Fix example datasets that crash on datasets 4.x in #3782 (e.g. quora, nq_open, yahoo_answers_topics).
Migrate the MS MARCO examples to datasets in #3783.
Migrate the multilingual parallel-sentences data-prep scripts to datasets in #3784.
Migrate the Quora semantic-search and clustering examples to datasets in #3785.
Modernize the AugSBERT data-augmentation STSb scripts in #3806, also fixing the QQP cross-domain script to use the quora-duplicates labels it previously ignored.
Refresh the CLIP / image-search notebooks in #3780, and modernize the CLIP training notebook in #3805.
Fix a reStructuredText table in the ContrastiveTensionLoss docstring in #3788, and add a low-VRAM hardware note to the efficiency docs in #3802.
Fix package build warnings and errors in #3809, and a batch of Sphinx doc-build problems in #3811.

All Changes

[chore] Increment dev version by @tomaarsen in #3775
[fix] Collapse single-key multimodal dicts to bare modality by @tomaarsen in #3779
chore: enable Dependabot weekly GitHub Actions bumps by @hf-dependantbot-rollout[bot] in #3786
Bump the actions group with 3 updates by @dependabot[bot] in #3790
Fix example datasets that crash on datasets 4.x by @omkar-334 in #3782
CLIP Notebooks Refresh by @lbourdois in #3780
fix TSDAE weight tying on transformers v5 by @omkar-334 in #3781
Migrate MS MARCO examples to hf datasets by @omkar-334 in #3783
Migrate multilingual parallel-sentences scripts to hf datasets by @omkar-334 in #3784
Fix formatting in contrastive_tension.py documentation by @jadermcs in #3788
migrate quora semantic-search / clustering examples to hf by @omkar-334 in #3785
Bump actions/checkout from 6 to 6.0.2 in the actions group by @dependabot[bot] in #3797
[fix] Cast learning-to-rank loss logits to float32 for bf16/fp16 CrossEncoder training by @Incheonkirin in #3800
Guard distributed APIs in get_device_name by @swankystark in #3798
Update GitHub Actions to resolve Node 20 deprecation warnings by @kurtmckee in #3803
[ci] Pass HF read token for main CI by @tomaarsen in #3808
Fix package build warnings and errors by @kurtmckee in #3809
[tests] Reuse model/dataset fixtures more in tests by @tomaarsen in #3810
Modernize the data-augmentation STSb scripts by @omkar-334 in #3806
modernize clip notebook by @omkar-334 in #3805
docs : added a hardware constraint warning for low vram GPUs for preventing memory thrashing by @sreyanshacharya in #3802
[ci] Avoid model2vec distill install in CI as it limits the transformers version by @tomaarsen in #3813
[openvino] Fix calibration default & tests for optimum-intel 2.0 / openvino 2026 by @tomaarsen in #3814
Bump the actions group with 2 updates by @dependabot[bot] in #3815
[fix] Clarify unsupported-modality error messages by @tomaarsen in #3792
[fix] Avoid materializing the full similarity matrix in mine_hard_negatives without FAISS by @Incheonkirin in #3816
Fix doc build problems (part 1) by @kurtmckee in #3811
[fix] Fix positive masking in GIST losses with gather_across_devices by @Incheonkirin in #3817
fix MPS errors by @omkar-334 in #3818
[fix] Support MPS in the cached losses' RandContext by @Incheonkirin in #3812
[deprecation] Warn when loading local custom code without trust_remote_code by @tomaarsen in #3807
[fix] Make relative margin sign-independent in mining and GIST losses by @Incheonkirin in #3821
Fix causal LM reranker scoring when max_length truncates chat-template suffix by @hotchpotch in #3787
[fix] Don't override device_map placement with the device argument by @tomaarsen in #3823

New Contributors

@lbourdois made their first contribution in #3780
@Incheonkirin made their first contribution in #3800
@swankystark made their first contribution in #3798
@kurtmckee made their first contribution in #3803
@sreyanshacharya made their first contribution in #3802

Full Changelog: v5.5.1...v5.6.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v5.6.0 - Fixes for Causal LM Rerankers, Hard-Negative Mining, and More

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Fixed silently wrong scores when truncation drops chat-template suffixes (#3787)

Hard-negative mining and GIST loss correctness (#3821, #3817, #3816)

TSDAE weight tying restored on `transformers` v5 (#3781)

Deprecation: loading local custom code without `trust_remote_code` (#3807)

Apple Silicon (MPS) support (#3812, #3818)

Bug Fixes

Examples, Documentation, and Notebooks

All Changes

New Contributors

Contributors

Uh oh!

v5.6.0 - Fixes for Causal LM Rerankers, Hard-Negative Mining, and More

Fixed silently wrong scores when truncation drops chat-template suffixes (#3787)

Hard-negative mining and GIST loss correctness (#3821, #3817, #3816)

TSDAE weight tying restored on transformers v5 (#3781)

Deprecation: loading local custom code without trust_remote_code (#3807)

Apple Silicon (MPS) support (#3812, #3818)

Bug Fixes

Examples, Documentation, and Notebooks

All Changes

New Contributors

Contributors

Uh oh!

TSDAE weight tying restored on `transformers` v5 (#3781)

Deprecation: loading local custom code without `trust_remote_code` (#3807)