This minor version is a correctness- and robustness-focused release. It fixes a silent scoring bug for causal-LM rerankers, corrects several hard-negative mining and GIST loss edge cases, restores TSDAE on transformers v5, and adds Apple Silicon (MPS) support for the cached losses.
The headline fix affects chat-template models that read the final token position, i.e. causal-LM rerankers (like Qwen3-Reranker) and last-token-pooling embedders: when an over-long input was truncated, the chat template's trailing suffix (e.g. the assistant prefill the model scores from) was silently dropped, producing wrong scores with no error. There's also a forward-looking deprecation: loading local custom code without trust_remote_code=True now warns, and will require it from v6.0.
Install this version with
# Training + Inference
pip install sentence-transformers[train]==5.6.0
# Inference only, use one of:
pip install sentence-transformers==5.6.0
pip install sentence-transformers[onnx-gpu]==5.6.0
pip install sentence-transformers[onnx]==5.6.0
pip install sentence-transformers[openvino]==5.6.0
# Multimodal dependencies (optional):
pip install sentence-transformers[image]==5.6.0
pip install sentence-transformers[audio]==5.6.0
pip install sentence-transformers[video]==5.6.0
# Or combine as needed:
pip install sentence-transformers[train,onnx,image]==5.6.0Fixed silently wrong scores when truncation drops chat-template suffixes (#3787)
Chat-template models render the full conversation to a flat string before tokenizing, so when the rendered input is longer than the tokenizer's model_max_length, the tokenizer truncates it from the right and drops the template's trailing suffix: the fixed tokens a template appends after the content, e.g. a prompt, instruction, [/INST], or a trailing EOS. For models that read the final token position, this silently corrupted the result:
- causal-LM rerankers (e.g.
Qwen/Qwen3-Reranker-0.6B) score a pair from the last token'syes/nologits, and - last-token-pooling embedders read the final hidden state.
When the suffix was truncated away, that final position landed mid-document instead of after the prefill, so the score or embedding came from the wrong place.
Transformer.preprocess now detects when truncation drops the suffix and splices it back onto the tail of each truncated row. Because the fix lives in the shared base Transformer, it applies across SentenceTransformer, CrossEncoder, and SparseEncoder. It's enabled by default and saved to the model configuration. Pass processing_kwargs={"chat_template": {"restore_suffix": False}} to opt back into raw truncation.
Hard-negative mining and GIST loss correctness (#3821, #3817, #3816)
A trio of correctness and scalability fixes for hard-negative mining and the GIST losses:
- Sign-independent relative margin:
mine_hard_negatives(relative_margin=...)and themargin_strategy="relative"branch ofGISTEmbedLoss/CachedGISTEmbedLossused a multiplicative threshold (positive * (1 - margin)) that only behaves correctly when the positive-pair similarity is positive. When that similarity was negative, the threshold moved the wrong way and let through false negatives: candidates more similar to the anchor than the true positive. The threshold is nowpositive - |positive| * margin, identical to before for positive scores but correct for negative ones. - Distributed positive masking in the GIST losses: with
gather_across_devices=Trueand a non-zeromargin, the false-negative suppression mask protected the wrong columns on ranks beyond the first (it ignored the per-rank offset into the gathered batch), which set the true positive's logit to-infand produced a+infloss. The mask now accounts for the cross-rank offset, so multi-GPU GIST training stays finite. - Memory-bounded mining without FAISS:
mine_hard_negatives(use_faiss=False)(the default) materialized the full(queries × corpus)similarity matrix at once, which could OOM on large corpora. It now batches over the query axis (controlled byfaiss_batch_size, default 16384), bounding peak memory while producing identical results.
TSDAE weight tying restored on transformers v5 (#3781)
transformers v5 removed the private PreTrainedModel._tie_encoder_decoder_weights helper that DenoisingAutoEncoderLoss (TSDAE) used to tie its separate encoder and decoder. As a stopgap, v5.5 raised a RuntimeError for the default tie_encoder_decoder=True on transformers >= 5.0.0, effectively breaking TSDAE there unless you pinned an older transformers or disabled tying. TSDAE now ships its own tying routine that shares storage between encoder and decoder, so it works on both transformers <5 and >=5 with the default settings.
Deprecation: loading local custom code without trust_remote_code (#3807)
Sentence Transformers has historically treated any local model directory as implicitly trusted: local custom code (e.g. modeling_*.py) loaded even with trust_remote_code=False, unlike transformers. This discrepancy might be unexpected, so loading local custom code this way now emits a FutureWarning, and from v6.0 it will require trust_remote_code=True like in transformers.
Apple Silicon (MPS) support (#3812, #3818)
Two fixes for training on Apple Silicon:
- Cached losses on MPS:
CachedMultipleNegativesRankingLossandCachedGISTEmbedLosscrashed at construction on MPS because theirRandContextused a CUDA-only RNG path. They now run on MPS with deterministic replay preserved. - Legacy fit path and
SparseEncodersparsity on MPS: the legacymodel.fit(..., use_amp=True)path hard-coded CUDA's AMPGradScaler/autocast, andSparseEncodersparsity statistics calledto_sparse_csr(), which is unimplemented on MPS. Both now work on Apple Silicon.
Bug Fixes
- Cast learning-to-rank loss logits to float32 in #3800: the listwise learning-to-rank losses scatter the model's logits into a float32 matrix, which crashed with a dtype mismatch when the model itself was in bf16/fp16 and trained without
bf16=True/fp16=True(with those enabled, autocast outputs float32 logits, so the common path was unaffected). Logits are now upcast to float32 in the loss. - Don't override
device_mapplacement with thedeviceargument in #3823: loading withmodel_kwargs={"device_map": ...}previously placed the backbone viaaccelerate, then immediately moved it to the default device, defeatingdevice_map. It now keeps the backbone in place (moving the other modules onto its device), and warns if bothdeviceanddevice_mapare passed. - Collapse single-key multimodal dicts to the bare modality in #3779: a single-key input like
{"image": img}was classified as a combined modality and rejected by models that support only that one modality (e.g. BGE-VL), with a self-contradicting error. It is now treated as the bare"image"input, which unblocks vision-retrieval benchmarks like MTEB that pass{"image": ...}. - Clarify unsupported-modality error messages in #3792: mixed or combined multimodal inputs on models that can't fuse modalities produced confusing errors (e.g.
Modality 'message' is not supported). The errors are now scenario-specific and suggest what to do, such as encoding each modality separately. - Guard distributed APIs in
get_device_namein #3798: on PyTorch builds wheretorch.distributedis present but unavailable (some ROCm and CPU-only builds),get_device_name()crashed withAttributeError: module 'torch.distributed' has no attribute 'is_initialized'. It now checksis_available()first, across all distributed call sites. - Fix OpenVINO static quantization for optimum-intel 2.0 / OpenVINO 2026 in #3814:
export_static_quantized_openvino_modeldefaulted its calibration dataset to the bare id"glue", which the stricter Hub repo-id validation now rejects. The default is now the namespaced"nyu-mll/glue".
Examples, Documentation, and Notebooks
A batch of example and documentation modernization, mostly migrating example scripts off deprecated datasets script-loaders and bare ids onto maintained Hugging Face datasets so they run on datasets 4.x:
- Fix example datasets that crash on
datasets4.x in #3782 (e.g.quora,nq_open,yahoo_answers_topics). - Migrate the MS MARCO examples to
datasetsin #3783. - Migrate the multilingual parallel-sentences data-prep scripts to
datasetsin #3784. - Migrate the Quora semantic-search and clustering examples to
datasetsin #3785. - Modernize the AugSBERT data-augmentation STSb scripts in #3806, also fixing the QQP cross-domain script to use the
quora-duplicateslabels it previously ignored. - Refresh the CLIP / image-search notebooks in #3780, and modernize the CLIP training notebook in #3805.
- Fix a reStructuredText table in the
ContrastiveTensionLossdocstring in #3788, and add a low-VRAM hardware note to the efficiency docs in #3802. - Fix package build warnings and errors in #3809, and a batch of Sphinx doc-build problems in #3811.
All Changes
- [
chore] Increment dev version by @tomaarsen in #3775 - [
fix] Collapse single-key multimodal dicts to bare modality by @tomaarsen in #3779 - chore: enable Dependabot weekly GitHub Actions bumps by @hf-dependantbot-rollout[bot] in #3786
- Bump the actions group with 3 updates by @dependabot[bot] in #3790
- Fix example datasets that crash on datasets 4.x by @omkar-334 in #3782
- CLIP Notebooks Refresh by @lbourdois in #3780
- fix TSDAE weight tying on transformers v5 by @omkar-334 in #3781
- Migrate MS MARCO examples to hf datasets by @omkar-334 in #3783
- Migrate multilingual parallel-sentences scripts to hf datasets by @omkar-334 in #3784
- Fix formatting in contrastive_tension.py documentation by @jadermcs in #3788
- migrate quora semantic-search / clustering examples to hf by @omkar-334 in #3785
- Bump actions/checkout from 6 to 6.0.2 in the actions group by @dependabot[bot] in #3797
- [fix] Cast learning-to-rank loss logits to float32 for bf16/fp16 CrossEncoder training by @Incheonkirin in #3800
- Guard distributed APIs in get_device_name by @swankystark in #3798
- Update GitHub Actions to resolve Node 20 deprecation warnings by @kurtmckee in #3803
- [
ci] Pass HF read token formainCI by @tomaarsen in #3808 - Fix package build warnings and errors by @kurtmckee in #3809
- [
tests] Reuse model/dataset fixtures more in tests by @tomaarsen in #3810 - Modernize the data-augmentation STSb scripts by @omkar-334 in #3806
- modernize clip notebook by @omkar-334 in #3805
- docs : added a hardware constraint warning for low vram GPUs for preventing memory thrashing by @sreyanshacharya in #3802
- [
ci] Avoid model2vec distill install in CI as it limits the transformers version by @tomaarsen in #3813 - [
openvino] Fix calibration default & tests for optimum-intel 2.0 / openvino 2026 by @tomaarsen in #3814 - Bump the actions group with 2 updates by @dependabot[bot] in #3815
- [
fix] Clarify unsupported-modality error messages by @tomaarsen in #3792 - [fix] Avoid materializing the full similarity matrix in mine_hard_negatives without FAISS by @Incheonkirin in #3816
- Fix doc build problems (part 1) by @kurtmckee in #3811
- [fix] Fix positive masking in GIST losses with gather_across_devices by @Incheonkirin in #3817
- fix
MPSerrors by @omkar-334 in #3818 - [fix] Support MPS in the cached losses' RandContext by @Incheonkirin in #3812
- [
deprecation] Warn when loading local custom code without trust_remote_code by @tomaarsen in #3807 - [fix] Make relative margin sign-independent in mining and GIST losses by @Incheonkirin in #3821
- Fix causal LM reranker scoring when max_length truncates chat-template suffix by @hotchpotch in #3787
- [fix] Don't override device_map placement with the device argument by @tomaarsen in #3823
New Contributors
- @lbourdois made their first contribution in #3780
- @Incheonkirin made their first contribution in #3800
- @swankystark made their first contribution in #3798
- @kurtmckee made their first contribution in #3803
- @sreyanshacharya made their first contribution in #3802
Full Changelog: v5.5.1...v5.6.0