When using evaluate.load("bertscore") with transformers>=5.0 and models that don't define model_max_length (e.g., microsoft/deberta-xlarge-mnli), the computation fails with:
OverflowError: int too big to convert
The error occurs at transformers/tokenization_utils_tokenizers.py in self._tokenizer.enable_truncation(**target), where the Rust tokenizers backend receives an integer (~10^30) that exceeds its maximum value.
Root Cause
Models like DeBERTa don't define model_max_length in their tokenizer config. Transformers assigns VERY_LARGE_INTEGER (~10^30) as default. In transformers>=5, this value is passed to the Rust tokenizers backend via enable_truncation(), which overflows.
Reproduction:
import evaluate
bertscore = evaluate.load("bertscore")
results = bertscore.compute(
predictions=["Hello world"],
references=["Hi world"],
model_type="microsoft/deberta-xlarge-mnli",
num_layers=40,
rescale_with_baseline=True,
lang="en",
)
Environment:
transformers==5.2.0
tokenizers==0.22.2
evaluate==0.4.6
bert-score==0.3.13
Python 3.11
Suggested Fixes:
Pass through max_length parameter: The underlying bert_score.score() function supports a max_length parameter, but evaluate's BERTScore wrapper (BERTScore._compute()) does not accept or forward it. Adding support for this parameter would allow users to work around the issue.
Cap model_max_length internally: Before calling the tokenizer, check if model_max_length exceeds a reasonable threshold and cap it (e.g., to the model's actual supported max length).
When using evaluate.load("bertscore") with transformers>=5.0 and models that don't define model_max_length (e.g., microsoft/deberta-xlarge-mnli), the computation fails with:
Root Cause
Models like DeBERTa don't define model_max_length in their tokenizer config. Transformers assigns VERY_LARGE_INTEGER (~10^30) as default. In transformers>=5, this value is passed to the Rust tokenizers backend via enable_truncation(), which overflows.
Reproduction:
Environment:
transformers==5.2.0
tokenizers==0.22.2
evaluate==0.4.6
bert-score==0.3.13
Python 3.11
Suggested Fixes:
Pass through max_length parameter: The underlying bert_score.score() function supports a max_length parameter, but evaluate's BERTScore wrapper (BERTScore._compute()) does not accept or forward it. Adding support for this parameter would allow users to work around the issue.
Cap model_max_length internally: Before calling the tokenizer, check if model_max_length exceeds a reasonable threshold and cap it (e.g., to the model's actual supported max length).