fix: resolve false-positive regex warning for non-mistral models by yunhaoli24 · Pull Request #44736 · huggingface/transformers

yunhaoli24 · 2026-03-16T06:00:47Z

What does this PR do?

The Problem

The condition for calling _patch_mistral_regex was too broad (vocab_size > 100000), causing non-Mistral models like Qwen, LLaMA, BGE-Reranker to show incorrect regex pattern warnings even though they don't need the regex fix.

Example from issue #44031:

from transformers import AutoTokenizer

# Save and reload Qwen tokenizer locally
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B")
tokenizer.save_pretrained("/tmp/qwen")

# This incorrectly triggers the warning!
tokenizer = AutoTokenizer.from_pretrained("/tmp/qwen")
# Warning: The tokenizer you are loading from '/tmp/qwen' with an incorrect regex pattern...

This is misleading because Qwen doesn't have the Mistral regex issue.

The Solution

Add a path-based filter before calling _patch_mistral_regex. Only call the function when:

The vocab_size > 100000 (indicates potential need for fix)
AND the model path contains "mistral" (indicating a Mistral model)
OR the path is empty (local models - let internal logic decide)

This ensures:

✅ Mistral models still get the regex fix when needed
✅ Non-Mistral models don't see false-positive warnings
✅ Local models (empty path) still get proper checking via internal logic

Code Change

# Before:
if vocab_size > 100000 and getattr(self._tokenizer, "pre_tokenizer", None) is not None:
    kwargs.pop("tokenizer", None)
    self._tokenizer = self._patch_mistral_regex(...)

# After:
name_or_path = self.init_kwargs.get("name_or_path", "")
if (
    vocab_size > 100000
    and getattr(self._tokenizer, "pre_tokenizer", None) is not None
    and ("mistral" in name_or_path.lower() or name_or_path == "")
):
    kwargs.pop("tokenizer", None)
    self._tokenizer = self._patch_mistral_regex(...)

Testing

The fix ensures:

Loading Qwen tokenizer from HuggingFace Hub no longer shows the warning
Loading Mistral tokenizer still shows the warning when needed
Loading local tokenizers work as expected via internal checking

Alternatives Considered

Check model_type in config.json: Similar to closed PR Fix incorrect fix_mistral_regex warning for local non-Mistral tokenizers #42605
- Pros: More accurate, checks actual model type
- Cons: Requires reading config.json (extra I/O), PR was closed
Maintain a list of Mistral variants:
- Pros: Explicit white-list
- Cons: Needs updating when new variants appear
Chosen: Path-based filter (current approach)
- Pros: Simple, fast, minimal code change
- Cons: Path-dependent (mitigated by empty path check)

Related Issues

Fixes All tokenizers raise incorrect regex pattern warning after version 4.57.3? #44031
Related to Loading local non-Mistral tokenizer incorrectly trigger fix_mistral_regex warning. #42591 (similar issue, different approach)

The condition for calling _patch_mistral_regex was too broad (vocab_size > 100000), causing non-mistral models like Qwen to show incorrect regex pattern warnings. This change tightens the condition to only apply to actual mistral models. Fixes #44031

Rocketknight1 · 2026-03-18T15:08:53Z

Tell your code agent to check if the error still exists on main before it makes a PR!

Rocketknight1 closed this Mar 18, 2026

Rocketknight1 added the Code agent slop label Mar 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: resolve false-positive regex warning for non-mistral models#44736

fix: resolve false-positive regex warning for non-mistral models#44736
yunhaoli24 wants to merge 1 commit intohuggingface:mainfrom
yunhaoli24:fix/regex-pattern-warning

yunhaoli24 commented Mar 16, 2026

Uh oh!

Rocketknight1 commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yunhaoli24 commented Mar 16, 2026

What does this PR do?

The Problem

The Solution

Code Change

Testing

Alternatives Considered

Related Issues

Uh oh!

Rocketknight1 commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants