Skip to content

fix: resolve false-positive regex warning for non-mistral models#44736

Closed
yunhaoli24 wants to merge 1 commit intohuggingface:mainfrom
yunhaoli24:fix/regex-pattern-warning
Closed

fix: resolve false-positive regex warning for non-mistral models#44736
yunhaoli24 wants to merge 1 commit intohuggingface:mainfrom
yunhaoli24:fix/regex-pattern-warning

Conversation

@yunhaoli24
Copy link
Copy Markdown

What does this PR do?

Fixes #44031

The Problem

The condition for calling _patch_mistral_regex was too broad (vocab_size > 100000), causing non-Mistral models like Qwen, LLaMA, BGE-Reranker to show incorrect regex pattern warnings even though they don't need the regex fix.

Example from issue #44031:

from transformers import AutoTokenizer

# Save and reload Qwen tokenizer locally
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B")
tokenizer.save_pretrained("/tmp/qwen")

# This incorrectly triggers the warning!
tokenizer = AutoTokenizer.from_pretrained("/tmp/qwen")
# Warning: The tokenizer you are loading from '/tmp/qwen' with an incorrect regex pattern...

This is misleading because Qwen doesn't have the Mistral regex issue.

The Solution

Add a path-based filter before calling _patch_mistral_regex. Only call the function when:

  1. The vocab_size > 100000 (indicates potential need for fix)
  2. AND the model path contains "mistral" (indicating a Mistral model)
  3. OR the path is empty (local models - let internal logic decide)

This ensures:

  • ✅ Mistral models still get the regex fix when needed
  • ✅ Non-Mistral models don't see false-positive warnings
  • ✅ Local models (empty path) still get proper checking via internal logic

Code Change

# Before:
if vocab_size > 100000 and getattr(self._tokenizer, "pre_tokenizer", None) is not None:
    kwargs.pop("tokenizer", None)
    self._tokenizer = self._patch_mistral_regex(...)

# After:
name_or_path = self.init_kwargs.get("name_or_path", "")
if (
    vocab_size > 100000
    and getattr(self._tokenizer, "pre_tokenizer", None) is not None
    and ("mistral" in name_or_path.lower() or name_or_path == "")
):
    kwargs.pop("tokenizer", None)
    self._tokenizer = self._patch_mistral_regex(...)

Testing

The fix ensures:

  • Loading Qwen tokenizer from HuggingFace Hub no longer shows the warning
  • Loading Mistral tokenizer still shows the warning when needed
  • Loading local tokenizers work as expected via internal checking

Alternatives Considered

  1. Check model_type in config.json: Similar to closed PR Fix incorrect fix_mistral_regex warning for local non-Mistral tokenizers #42605

    • Pros: More accurate, checks actual model type
    • Cons: Requires reading config.json (extra I/O), PR was closed
  2. Maintain a list of Mistral variants:

    • Pros: Explicit white-list
    • Cons: Needs updating when new variants appear
  3. Chosen: Path-based filter (current approach)

    • Pros: Simple, fast, minimal code change
    • Cons: Path-dependent (mitigated by empty path check)

Related Issues

The condition for calling _patch_mistral_regex was too broad (vocab_size > 100000),
causing non-mistral models like Qwen to show incorrect regex pattern warnings.
This change tightens the condition to only apply to actual mistral models.

Fixes #44031
@Rocketknight1
Copy link
Copy Markdown
Member

Tell your code agent to check if the error still exists on main before it makes a PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

All tokenizers raise incorrect regex pattern warning after version 4.57.3?

2 participants