Skip to content

Conversation

@pandora-s-git
Copy link

The current Mistral Tokenizer from Mistral Common to Transformers converter script uses a default regex pattern for the tokenizer, however Mistral models actually require a specific provided pattern.

More context can be found here: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84

This PR will aim to fix this issue by intitializing the tokenizer with the correct pattern.

cc @ArthurZucker

@pandora-s-git pandora-s-git changed the title tokenizer initialization with pattern argument Mistral Tokenizer Converter Script - Initialization with Pattern Argument Nov 10, 2025
@patrickvonplaten
Copy link
Contributor

@pandora-s-git any chance we can also directly put something like:

if self.pre_tokenizer[0] is incorrect:
   self.pre_tokenizer[0] = pre_tokenizers.Split(pattern=Regex(r"[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+|[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"), behavior = "isolated")
   logger.warning("Fix your file ...")

directly in the code so that all incorrect tokenizers will be corrected on-the-fly? 🙏

@pandora-s-git
Copy link
Author

@patrickvonplaten I dont think would be a good idea to force replace it, since there are finetuned models around that used the old tokenizer - those will be hard to judge if it will be better or worse with the fix - but we could add a warning only potentially ?

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice thanks for fixing! And yes this does not work for on the fly!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants