optionally override tokenizer class with serialized tokenizer by itazap · Pull Request #44606 · huggingface/transformers

itazap · 2026-03-11T17:29:12Z

In v5, we enforce creating a model-specific tokenizer (ex. LlamaTokenizer, Qwen2Tokenizer, et .) object when specified.

For instance, when tokenizer_class is set in tokenization_config.json
Or when using the auto_mapped tokenizer_class based on the model_type in config.json.

In v4, we always loaded the tokenizer object from tokenizer.json or tokenizer.model or other file directly without caring what the configs said.

This unveiled a lot of stale and incorrect tokenizer classes on the hub. For v5, we could introduce a way to (as in v4) optionally just dump whatever is in the _tokenizer object exactly as the tokenizer file has it.

We introduced MODELS_WITH_INCORRECT_HUB_TOKENIZER_CLASS to force TokenizersBackend, but for some checkpoints that don't have a config.json to create a mapping for, (ex: https://huggingface.co/jashing/tinyllama-colorist-lora/blob/main/tokenizer_config.json), we need another option to override

In this PR: do a light check if the _tokenizer object's are the same types, type(normalizer) == type(normalizer) , etc. If not and if we allow override, we do.

…le, when they don't match

HuggingFaceDocBuilderDev · 2026-03-11T17:39:17Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker

This is a hard one... I'm really against it. IMO we need to find all model type that have an issue, but when there is no config.json but there is a tokenizer.json we can set class to TokenizersBackend ?

IMO we need to just isolate all model classes that are wrong

optionally override tokenizer class with serialized tokenizer from fi…

1cfa028

…le, when they don't match

itazap requested a review from ArthurZucker March 11, 2026 17:29

hmellor mentioned this pull request Mar 17, 2026

Update to transformers v5 vllm-project/vllm#30566

Open

ArthurZucker reviewed Mar 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optionally override tokenizer class with serialized tokenizer #44606

optionally override tokenizer class with serialized tokenizer #44606
itazap wants to merge 1 commit intomainfrom
override_tokenizer

itazap commented Mar 11, 2026 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Mar 11, 2026

Uh oh!

ArthurZucker left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

itazap commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Mar 11, 2026

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

itazap commented Mar 11, 2026 •

edited

Loading