Skip to content

optionally override tokenizer class with serialized tokenizer #44606

Open
itazap wants to merge 1 commit intomainfrom
override_tokenizer
Open

optionally override tokenizer class with serialized tokenizer #44606
itazap wants to merge 1 commit intomainfrom
override_tokenizer

Conversation

@itazap
Copy link
Collaborator

@itazap itazap commented Mar 11, 2026

In v5, we enforce creating a model-specific tokenizer (ex. LlamaTokenizer, Qwen2Tokenizer, et .) object when specified.

  1. For instance, when tokenizer_class is set in tokenization_config.json
  2. Or when using the auto_mapped tokenizer_class based on the model_type in config.json.

In v4, we always loaded the tokenizer object from tokenizer.json or tokenizer.model or other file directly without caring what the configs said.

This unveiled a lot of stale and incorrect tokenizer classes on the hub. For v5, we could introduce a way to (as in v4) optionally just dump whatever is in the _tokenizer object exactly as the tokenizer file has it.

We introduced MODELS_WITH_INCORRECT_HUB_TOKENIZER_CLASS to force TokenizersBackend, but for some checkpoints that don't have a config.json to create a mapping for, (ex: https://huggingface.co/jashing/tinyllama-colorist-lora/blob/main/tokenizer_config.json), we need another option to override

In this PR: do a light check if the _tokenizer object's are the same types, type(normalizer) == type(normalizer) , etc. If not and if we allow override, we do.

@itazap itazap requested a review from ArthurZucker March 11, 2026 17:29
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a hard one... I'm really against it. IMO we need to find all model type that have an issue, but when there is no config.json but there is a tokenizer.json we can set class to TokenizersBackend ?

IMO we need to just isolate all model classes that are wrong

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants