validate tokenizer components by itazap · Pull Request #42816 · huggingface/transformers

itazap · 2025-12-11T15:31:54Z

check if the tokenizer object in tokenizer.json that is being loaded is actually matching that of mapped tokenizer in TOKENIZER_MAPPINGS in tokenization_auto.py

ArthurZucker

ty

ArthurZucker · 2025-12-11T15:38:09Z

 TIKTOKEN_VOCAB_FILE = "tokenizer.model"

+
+def _validate_tokenizer_components(tokenizer_class, tokenizer_json_path):


Suggested change

def _validate_tokenizer_components(tokenizer_class, tokenizer_json_path):

def _validate_tokenizer_components(tokenizer_instance tokenizer_json_path):

will do ourselves a favor IMO to have an instance

ArthurZucker · 2025-12-11T15:39:10Z

+        }
+
+        # Compare and warn on mismatches
+        mismatches = []


Suggested change

mismatches = []

mismatches = set(expected_components) - set(json_components)

ArthurZucker · 2025-12-11T15:39:20Z

+        for name in ["normalizer", "pre_tokenizer", "decoder", "model"]:
+            json_val = json_components[name]
+            expected_val = expected_components[name]
+            if json_val != expected_val:
+                mismatches.append(f"{name}: expected {expected_val}, found {json_val}")


Suggested change

for name in ["normalizer", "pre_tokenizer", "decoder", "model"]:

json_val = json_components[name]

expected_val = expected_components[name]

if json_val != expected_val:

mismatches.append(f"{name}: expected {expected_val}, found {json_val}")

how do we check if they match?

with m y above:

set(expected_components) - set(json_components)

ArthurZucker · 2025-12-11T15:39:57Z

        if tokenizer_object is not None:
            fast_tokenizer = copy.deepcopy(tokenizer_object)
        elif fast_tokenizer_file is not None and os.path.isfile(fast_tokenizer_file):
+            _validate_tokenizer_components(self.__class__, fast_tokenizer_file)


No validate should go after INIT, most probably in the convert_to_native_format function

HuggingFaceDocBuilderDev · 2025-12-11T15:46:05Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

github-actions · 2025-12-11T16:11:17Z

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=42816&sha=ca3742

itazap and others added 2 commits December 11, 2025 16:28

validate tokenizer components

5655097

Merge branch 'main' into tokenizer-validation

f1f7393

ArthurZucker reviewed Dec 11, 2025

View reviewed changes

update to check instance

ca3742b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

validate tokenizer components#42816

validate tokenizer components#42816
itazap wants to merge 3 commits intomainfrom
tokenizer-validation

itazap commented Dec 11, 2025

Uh oh!

ArthurZucker left a comment

Uh oh!

ArthurZucker Dec 11, 2025

Uh oh!

ArthurZucker Dec 11, 2025

Uh oh!

ArthurZucker Dec 11, 2025

Uh oh!

itazap Dec 11, 2025

Uh oh!

ArthurZucker Dec 12, 2025

Uh oh!

ArthurZucker Dec 11, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Dec 11, 2025

Uh oh!

github-actions bot commented Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		TIKTOKEN_VOCAB_FILE = "tokenizer.model"


		def _validate_tokenizer_components(tokenizer_class, tokenizer_json_path):

	mismatches = []
	mismatches = set(expected_components) - set(json_components)

Conversation

itazap commented Dec 11, 2025

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

itazap Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Dec 11, 2025

Uh oh!

github-actions bot commented Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants