Skip to content

validate tokenizer components#42816

Open
itazap wants to merge 3 commits intomainfrom
tokenizer-validation
Open

validate tokenizer components#42816
itazap wants to merge 3 commits intomainfrom
tokenizer-validation

Conversation

@itazap
Copy link
Copy Markdown
Collaborator

@itazap itazap commented Dec 11, 2025

check if the tokenizer object in tokenizer.json that is being loaded is actually matching that of mapped tokenizer in TOKENIZER_MAPPINGS in tokenization_auto.py

Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ty

TIKTOKEN_VOCAB_FILE = "tokenizer.model"


def _validate_tokenizer_components(tokenizer_class, tokenizer_json_path):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _validate_tokenizer_components(tokenizer_class, tokenizer_json_path):
def _validate_tokenizer_components(tokenizer_instance tokenizer_json_path):

will do ourselves a favor IMO to have an instance

}

# Compare and warn on mismatches
mismatches = []
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
mismatches = []
mismatches = set(expected_components) - set(json_components)

Comment on lines +109 to +113
for name in ["normalizer", "pre_tokenizer", "decoder", "model"]:
json_val = json_components[name]
expected_val = expected_components[name]
if json_val != expected_val:
mismatches.append(f"{name}: expected {expected_val}, found {json_val}")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for name in ["normalizer", "pre_tokenizer", "decoder", "model"]:
json_val = json_components[name]
expected_val = expected_components[name]
if json_val != expected_val:
mismatches.append(f"{name}: expected {expected_val}, found {json_val}")

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do we check if they match?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with m y above:

 set(expected_components) - set(json_components)

if tokenizer_object is not None:
fast_tokenizer = copy.deepcopy(tokenizer_object)
elif fast_tokenizer_file is not None and os.path.isfile(fast_tokenizer_file):
_validate_tokenizer_components(self.__class__, fast_tokenizer_file)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No validate should go after INIT, most probably in the convert_to_native_format function

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@github-actions
Copy link
Copy Markdown
Contributor

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=42816&sha=ca3742

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants