Add Context-Aware Tokenizer Selection Utility Based on Corpus Analysis by Aishwarya0811 · Pull Request #40515 · huggingface/transformers

Aishwarya0811 · 2025-08-28T08:15:08Z

What does this PR do?

This PR introduces a new utility module that automates tokenizer selection and configuration based on corpus characteristics, addressing the need to reduce manual trial-and-error in tokenizer selection and improve model performance with minimal user effort.

Key Features:

CorpusAnalyzer: Extracts statistical features from text corpora (vocabulary size, morphological complexity, character diversity, language patterns)
TokenizerRecommender: Maps corpus features to optimal tokenizer types (BPE, WordPiece, SentencePiece) using rule-based heuristics
TokenizerSelector: End-to-end utility that analyzes corpus, recommends tokenizer type, and optionally trains it using existing infrastructure
Language-aware recommendations: Handles different script types (Latin, CJK, mixed) appropriately

Implementation Details:

Minimal changes: Single new file src/transformers/utils/tokenizer_selection.py
Zero modifications to existing tokenizer classes
Uses lazy imports to avoid circular dependencies
Integrates with existing train_new_from_iterator method
Comprehensive test coverage (16 tests)
-->

Fixes #40512

Fixes # (issue) Context-Aware Tokenizer Selection Utility Based on Corpus Analysis

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

…ithub.com/Aishwarya0809/transformers into feature/context-aware-tokenizer-selection

Aishwarya0811 · 2025-08-28T10:25:29Z

The test failures appeared after clicking "Update branch" and are in SmolVLM image processing, completely unrelated to the tokenizer selection utility I implemented. These appear to be pre-existing issues in the main branch.

Rocketknight1 · 2025-08-28T12:02:00Z

Mentioned in #40512, but I don't think we want this in Transformers right now! A lot of our focus is on running and fine-tuning existing models, rather than tools for making these kind of baseline decisions about brand new model architectures. Maybe make your own repo for it?

ArthurZucker

Yep as @Rocketknight1 said, but we are happy to help you get visibiity: you should make it into a hf community post!

Aishwarya0811 and others added 6 commits August 28, 2025 12:58

Add context-aware tokenizer selection utility

d81d5af

Merge branch 'main' into feature/context-aware-tokenizer-selection

69dc999

Fix code formatting and unused imports

9762405

Merge branch 'feature/context-aware-tokenizer-selection' of https://g…

3676758

…ithub.com/Aishwarya0809/transformers into feature/context-aware-tokenizer-selection

Remove unused medium_vocab variable from test

5dc4a28

ruff fformatting checkss

d18155b

ArthurZucker reviewed Sep 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Context-Aware Tokenizer Selection Utility Based on Corpus Analysis#40515

Add Context-Aware Tokenizer Selection Utility Based on Corpus Analysis#40515
Aishwarya0811 wants to merge 6 commits intohuggingface:mainfrom
Aishwarya0811:feature/context-aware-tokenizer-selection

Aishwarya0811 commented Aug 28, 2025 •

edited

Loading

Uh oh!

Aishwarya0811 commented Aug 28, 2025

Uh oh!

Rocketknight1 commented Aug 28, 2025

Uh oh!

ArthurZucker left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Aishwarya0811 commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Key Features:

Implementation Details:

Fixes #40512

Fixes # (issue) Context-Aware Tokenizer Selection Utility Based on Corpus Analysis

Before submitting

Who can review?

Uh oh!

Aishwarya0811 commented Aug 28, 2025

Uh oh!

Rocketknight1 commented Aug 28, 2025

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Aishwarya0811 commented Aug 28, 2025 •

edited

Loading