Skip to content

Add Context-Aware Tokenizer Selection Utility Based on Corpus Analysis#40515

Open
Aishwarya0811 wants to merge 6 commits intohuggingface:mainfrom
Aishwarya0811:feature/context-aware-tokenizer-selection
Open

Add Context-Aware Tokenizer Selection Utility Based on Corpus Analysis#40515
Aishwarya0811 wants to merge 6 commits intohuggingface:mainfrom
Aishwarya0811:feature/context-aware-tokenizer-selection

Conversation

@Aishwarya0811
Copy link
Copy Markdown

@Aishwarya0811 Aishwarya0811 commented Aug 28, 2025

What does this PR do?

This PR introduces a new utility module that automates tokenizer selection and configuration based on corpus characteristics, addressing the need to reduce manual trial-and-error in tokenizer selection and improve model performance with minimal user effort.

Key Features:

CorpusAnalyzer: Extracts statistical features from text corpora (vocabulary size, morphological complexity, character diversity, language patterns)
TokenizerRecommender: Maps corpus features to optimal tokenizer types (BPE, WordPiece, SentencePiece) using rule-based heuristics
TokenizerSelector: End-to-end utility that analyzes corpus, recommends tokenizer type, and optionally trains it using existing infrastructure
Language-aware recommendations: Handles different script types (Latin, CJK, mixed) appropriately

Implementation Details:

Minimal changes: Single new file src/transformers/utils/tokenizer_selection.py
Zero modifications to existing tokenizer classes
Uses lazy imports to avoid circular dependencies
Integrates with existing train_new_from_iterator method
Comprehensive test coverage (16 tests)
-->

Fixes #40512

Fixes # (issue) Context-Aware Tokenizer Selection Utility Based on Corpus Analysis

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@Aishwarya0811
Copy link
Copy Markdown
Author

The test failures appeared after clicking "Update branch" and are in SmolVLM image processing, completely unrelated to the tokenizer selection utility I implemented. These appear to be pre-existing issues in the main branch.

@Rocketknight1
Copy link
Copy Markdown
Member

Mentioned in #40512, but I don't think we want this in Transformers right now! A lot of our focus is on running and fine-tuning existing models, rather than tools for making these kind of baseline decisions about brand new model architectures. Maybe make your own repo for it?

Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep as @Rocketknight1 said, but we are happy to help you get visibiity: you should make it into a hf community post!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Context-Aware Tokenizer Suggestions

3 participants