feat: Normalizer::normalize_str — skip NormalizedString allocation#2020
Open
ArthurZucker wants to merge 4 commits intomainfrom
Open
feat: Normalizer::normalize_str — skip NormalizedString allocation#2020ArthurZucker wants to merge 4 commits intomainfrom
ArthurZucker wants to merge 4 commits intomainfrom
Conversation
…ization Add a default normalize_str(&str) -> Result<String> method to the Normalizer trait that produces the normalized output without allocating a full NormalizedString (which carries original + normalized + alignment vectors — 3 allocations + O(n) alignment entries per call). Specialized fast paths: - Lowercase: direct s.to_lowercase(), no NormalizedString - ByteLevel: direct byte→char mapping into a pre-allocated String - Sequence: chains normalize_str calls without intermediate NormalizedString - NormalizerWrapper: forwards to the concrete normalizer's fast path Python binding updated to use normalize_str directly. All other normalizers fall back to the default implementation which still allocates a NormalizedString. They can be optimized individually in follow-ups.
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Instead of HashMap lookup + per-char push, use a flat [Utf8Entry; 256] array with pre-encoded UTF-8 bytes. Each input byte maps to 1-2 output bytes via extend_from_slice — no HashMap, no char encoding, no branching.
Fast paths (no NormalizedString allocation): - Lowercase: s.to_lowercase() - ByteLevel: pre-encoded UTF-8 table lookup - NFD/NFKD/NFC/NFKC: direct unicode_normalization iterator - Nmt: inline filter + map over chars - Strip: trim_start/trim_end - StripAccents: filter combining marks - Prepend: format! - Sequence: chain normalize_str calls Still using default fallback (NormalizedString): - BertNormalizer (complex multi-step logic) - Replace (regex-based, needs NormalizedString::transform) - Precompiled (sentencepiece precompiled charsmap)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a
normalize_str(&str) -> Result<String>method to theNormalizertrait that produces the normalized output without allocating a fullNormalizedString.Problem
NormalizedString::from(s)allocates:original: String— clone of inputnormalized: String— another clonealignments: Vec<(usize, usize)>— one entry per byteFor callers that only need the normalized string (like
add_tokensbuilding the normalized cache, or Python'snormalize_str), this is pure overhead.Solution
normalize_stron the trait — falls back toNormalizedStringfor normalizers that haven't opted in yets.to_lowercase()— zero intermediate allocationsStringnormalize_strcalls without intermediateNormalizedStringnormalize_strnow calls the trait method directlyFollow-ups
Other normalizers (NFC, NFD, NFKC, NFKD, BertNormalizer, etc.) still fall back to the default. They can be optimized individually — the trait method is ready.