Skip to content

feat: Normalizer::normalize_str — skip NormalizedString allocation#2020

Open
ArthurZucker wants to merge 4 commits intomainfrom
fast-normalize-str
Open

feat: Normalizer::normalize_str — skip NormalizedString allocation#2020
ArthurZucker wants to merge 4 commits intomainfrom
fast-normalize-str

Conversation

@ArthurZucker
Copy link
Copy Markdown
Collaborator

Summary

Adds a normalize_str(&str) -> Result<String> method to the Normalizer trait that produces the normalized output without allocating a full NormalizedString.

Problem

NormalizedString::from(s) allocates:

  1. original: String — clone of input
  2. normalized: String — another clone
  3. alignments: Vec<(usize, usize)> — one entry per byte

For callers that only need the normalized string (like add_tokens building the normalized cache, or Python's normalize_str), this is pure overhead.

Solution

  • Default normalize_str on the trait — falls back to NormalizedString for normalizers that haven't opted in yet
  • Lowercase: s.to_lowercase() — zero intermediate allocations
  • ByteLevel: direct byte→char mapping into a pre-allocated String
  • Sequence: chains normalize_str calls without intermediate NormalizedString
  • NormalizerWrapper: forwards to the concrete normalizer's fast path
  • Python binding: normalize_str now calls the trait method directly

Follow-ups

Other normalizers (NFC, NFD, NFKC, NFKD, BertNormalizer, etc.) still fall back to the default. They can be optimized individually — the trait method is ready.

…ization

Add a default normalize_str(&str) -> Result<String> method to the
Normalizer trait that produces the normalized output without allocating
a full NormalizedString (which carries original + normalized + alignment
vectors — 3 allocations + O(n) alignment entries per call).

Specialized fast paths:
- Lowercase: direct s.to_lowercase(), no NormalizedString
- ByteLevel: direct byte→char mapping into a pre-allocated String
- Sequence: chains normalize_str calls without intermediate NormalizedString
- NormalizerWrapper: forwards to the concrete normalizer's fast path

Python binding updated to use normalize_str directly.

All other normalizers fall back to the default implementation which
still allocates a NormalizedString. They can be optimized individually
in follow-ups.
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Instead of HashMap lookup + per-char push, use a flat [Utf8Entry; 256]
array with pre-encoded UTF-8 bytes. Each input byte maps to 1-2 output
bytes via extend_from_slice — no HashMap, no char encoding, no branching.
@ArthurZucker ArthurZucker requested a review from McPatate April 10, 2026 14:55
Fast paths (no NormalizedString allocation):
- Lowercase: s.to_lowercase()
- ByteLevel: pre-encoded UTF-8 table lookup
- NFD/NFKD/NFC/NFKC: direct unicode_normalization iterator
- Nmt: inline filter + map over chars
- Strip: trim_start/trim_end
- StripAccents: filter combining marks
- Prepend: format!
- Sequence: chain normalize_str calls

Still using default fallback (NormalizedString):
- BertNormalizer (complex multi-step logic)
- Replace (regex-based, needs NormalizedString::transform)
- Precompiled (sentencepiece precompiled charsmap)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants