Step 4: BPE Tokenizer (SentencePiece + tiktoken) (#6)#16
Conversation
Dictionary-node trie supporting O(L) longest-prefix lookup used by the BPE merge loop to check if a symbol pair concatenation exists in vocab. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements BPE encoding and decoding for both tokenizer styles: - SentencePiece (Llama 1/2, Mistral, SmolLM): ▁ space markers, score-based merge priority. - tiktoken/GPT-2 (Llama 3, GPT-4): byte-level Unicode encoding, merge-rank priority. BPE merge loop uses index-based doubly-linked symbol list (ArrayPool) and PriorityQueue with staleness detection (adjacency + token-ID checks). GPT-2 byte-to-unicode tables handle the Ġ/space encoding correctly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Reads tokenizer.ggml.* metadata keys and dispatches to BpeTokenizer.CreateSentencePiece (llama/mistral) or BpeTokenizer.CreateTiktoken (gpt2/llama3). Adds project reference from DotLLM.Models to DotLLM.Tokenizers. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Unit tests (21): synthetic in-memory vocabs covering encode/decode roundtrips, merge priority, staleness, byte fallback, space marker, BOS/EOS, and GgufBpeTokenizerFactory — no file I/O. Integration tests (9): against real SmolLM-135M Q8_0 GGUF via SmallModelFixture. Verifies vocab size, BOS/EOS range, encode smoke tests, and encode-decode roundtrips for ASCII, multi-word, Unicode, and numbers/punctuation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
@codex please review this PR - also from performance and clean code perspectives |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1f9d38daa8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // TODO: implement regex pre-tokenization using tokenizer.ggml.pre pattern. | ||
| // Without it, this path treats the whole GPT-2-encoded text as one segment, which is | ||
| // incorrect for most tiktoken models (splits should happen at word boundaries first). | ||
| Symbol[] symbols = ArrayPool<Symbol>.Shared.Rent(gpt2Text.Length * 2); |
There was a problem hiding this comment.
Apply tokenizer.ggml.pre chunking before tiktoken merges
The tiktoken path currently runs BPE over the entire GPT-2-encoded string and explicitly skips regex pre-tokenization, but llama3/gpt2 models are routed into this code path by the GGUF factory. For those models, merges are defined relative to pre-tokenized chunks, so global merging can produce different token IDs/counts than reference tokenizers (especially around punctuation/whitespace), which can break prompt compatibility and any token-budget logic.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Acknowledged — and explicitly deferred by design. The TODO comment at line 445 of the original BpeTokenizer.cs (now preserved in Gpt2TiktokenEncoding.Encode) documents this. A follow-up issue will track proper tokenizer.ggml.pre regex parsing.
Our integration roundtrip tests pass because the SmolLM test model (SentencePiece variant) doesn't exercise this path, and short tiktoken inputs produce correct output without word-boundary pre-splits. Regex pre-tokenization will be wired up when the first tiktoken model with a tokenizer.ggml.pre key is used in integration tests.
| int byteId = _byteToTokenId[utf8[b]]; | ||
| if (byteId < 0) continue; // no token for this byte; skip | ||
| symbols[count] = new Symbol { Prev = count - 1, Next = count + 1, TokenId = byteId }; |
There was a problem hiding this comment.
Emit unk token when byte fallback token is absent
When a code point misses vocab lookup, the byte-fallback loop silently skips bytes that have no <0xNN> token. In SentencePiece vocabs without byte fallback (or with incomplete byte literals), this drops characters from the encoded output instead of emitting an unknown token, so input text can be lossy and Decode(Encode(text)) can become unexpectedly shorter.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in this update. BuildInitialSymbols in SentencePieceEncoding (and the parallel method in Gpt2TiktokenEncoding) now emits _unkId instead of silently dropping unmapped bytes:
int byteId = _byteToTokenId[utf8[b]];
int effectiveId = byteId >= 0 ? byteId : _unkId;
symbols[count] = new Symbol { Prev = count - 1, Next = count + 1, TokenId = effectiveId };
count++;_unkId is resolved at construction via Array.FindIndex(tokens, t => t is "<unk>" or "<UNK>"), defaulting to 0.
| string mergeKey = leftText + " " + rightText; | ||
| if (!_mergeRanks.TryGetValue(mergeKey, out int rank)) return; |
There was a problem hiding this comment.
Remove per-bigram merge-key allocations in tiktoken loop
The inner tiktoken merge candidate path allocates a new concatenated mergeKey string for every adjacent pair. On long inputs this is called many times, creating substantial transient allocations and GC pressure in the tokenizer hot path, which undermines the performance improvements from pooling elsewhere in this implementation.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in this update. _mergeRanks in Gpt2TiktokenEncoding now uses Dictionary<(int, int), int> tuple keys — zero allocation per bigram lookup:
// Construction: parse "A B" merges into token-ID tuple keys (one-time cost)
var tokenToId = new Dictionary<string, int>(tokens.Length, StringComparer.Ordinal);
for (int i = 0; i < tokens.Length; i++) tokenToId[tokens[i]] = i;
var mergeRanks = new Dictionary<(int, int), int>(merges.Length);
for (int rank = 0; rank < merges.Length; rank++) {
int sep = merges[rank].IndexOf(' ');
if (sep < 0) continue;
if (tokenToId.TryGetValue(merges[rank][..sep], out int idA)
&& tokenToId.TryGetValue(merges[rank][(sep + 1)..], out int idB))
mergeRanks[(idA, idB)] = rank;
}
// Hot path: zero allocation — value-type tuple key
if (!_mergeRanks.TryGetValue((symbols[leftIdx].TokenId, symbols[rightIdx].TokenId), out int rank)) return;The trie lookup for the merged token ID still uses a stack-allocated Span<char> concat (already allocation-free). Only the string mergeKey per-bigram alloc is eliminated.
Split monolithic BpeTokenizer.cs into four focused files: - BpeCore.cs: IBpeEncoding interface + shared Symbol/BgramEntry types + static utilities - SentencePieceEncoding.cs: SP encode/decode implementing IBpeEncoding - Gpt2TiktokenEncoding.cs: tiktoken encode/decode implementing IBpeEncoding - BpeTokenizer.cs: thin public facade (~70 lines), factory methods delegate to IBpeEncoding Remove BpeVariant enum — dispatch is now structural, not conditional. New variants (e.g. Cl100kEncoding) require only a new IBpeEncoding impl + factory method. Also fixes two Codex review comments from PR #16: - P1 (emit unk on missing byte fallback): BuildInitialSymbols now emits _unkId instead of silently dropping bytes with no <0xNN> token — applies to both encodings. - P2 (hot-path string alloc): Gpt2TiktokenEncoding._mergeRanks now uses Dictionary<(int,int),int> tuple keys; no string allocation per bigram check. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
GgufFactory_LoadsSentencePieceTokenizer used a mini-vocab without ▁. With addBosSpace=true the encoder hits byte fallback for ▁'s UTF-8 bytes (0xE2/0x96/0x81); the P1 fix now emits <unk> for each rather than silently dropping them — changing the output from [5] to [0,0,0,5]. Fix: add ▁ (token 6) to the test vocab, matching real SentencePiece models which always include ▁ as a direct token. Expected encode result is now [6, 5] (▁ + the ab merge), exercising both the direct ▁ match and the BPE a+b→ab merge. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
@gemini please review this PR - also from performance and clean code perspectives |
|
Gemini ✦
|
|
@gemini please review this PR - also from performance and clean code perspectives |
|
Gemini ✦ This is an excellently structured PR. The separation of BPE variants, the zero-allocation tuple keys for merge ranks, and the comprehensive documentation (the Given dotLLM's strict philosophy of zero GC pressure on the inference hot path, there are a few hidden allocations in the encoding/decoding loops that we should eliminate before merging. Here is my review focusing on performance and .NET best practices: 1. Hidden String Allocations in
|
|
Claude ✦ Thanks for the thorough review, Gemini! All four findings addressed: Findings 1–3 — Fixed ✅
Finding 4 — Follow-up issue created → #19 All 169 tests passing (141 unit + 28 integration), including a new edge case test for raw U+2581 input. |
Address Gemini review findings 1-3 on PR #16: SentencePieceEncoding: - Encode: ArrayPool<char> + MemoryExtensions.Replace instead of string allocs - Decode: sb.Append + sb.Replace pattern, sb.ToString(1, len-1) for strip - DecodeToken: Contains guard + stackalloc byte for Latin1 - BuildInitialSymbols: ReadOnlySpan<char> parameter - TryEnqueueBigram: ArrayPool fallback for >256 char path Gpt2TiktokenEncoding: - Encode: pooled byte[] + char[], no intermediate string - DecodeToken: stackalloc with ArrayPool fallback - BuildInitialSymbols: ReadOnlySpan<char> parameter - TryEnqueueBigram: ArrayPool fallback for >256 char path Both files: PriorityQueue pre-sized with symbol count. Finding 4 (Trie node overhead) deferred to #19. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
Trieprefix-matching data structure inDotLLM.Tokenizersfor O(L) vocab lookup during BPE merge loopsBpeTokenizersupporting both SentencePiece (Llama 1/2, Mistral, SmolLM) and tiktoken/GPT-2 (Llama 3) BPE variants<0xNN>fallbackĠ= space, etc.), merge-rank priority; BPE staleness detection using token-ID checksGgufBpeTokenizerFactorythat readstokenizer.ggml.*metadata and dispatches to the correct variantTest plan
dotnet build— clean, 0 warnings, 0 errorsdotnet test tests/DotLLM.Tests.Unit— 140/140 passingdotnet test tests/DotLLM.Tests.Integration— 28/28 passing (includes 9 new BPE tests)café au lait), numbers+punctuation against SmolLM-135MCloses #6
🤖 Generated with Claude Code