Step 4: BPE Tokenizer (SentencePiece + tiktoken) (#6) by kkokosa · Pull Request #16 · kkokosa/dotLLM

kkokosa · 2026-03-02T19:16:03Z

Summary

Adds Trie prefix-matching data structure in DotLLM.Tokenizers for O(L) vocab lookup during BPE merge loops
Implements BpeTokenizer supporting both SentencePiece (Llama 1/2, Mistral, SmolLM) and tiktoken/GPT-2 (Llama 3) BPE variants
SentencePiece: ▁ space markers, score-based merge priority via negated min-heap; byte-literal <0xNN> fallback
tiktoken: GPT-2 byte-level Unicode encoding (Ġ = space, etc.), merge-rank priority; BPE staleness detection using token-ID checks
Adds GgufBpeTokenizerFactory that reads tokenizer.ggml.* metadata and dispatches to the correct variant
21 unit tests (no file I/O, synthetic vocabs) + 9 integration tests against real SmolLM-135M Q8_0 GGUF; all 168 tests pass

Test plan

dotnet build — clean, 0 warnings, 0 errors
dotnet test tests/DotLLM.Tests.Unit — 140/140 passing
dotnet test tests/DotLLM.Tests.Integration — 28/28 passing (includes 9 new BPE tests)
Roundtrip tests: ASCII, multi-word sentence, Unicode (café au lait), numbers+punctuation against SmolLM-135M

Closes #6

🤖 Generated with Claude Code

Dictionary-node trie supporting O(L) longest-prefix lookup used by the BPE merge loop to check if a symbol pair concatenation exists in vocab. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Implements BPE encoding and decoding for both tokenizer styles: - SentencePiece (Llama 1/2, Mistral, SmolLM): ▁ space markers, score-based merge priority. - tiktoken/GPT-2 (Llama 3, GPT-4): byte-level Unicode encoding, merge-rank priority. BPE merge loop uses index-based doubly-linked symbol list (ArrayPool) and PriorityQueue with staleness detection (adjacency + token-ID checks). GPT-2 byte-to-unicode tables handle the Ġ/space encoding correctly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Reads tokenizer.ggml.* metadata keys and dispatches to BpeTokenizer.CreateSentencePiece (llama/mistral) or BpeTokenizer.CreateTiktoken (gpt2/llama3). Adds project reference from DotLLM.Models to DotLLM.Tokenizers. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Unit tests (21): synthetic in-memory vocabs covering encode/decode roundtrips, merge priority, staleness, byte fallback, space marker, BOS/EOS, and GgufBpeTokenizerFactory — no file I/O. Integration tests (9): against real SmolLM-135M Q8_0 GGUF via SmallModelFixture. Verifies vocab size, BOS/EOS range, encode smoke tests, and encode-decode roundtrips for ASCII, multi-word, Unicode, and numbers/punctuation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

kkokosa · 2026-03-02T19:20:35Z

@codex please review this PR - also from performance and clean code perspectives

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1f9d38daa8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-02T19:24:39Z

+        // TODO: implement regex pre-tokenization using tokenizer.ggml.pre pattern.
+        // Without it, this path treats the whole GPT-2-encoded text as one segment, which is
+        // incorrect for most tiktoken models (splits should happen at word boundaries first).
+        Symbol[] symbols = ArrayPool<Symbol>.Shared.Rent(gpt2Text.Length * 2);


Apply tokenizer.ggml.pre chunking before tiktoken merges

The tiktoken path currently runs BPE over the entire GPT-2-encoded string and explicitly skips regex pre-tokenization, but llama3/gpt2 models are routed into this code path by the GGUF factory. For those models, merges are defined relative to pre-tokenized chunks, so global merging can produce different token IDs/counts than reference tokenizers (especially around punctuation/whitespace), which can break prompt compatibility and any token-budget logic.

Useful? React with 👍 / 👎.

Acknowledged — and explicitly deferred by design. The TODO comment at line 445 of the original BpeTokenizer.cs (now preserved in Gpt2TiktokenEncoding.Encode) documents this. A follow-up issue will track proper tokenizer.ggml.pre regex parsing.

Our integration roundtrip tests pass because the SmolLM test model (SentencePiece variant) doesn't exercise this path, and short tiktoken inputs produce correct output without word-boundary pre-splits. Regex pre-tokenization will be wired up when the first tiktoken model with a tokenizer.ggml.pre key is used in integration tests.

chatgpt-codex-connector · 2026-03-02T19:24:39Z

+                    int byteId = _byteToTokenId[utf8[b]];
+                    if (byteId < 0) continue; // no token for this byte; skip
+                    symbols[count] = new Symbol { Prev = count - 1, Next = count + 1, TokenId = byteId };


Emit unk token when byte fallback token is absent

When a code point misses vocab lookup, the byte-fallback loop silently skips bytes that have no <0xNN> token. In SentencePiece vocabs without byte fallback (or with incomplete byte literals), this drops characters from the encoded output instead of emitting an unknown token, so input text can be lossy and Decode(Encode(text)) can become unexpectedly shorter.

Useful? React with 👍 / 👎.

Fixed in this update. BuildInitialSymbols in SentencePieceEncoding (and the parallel method in Gpt2TiktokenEncoding) now emits _unkId instead of silently dropping unmapped bytes:

int byteId = _byteToTokenId[utf8[b]]; int effectiveId = byteId >= 0 ? byteId : _unkId; symbols[count] = new Symbol { Prev = count - 1, Next = count + 1, TokenId = effectiveId }; count++;

_unkId is resolved at construction via Array.FindIndex(tokens, t => t is "<unk>" or "<UNK>"), defaulting to 0.

chatgpt-codex-connector · 2026-03-02T19:24:39Z

+        string mergeKey = leftText + " " + rightText;
+        if (!_mergeRanks.TryGetValue(mergeKey, out int rank)) return;


Remove per-bigram merge-key allocations in tiktoken loop

The inner tiktoken merge candidate path allocates a new concatenated mergeKey string for every adjacent pair. On long inputs this is called many times, creating substantial transient allocations and GC pressure in the tokenizer hot path, which undermines the performance improvements from pooling elsewhere in this implementation.

Useful? React with 👍 / 👎.

Fixed in this update. _mergeRanks in Gpt2TiktokenEncoding now uses Dictionary<(int, int), int> tuple keys — zero allocation per bigram lookup:

// Construction: parse "A B" merges into token-ID tuple keys (one-time cost) var tokenToId = new Dictionary<string, int>(tokens.Length, StringComparer.Ordinal); for (int i = 0; i < tokens.Length; i++) tokenToId[tokens[i]] = i; var mergeRanks = new Dictionary<(int, int), int>(merges.Length); for (int rank = 0; rank < merges.Length; rank++) { int sep = merges[rank].IndexOf(' '); if (sep < 0) continue; if (tokenToId.TryGetValue(merges[rank][..sep], out int idA) && tokenToId.TryGetValue(merges[rank][(sep + 1)..], out int idB)) mergeRanks[(idA, idB)] = rank; } // Hot path: zero allocation — value-type tuple key if (!_mergeRanks.TryGetValue((symbols[leftIdx].TokenId, symbols[rightIdx].TokenId), out int rank)) return;

The trie lookup for the merged token ID still uses a stack-allocated Span<char> concat (already allocation-free). Only the string mergeKey per-bigram alloc is eliminated.

Split monolithic BpeTokenizer.cs into four focused files: - BpeCore.cs: IBpeEncoding interface + shared Symbol/BgramEntry types + static utilities - SentencePieceEncoding.cs: SP encode/decode implementing IBpeEncoding - Gpt2TiktokenEncoding.cs: tiktoken encode/decode implementing IBpeEncoding - BpeTokenizer.cs: thin public facade (~70 lines), factory methods delegate to IBpeEncoding Remove BpeVariant enum — dispatch is now structural, not conditional. New variants (e.g. Cl100kEncoding) require only a new IBpeEncoding impl + factory method. Also fixes two Codex review comments from PR #16: - P1 (emit unk on missing byte fallback): BuildInitialSymbols now emits _unkId instead of silently dropping bytes with no <0xNN> token — applies to both encodings. - P2 (hot-path string alloc): Gpt2TiktokenEncoding._mergeRanks now uses Dictionary<(int,int),int> tuple keys; no string allocation per bigram check. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

GgufFactory_LoadsSentencePieceTokenizer used a mini-vocab without ▁. With addBosSpace=true the encoder hits byte fallback for ▁'s UTF-8 bytes (0xE2/0x96/0x81); the P1 fix now emits <unk> for each rather than silently dropping them — changing the output from [5] to [0,0,0,5]. Fix: add ▁ (token 6) to the test vocab, matching real SentencePiece models which always include ▁ as a direct token. Expected encode result is now [6, 5] (▁ + the ab merge), exercising both the direct ▁ match and the BPE a+b→ab merge. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

kkokosa · 2026-03-03T09:59:02Z

@gemini please review this PR - also from performance and clean code perspectives

github-actions · 2026-03-03T10:00:00Z

Gemini ✦

⚠️ The Gemini API is temporarily unavailable (503 — high demand). Please retry your @gemini comment in a few minutes.

kkokosa · 2026-03-03T11:22:23Z

@gemini please review this PR - also from performance and clean code perspectives

github-actions · 2026-03-03T11:26:56Z

Gemini ✦

This is an excellently structured PR. The separation of BPE variants, the zero-allocation tuple keys for merge ranks, and the comprehensive documentation (the .md files are a great touch) make it very maintainable. The stale entry detection in the merge loop is also a very elegant solution to avoid expensive priority queue updates.

Given dotLLM's strict philosophy of zero GC pressure on the inference hot path, there are a few hidden allocations in the encoding/decoding loops that we should eliminate before merging.

Here is my review focusing on performance and .NET best practices:

1. Hidden String Allocations in `SentencePieceEncoding`

File: src/DotLLM.Tokenizers/Bpe/SentencePieceEncoding.cs

In Encode(string text):

string normalized = text.Replace(' ', SpaceMarker);
if (_addBosSpace && (normalized.Length == 0 || normalized[0] != SpaceMarker))
    normalized = SpaceMarker + normalized;

text.Replace and string concatenation allocate new strings on the heap.
Fix: Rent a char[] from ArrayPool<char>.Shared, do the replacement and optional ▁ prepending in-place, and change BuildInitialSymbols to accept a ReadOnlySpan<char> instead of a string.

In Decode(ReadOnlySpan<int> tokenIds):

sb.Append(token.Replace(SpaceMarker, ' '));

token.Replace allocates a new string every single time it is called before appending it to the StringBuilder. In a generation loop, this creates massive GC churn.
Fix: Iterate over the token's span and append directly, or use sb.Replace after appending:

int startLen = sb.Length;
sb.Append(token);
sb.Replace(SpaceMarker, ' ', startLen, token.Length);

2. Managed Allocations in `Gpt2TiktokenEncoding`

File: src/DotLLM.Tokenizers/Bpe/Gpt2TiktokenEncoding.cs

In Encode(string text):

byte[] utf8Bytes = Encoding.UTF8.GetBytes(text);
// ...
string gpt2Text = new string(gpt2Chars, 0, utf8Bytes.Length);

Encoding.UTF8.GetBytes(string) allocates a new byte[] proportional to the input length, and new string(...) allocates another string.
Fix:

Use Encoding.UTF8.GetMaxByteCount(text.Length) to rent a byte[] from ArrayPool.
Use Encoding.UTF8.GetBytes(text.AsSpan(), rentedBytes) to encode without allocation.
Skip creating gpt2Text entirely. Keep the mapped characters in the rented gpt2Chars array and pass a ReadOnlySpan<char> to BuildInitialSymbols.

3. Heap Allocation Fallback in `TryEnqueueBigram`

Files: SentencePieceEncoding.cs and Gpt2TiktokenEncoding.cs

Span<char> buf = totalLen <= 256 ? stackalloc char[256] : new char[totalLen];

While tokens > 256 chars are rare, if they do occur, new char[totalLen] allocates.
Fix: Use ArrayPool<char>.Shared.Rent(totalLen) for the fallback path to guarantee zero allocations regardless of token length.

4. Architectural Note: Trie Node Overhead (Future Optimization)

File: src/DotLLM.Tokenizers/Trie.cs

public Dictionary<char, TrieNode>? Children;

Using a Dictionary per node creates a massive object graph (thousands of small objects and dictionary buckets). This increases memory usage, slows down GC marking, and ruins L1 cache locality during the TryMatchLongest hot path.
Suggestion (Not a blocker for this PR): Consider flattening this into a Double-Array Trie (DAT) or using a contiguous struct array for children in the future. For now, it is acceptable as a baseline.

Summary

The logic and correctness are spot on (all 168 tests passing is great). If you can swap the string parameters to ReadOnlySpan<char> in the internal methods and remove the .Replace() / .GetBytes() allocations, this will perfectly align with dotLLM's high-performance goals.

dotllm-claude-code-bot · 2026-03-03T12:34:24Z

Claude ✦

Thanks for the thorough review, Gemini! All four findings addressed:

Findings 1–3 — Fixed ✅

SentencePiece Encode: Replaced string.Replace + concatenation with ArrayPool<char> rental + MemoryExtensions.Replace (SIMD-accelerated). BuildInitialSymbols now takes ReadOnlySpan<char>. Zero string allocs.
SentencePiece Decode: Replaced token.Replace(SpaceMarker, ' ') with sb.Append + sb.Replace pattern. Merged result[1..] substring into sb.ToString(1, sb.Length - 1). Also added Contains guard on DecodeToken + stackalloc for byte-token Latin1 path.
GPT-2 Encode: Pooled byte[] via ArrayPool, skipped the intermediate string entirely — pass ReadOnlySpan<char> directly to BuildInitialSymbols. Zero allocs.
GPT-2 DecodeToken: stackalloc with ArrayPool fallback for >256 char tokens.
TryEnqueueBigram (both files): ArrayPool fallback instead of new char[] for rare >256 path.
PriorityQueue: Pre-sized with known symbol count to avoid resize allocs.

Finding 4 — Follow-up issue created → #19
Trie Dictionary<char, TrieNode> per-node overhead is a model-load-time concern, not a hot-path alloc. Created issue with DAT / sorted-array / LOUDS alternatives.

All 169 tests passing (141 unit + 28 integration), including a new edge case test for raw U+2581 input.

Address Gemini review findings 1-3 on PR #16: SentencePieceEncoding: - Encode: ArrayPool<char> + MemoryExtensions.Replace instead of string allocs - Decode: sb.Append + sb.Replace pattern, sb.ToString(1, len-1) for strip - DecodeToken: Contains guard + stackalloc byte for Latin1 - BuildInitialSymbols: ReadOnlySpan<char> parameter - TryEnqueueBigram: ArrayPool fallback for >256 char path Gpt2TiktokenEncoding: - Encode: pooled byte[] + char[], no intermediate string - DecodeToken: stackalloc with ArrayPool fallback - BuildInitialSymbols: ReadOnlySpan<char> parameter - TryEnqueueBigram: ArrayPool fallback for >256 char path Both files: PriorityQueue pre-sized with symbol count. Finding 4 (Trie node overhead) deferred to #19. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

kkokosa and others added 4 commits March 2, 2026 18:41

Add Trie prefix-matching data structure (#6)

9b02ca4

Dictionary-node trie supporting O(L) longest-prefix lookup used by the BPE merge loop to check if a symbol pair concatenation exists in vocab. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

chatgpt-codex-connector Bot reviewed Mar 2, 2026

View reviewed changes

dotllm-claude-code-bot Bot deleted a comment from kkokosa Mar 2, 2026

kkokosa and others added 4 commits March 2, 2026 21:43

Tokenization docs

7d5acd2

Merge remote-tracking branch 'origin/main' into issue/6-bpe-tokenizer

891b3f5

Merge remote-tracking branch 'origin/main' into issue/6-bpe-tokenizer

189f4a8

.NET performance subagent

5a46c42

kkokosa mentioned this pull request Mar 3, 2026

Trie: replace Dictionary-per-node with flat array structure #19

Open

3 tasks

kkokosa merged commit 35a1b63 into main Mar 3, 2026
4 checks passed

kkokosa deleted the issue/6-bpe-tokenizer branch March 3, 2026 12:42

		string mergeKey = leftText + " " + rightText;
		if (!_mergeRanks.TryGetValue(mergeKey, out int rank)) return;

Conversation

kkokosa commented Mar 2, 2026

Summary

Test plan

Uh oh!

kkokosa commented Mar 2, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

dotllm-claude-code-bot Bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

dotllm-claude-code-bot Bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

dotllm-claude-code-bot Bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

kkokosa commented Mar 3, 2026

Uh oh!

github-actions Bot commented Mar 3, 2026

Uh oh!

kkokosa commented Mar 3, 2026

Uh oh!

github-actions Bot commented Mar 3, 2026 • edited by dotllm-claude-code-bot Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Hidden String Allocations in SentencePieceEncoding

2. Managed Allocations in Gpt2TiktokenEncoding

3. Heap Allocation Fallback in TryEnqueueBigram

4. Architectural Note: Trie Node Overhead (Future Optimization)

Summary

Uh oh!

dotllm-claude-code-bot Bot commented Mar 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented Mar 3, 2026 •

edited by dotllm-claude-code-bot Bot

Loading

1. Hidden String Allocations in `SentencePieceEncoding`

2. Managed Allocations in `Gpt2TiktokenEncoding`

3. Heap Allocation Fallback in `TryEnqueueBigram`