Skip to content

Step 4: BPE Tokenizer (SentencePiece + tiktoken) (#6)#16

Merged
kkokosa merged 11 commits intomainfrom
issue/6-bpe-tokenizer
Mar 3, 2026
Merged

Step 4: BPE Tokenizer (SentencePiece + tiktoken) (#6)#16
kkokosa merged 11 commits intomainfrom
issue/6-bpe-tokenizer

Conversation

@kkokosa
Copy link
Copy Markdown
Owner

@kkokosa kkokosa commented Mar 2, 2026

Summary

  • Adds Trie prefix-matching data structure in DotLLM.Tokenizers for O(L) vocab lookup during BPE merge loops
  • Implements BpeTokenizer supporting both SentencePiece (Llama 1/2, Mistral, SmolLM) and tiktoken/GPT-2 (Llama 3) BPE variants
  • SentencePiece: ▁ space markers, score-based merge priority via negated min-heap; byte-literal <0xNN> fallback
  • tiktoken: GPT-2 byte-level Unicode encoding (Ġ = space, etc.), merge-rank priority; BPE staleness detection using token-ID checks
  • Adds GgufBpeTokenizerFactory that reads tokenizer.ggml.* metadata and dispatches to the correct variant
  • 21 unit tests (no file I/O, synthetic vocabs) + 9 integration tests against real SmolLM-135M Q8_0 GGUF; all 168 tests pass

Test plan

  • dotnet build — clean, 0 warnings, 0 errors
  • dotnet test tests/DotLLM.Tests.Unit — 140/140 passing
  • dotnet test tests/DotLLM.Tests.Integration — 28/28 passing (includes 9 new BPE tests)
  • Roundtrip tests: ASCII, multi-word sentence, Unicode (café au lait), numbers+punctuation against SmolLM-135M

Closes #6

🤖 Generated with Claude Code

kkokosa and others added 4 commits March 2, 2026 18:41
Dictionary-node trie supporting O(L) longest-prefix lookup used by the
BPE merge loop to check if a symbol pair concatenation exists in vocab.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements BPE encoding and decoding for both tokenizer styles:
- SentencePiece (Llama 1/2, Mistral, SmolLM): ▁ space markers, score-based merge priority.
- tiktoken/GPT-2 (Llama 3, GPT-4): byte-level Unicode encoding, merge-rank priority.

BPE merge loop uses index-based doubly-linked symbol list (ArrayPool) and
PriorityQueue with staleness detection (adjacency + token-ID checks).
GPT-2 byte-to-unicode tables handle the Ġ/space encoding correctly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Reads tokenizer.ggml.* metadata keys and dispatches to
BpeTokenizer.CreateSentencePiece (llama/mistral) or
BpeTokenizer.CreateTiktoken (gpt2/llama3). Adds project reference
from DotLLM.Models to DotLLM.Tokenizers.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Unit tests (21): synthetic in-memory vocabs covering encode/decode
roundtrips, merge priority, staleness, byte fallback, space marker,
BOS/EOS, and GgufBpeTokenizerFactory — no file I/O.

Integration tests (9): against real SmolLM-135M Q8_0 GGUF via
SmallModelFixture. Verifies vocab size, BOS/EOS range, encode smoke
tests, and encode-decode roundtrips for ASCII, multi-word, Unicode,
and numbers/punctuation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@kkokosa
Copy link
Copy Markdown
Owner Author

kkokosa commented Mar 2, 2026

@codex please review this PR - also from performance and clean code perspectives

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1f9d38daa8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +445 to +448
// TODO: implement regex pre-tokenization using tokenizer.ggml.pre pattern.
// Without it, this path treats the whole GPT-2-encoded text as one segment, which is
// incorrect for most tiktoken models (splits should happen at word boundaries first).
Symbol[] symbols = ArrayPool<Symbol>.Shared.Rent(gpt2Text.Length * 2);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Apply tokenizer.ggml.pre chunking before tiktoken merges

The tiktoken path currently runs BPE over the entire GPT-2-encoded string and explicitly skips regex pre-tokenization, but llama3/gpt2 models are routed into this code path by the GGUF factory. For those models, merges are defined relative to pre-tokenized chunks, so global merging can produce different token IDs/counts than reference tokenizers (especially around punctuation/whitespace), which can break prompt compatibility and any token-budget logic.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged — and explicitly deferred by design. The TODO comment at line 445 of the original BpeTokenizer.cs (now preserved in Gpt2TiktokenEncoding.Encode) documents this. A follow-up issue will track proper tokenizer.ggml.pre regex parsing.

Our integration roundtrip tests pass because the SmolLM test model (SentencePiece variant) doesn't exercise this path, and short tiktoken inputs produce correct output without word-boundary pre-splits. Regex pre-tokenization will be wired up when the first tiktoken model with a tokenizer.ggml.pre key is used in integration tests.

Comment on lines +356 to +358
int byteId = _byteToTokenId[utf8[b]];
if (byteId < 0) continue; // no token for this byte; skip
symbols[count] = new Symbol { Prev = count - 1, Next = count + 1, TokenId = byteId };
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Emit unk token when byte fallback token is absent

When a code point misses vocab lookup, the byte-fallback loop silently skips bytes that have no <0xNN> token. In SentencePiece vocabs without byte fallback (or with incomplete byte literals), this drops characters from the encoded output instead of emitting an unknown token, so input text can be lossy and Decode(Encode(text)) can become unexpectedly shorter.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in this update. BuildInitialSymbols in SentencePieceEncoding (and the parallel method in Gpt2TiktokenEncoding) now emits _unkId instead of silently dropping unmapped bytes:

int byteId = _byteToTokenId[utf8[b]];
int effectiveId = byteId >= 0 ? byteId : _unkId;
symbols[count] = new Symbol { Prev = count - 1, Next = count + 1, TokenId = effectiveId };
count++;

_unkId is resolved at construction via Array.FindIndex(tokens, t => t is "<unk>" or "<UNK>"), defaulting to 0.

Comment on lines +509 to +510
string mergeKey = leftText + " " + rightText;
if (!_mergeRanks.TryGetValue(mergeKey, out int rank)) return;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Remove per-bigram merge-key allocations in tiktoken loop

The inner tiktoken merge candidate path allocates a new concatenated mergeKey string for every adjacent pair. On long inputs this is called many times, creating substantial transient allocations and GC pressure in the tokenizer hot path, which undermines the performance improvements from pooling elsewhere in this implementation.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in this update. _mergeRanks in Gpt2TiktokenEncoding now uses Dictionary<(int, int), int> tuple keys — zero allocation per bigram lookup:

// Construction: parse "A B" merges into token-ID tuple keys (one-time cost)
var tokenToId = new Dictionary<string, int>(tokens.Length, StringComparer.Ordinal);
for (int i = 0; i < tokens.Length; i++) tokenToId[tokens[i]] = i;

var mergeRanks = new Dictionary<(int, int), int>(merges.Length);
for (int rank = 0; rank < merges.Length; rank++) {
    int sep = merges[rank].IndexOf(' ');
    if (sep < 0) continue;
    if (tokenToId.TryGetValue(merges[rank][..sep], out int idA)
        && tokenToId.TryGetValue(merges[rank][(sep + 1)..], out int idB))
        mergeRanks[(idA, idB)] = rank;
}

// Hot path: zero allocation — value-type tuple key
if (!_mergeRanks.TryGetValue((symbols[leftIdx].TokenId, symbols[rightIdx].TokenId), out int rank)) return;

The trie lookup for the merged token ID still uses a stack-allocated Span<char> concat (already allocation-free). Only the string mergeKey per-bigram alloc is eliminated.

Split monolithic BpeTokenizer.cs into four focused files:
- BpeCore.cs: IBpeEncoding interface + shared Symbol/BgramEntry types + static utilities
- SentencePieceEncoding.cs: SP encode/decode implementing IBpeEncoding
- Gpt2TiktokenEncoding.cs: tiktoken encode/decode implementing IBpeEncoding
- BpeTokenizer.cs: thin public facade (~70 lines), factory methods delegate to IBpeEncoding

Remove BpeVariant enum — dispatch is now structural, not conditional.
New variants (e.g. Cl100kEncoding) require only a new IBpeEncoding impl + factory method.

Also fixes two Codex review comments from PR #16:
- P1 (emit unk on missing byte fallback): BuildInitialSymbols now emits _unkId
  instead of silently dropping bytes with no <0xNN> token — applies to both encodings.
- P2 (hot-path string alloc): Gpt2TiktokenEncoding._mergeRanks now uses
  Dictionary<(int,int),int> tuple keys; no string allocation per bigram check.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@dotllm-claude-code-bot dotllm-claude-code-bot Bot deleted a comment from kkokosa Mar 2, 2026
@dotllm-claude-code-bot dotllm-claude-code-bot Bot deleted a comment from kkokosa Mar 2, 2026
@dotllm-claude-code-bot dotllm-claude-code-bot Bot deleted a comment from kkokosa Mar 2, 2026
kkokosa and others added 4 commits March 2, 2026 21:43
GgufFactory_LoadsSentencePieceTokenizer used a mini-vocab without ▁.
With addBosSpace=true the encoder hits byte fallback for ▁'s UTF-8 bytes
(0xE2/0x96/0x81); the P1 fix now emits <unk> for each rather than
silently dropping them — changing the output from [5] to [0,0,0,5].

Fix: add ▁ (token 6) to the test vocab, matching real SentencePiece
models which always include ▁ as a direct token. Expected encode result
is now [6, 5] (▁ + the ab merge), exercising both the direct ▁ match
and the BPE a+b→ab merge.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@kkokosa
Copy link
Copy Markdown
Owner Author

kkokosa commented Mar 3, 2026

@gemini please review this PR - also from performance and clean code perspectives

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 3, 2026

Gemini

⚠️ The Gemini API is temporarily unavailable (503 — high demand). Please retry your @gemini comment in a few minutes.

@kkokosa
Copy link
Copy Markdown
Owner Author

kkokosa commented Mar 3, 2026

@gemini please review this PR - also from performance and clean code perspectives

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 3, 2026

Gemini

This is an excellently structured PR. The separation of BPE variants, the zero-allocation tuple keys for merge ranks, and the comprehensive documentation (the .md files are a great touch) make it very maintainable. The stale entry detection in the merge loop is also a very elegant solution to avoid expensive priority queue updates.

Given dotLLM's strict philosophy of zero GC pressure on the inference hot path, there are a few hidden allocations in the encoding/decoding loops that we should eliminate before merging.

Here is my review focusing on performance and .NET best practices:

1. Hidden String Allocations in SentencePieceEncoding

File: src/DotLLM.Tokenizers/Bpe/SentencePieceEncoding.cs

In Encode(string text):

string normalized = text.Replace(' ', SpaceMarker);
if (_addBosSpace && (normalized.Length == 0 || normalized[0] != SpaceMarker))
    normalized = SpaceMarker + normalized;

text.Replace and string concatenation allocate new strings on the heap.
Fix: Rent a char[] from ArrayPool<char>.Shared, do the replacement and optional prepending in-place, and change BuildInitialSymbols to accept a ReadOnlySpan<char> instead of a string.

In Decode(ReadOnlySpan<int> tokenIds):

sb.Append(token.Replace(SpaceMarker, ' '));

token.Replace allocates a new string every single time it is called before appending it to the StringBuilder. In a generation loop, this creates massive GC churn.
Fix: Iterate over the token's span and append directly, or use sb.Replace after appending:

int startLen = sb.Length;
sb.Append(token);
sb.Replace(SpaceMarker, ' ', startLen, token.Length);

2. Managed Allocations in Gpt2TiktokenEncoding

File: src/DotLLM.Tokenizers/Bpe/Gpt2TiktokenEncoding.cs

In Encode(string text):

byte[] utf8Bytes = Encoding.UTF8.GetBytes(text);
// ...
string gpt2Text = new string(gpt2Chars, 0, utf8Bytes.Length);

Encoding.UTF8.GetBytes(string) allocates a new byte[] proportional to the input length, and new string(...) allocates another string.
Fix:

  1. Use Encoding.UTF8.GetMaxByteCount(text.Length) to rent a byte[] from ArrayPool.
  2. Use Encoding.UTF8.GetBytes(text.AsSpan(), rentedBytes) to encode without allocation.
  3. Skip creating gpt2Text entirely. Keep the mapped characters in the rented gpt2Chars array and pass a ReadOnlySpan<char> to BuildInitialSymbols.

3. Heap Allocation Fallback in TryEnqueueBigram

Files: SentencePieceEncoding.cs and Gpt2TiktokenEncoding.cs

Span<char> buf = totalLen <= 256 ? stackalloc char[256] : new char[totalLen];

While tokens > 256 chars are rare, if they do occur, new char[totalLen] allocates.
Fix: Use ArrayPool<char>.Shared.Rent(totalLen) for the fallback path to guarantee zero allocations regardless of token length.

4. Architectural Note: Trie Node Overhead (Future Optimization)

File: src/DotLLM.Tokenizers/Trie.cs

public Dictionary<char, TrieNode>? Children;

Using a Dictionary per node creates a massive object graph (thousands of small objects and dictionary buckets). This increases memory usage, slows down GC marking, and ruins L1 cache locality during the TryMatchLongest hot path.
Suggestion (Not a blocker for this PR): Consider flattening this into a Double-Array Trie (DAT) or using a contiguous struct array for children in the future. For now, it is acceptable as a baseline.

Summary

The logic and correctness are spot on (all 168 tests passing is great). If you can swap the string parameters to ReadOnlySpan<char> in the internal methods and remove the .Replace() / .GetBytes() allocations, this will perfectly align with dotLLM's high-performance goals.

@dotllm-claude-code-bot
Copy link
Copy Markdown

Claude

Thanks for the thorough review, Gemini! All four findings addressed:

Findings 1–3 — Fixed

  • SentencePiece Encode: Replaced string.Replace + concatenation with ArrayPool<char> rental + MemoryExtensions.Replace (SIMD-accelerated). BuildInitialSymbols now takes ReadOnlySpan<char>. Zero string allocs.
  • SentencePiece Decode: Replaced token.Replace(SpaceMarker, ' ') with sb.Append + sb.Replace pattern. Merged result[1..] substring into sb.ToString(1, sb.Length - 1). Also added Contains guard on DecodeToken + stackalloc for byte-token Latin1 path.
  • GPT-2 Encode: Pooled byte[] via ArrayPool, skipped the intermediate string entirely — pass ReadOnlySpan<char> directly to BuildInitialSymbols. Zero allocs.
  • GPT-2 DecodeToken: stackalloc with ArrayPool fallback for >256 char tokens.
  • TryEnqueueBigram (both files): ArrayPool fallback instead of new char[] for rare >256 path.
  • PriorityQueue: Pre-sized with known symbol count to avoid resize allocs.

Finding 4 — Follow-up issue created#19
Trie Dictionary<char, TrieNode> per-node overhead is a model-load-time concern, not a hot-path alloc. Created issue with DAT / sorted-array / LOUDS alternatives.

All 169 tests passing (141 unit + 28 integration), including a new edge case test for raw U+2581 input.

Address Gemini review findings 1-3 on PR #16:

SentencePieceEncoding:
- Encode: ArrayPool<char> + MemoryExtensions.Replace instead of string allocs
- Decode: sb.Append + sb.Replace pattern, sb.ToString(1, len-1) for strip
- DecodeToken: Contains guard + stackalloc byte for Latin1
- BuildInitialSymbols: ReadOnlySpan<char> parameter
- TryEnqueueBigram: ArrayPool fallback for >256 char path

Gpt2TiktokenEncoding:
- Encode: pooled byte[] + char[], no intermediate string
- DecodeToken: stackalloc with ArrayPool fallback
- BuildInitialSymbols: ReadOnlySpan<char> parameter
- TryEnqueueBigram: ArrayPool fallback for >256 char path

Both files: PriorityQueue pre-sized with symbol count.
Finding 4 (Trie node overhead) deferred to #19.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@kkokosa kkokosa merged commit 35a1b63 into main Mar 3, 2026
4 checks passed
@kkokosa kkokosa deleted the issue/6-bpe-tokenizer branch March 3, 2026 12:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement BPE tokenizer with GGUF vocabulary loading

1 participant