-
Notifications
You must be signed in to change notification settings - Fork 0
Architecture
title: Architecture summary: "The tokenization pipeline — a Source over the input, a Lexer that applies match and ignore Patterns with maximal-munch selection, the Symbol it emits, and the fluent VocabularyBuilder." tags: [lexi, concept, architecture, lexer, tokenizer, regex, csharp] created: 2026-06-24 status: draft
lexi scans an input string into tokens one match at a time. Everything is built around ref struct value types, so scanning allocates almost nothing. The pipeline is direct: the input string becomes a Source, the Lexer tests its patterns at the current offset and picks the best match, and it returns a Symbol wrapped in a MatchResult that carries an advanced Source for the next call.
Source is a readonly ref struct holding the text and the current offset (source: Source.cs). It converts implicitly to and from string, reports Remaining(), and extracts a token's text with ReadSymbol.
Lexer is the scanner (source: Lexer.cs). It holds match patterns and ignore patterns. NextMatch first advances past any ignored span — whitespace, comments — then applies the Dragon-book maximal-munch rule: it runs every match pattern at the offset and picks the best with CompareAndSwap, where the longest match wins and ties break to the lowest pattern index.
-
Patternwraps aRegexanchored with\Gat the scan offset, paired with its token id (source: Pattern.cs). It reserves the token idsEndOfSource(1 << 31) andNoMatch(1 << 30). -
Symbolis a readonly ref struct token ofOffset,Length, andTokenId, withIsMatchandIsEndOfSourcepredicates (source: Symbol.cs). -
MatchResultpairs aSymbolwith the post-matchSource, threading position through successiveNextMatchcalls (source: MatchResult.cs).
VocabularyBuilder is the fluent entry point (source: VocabularyBuilder.cs). .Match(...) and .Ignore(...) register patterns from a string, a Regex, or a prebuilt Pattern[], and .Build() produces a configured Lexer. Reusable regexes — identifiers, integer, float, scientific, string and char literals, whitespace, newline — live in CommonPatterns, using source-generated [GeneratedRegex] on net7.0 and later and plain Regex on net6.0 (source: CommonPatterns.cs).