Skip to content

Architecture

Mark Lauter edited this page Jun 24, 2026 · 1 revision

title: Architecture summary: "The tokenization pipeline — a Source over the input, a Lexer that applies match and ignore Patterns with maximal-munch selection, the Symbol it emits, and the fluent VocabularyBuilder." tags: [lexi, concept, architecture, lexer, tokenizer, regex, csharp] created: 2026-06-24 status: draft

Architecture

lexi scans an input string into tokens one match at a time. Everything is built around ref struct value types, so scanning allocates almost nothing. The pipeline is direct: the input string becomes a Source, the Lexer tests its patterns at the current offset and picks the best match, and it returns a Symbol wrapped in a MatchResult that carries an advanced Source for the next call.

The source

Source is a readonly ref struct holding the text and the current offset (source: Source.cs). It converts implicitly to and from string, reports Remaining(), and extracts a token's text with ReadSymbol.

The lexer

Lexer is the scanner (source: Lexer.cs). It holds match patterns and ignore patterns. NextMatch first advances past any ignored span — whitespace, comments — then applies the Dragon-book maximal-munch rule: it runs every match pattern at the offset and picks the best with CompareAndSwap, where the longest match wins and ties break to the lowest pattern index.

Patterns and tokens

  • Pattern wraps a Regex anchored with \G at the scan offset, paired with its token id (source: Pattern.cs). It reserves the token ids EndOfSource (1 << 31) and NoMatch (1 << 30).
  • Symbol is a readonly ref struct token of Offset, Length, and TokenId, with IsMatch and IsEndOfSource predicates (source: Symbol.cs).
  • MatchResult pairs a Symbol with the post-match Source, threading position through successive NextMatch calls (source: MatchResult.cs).

Building a vocabulary

VocabularyBuilder is the fluent entry point (source: VocabularyBuilder.cs). .Match(...) and .Ignore(...) register patterns from a string, a Regex, or a prebuilt Pattern[], and .Build() produces a configured Lexer. Reusable regexes — identifiers, integer, float, scientific, string and char literals, whitespace, newline — live in CommonPatterns, using source-generated [GeneratedRegex] on net7.0 and later and plain Regex on net6.0 (source: CommonPatterns.cs).