Skip to content

langpkg/lexer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation


logo

Fast, general-purpose lexer for building language tooling.
Up to ~2.5x faster than moo on equivalent specs.



Test Coverage Github Repo Issues GitHub Repo stars

  • Benchmark

    bench

    To run the benchmark, use: bun run bench

    To check the benchmark code, read: ./bench/index.bench.ts.



  • Quick Start πŸ”₯

    bun add @langpkg/lexer
    # or
    npm install @langpkg/lexer
    import { compile, keywords } from '@langpkg/lexer'
    
    const lexer = compile({
        WS      : /[ \t]+/,
        NL      : { match: /\n/, lineBreaks: true },
        NUM     : /[0-9]+/,
        IDENT   : { match: /[a-zA-Z_][a-zA-Z0-9_]*/, type: keywords({ KW: ['if','else','return'] }) },
        EQ3     : '===',
        ARROW   : '=>',
        EQ      : '=',
        PLUS    : '+',
        SEMI    : ';',
    })
    
    lexer.reset('if x === 1 { return x + 1; }')
    
    let tok
        while ((tok = lexer.next()) !== undefined) {
        console.log(tok.type, JSON.stringify(tok.value))
    }
    // KW    "if"
    // WS    " "
    // IDENT "x"
    // EQ3   "==="
    // NUM   "1"
    // ...


  • Documentation πŸ“‘

    • How it works

      The lexer compiles your spec into a per-character dispatch table:

      1. Each ASCII charCode maps to an ordered list of candidate rules.

      2. next() does one array lookup on charCodeAt(pos).

      3. Candidates are tried with re.test(buf) (sticky regex, no exec, no array allocation).

      4. 94%+ of charCodes have exactly one candidate -- they skip the loop entirely.

      No combined mega-regex. No alternation backtracking.

      Each rule has its own sticky regex anchored at the current position.

      line

    • API

      • compile(spec, options?): Lexer

        Compiles a rule spec into a Lexer. Call once, reuse many times.

        const lexer = compile({
          // string literal -- exact match
          PLUS  : '+',
        
          // multiple literals for one type
          OP    : ['+=', '+'],
        
          // RegExp -- no /g /i /y /m flags, no capture groups
          NUM   : /[0-9]+/,
        
          // full rule object
          NL    :  { match: /\n/, lineBreaks: true },
        
          // value transform -- token.value is the stripped version, token.text is raw
          STR   : { match: /"[^"]*"/, value: s => s.slice(1, -1) },
        
          // error recovery -- returns an error token instead of throwing
          ERR   : { error: true },
        }, { kw_ends_with_token: true })

        Options:

        • kw_ends_with_token (boolean, optional):

          When enabled, keywords matched via keywords() only match if followed by a non-identifier character or EOF.

          This prevents keywords from matching in the middle of identifiers.

          For example, with this option enabled, the text "assert" won't be split into the keyword "as" + identifier "sert".

          Default is false.

        Matching priority:

        1. Longer string literals always beat shorter ones -- '===' beats '=>' beats '=', regardless of declaration order.

        2. RegExp rules sharing the same first character run in declaration order, after all string literals.

        line

      • keywords(map): TypeTransform

        Remaps matched identifiers to keyword types. Handles the longest-match edge case correctly -- className is never split into class + Name.

        compile({
            IDENT: {
                match   : /[a-zA-Z_][a-zA-Z0-9_]*/,
                type    :  keywords({
                    'kw-if'     : 'if',
                    'kw-else'   : 'else',
                    KW          : ['while', 'for', 'return'],  // multiple keywords, one type
                }),
            },
        })
        line

      • kw_ends_with_token Option

        When enabled via compile(spec, { kw_ends_with_token: true }), keywords only match if followed by a non-identifier character or EOF.

        This is useful for languages where keywords like "as" should not split identifiers like "assert" into "as" + "sert".

        Without the option (default behavior):

        const lexer = compile({
            IDENT: { match: /[a-zA-Z_][a-zA-Z0-9_]*/, type: keywords({ KW: ['as'] }) },
        });
        // Input: "assert"  β†’  Token type: IDENT (full identifier, "as" not matched as keyword)
        // Input: "as"      β†’  Token type: KW

        With the option enabled:

        const lexer = compile(
            {
                IDENT: { match: /[a-zA-Z_][a-zA-Z0-9_]*/, type: keywords({ KW: ['as'] }) },
            },
            { kw_ends_with_token: true }
        );
        // Input: "assert"   β†’  Token type: IDENT (keyword doesn't match, next char is 's')
        // Input: "as"       β†’  Token type: KW (EOF after keyword, match is valid)
        // Input: "as x"     β†’  Token types: KW, WS, IDENT (space after keyword, match is valid)

        Real-world example (Zig language):

        const zigLexer = compile(
            {
                AT: '@',
                IDENT: {
                    match: /[a-zA-Z_][a-zA-Z0-9_]*/,
                    type: keywords({ KW: ['assert', 'as', 'if', 'else'] })
                },
                WS: /[ \t]+/,
            },
            { kw_ends_with_token: true }
        );
        zigLexer.reset('@assert');
        // Tokens: AT("@"), KW("assert") β€” not split into AT + KW("as") + IDENT("sert")
        line

      • lexer.reset(input?, state?): this

        Load new input. Resets position to 0, line/col to 1 (unless state is passed).

        Returns this for chaining: lexer.reset(src).next().

      • lexer.next(): Token | undefined

        Return the next token, or undefined at EOF.

      • lexer.save(): LexerState

        Snapshot { line, col } for later reset().

      • lexer.formatError(token, message?): string

        Return a human-readable error string: "<message> at line N col N".

        line

      • Token fields

        Field Type Description
        type string Token type name from the spec
        value string Matched text, transformed if value() was set
        text string Raw matched text, always untransformed
        offset number Byte offset from start of input
        lineBreaks number Newlines in match (0 unless rule sets lineBreaks: true)
        line number 1-based line number at match start
        col number 1-based column at match start
        toString() Returns value


  • Credits ❀️

    Inspired by moo -- I kept the same familiar API (compile, keywords, reset, next) while replacing the internals with a per-character dispatch table and per-rule sticky regexes, which eliminates alternation backtracking and gives ~3x better throughput.

    Built as part of the Mine language compiler toolchain.



  • Dev Notes πŸ“

    I'm currently working on +10 packages simultaneously, so sometimes i use the AI to write some parts of the documentation -- if you spot something incorrect that I may have missed, please open an issue and let me know.

    And if you'd like to fix something yourself, feel free to fork the repo and open a pull request -- I'll review it and happily merge it if it looks good.

    Thank you!




About

Fast, general-purpose lexer for building programming language tooling.

Topics

Resources

License

Stars

Watchers

Forks

Contributors

No contributors