GitHub - langpkg/lexer: Fast, general-purpose lexer for building programming language tooling.

Fast, general-purpose lexer for building language tooling.
Up to ~2.5x faster than moo on equivalent specs.

Benchmark

To run the benchmark, use: bun run bench

To check the benchmark code, read: ./bench/index.bench.ts.

Quick Start 🔥

bun add @langpkg/lexer
# or
npm install @langpkg/lexer

import { compile, keywords } from '@langpkg/lexer'

const lexer = compile({
    WS      : /[ \t]+/,
    NL      : { match: /\n/, lineBreaks: true },
    NUM     : /[0-9]+/,
    IDENT   : { match: /[a-zA-Z_][a-zA-Z0-9_]*/, type: keywords({ KW: ['if','else','return'] }) },
    EQ3     : '===',
    ARROW   : '=>',
    EQ      : '=',
    PLUS    : '+',
    SEMI    : ';',
})

lexer.reset('if x === 1 { return x + 1; }')

let tok
    while ((tok = lexer.next()) !== undefined) {
    console.log(tok.type, JSON.stringify(tok.value))
}
// KW    "if"
// WS    " "
// IDENT "x"
// EQ3   "==="
// NUM   "1"
// ...

Documentation 📑

How it works

The lexer compiles your spec into a per-character dispatch table:
1. Each ASCII charCode maps to an ordered list of candidate rules.
2. next() does one array lookup on charCodeAt(pos).
3. Candidates are tried with re.test(buf) (sticky regex, no exec, no array allocation).
4. 94%+ of charCodes have exactly one candidate -- they skip the loop entirely.
No combined mega-regex. No alternation backtracking.

Each rule has its own sticky regex anchored at the current position.

API

compile(spec, options?): Lexer

Compiles a rule spec into a Lexer. Call once, reuse many times.
```
const lexer = compile({
  // string literal -- exact match
  PLUS  : '+',

  // multiple literals for one type
  OP    : ['+=', '+'],

  // RegExp -- no /g /i /y /m flags, no capture groups
  NUM   : /[0-9]+/,

  // full rule object
  NL    :  { match: /\n/, lineBreaks: true },

  // value transform -- token.value is the stripped version, token.text is raw
  STR   : { match: /"[^"]*"/, value: s => s.slice(1, -1) },

  // error recovery -- returns an error token instead of throwing
  ERR   : { error: true },
}, { kw_ends_with_token: true })
```
Options:
- kw_ends_with_token (boolean, optional):
  
  When enabled, keywords matched via keywords() only match if followed by a non-identifier character or EOF.
  
  This prevents keywords from matching in the middle of identifiers.
  
  For example, with this option enabled, the text "assert" won't be split into the keyword "as" + identifier "sert".
  
  Default is false.
Matching priority:
1. Longer string literals always beat shorter ones -- '===' beats '=>' beats '=', regardless of declaration order.
2. RegExp rules sharing the same first character run in declaration order, after all string literals.

`keywords(map): TypeTransform`

Remaps matched identifiers to keyword types. Handles the longest-match edge case correctly -- className is never split into class + Name.

compile({
    IDENT: {
        match   : /[a-zA-Z_][a-zA-Z0-9_]*/,
        type    :  keywords({
            'kw-if'     : 'if',
            'kw-else'   : 'else',
            KW          : ['while', 'for', 'return'],  // multiple keywords, one type
        }),
    },
})

`kw_ends_with_token` Option

When enabled via compile(spec, { kw_ends_with_token: true }), keywords only match if followed by a non-identifier character or EOF.

This is useful for languages where keywords like "as" should not split identifiers like "assert" into "as" + "sert".

Without the option (default behavior):

const lexer = compile({
    IDENT: { match: /[a-zA-Z_][a-zA-Z0-9_]*/, type: keywords({ KW: ['as'] }) },
});
// Input: "assert"  →  Token type: IDENT (full identifier, "as" not matched as keyword)
// Input: "as"      →  Token type: KW

With the option enabled:

const lexer = compile(
    {
        IDENT: { match: /[a-zA-Z_][a-zA-Z0-9_]*/, type: keywords({ KW: ['as'] }) },
    },
    { kw_ends_with_token: true }
);
// Input: "assert"   →  Token type: IDENT (keyword doesn't match, next char is 's')
// Input: "as"       →  Token type: KW (EOF after keyword, match is valid)
// Input: "as x"     →  Token types: KW, WS, IDENT (space after keyword, match is valid)

Real-world example (Zig language):

const zigLexer = compile(
    {
        AT: '@',
        IDENT: {
            match: /[a-zA-Z_][a-zA-Z0-9_]*/,
            type: keywords({ KW: ['assert', 'as', 'if', 'else'] })
        },
        WS: /[ \t]+/,
    },
    { kw_ends_with_token: true }
);
zigLexer.reset('@assert');
// Tokens: AT("@"), KW("assert") — not split into AT + KW("as") + IDENT("sert")

lexer.reset(input?, state?): this

Load new input. Resets position to 0, line/col to 1 (unless state is passed).

Returns this for chaining: lexer.reset(src).next().
lexer.next(): Token | undefined

Return the next token, or undefined at EOF.
lexer.save(): LexerState

Snapshot { line, col } for later reset().
lexer.formatError(token, message?): string

Return a human-readable error string: "<message> at line N col N".

Token fields

Field	Type	Description
`type`	`string`	Token type name from the spec
`value`	`string`	Matched text, transformed if `value()` was set
`text`	`string`	Raw matched text, always untransformed
`offset`	`number`	Byte offset from start of input
`lineBreaks`	`number`	Newlines in match (0 unless rule sets `lineBreaks: true`)
`line`	`number`	1-based line number at match start
`col`	`number`	1-based column at match start
`toString()`		Returns `value`

Credits ❤️

Inspired by moo -- I kept the same familiar API (compile, keywords, reset, next) while replacing the internals with a per-character dispatch table and per-rule sticky regexes, which eliminates alternation backtracking and gives ~3x better throughput.

Built as part of the Mine language compiler toolchain.
Dev Notes 📝

I'm currently working on +10 packages simultaneously, so sometimes i use the AI to write some parts of the documentation -- if you spot something incorrect that I may have missed, please open an issue and let me know.

And if you'd like to fix something yourself, feel free to fork the repo and open a pull request -- I'll review it and happily merge it if it looks good.

Thank you!

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.vscode		.vscode
assets/img		assets/img
bench		bench
dist		dist
src		src
test		test
.gitattributes		.gitattributes
.gitignore		.gitignore
.hmm		.hmm
LICENSE		LICENSE
README.md		README.md
bun.lock		bun.lock
eslint.config.mjs		eslint.config.mjs
package.json		package.json
tsconfig.json		tsconfig.json
tsup.config.ts		tsup.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Benchmark

Quick Start 🔥

Documentation 📑

How it works

API

`compile(spec, options?): Lexer`

`keywords(map): TypeTransform`

`kw_ends_with_token` Option

`lexer.reset(input?, state?): this`

`lexer.next(): Token | undefined`

`lexer.save(): LexerState`

`lexer.formatError(token, message?): string`

Token fields

Credits ❤️

Dev Notes 📝

About

Uh oh!

Contributors

Languages

Folders and files

Latest commit

History

Repository files navigation

Benchmark

Quick Start 🔥

Documentation 📑

How it works

API

compile(spec, options?): Lexer

keywords(map): TypeTransform

kw_ends_with_token Option

lexer.reset(input?, state?): this

lexer.next(): Token | undefined

lexer.save(): LexerState

lexer.formatError(token, message?): string

Token fields

Credits ❤️

Dev Notes 📝

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors 0

Languages

`compile(spec, options?): Lexer`

`keywords(map): TypeTransform`

`kw_ends_with_token` Option

`lexer.reset(input?, state?): this`

`lexer.next(): Token | undefined`

`lexer.save(): LexerState`

`lexer.formatError(token, message?): string`

Contributors