Fast, general-purpose lexer for building language tooling.
Up to ~2.5x faster than moo on equivalent specs.
-
To run the benchmark, use:
bun run benchTo check the benchmark code, read:
./bench/index.bench.ts.
-
bun add @langpkg/lexer # or npm install @langpkg/lexerimport { compile, keywords } from '@langpkg/lexer' const lexer = compile({ WS : /[ \t]+/, NL : { match: /\n/, lineBreaks: true }, NUM : /[0-9]+/, IDENT : { match: /[a-zA-Z_][a-zA-Z0-9_]*/, type: keywords({ KW: ['if','else','return'] }) }, EQ3 : '===', ARROW : '=>', EQ : '=', PLUS : '+', SEMI : ';', }) lexer.reset('if x === 1 { return x + 1; }') let tok while ((tok = lexer.next()) !== undefined) { console.log(tok.type, JSON.stringify(tok.value)) } // KW "if" // WS " " // IDENT "x" // EQ3 "===" // NUM "1" // ...
-
-
The lexer compiles your spec into a per-character dispatch table:
-
Each ASCII charCode maps to an ordered list of candidate rules.
-
next()does one array lookup oncharCodeAt(pos). -
Candidates are tried with
re.test(buf)(sticky regex, noexec, no array allocation). -
94%+ of charCodes have exactly one candidate -- they skip the loop entirely.
No combined mega-regex. No alternation backtracking.
Each rule has its own sticky regex anchored at the current position.
-
-
-
Compiles a rule spec into a
Lexer. Call once, reuse many times.const lexer = compile({ // string literal -- exact match PLUS : '+', // multiple literals for one type OP : ['+=', '+'], // RegExp -- no /g /i /y /m flags, no capture groups NUM : /[0-9]+/, // full rule object NL : { match: /\n/, lineBreaks: true }, // value transform -- token.value is the stripped version, token.text is raw STR : { match: /"[^"]*"/, value: s => s.slice(1, -1) }, // error recovery -- returns an error token instead of throwing ERR : { error: true }, }, { kw_ends_with_token: true })
Options:
-
kw_ends_with_token(boolean, optional):When enabled, keywords matched via
keywords()only match if followed by a non-identifier character or EOF.This prevents keywords from matching in the middle of identifiers.
For example, with this option enabled, the text
"assert"won't be split into the keyword"as"+ identifier"sert".Default is
false.
Matching priority:
-
Longer string literals always beat shorter ones --
'==='beats'=>'beats'=', regardless of declaration order. -
RegExp rules sharing the same first character run in declaration order, after all string literals.
-
-
Remaps matched identifiers to keyword types. Handles the longest-match edge case correctly --
classNameis never split intoclass+Name.compile({ IDENT: { match : /[a-zA-Z_][a-zA-Z0-9_]*/, type : keywords({ 'kw-if' : 'if', 'kw-else' : 'else', KW : ['while', 'for', 'return'], // multiple keywords, one type }), }, })
-
When enabled via
compile(spec, { kw_ends_with_token: true }), keywords only match if followed by a non-identifier character or EOF.This is useful for languages where keywords like
"as"should not split identifiers like"assert"into"as"+"sert".Without the option (default behavior):
const lexer = compile({ IDENT: { match: /[a-zA-Z_][a-zA-Z0-9_]*/, type: keywords({ KW: ['as'] }) }, }); // Input: "assert" β Token type: IDENT (full identifier, "as" not matched as keyword) // Input: "as" β Token type: KW
With the option enabled:
const lexer = compile( { IDENT: { match: /[a-zA-Z_][a-zA-Z0-9_]*/, type: keywords({ KW: ['as'] }) }, }, { kw_ends_with_token: true } ); // Input: "assert" β Token type: IDENT (keyword doesn't match, next char is 's') // Input: "as" β Token type: KW (EOF after keyword, match is valid) // Input: "as x" β Token types: KW, WS, IDENT (space after keyword, match is valid)
Real-world example (Zig language):
const zigLexer = compile( { AT: '@', IDENT: { match: /[a-zA-Z_][a-zA-Z0-9_]*/, type: keywords({ KW: ['assert', 'as', 'if', 'else'] }) }, WS: /[ \t]+/, }, { kw_ends_with_token: true } ); zigLexer.reset('@assert'); // Tokens: AT("@"), KW("assert") β not split into AT + KW("as") + IDENT("sert")
-
Load new input. Resets position to 0, line/col to 1 (unless
stateis passed).Returns
thisfor chaining:lexer.reset(src).next(). -
Return the next token, or
undefinedat EOF. -
Snapshot
{ line, col }for laterreset(). -
Return a human-readable error string:
"<message> at line N col N".
-
Field Type Description typestringToken type name from the spec valuestringMatched text, transformed if value()was settextstringRaw matched text, always untransformed offsetnumberByte offset from start of input lineBreaksnumberNewlines in match (0 unless rule sets lineBreaks: true)linenumber1-based line number at match start colnumber1-based column at match start toString()Returns value
-
-
-
Inspired by moo -- I kept the same familiar API (
compile,keywords,reset,next) while replacing the internals with a per-character dispatch table and per-rule sticky regexes, which eliminates alternation backtracking and gives ~3x better throughput.Built as part of the Mine language compiler toolchain.
-
I'm currently working on +10 packages simultaneously, so sometimes i use the AI to write some parts of the documentation -- if you spot something incorrect that I may have missed, please open an issue and let me know.
And if you'd like to fix something yourself, feel free to fork the repo and open a pull request -- I'll review it and happily merge it if it looks good.
Thank you!


