Emit a standalone parser, equal to the interpreter — + integer token kinds + sticky/dispatch lexer (~5.0x the official TS parser) by johnsoncodehk · Pull Request #4 · johnsoncodehk/monogram

johnsoncodehk · 2026-06-05T14:51:14Z

What

createParser(grammar) interprets the RuleExpr trees at runtime — it's the one derived output that isn't emitted as code (the TextMate / Monarch / tree-sitter highlighters all are). This PR adds emitParser(grammar) (src/emit-parser.ts): a self-contained JS parser that bakes the grammar's static analysis into constants and specializes each rule into straight-line code, removing the central matchExpr tree-walk. It reuses the lexer runtime — so the parser is what's specialized — and never imports the grammar-definition object. Then two perf levers, both proven byte-identical against createParser as the oracle:

Stage A — the emitter. Per-rule + per-arm specialized matchers; the Pratt / left-rec / longest-match / mixfix / memo cores stay as a shared runtime the emitted code calls.
Integer token kinds (emit-parser.ts). String-keyed dispatch (tok.type === key, Set.has, startsWith) → integer compares, via a two-int interning pass after tokenize (type-kind + literal-kind; two fields because a keyword is an identifier-with-text here, so it must match both its keyword key and the identifier token-name key).
Sticky-regex + first-char dispatch lexer (src/gen-lexer.ts, shared — also speeds up the interpreter and every language). tokenize no longer tries every matcher per position against a fresh source.slice(pos): matchers are sticky regexes exec'd in place, and a conservative first-char analysis skips matchers whose possible first chars can't match the current char (so (/; skip the word/number/string regexes).

Measured (TS grammar)

Correctness: byte-identical CST + accept/reject vs createParser across the full tests/cases corpus — 18,805 / 18,805 at every step. The shared lexer change is byte-identical too: identical token stream across TS+JS+YAML corpora (0 differing tokens of ~54k) and conformance unchanged (run-conformance 5386/5659).
Emitted parser, four bench files (best-of-5):

emitted vs official tsc

Stage A 50.9 ms 6.3×

+ integer kinds 42.6 ms 5.6×

+ sticky/dispatch lexer ~38 ms ~5.0×

vs the interpreter the emitted parser is now ~1.9×; the lexer lever (~1.25× on tokenize) also speeds up the interpreter, so both improve. tsc ≈ 7.6 ms.
Context: the gap to the official parser is implementation, not a V8 ceiling — tsc is itself JS. The decomposition is even (lexer ≈ parser layer); the remaining lexer cost is regex-per-token execution vs tsc's hand-tuned char scanner.

Scope

src/emit-parser.ts (new) + src/gen-lexer.ts (shared lexer; byte-identical token stream, verified across languages) + test/emit-parser-verify.ts (correctness gate) + test/emit-parser-bench.ts (benchmark). createParser (the oracle), the grammars, and the generated artifacts are untouched.

Follow-ups (tracked as issues)

Emit the lexer: author tokens as a target-neutral IR instead of regex strings #5 — Emit the lexer: author tokens as a target-neutral IR (combinators) instead of regex strings, deriving a char-code DFA scanner. The next ceiling (the lexer is still the reused regex runtime) and what makes the parser self-contained for Go/Rust.
Target-agnostic emitter: emitParser(grammar, target) for Go/Rust/native #6 — Target-agnostic emitter: emitParser(grammar, target) with JS-specific emission behind a Target config (Go / Rust / native).
CI: gate the emitted parser against createParser (run emit-parser-verify) #7 — CI gate: run emit-parser-verify so emitted == createParser can't silently rot.
Reduce parser-layer backtracking / longest-match (remaining gap to the official parser) #8 — Reduce parser-layer backtracking / longest-match (the remaining parser-side gap to the official parser).

Notes

The emitted parser (~290 KB for TS) is written to /tmp by the scripts and is not committed. Subtleties handled: parseRule sets the Pratt context + packrat memo for both Pratt and left-rec rules (left-rec rules go through the same wrapper, else ${…} in a template-literal type resolves to Expr not Type); the int-kinds vocabulary must be a superset of every dispatch key, including operators that appear only inside a not (reserved-word) lookahead; the first-char dispatch is conservative (proven first-char superset, verified by a 5k-input fuzz + full token byte-identity).

…preter, ~1.5x faster `createParser` interprets the RuleExpr trees at runtime. This adds `emitParser(grammar)`, which emits self-contained JS: it bakes the analysis (precedence, FIRST sets, NUD/LED classification) into constants and specializes each rule into straight-line code, removing the matchExpr tree-walk. It reuses the existing lexer runtime (fed baked token/config data), so the parser — not the lexer — is what's specialized, and it never imports the grammar definition object. Measured on the TS grammar: byte-identical CST + accept/reject vs createParser across the full tests/cases corpus (18,805/18,805), and ~1.53x end-to-end on the four bench files. The interpretive dispatch is gone from the profile; the remaining floor is the shared lexer (~33%) and FIRST-set membership checks (~22%). Additive only — createParser (the oracle), the grammars, and the generated artifacts are untouched: - src/emit-parser.ts emitParser(grammar): string - test/emit-parser-verify.ts correctness gate: emitted CST must equal createParser's - test/emit-parser-bench.ts interpreter-vs-emitted benchmark

…Stage A) Replaces the emitted parser's string-keyed dispatch with integer compares: a two-int interning pass after tokenize tags each token with a type-kind and a literal-kind (keyword/punct), and the FIRST-set guards, canStart, matchLiteral and matchToken are emitted as integer checks against pre-classified keys. Two fields are needed because a keyword is an identifier-with-text in this grammar, so a keyword token must match both its keyword key and the identifier token-name key. Emitted-only change (createParser, the lexer, the grammars, and the generated artifacts are untouched), byte-identical to createParser across the full tests/cases corpus (18,805/18,805). String dispatch drops from ~45% of the parser layer to ~22%; the emitted parser goes 49->43 ms aggregate on the four bench files (~1.15x), and interpreter-vs-emitted rises 1.53x->1.84x. The win is capped by the shared lexer (tokenize, ~40% of total, out of scope here).

…-identical) tokenize tried every token matcher per position against `source.slice(pos)` (a fresh rest-of-file copy). Now the matchers are sticky regexes exec'd in place at lastIndex (no slice), and a conservative first-char analysis skips matchers whose possible first chars can't match the current char — so a `(` or `;` no longer runs the identifier/number/string/comment regexes. The first-char dispatch is the real win (most positions are punctuation that fell through all 16 matchers); the slice removal is minor (V8's flat-string slice is cheap). Shared lexer change → speeds up the interpreter, the emitted parser, and every language (TS/JS/HTML/YAML/Vue). Byte-identical token stream (verified across TS+JS+YAML corpora, 0 differing tokens; full conformance unchanged at 5386/5659). Lexer ~21->17 ms on the bench files (~1.25x); the emitted parser's gap to the official TS parser narrows 5.6x->5.0x.

johnsoncodehk added 2 commits June 5, 2026 22:50

johnsoncodehk changed the title ~~Emit a standalone parser (Stage A): equal to the interpreter, ~1.5x faster~~ Emit a standalone parser: 100% equal to the interpreter, ~1.8x faster (Stage A + integer token kinds) Jun 5, 2026

johnsoncodehk changed the title ~~Emit a standalone parser: 100% equal to the interpreter, ~1.8x faster (Stage A + integer token kinds)~~ Emit a standalone parser, equal to the interpreter — + integer token kinds + sticky/dispatch lexer (~5.0x the official TS parser) Jun 5, 2026

johnsoncodehk merged commit cca4c6a into master Jun 5, 2026
2 checks passed

johnsoncodehk deleted the emit-parser-stage-a branch June 5, 2026 18:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Emit a standalone parser, equal to the interpreter — + integer token kinds + sticky/dispatch lexer (~5.0x the official TS parser)#4

Emit a standalone parser, equal to the interpreter — + integer token kinds + sticky/dispatch lexer (~5.0x the official TS parser)#4
johnsoncodehk merged 3 commits into
masterfrom
emit-parser-stage-a

johnsoncodehk commented Jun 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	emitted	vs official tsc
Stage A	50.9 ms	6.3×
+ integer kinds	42.6 ms	5.6×
+ sticky/dispatch lexer	~38 ms	~5.0×

Conversation

johnsoncodehk commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Measured (TS grammar)

Scope

Follow-ups (tracked as issues)

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

johnsoncodehk commented Jun 5, 2026 •

edited

Loading