Skip to content

Emit a standalone parser, equal to the interpreter — + integer token kinds + sticky/dispatch lexer (~5.0x the official TS parser)#4

Merged
johnsoncodehk merged 3 commits into
masterfrom
emit-parser-stage-a
Jun 5, 2026
Merged

Emit a standalone parser, equal to the interpreter — + integer token kinds + sticky/dispatch lexer (~5.0x the official TS parser)#4
johnsoncodehk merged 3 commits into
masterfrom
emit-parser-stage-a

Conversation

@johnsoncodehk
Copy link
Copy Markdown
Owner

@johnsoncodehk johnsoncodehk commented Jun 5, 2026

What

createParser(grammar) interprets the RuleExpr trees at runtime — it's the one derived output that isn't emitted as code (the TextMate / Monarch / tree-sitter highlighters all are). This PR adds emitParser(grammar) (src/emit-parser.ts): a self-contained JS parser that bakes the grammar's static analysis into constants and specializes each rule into straight-line code, removing the central matchExpr tree-walk. It reuses the lexer runtime — so the parser is what's specialized — and never imports the grammar-definition object. Then two perf levers, both proven byte-identical against createParser as the oracle:

  1. Stage A — the emitter. Per-rule + per-arm specialized matchers; the Pratt / left-rec / longest-match / mixfix / memo cores stay as a shared runtime the emitted code calls.
  2. Integer token kinds (emit-parser.ts). String-keyed dispatch (tok.type === key, Set.has, startsWith) → integer compares, via a two-int interning pass after tokenize (type-kind + literal-kind; two fields because a keyword is an identifier-with-text here, so it must match both its keyword key and the identifier token-name key).
  3. Sticky-regex + first-char dispatch lexer (src/gen-lexer.ts, shared — also speeds up the interpreter and every language). tokenize no longer tries every matcher per position against a fresh source.slice(pos): matchers are sticky regexes exec'd in place, and a conservative first-char analysis skips matchers whose possible first chars can't match the current char (so (/; skip the word/number/string regexes).

Measured (TS grammar)

  • Correctness: byte-identical CST + accept/reject vs createParser across the full tests/cases corpus — 18,805 / 18,805 at every step. The shared lexer change is byte-identical too: identical token stream across TS+JS+YAML corpora (0 differing tokens of ~54k) and conformance unchanged (run-conformance 5386/5659).

  • Emitted parser, four bench files (best-of-5):

    emitted vs official tsc
    Stage A 50.9 ms 6.3×
    + integer kinds 42.6 ms 5.6×
    + sticky/dispatch lexer ~38 ms ~5.0×

    vs the interpreter the emitted parser is now ~1.9×; the lexer lever (~1.25× on tokenize) also speeds up the interpreter, so both improve. tsc ≈ 7.6 ms.

  • Context: the gap to the official parser is implementation, not a V8 ceiling — tsc is itself JS. The decomposition is even (lexer ≈ parser layer); the remaining lexer cost is regex-per-token execution vs tsc's hand-tuned char scanner.

Scope

src/emit-parser.ts (new) + src/gen-lexer.ts (shared lexer; byte-identical token stream, verified across languages) + test/emit-parser-verify.ts (correctness gate) + test/emit-parser-bench.ts (benchmark). createParser (the oracle), the grammars, and the generated artifacts are untouched.

Follow-ups (tracked as issues)

Notes

The emitted parser (~290 KB for TS) is written to /tmp by the scripts and is not committed. Subtleties handled: parseRule sets the Pratt context + packrat memo for both Pratt and left-rec rules (left-rec rules go through the same wrapper, else ${…} in a template-literal type resolves to Expr not Type); the int-kinds vocabulary must be a superset of every dispatch key, including operators that appear only inside a not (reserved-word) lookahead; the first-char dispatch is conservative (proven first-char superset, verified by a 5k-input fuzz + full token byte-identity).

…preter, ~1.5x faster

`createParser` interprets the RuleExpr trees at runtime. This adds `emitParser(grammar)`,
which emits self-contained JS: it bakes the analysis (precedence, FIRST sets, NUD/LED
classification) into constants and specializes each rule into straight-line code, removing
the matchExpr tree-walk. It reuses the existing lexer runtime (fed baked token/config data),
so the parser — not the lexer — is what's specialized, and it never imports the grammar
definition object.

Measured on the TS grammar: byte-identical CST + accept/reject vs createParser across the
full tests/cases corpus (18,805/18,805), and ~1.53x end-to-end on the four bench files. The
interpretive dispatch is gone from the profile; the remaining floor is the shared lexer
(~33%) and FIRST-set membership checks (~22%).

Additive only — createParser (the oracle), the grammars, and the generated artifacts are
untouched:
- src/emit-parser.ts          emitParser(grammar): string
- test/emit-parser-verify.ts  correctness gate: emitted CST must equal createParser's
- test/emit-parser-bench.ts   interpreter-vs-emitted benchmark
…Stage A)

Replaces the emitted parser's string-keyed dispatch with integer compares: a
two-int interning pass after tokenize tags each token with a type-kind and a
literal-kind (keyword/punct), and the FIRST-set guards, canStart, matchLiteral
and matchToken are emitted as integer checks against pre-classified keys. Two
fields are needed because a keyword is an identifier-with-text in this grammar,
so a keyword token must match both its keyword key and the identifier
token-name key.

Emitted-only change (createParser, the lexer, the grammars, and the generated
artifacts are untouched), byte-identical to createParser across the full
tests/cases corpus (18,805/18,805). String dispatch drops from ~45% of the
parser layer to ~22%; the emitted parser goes 49->43 ms aggregate on the four
bench files (~1.15x), and interpreter-vs-emitted rises 1.53x->1.84x. The win is
capped by the shared lexer (tokenize, ~40% of total, out of scope here).
@johnsoncodehk johnsoncodehk changed the title Emit a standalone parser (Stage A): equal to the interpreter, ~1.5x faster Emit a standalone parser: 100% equal to the interpreter, ~1.8x faster (Stage A + integer token kinds) Jun 5, 2026
…-identical)

tokenize tried every token matcher per position against `source.slice(pos)` (a
fresh rest-of-file copy). Now the matchers are sticky regexes exec'd in place at
lastIndex (no slice), and a conservative first-char analysis skips matchers whose
possible first chars can't match the current char — so a `(` or `;` no longer
runs the identifier/number/string/comment regexes. The first-char dispatch is the
real win (most positions are punctuation that fell through all 16 matchers); the
slice removal is minor (V8's flat-string slice is cheap).

Shared lexer change → speeds up the interpreter, the emitted parser, and every
language (TS/JS/HTML/YAML/Vue). Byte-identical token stream (verified across
TS+JS+YAML corpora, 0 differing tokens; full conformance unchanged at 5386/5659).
Lexer ~21->17 ms on the bench files (~1.25x); the emitted parser's gap to the
official TS parser narrows 5.6x->5.0x.
@johnsoncodehk johnsoncodehk changed the title Emit a standalone parser: 100% equal to the interpreter, ~1.8x faster (Stage A + integer token kinds) Emit a standalone parser, equal to the interpreter — + integer token kinds + sticky/dispatch lexer (~5.0x the official TS parser) Jun 5, 2026
@johnsoncodehk johnsoncodehk merged commit cca4c6a into master Jun 5, 2026
2 checks passed
@johnsoncodehk johnsoncodehk deleted the emit-parser-stage-a branch June 5, 2026 18:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant