Emit a standalone parser, equal to the interpreter — + integer token kinds + sticky/dispatch lexer (~5.0x the official TS parser)#4
Merged
Conversation
…preter, ~1.5x faster `createParser` interprets the RuleExpr trees at runtime. This adds `emitParser(grammar)`, which emits self-contained JS: it bakes the analysis (precedence, FIRST sets, NUD/LED classification) into constants and specializes each rule into straight-line code, removing the matchExpr tree-walk. It reuses the existing lexer runtime (fed baked token/config data), so the parser — not the lexer — is what's specialized, and it never imports the grammar definition object. Measured on the TS grammar: byte-identical CST + accept/reject vs createParser across the full tests/cases corpus (18,805/18,805), and ~1.53x end-to-end on the four bench files. The interpretive dispatch is gone from the profile; the remaining floor is the shared lexer (~33%) and FIRST-set membership checks (~22%). Additive only — createParser (the oracle), the grammars, and the generated artifacts are untouched: - src/emit-parser.ts emitParser(grammar): string - test/emit-parser-verify.ts correctness gate: emitted CST must equal createParser's - test/emit-parser-bench.ts interpreter-vs-emitted benchmark
…Stage A) Replaces the emitted parser's string-keyed dispatch with integer compares: a two-int interning pass after tokenize tags each token with a type-kind and a literal-kind (keyword/punct), and the FIRST-set guards, canStart, matchLiteral and matchToken are emitted as integer checks against pre-classified keys. Two fields are needed because a keyword is an identifier-with-text in this grammar, so a keyword token must match both its keyword key and the identifier token-name key. Emitted-only change (createParser, the lexer, the grammars, and the generated artifacts are untouched), byte-identical to createParser across the full tests/cases corpus (18,805/18,805). String dispatch drops from ~45% of the parser layer to ~22%; the emitted parser goes 49->43 ms aggregate on the four bench files (~1.15x), and interpreter-vs-emitted rises 1.53x->1.84x. The win is capped by the shared lexer (tokenize, ~40% of total, out of scope here).
…-identical) tokenize tried every token matcher per position against `source.slice(pos)` (a fresh rest-of-file copy). Now the matchers are sticky regexes exec'd in place at lastIndex (no slice), and a conservative first-char analysis skips matchers whose possible first chars can't match the current char — so a `(` or `;` no longer runs the identifier/number/string/comment regexes. The first-char dispatch is the real win (most positions are punctuation that fell through all 16 matchers); the slice removal is minor (V8's flat-string slice is cheap). Shared lexer change → speeds up the interpreter, the emitted parser, and every language (TS/JS/HTML/YAML/Vue). Byte-identical token stream (verified across TS+JS+YAML corpora, 0 differing tokens; full conformance unchanged at 5386/5659). Lexer ~21->17 ms on the bench files (~1.25x); the emitted parser's gap to the official TS parser narrows 5.6x->5.0x.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
createParser(grammar)interprets theRuleExprtrees at runtime — it's the one derived output that isn't emitted as code (the TextMate / Monarch / tree-sitter highlighters all are). This PR addsemitParser(grammar)(src/emit-parser.ts): a self-contained JS parser that bakes the grammar's static analysis into constants and specializes each rule into straight-line code, removing the centralmatchExprtree-walk. It reuses the lexer runtime — so the parser is what's specialized — and never imports the grammar-definition object. Then two perf levers, both proven byte-identical againstcreateParseras the oracle:emit-parser.ts). String-keyed dispatch (tok.type === key,Set.has,startsWith) → integer compares, via a two-int interning pass after tokenize (type-kind + literal-kind; two fields because a keyword is an identifier-with-text here, so it must match both its keyword key and the identifier token-name key).src/gen-lexer.ts, shared — also speeds up the interpreter and every language).tokenizeno longer tries every matcher per position against a freshsource.slice(pos): matchers are sticky regexes exec'd in place, and a conservative first-char analysis skips matchers whose possible first chars can't match the current char (so(/;skip the word/number/string regexes).Measured (TS grammar)
Correctness: byte-identical CST + accept/reject vs
createParseracross the fulltests/casescorpus — 18,805 / 18,805 at every step. The shared lexer change is byte-identical too: identical token stream across TS+JS+YAML corpora (0 differing tokens of ~54k) and conformance unchanged (run-conformance 5386/5659).Emitted parser, four bench files (best-of-5):
vs the interpreter the emitted parser is now ~1.9×; the lexer lever (~1.25× on tokenize) also speeds up the interpreter, so both improve. tsc ≈ 7.6 ms.
Context: the gap to the official parser is implementation, not a V8 ceiling — tsc is itself JS. The decomposition is even (lexer ≈ parser layer); the remaining lexer cost is regex-per-token execution vs tsc's hand-tuned char scanner.
Scope
src/emit-parser.ts(new) +src/gen-lexer.ts(shared lexer; byte-identical token stream, verified across languages) +test/emit-parser-verify.ts(correctness gate) +test/emit-parser-bench.ts(benchmark).createParser(the oracle), the grammars, and the generated artifacts are untouched.Follow-ups (tracked as issues)
emitParser(grammar, target)with JS-specific emission behind aTargetconfig (Go / Rust / native).emit-parser-verifysoemitted == createParsercan't silently rot.Notes
The emitted parser (~290 KB for TS) is written to /tmp by the scripts and is not committed. Subtleties handled:
parseRulesets the Pratt context + packrat memo for both Pratt and left-rec rules (left-rec rules go through the same wrapper, else${…}in a template-literal type resolves toExprnotType); the int-kinds vocabulary must be a superset of every dispatch key, including operators that appear only inside anot(reserved-word) lookahead; the first-char dispatch is conservative (proven first-char superset, verified by a 5k-input fuzz + full token byte-identity).