Prune longest-match alternatives by their deep FIRST set (#8)#34
Merged
Conversation
…st token (#8) The three longest-match dispatch loops (parseNonRec, parseLeftRec atoms, parsePratt nuds) decided whether to try an alternative from its SHALLOW first token (firstTokenOf): a leading rule ref defeated it (→ always tried), so a rule-ref-led alt was speculatively parsed and usually failed at every position — ~57% of all alternative attempts on the TS corpus. Resolve each alternative's FIRST set through the transitive firstSets (already computed, previously used only for rule-ref guards) and skip the alt when the lookahead is provably outside it. Sound by construction: exprFirst is a complete over-approximation (never omits a startable token), and a nullable alt is always tried — its only extra match is the empty one, which never wins the longest-match comparison. Mirrored in both createParser (gen-parser.ts) and emitParser (emit-parser.ts). The per-alt FIRST descriptor arrays are precomputed (interpreter) / hoisted to deduped module-level consts (emitter) so the guard costs nothing per call — which also removes the same per-call array allocation from the existing rule-ref firstGuard, and shrinks the emitted TS parser 328KB → 314KB. Parser-layer ~9% faster (up to ~15% on expression-dense files). Zero behaviour change: byte-identical CST + accept/reject across the full 18,805-file corpus (createParser ≡ emitParser), interpreter CST identical to HEAD, run-conformance 5386/5659 unchanged, 26/26 gates pass.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
The longest-match loops —
parseNonRec, theparseLeftRecatoms, theparsePrattnuds — try every alternative the lookahead doesn't rule out and keep the longest. The "rule out" test was shallow (firstTokenOf): it pruned an alternative only when its FIRST element was a literal or a token ref. A leading rule ref (Decl …,Expr …) defeated it →null→ always tried. So a rule-ref-led alternative was speculatively parsed (and usually failed) at every position.Instrumented on the TS corpus, that's the bulk of the waste: the NUD/atom loops try ~4.4 alternatives per dispatch, ~80% of those
matchExprattempts never win the longest-match, and ~57% of all attempts are alternatives a deep FIRST set would have ruled out (70% of the wasted ones).This PR resolves each alternative's FIRST set through the transitive
firstSets(already computed for the rule-ref guards) and skips the alternative when the lookahead is provably outside it — issue #8's lever 2 ("strengthen first-token dispatch so provably-dead alternatives are never entered").Sound by construction (zero behaviour change)
exprFirstis a complete over-approximation — it never omits a token a non-empty match could begin with — so an alternative pruned here genuinely cannot match non-empty at this token. A nullable alternative is always tried: its only extra match is the empty one, and an empty match never wins the longest-match comparison (pos === saved, never> bestPos). The deep test strictly dominates the shallow one (whenever the shallow check pruned, the deep FIRST set — whose leading member is that same literal/token — prunes too).Verified:
createParser≡emitParser, full corpusrun-conformanceaccept/reject setnpm run checkMirrored in both engines:
createParser(gen-parser.ts,altMightStart) and the emittedemitParser(emit-parser.ts,altGuard). The per-alt FIRST descriptor arrays are precomputed (interpreter) / hoisted to deduped module-level consts (emitter), so the guard allocates nothing per call — which also removes the same per-call array allocation from the existing rule-reffirstGuard, and shrinks the emitted TS parser 328 KB → 314 KB.Measured
Isolating the parser layer (full parse − tokenize, since the lexer is untouched), interleaved best-of-N vs the prior engine on the four bench files:
Full-pipeline that's ~4–5%, since (per #4) the lexer is ~half the time. The headroom here is inherently small: the parser layer is already ~1.6× tsc and on some files beats it — the dominant remaining gap is the lexer (regex-per-token vs tsc's hand-tuned char scanner), which is #5's territory.
Scope
src/gen-parser.ts+src/emit-parser.tsonly. The grammars, generated artifacts, andcreateParser's public behaviour are untouched. Lever 1 (committing through genuine shared-first-token ambiguity —(→paren/arrow,ident→ident/arrow) is not attempted: FIRST sets can't separate those, and with the parser layer already near tsc the headroom doesn't justify the structural risk.