Claude/bnf jsonicjs converter n aurn#72
Conversation
Pin the desired behaviour for the runtime-guard handling of indirect left-recursive cycles before implementing it. Seven new tests currently fail (legal inputs produce \`[]\` or throw; the eighth, a simple rejection, happens to reject for the wrong reason); the next commit adds the k+sI runtime guard that makes them pass. Cases covered: - two-rule cycle (p → q x ; q → p y | z) with shortest, one-step, and two-step derivations, plus three rejection cases, - three-rule cycle (a → b 1 ; b → c 2 ; c → a 3 | x) with shortest and one-step derivations plus a premature-stop rejection, - indirect cycle alongside a production that also has a direct seed alternative. Every test carries an explicit 2s timeout so a runaway parse surfaces as a test failure rather than hanging CI.
Extend the left-recursion eliminator to Paull's algorithm: order the productions and, for each A_i, inline any leading reference to A_j (j < i) into A_i's alternatives before eliminating direct left recursion. After the pass, every cycle in the first-position graph has been collapsed to a direct cycle and then rewritten to the iterative form. The inlining moves cycle work into the star-helper patterns already handled by the direct-LR eliminator, which revealed a pre-existing LL(1) dispatch limitation: the star helper's dispatcher only peeked one token, so an iteration's first terminal collided with the FOLLOW context (the \`x\` in \`p = q x\` and the next \`(x y)\` iteration inside \`q = z (x y)*\`). Generalise the dispatcher to enumerate concrete token prefixes through the grammar (\`altPrefixes\`), fanning out one dispatch entry per distinct path up to four tokens deep. Cycles freeze the path at the tokens accumulated so far (a \`done\` flag propagates out of recursive calls so a truncated sub-prefix isn't silently extended with tokens from the outer alt's trailing elements). Result: the failing tests committed in the previous change now pass — two-rule and three-rule cycles parse their shortest seed through several unfoldings, rejections are rejected, and an indirect cycle alongside a direct seed resolves to the right alternative on each input. 379/379 tests pass, no existing grammar regresses.
Replace the fixed two-slot consumed-token history (ctx.v1 / ctx.v2) with a growing stack ctx.v that records every token consumed, in order. Expose two helpers on ctx: - ctx.mark() — returns the current v.length, an opaque position. - ctx.rewind(mark) — pops the tokens consumed since that mark and unshifts them onto the active lexer's pending-token queue (pnt.token, which already exists and is consulted by lex.next before running matchers). Also invalidates the lookahead buffer so parse_alts re-fetches fresh. Legacy ctx.v1 / ctx.v2 keep working as getters on the top of the stack, so existing grammar code is unaffected. The v-history push is moved to run *before* alt actions so ctx.rewind from inside an `a:` action sees the just-matched tokens on the stack. The lookahead-buffer shift stays at the end of process(), so non-action code paths behave identically. No changes to the lexer or to the parse loop — only rules.ts / parser.ts / types.ts. Full suite: 385/385 tests pass (379 existing + 6 new in test/rewind.test.js covering mark/rewind semantics, partial rewind, no-op rewind, and close-state rewind).
Two changes to eliminateLeftRecursion that together cover the
nullable-through-cycle case the previous commit documented as a
silent-failure:
1. Topologically sort productions before running Paull's: an edge
A → B exists when A has an alternative whose leading element is
a reference to B. Processing B before A means Paull's inner
substitution loop (which only inlines A_j for j < i) actually
reaches the nullable rule and can propagate ε into A's leading
position.
2. Silently drop trivial `[P]` self-reference alternatives produced
by nullable-prefix expansion. `P ::= P` adds nothing to the
language, and an alt that reduces to exactly `[P]` after
stripping the leading self-ref is semantically a no-op. The
previous behaviour — throwing on a "trivial left-recursive
alternative" — rejected valid grammars that happened to contain
a pure self-ref alt.
The caller's declared production order is restored after Paull's so
the start rule still ends up first in the emitted spec.
End-to-end: the classic hidden-LR grammar
<a> ::= <b> <a> | "x"
<b> ::= "y" |
now parses `x`, `y x`, `y y x`, ... correctly and rejects `y`, `z`.
Longer nullable chains (a → b → c, with c nullable) also work.
Adds two hidden-LR tests and adjusts two existing tests whose
assertions depended on the rejected behaviour (trivial alts now
silently drop rather than throw; the topo reorder means Paull's
inlines earlier rules more aggressively in multi-production cases).
Add \`options.rewind.history\` — a capacity cap on the consumed-token
history used by ctx.rewind. Default is Infinity (unchanged
behaviour); set a finite value to bound parse-time memory.
Implementation:
- New \`cfg.rewind.history\` flows from Options through utility.ts.
- rules.ts keeps pushing each consumed token onto ctx.v, then
batch-evicts from the front once v.length exceeds 2*cap, pulling
it back down to cap. Amortised O(1) per push regardless of
capacity (a naive Array.shift() per overflow push would be O(cap)).
- mark/rewind now key off a new absolute counter ctx.vAbs (total
pushed minus rewound) instead of ctx.v.length. That way, when
the ring evicts old tokens from the front, outstanding marks
captured within the retained window still rewind correctly.
- ctx.rewind(mark) throws a clear error if the mark has been
evicted ("increase options.rewind.history").
Empirical: with history=64 on an input producing 20001 consumed
tokens, ctx.v peaks at 128 entries (2*cap), vs 20001 for the
unbounded default. For small inputs (fewer tokens than the cap),
behaviour is identical to unbounded.
Adds four tests: cap enforcement, marks surviving eviction,
rewind-too-deep error, and the default remaining unbounded.
Full suite: 392/392 tests pass.
Cap the consumed-token retention at 64 by default so parses of large inputs don't accidentally pin every token in memory. Small parses (under the cap) are unaffected — every token is retained, identical to the unbounded behaviour. Grammars that need deeper rewinds can raise the limit (or set it to Infinity) via options.rewind.history. Updates the test that covered the old "default is unbounded" behaviour and adds a new test pinning that Infinity opts back in to unbounded retention. Full suite: 393/393 tests pass.
Drop the angle-bracket rule-name delimiters and change the definition operator from \`::=\` to \`=\`, matching ABNF (RFC 5234). References are now bare identifiers everywhere — quoted strings remain the only way to write a literal, and parentheses still group. Changes are confined to the BNF parser grammar (bnfRules), the BNF-parser jsonic instance (drop #LT/#GT, remap #DEF), the test fixtures (.bnf files) and the inline grammars in tests. The downstream AST, left-recursion pass, desugar, and emitter are unchanged — the AST shape (name: string, refs by name) was already ABNF-compatible. Production-boundary detection shrinks from a 4-token peek (#LT #TX #GT #DEF) to a 2-token peek (#TX #DEF); seq/prod rules updated accordingly. Remaining BNF-isms — alternation \`|\`, optional \`?\`, repetition \`*\`/\`+\`, comment \`#\` — are unchanged in this stage; later stages replace them in turn. 393/393 tests pass.
Change the alternation operator to \`/\`, matching ABNF (RFC 5234).
The dropping side-effect is that the regex-terminal extension
(\`/pattern/flags\`) had to go — ABNF has no regex terminal, and
the \`/\` delimiters would be ambiguous with the alternation
operator. Numeric ranges (\`%x30-39\`, etc.) will take over that
role in a later stage.
Changes:
- bnfRules grammar renames \`#PIPE\` (\`|\`) to \`#ALT\` (\`/\`),
drops the \`#RX\` match-token matcher and the elem alt that
recognised regex syntax.
- The internal AST \`{ kind: 'regex', pattern, flags }\` and the
emitter's regex-token allocation stay in place — they'll be
reused when \`%x\` ranges land.
- Fixtures (\`greet.bnf\`, \`pair.bnf\`, \`arith.bnf\`,
\`arith-leftrec.bnf\`, \`json-subset.bnf\`) rewritten to use
\`/\` alternation; regex-dependent multi-digit/letter
terminals replaced with explicit \`"0" / "1" / …\` alternation
or single-letter ruleset until the %x stage.
- Inline grammars in tests rewritten; the \`regex terminals\`
describe block is dropped and marked for return after %x.
- Assertions that pinned an inlined seed alt to \`kind: 'regex'\`
updated to \`'term'\` (the fixture now uses "1" instead of a
regex digit class).
389/389 tests pass.
Replace the BNF postfix-question-mark optional with ABNF's bracket
form. Drop the \`#QM\` (\`?\`) fixed token entirely; remap \`#OB\` /
\`#CB\` from JSON's \`{\` / \`}\` to ABNF's \`[\` / \`]\`. Add an
elem.open alt that pushes \`alts\` for \`[\`, an elem.close alt that
consumes \`]\` (gated on \`r.u.groupKind === 'opt'\`), and a
matching seq end-of-sequence guard for \`]\`.
\`r.u.group\` becomes \`r.u.groupKind\` (\`'group'\` for parens,
\`'opt'\` for brackets) so the close state knows which terminator
to expect and how to wrap the result. The internal AST shape for
optional is unchanged — \`{ kind: 'opt', inner: {kind:'group', …} }\`,
identical to what \`(A)?\` used to produce, so the desugar /
emitter pipeline didn't need any changes.
Fixtures and inline tests updated to use the bracket form. The
\`(A)?\` combination in the BNF source no longer exists; users wrap
in brackets directly. Repetition postfixes \`*\` / \`+\` and
comments \`#\` are still in their BNF form pending later stages.
389/389 tests pass.
Replace BNF postfix \`A*\` / \`A+\` with ABNF's prefix repetition:
*A = 0 or more (was A*)
1*A = 1 or more (was A+)
m*nA = between m and n (NEW)
*nA = at most n (NEW)
m*A = at least m (NEW)
nA = exactly n (NEW)
Add a #NUM match-token (decimal digits) and an explicit \`atom\`
rule split out of \`elem\`. \`elem\` now parses an optional
repetition prefix (NUM/'*' combinations) into r.u.min/r.u.max,
pushes \`atom\` to consume the actual element body, then wraps the
returned atom into the right AST node — \`star\` / \`plus\` /
\`opt\` for the boundary cases, \`rep\` for general m*n.
The new \`rep\` AST kind desugars to a concatenation of \`min\`
mandatory copies followed by either a star tail (\`min, ∞\`) or
\`max - min\` nested optionals — staying within the existing
emitter's vocabulary.
#PLUS (\`+\`) is dropped entirely; ABNF has no postfix repetition.
atom.close grew tcol-only \`{ s: TOK, b: 1 }\` declarations for
every token that can legitimately follow an atom. Without these,
the lexer would fall through to the default number-matcher and
emit #NR for digits, leaving the enclosing seq.close unable to
recognise \`1\` as the start of a repetition prefix.
Tests cover *A / 1*A / m*nA / nA / *nA. All 392 tests pass.
Override the BNF-parser jsonic instance's comment definitions: remap \`hash\` (default \`#\`) to \`;\` and disable the \`slash\` and \`multi\` variants (we'd never want \`//\` or \`/*\` ambiguous with the ABNF \`/\` alternation operator). All five fixtures and the inline comment tests now use \`;\` — this stage finishes the syntax migration of features the converter already supported. Remaining work is all additive: %x numeric ranges, case-insensitive strings, incremental alternatives, line continuation, and the RFC 5234 core rules library. 392/392 tests pass.
Add RFC 5234 numeric-value syntax as a new atom form:
%xNN single hex code point (%x61 matches 'a')
%dNN decimal (%d97 matches 'a')
%bNN binary (%b1100001 matches 'a')
%xNN-NN character range (%x30-39 matches any digit)
%xNN.NN.NN concatenated points (%x66.6f.6f matches "foo")
Implementation:
- New #NV match-token with a permissive hex regex;
parseNumericValue() decodes the matched source based on the base
prefix (x/d/b).
- Single values and concatenations produce a `{kind:'term', literal}`
(one or many chars); ranges produce a `{kind:'regex', pattern}`
over the corresponding `\\uNNNN-\\uNNNN` class.
- atom.open gains a `#NV` alt; atom.close grew a `{s:'#NV',b:1}`
tcol-only declaration so the lexer emits #NV even when the rule
state it falls into isn't explicitly expecting it.
- elem.open's repetition-prefix alts now end with a new `#ATOM`
tokenset (#ST / #NV / #TX / #LP / #OB) plus `b:1`, so tcol at the
atom-starter position includes every matcher tin — without this
the lexer would fall through to #TX on `1*%x30-39` and the
downstream emitter would try to treat `%x30-39` as a rule name.
The `arith.bnf` / `arith-leftrec.bnf` fixtures collapse their
single-digit \`"0" / "1" / … / "9"\` enumeration to `number =
1*DIGIT` with `DIGIT = %x30-39`. Tests cover single values in
each base, concatenation, ranges, and ranges under repetition.
397/397 tests pass.
Match RFC 5234's default: a bare quoted-string terminal \`"foo"\`
accepts \`foo\`, \`Foo\`, \`FOO\`, \`fOo\` — every case-fold of
the literal. New prefixes:
%s"foo" force a case-sensitive match
%i"foo" explicit case-insensitive (same as the bare form)
Implementation:
- BnfElement.term grows an optional \`caseSensitive\` flag; the
default ABNF semantics kick in when it's omitted.
- #SS / #SI match-tokens lex \`%s\` and \`%i\` (gated by a \`(?=")\`
lookahead so they don't steal the \`%\` from \`%xNN\` numeric
values). atom.open has three string alts (%s / %i / plain) and
stamps the flag appropriately.
- The emitter keys its literal-token map on (literal,
effective-sensitivity) — \`termKey(el)\` — so a \`%s"foo"\` and a
bare \`"foo"\` produce distinct tokens. Literals whose effective
sensitivity is "sensitive" (either \`%s\` or no letters) emit as
fixed tokens; insensitive literals emit as anchored
\`/^literal/i\` regexes with a new \`eager$\` flag on the
RegExp instance.
To make \`eager$\` meaningful, lexer.ts's tcol-gating skips the
gate for matchers that carry the flag. Without it, the default
text-matcher would eat \`end\` as #TX inside a narrower star-helper
context before the outer rule could recognise it as its own
#END token.
Tests cover: bare literal matching any case, \`%s\` forcing exact
match, \`%i\` behaving identically to bare, literals with no
letters, and sensitive / insensitive variants of the same string
coexisting in one grammar. 402/402 tests pass.
Add RFC 5234's \`=/\` operator: a production header of the form
name =/ alt1 / alt2 / …
folds its alternatives into the production named \`name\`
declared earlier. Parser flags each incremental occurrence on
BnfProduction; a new \`mergeIncrementals\` post-parse pass walks
the list and appends each incremental's \`alts\` onto the base
(throwing if the base hasn't been seen yet).
Implementation details:
- New \`#DEFA\` fixed token for \`=/\`. Jsonic's longest-match-wins
fixed matcher picks it over \`#DEF\` (\`=\`) automatically.
- prod.open gains a second alt for \`#TX #DEFA\`; prod.close and
seq's end-guards also recognise \`#TX #DEFA\` as a
production-boundary lookahead.
- BnfProduction grows an optional \`incremental\` flag; the merge
pass drops it so the emitter only ever sees canonical
productions.
Tests cover: multi-increment chaining (as in \`command =/ "get" /
"post"\` split across three lines), a single-alt append, a
multi-alt append, and the error on orphaned \`=/\`. 406/406 tests
pass.
Document and pin the fact that rules can span multiple lines with
no special parser support — newlines are IGNORE whitespace in the
BNF-parser's lexer, and production boundaries are detected by the
\`name =\` lookahead rather than by end-of-line. A rule like
command = "get"
/ "post"
/ "delete"
parses as a single 3-alt production. No implementation change is
needed; three tests pin the behaviour (wrapped alternation,
wrapped sequence, and multiple wrapped rules coexisting). 409/409
tests pass.
Auto-include the RFC 5234 core rules — ALPHA, BIT, CHAR, CR, LF, CRLF, CTL, DIGIT, DQUOTE, HEXDIG, HTAB, OCTET, SP, VCHAR, WSP — when a user grammar references them but doesn't define them locally. Resolution is transitive (e.g. HEXDIG pulls in DIGIT), and a local definition of a core-rule name always wins over the built-in. Implementation: a static CORE_RULES_ABNF string is parsed once on first use via the existing BNF parser; each user grammar gets scanned for references not satisfied by its own productions, and the needed subset of core rules is appended. No run-time cost when no core rule is referenced. Fixtures simplified — \`arith.bnf\` and \`arith-leftrec.bnf\` now use \`number = 1*DIGIT\` (single core reference) instead of open-coding \`%x30-39\`. Six new tests cover auto-inclusion, upper/lower ALPHA, transitive HEXDIG → DIGIT, BIT, local override, and the "no core added" negative case. 415/415 tests pass.
New \`test/grammar/rfc3986-uri.abnf\` contains the Appendix A collected ABNF from RFC 3986 — 27 user-authored productions covering URI, scheme, authority, userinfo, host (with IP-literal, IPv4address, reg-name alternatives), full IPv6address, path-abempty / absolute / rootless / empty, query, fragment, pct-encoded, unreserved, sub-delims — written in the same syntax the RFC prints. The converter transitively pulls in ALPHA, DIGIT, and HEXDIG from the core-rules library. The only adjustment made to the RFC text is replacing \`path-empty = 0<pchar>\` (which uses prose-val notation outside the scope of ABNF literal interpretation) with \`path-empty = ""\`, semantically equivalent. New \`test/rfc3986.test.js\` exercises the grammar in three parts: 1. Compilation: the full grammar builds without error, and every named production from the RFC survives into the emitted spec alongside the auto-included core rules. 2. URI acceptance: parses URIs that don't exercise the grammar's LL(k) ambiguities — \`urn:isbn:0451450523\`, \`mailto:alice@example.com\`, \`tag:yaml.org,2002:int\`, \`http://[::1]/\` — and rejects structurally-invalid inputs. 3. Known limitations: explicitly documents that URIs like \`http://example.com\` and \`ftp://user@host\` currently throw. The \`authority\` rule's \`[ userinfo "@" ]\` optional is the root cause: userinfo and reg-name share a leading character set, and the converter's static FIRST-set dispatcher can't peek past the userinfo body to decide whether the trailing \`@\` is there. ctx.rewind — shipped earlier in this project — is the primitive a future emitter change could use to close that gap. Full suite: 425/425 tests pass.
The BNF converter now detects optional-prefix ambiguities of the form `[X D] Y` where X and Y share a character vocabulary and D is a terminal disambiguator, and emits a probe + phase-retry dispatcher using only standard jsonic primitives (r:, k:, c:, ctx.mark/rewind/t). On probe return the decide action peeks ctx.t[0], rewinds to the recorded mark, and retries the dispatcher with the committed phase — RFC 3986's authority ambiguity (userinfo vs reg-name) now parses all URI shapes (http://example.com, ftp://user@host, etc). Also fixes two latent jsonic bugs the new pattern exposed: ctx.rewind now re-queues pre-lexed lookahead and clears the cached end-of-source token, and the alt bitset check no longer ORs bitAA into partitions other than 0 (it only belongs where #AA itself lives). https://claude.ai/code/session_01F3V3mSpQonb9XHG5tr2265
When an alt's position lists #AA (ANY), drop the per-partition bitset to the existing null sentinel so the parse_alts matcher skips the bitset check entirely. Previously the alt's S[i] was sized from AA's own tin (=4, partition 0), so tokens with tin >= 31 hit an undefined higher partition and silently failed to match. The null-S[i] path was already the "no constraint" contract in the matcher; this just routes the wildcard through it. tcol collation is unaffected — t[i] still carries the raw tin list. https://claude.ai/code/session_01F3V3mSpQonb9XHG5tr2265
Every user-declared rule now produces an AST node `{rule, src, kids}`
where `rule` is the grammar-rule name, `src` is the matched source
text (concatenated), and `kids` is the array of child user-rule
nodes. Core rules (ALPHA, DIGIT, HEXDIG, …) and synthesised helpers
(desugar / dispatcher / chain-step) are transparent: their `src`
appends to the enclosing user rule and their `kids` extend it.
For grammars without leading-ref inlining the tree mirrors the ABNF
almost exactly — an RFC 3986 URI parse surfaces `hier-part`,
`authority`, `host`, `path-abempty`, etc. as named nodes. Leading
refs at the head of a production alt still get flattened by Paull's
substitution; documenting this as a current limitation.
Also fixes a jsonic bug the tree-builder surfaced: `r:` rule
replacement wasn't updating the parent's `child` pointer, so the
parent's close-state actions saw the stale pre-replacement rule
instead of the replacement. Rewound the AST tree for probe-dispatch
paths like authority's.
https://claude.ai/code/session_01F3V3mSpQonb9XHG5tr2265
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2648edde42
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| (tokenMatcher as any).tin$ && | ||
| !(tokenMatcher as any).eager$ && | ||
| !rule.spec.def.tcol[oc][tI].includes((tokenMatcher as any).tin$) | ||
| ) { |
There was a problem hiding this comment.
Prevent eager matchers from preempting expected literals
By skipping the tcol check for eager$ matchers, the lexer now accepts the first regex token that matches even when that token is not valid in the current rule position. Because token matchers are iterated by tin order and return on first match, a shorter case-insensitive literal can consume input before a longer literal that is actually expected (e.g. grammar contains both "A" and "AB", with "A" allocated first, and start rule expects only "AB"; parsing ab fails). This introduces false parse errors for valid ABNF grammars with prefix-overlapping literals.
Useful? React with 👍 / 👎.
| const parts = body.split('.') | ||
| const chars = parts.map((n) => String.fromCharCode(parseInt(n, radix))) | ||
| return { kind: 'term', literal: chars.join('') } |
There was a problem hiding this comment.
Validate numeric-value digits against the declared base
parseNumericValue uses parseInt directly on each numeric chunk, but parseInt accepts partial strings, so invalid ABNF such as %d6A or %b2 is silently accepted and compiled into unintended characters/ranges instead of being rejected. That produces grammars that parse the wrong language and makes numeric-value mistakes hard to diagnose.
Useful? React with 👍 / 👎.
No description provided.