Claude/bnf jsonicjs converter n aurn by rjrodger · Pull Request #72 · jsonicjs/jsonic

rjrodger · 2026-04-21T10:26:45Z

No description provided.

Pin the desired behaviour for the runtime-guard handling of indirect left-recursive cycles before implementing it. Seven new tests currently fail (legal inputs produce \`[]\` or throw; the eighth, a simple rejection, happens to reject for the wrong reason); the next commit adds the k+sI runtime guard that makes them pass. Cases covered: - two-rule cycle (p → q x ; q → p y | z) with shortest, one-step, and two-step derivations, plus three rejection cases, - three-rule cycle (a → b 1 ; b → c 2 ; c → a 3 | x) with shortest and one-step derivations plus a premature-stop rejection, - indirect cycle alongside a production that also has a direct seed alternative. Every test carries an explicit 2s timeout so a runaway parse surfaces as a test failure rather than hanging CI.

Extend the left-recursion eliminator to Paull's algorithm: order the productions and, for each A_i, inline any leading reference to A_j (j < i) into A_i's alternatives before eliminating direct left recursion. After the pass, every cycle in the first-position graph has been collapsed to a direct cycle and then rewritten to the iterative form. The inlining moves cycle work into the star-helper patterns already handled by the direct-LR eliminator, which revealed a pre-existing LL(1) dispatch limitation: the star helper's dispatcher only peeked one token, so an iteration's first terminal collided with the FOLLOW context (the \`x\` in \`p = q x\` and the next \`(x y)\` iteration inside \`q = z (x y)*\`). Generalise the dispatcher to enumerate concrete token prefixes through the grammar (\`altPrefixes\`), fanning out one dispatch entry per distinct path up to four tokens deep. Cycles freeze the path at the tokens accumulated so far (a \`done\` flag propagates out of recursive calls so a truncated sub-prefix isn't silently extended with tokens from the outer alt's trailing elements). Result: the failing tests committed in the previous change now pass — two-rule and three-rule cycles parse their shortest seed through several unfoldings, rejections are rejected, and an indirect cycle alongside a direct seed resolves to the right alternative on each input. 379/379 tests pass, no existing grammar regresses.

Replace the fixed two-slot consumed-token history (ctx.v1 / ctx.v2) with a growing stack ctx.v that records every token consumed, in order. Expose two helpers on ctx: - ctx.mark() — returns the current v.length, an opaque position. - ctx.rewind(mark) — pops the tokens consumed since that mark and unshifts them onto the active lexer's pending-token queue (pnt.token, which already exists and is consulted by lex.next before running matchers). Also invalidates the lookahead buffer so parse_alts re-fetches fresh. Legacy ctx.v1 / ctx.v2 keep working as getters on the top of the stack, so existing grammar code is unaffected. The v-history push is moved to run *before* alt actions so ctx.rewind from inside an `a:` action sees the just-matched tokens on the stack. The lookahead-buffer shift stays at the end of process(), so non-action code paths behave identically. No changes to the lexer or to the parse loop — only rules.ts / parser.ts / types.ts. Full suite: 385/385 tests pass (379 existing + 6 new in test/rewind.test.js covering mark/rewind semantics, partial rewind, no-op rewind, and close-state rewind).

Two changes to eliminateLeftRecursion that together cover the nullable-through-cycle case the previous commit documented as a silent-failure: 1. Topologically sort productions before running Paull's: an edge A → B exists when A has an alternative whose leading element is a reference to B. Processing B before A means Paull's inner substitution loop (which only inlines A_j for j < i) actually reaches the nullable rule and can propagate ε into A's leading position. 2. Silently drop trivial `[P]` self-reference alternatives produced by nullable-prefix expansion. `P ::= P` adds nothing to the language, and an alt that reduces to exactly `[P]` after stripping the leading self-ref is semantically a no-op. The previous behaviour — throwing on a "trivial left-recursive alternative" — rejected valid grammars that happened to contain a pure self-ref alt. The caller's declared production order is restored after Paull's so the start rule still ends up first in the emitted spec. End-to-end: the classic hidden-LR grammar <a> ::= <b> <a> | "x" <b> ::= "y" | now parses `x`, `y x`, `y y x`, ... correctly and rejects `y`, `z`. Longer nullable chains (a → b → c, with c nullable) also work. Adds two hidden-LR tests and adjusts two existing tests whose assertions depended on the rejected behaviour (trivial alts now silently drop rather than throw; the topo reorder means Paull's inlines earlier rules more aggressively in multi-production cases).

Add \`options.rewind.history\` — a capacity cap on the consumed-token history used by ctx.rewind. Default is Infinity (unchanged behaviour); set a finite value to bound parse-time memory. Implementation: - New \`cfg.rewind.history\` flows from Options through utility.ts. - rules.ts keeps pushing each consumed token onto ctx.v, then batch-evicts from the front once v.length exceeds 2*cap, pulling it back down to cap. Amortised O(1) per push regardless of capacity (a naive Array.shift() per overflow push would be O(cap)). - mark/rewind now key off a new absolute counter ctx.vAbs (total pushed minus rewound) instead of ctx.v.length. That way, when the ring evicts old tokens from the front, outstanding marks captured within the retained window still rewind correctly. - ctx.rewind(mark) throws a clear error if the mark has been evicted ("increase options.rewind.history"). Empirical: with history=64 on an input producing 20001 consumed tokens, ctx.v peaks at 128 entries (2*cap), vs 20001 for the unbounded default. For small inputs (fewer tokens than the cap), behaviour is identical to unbounded. Adds four tests: cap enforcement, marks surviving eviction, rewind-too-deep error, and the default remaining unbounded. Full suite: 392/392 tests pass.

Cap the consumed-token retention at 64 by default so parses of large inputs don't accidentally pin every token in memory. Small parses (under the cap) are unaffected — every token is retained, identical to the unbounded behaviour. Grammars that need deeper rewinds can raise the limit (or set it to Infinity) via options.rewind.history. Updates the test that covered the old "default is unbounded" behaviour and adds a new test pinning that Infinity opts back in to unbounded retention. Full suite: 393/393 tests pass.

Drop the angle-bracket rule-name delimiters and change the definition operator from \`::=\` to \`=\`, matching ABNF (RFC 5234). References are now bare identifiers everywhere — quoted strings remain the only way to write a literal, and parentheses still group. Changes are confined to the BNF parser grammar (bnfRules), the BNF-parser jsonic instance (drop #LT/#GT, remap #DEF), the test fixtures (.bnf files) and the inline grammars in tests. The downstream AST, left-recursion pass, desugar, and emitter are unchanged — the AST shape (name: string, refs by name) was already ABNF-compatible. Production-boundary detection shrinks from a 4-token peek (#LT #TX #GT #DEF) to a 2-token peek (#TX #DEF); seq/prod rules updated accordingly. Remaining BNF-isms — alternation \`|\`, optional \`?\`, repetition \`*\`/\`+\`, comment \`#\` — are unchanged in this stage; later stages replace them in turn. 393/393 tests pass.

Change the alternation operator to \`/\`, matching ABNF (RFC 5234). The dropping side-effect is that the regex-terminal extension (\`/pattern/flags\`) had to go — ABNF has no regex terminal, and the \`/\` delimiters would be ambiguous with the alternation operator. Numeric ranges (\`%x30-39\`, etc.) will take over that role in a later stage. Changes: - bnfRules grammar renames \`#PIPE\` (\`|\`) to \`#ALT\` (\`/\`), drops the \`#RX\` match-token matcher and the elem alt that recognised regex syntax. - The internal AST \`{ kind: 'regex', pattern, flags }\` and the emitter's regex-token allocation stay in place — they'll be reused when \`%x\` ranges land. - Fixtures (\`greet.bnf\`, \`pair.bnf\`, \`arith.bnf\`, \`arith-leftrec.bnf\`, \`json-subset.bnf\`) rewritten to use \`/\` alternation; regex-dependent multi-digit/letter terminals replaced with explicit \`"0" / "1" / …\` alternation or single-letter ruleset until the %x stage. - Inline grammars in tests rewritten; the \`regex terminals\` describe block is dropped and marked for return after %x. - Assertions that pinned an inlined seed alt to \`kind: 'regex'\` updated to \`'term'\` (the fixture now uses "1" instead of a regex digit class). 389/389 tests pass.

Replace the BNF postfix-question-mark optional with ABNF's bracket form. Drop the \`#QM\` (\`?\`) fixed token entirely; remap \`#OB\` / \`#CB\` from JSON's \`{\` / \`}\` to ABNF's \`[\` / \`]\`. Add an elem.open alt that pushes \`alts\` for \`[\`, an elem.close alt that consumes \`]\` (gated on \`r.u.groupKind === 'opt'\`), and a matching seq end-of-sequence guard for \`]\`. \`r.u.group\` becomes \`r.u.groupKind\` (\`'group'\` for parens, \`'opt'\` for brackets) so the close state knows which terminator to expect and how to wrap the result. The internal AST shape for optional is unchanged — \`{ kind: 'opt', inner: {kind:'group', …} }\`, identical to what \`(A)?\` used to produce, so the desugar / emitter pipeline didn't need any changes. Fixtures and inline tests updated to use the bracket form. The \`(A)?\` combination in the BNF source no longer exists; users wrap in brackets directly. Repetition postfixes \`*\` / \`+\` and comments \`#\` are still in their BNF form pending later stages. 389/389 tests pass.

Replace BNF postfix \`A*\` / \`A+\` with ABNF's prefix repetition: *A = 0 or more (was A*) 1*A = 1 or more (was A+) m*nA = between m and n (NEW) *nA = at most n (NEW) m*A = at least m (NEW) nA = exactly n (NEW) Add a #NUM match-token (decimal digits) and an explicit \`atom\` rule split out of \`elem\`. \`elem\` now parses an optional repetition prefix (NUM/'*' combinations) into r.u.min/r.u.max, pushes \`atom\` to consume the actual element body, then wraps the returned atom into the right AST node — \`star\` / \`plus\` / \`opt\` for the boundary cases, \`rep\` for general m*n. The new \`rep\` AST kind desugars to a concatenation of \`min\` mandatory copies followed by either a star tail (\`min, ∞\`) or \`max - min\` nested optionals — staying within the existing emitter's vocabulary. #PLUS (\`+\`) is dropped entirely; ABNF has no postfix repetition. atom.close grew tcol-only \`{ s: TOK, b: 1 }\` declarations for every token that can legitimately follow an atom. Without these, the lexer would fall through to the default number-matcher and emit #NR for digits, leaving the enclosing seq.close unable to recognise \`1\` as the start of a repetition prefix. Tests cover *A / 1*A / m*nA / nA / *nA. All 392 tests pass.

Override the BNF-parser jsonic instance's comment definitions: remap \`hash\` (default \`#\`) to \`;\` and disable the \`slash\` and \`multi\` variants (we'd never want \`//\` or \`/*\` ambiguous with the ABNF \`/\` alternation operator). All five fixtures and the inline comment tests now use \`;\` — this stage finishes the syntax migration of features the converter already supported. Remaining work is all additive: %x numeric ranges, case-insensitive strings, incremental alternatives, line continuation, and the RFC 5234 core rules library. 392/392 tests pass.

Add RFC 5234 numeric-value syntax as a new atom form: %xNN single hex code point (%x61 matches 'a') %dNN decimal (%d97 matches 'a') %bNN binary (%b1100001 matches 'a') %xNN-NN character range (%x30-39 matches any digit) %xNN.NN.NN concatenated points (%x66.6f.6f matches "foo") Implementation: - New #NV match-token with a permissive hex regex; parseNumericValue() decodes the matched source based on the base prefix (x/d/b). - Single values and concatenations produce a `{kind:'term', literal}` (one or many chars); ranges produce a `{kind:'regex', pattern}` over the corresponding `\\uNNNN-\\uNNNN` class. - atom.open gains a `#NV` alt; atom.close grew a `{s:'#NV',b:1}` tcol-only declaration so the lexer emits #NV even when the rule state it falls into isn't explicitly expecting it. - elem.open's repetition-prefix alts now end with a new `#ATOM` tokenset (#ST / #NV / #TX / #LP / #OB) plus `b:1`, so tcol at the atom-starter position includes every matcher tin — without this the lexer would fall through to #TX on `1*%x30-39` and the downstream emitter would try to treat `%x30-39` as a rule name. The `arith.bnf` / `arith-leftrec.bnf` fixtures collapse their single-digit \`"0" / "1" / … / "9"\` enumeration to `number = 1*DIGIT` with `DIGIT = %x30-39`. Tests cover single values in each base, concatenation, ranges, and ranges under repetition. 397/397 tests pass.

Match RFC 5234's default: a bare quoted-string terminal \`"foo"\` accepts \`foo\`, \`Foo\`, \`FOO\`, \`fOo\` — every case-fold of the literal. New prefixes: %s"foo" force a case-sensitive match %i"foo" explicit case-insensitive (same as the bare form) Implementation: - BnfElement.term grows an optional \`caseSensitive\` flag; the default ABNF semantics kick in when it's omitted. - #SS / #SI match-tokens lex \`%s\` and \`%i\` (gated by a \`(?=")\` lookahead so they don't steal the \`%\` from \`%xNN\` numeric values). atom.open has three string alts (%s / %i / plain) and stamps the flag appropriately. - The emitter keys its literal-token map on (literal, effective-sensitivity) — \`termKey(el)\` — so a \`%s"foo"\` and a bare \`"foo"\` produce distinct tokens. Literals whose effective sensitivity is "sensitive" (either \`%s\` or no letters) emit as fixed tokens; insensitive literals emit as anchored \`/^literal/i\` regexes with a new \`eager$\` flag on the RegExp instance. To make \`eager$\` meaningful, lexer.ts's tcol-gating skips the gate for matchers that carry the flag. Without it, the default text-matcher would eat \`end\` as #TX inside a narrower star-helper context before the outer rule could recognise it as its own #END token. Tests cover: bare literal matching any case, \`%s\` forcing exact match, \`%i\` behaving identically to bare, literals with no letters, and sensitive / insensitive variants of the same string coexisting in one grammar. 402/402 tests pass.

Add RFC 5234's \`=/\` operator: a production header of the form name =/ alt1 / alt2 / … folds its alternatives into the production named \`name\` declared earlier. Parser flags each incremental occurrence on BnfProduction; a new \`mergeIncrementals\` post-parse pass walks the list and appends each incremental's \`alts\` onto the base (throwing if the base hasn't been seen yet). Implementation details: - New \`#DEFA\` fixed token for \`=/\`. Jsonic's longest-match-wins fixed matcher picks it over \`#DEF\` (\`=\`) automatically. - prod.open gains a second alt for \`#TX #DEFA\`; prod.close and seq's end-guards also recognise \`#TX #DEFA\` as a production-boundary lookahead. - BnfProduction grows an optional \`incremental\` flag; the merge pass drops it so the emitter only ever sees canonical productions. Tests cover: multi-increment chaining (as in \`command =/ "get" / "post"\` split across three lines), a single-alt append, a multi-alt append, and the error on orphaned \`=/\`. 406/406 tests pass.

Document and pin the fact that rules can span multiple lines with no special parser support — newlines are IGNORE whitespace in the BNF-parser's lexer, and production boundaries are detected by the \`name =\` lookahead rather than by end-of-line. A rule like command = "get" / "post" / "delete" parses as a single 3-alt production. No implementation change is needed; three tests pin the behaviour (wrapped alternation, wrapped sequence, and multiple wrapped rules coexisting). 409/409 tests pass.

Auto-include the RFC 5234 core rules — ALPHA, BIT, CHAR, CR, LF, CRLF, CTL, DIGIT, DQUOTE, HEXDIG, HTAB, OCTET, SP, VCHAR, WSP — when a user grammar references them but doesn't define them locally. Resolution is transitive (e.g. HEXDIG pulls in DIGIT), and a local definition of a core-rule name always wins over the built-in. Implementation: a static CORE_RULES_ABNF string is parsed once on first use via the existing BNF parser; each user grammar gets scanned for references not satisfied by its own productions, and the needed subset of core rules is appended. No run-time cost when no core rule is referenced. Fixtures simplified — \`arith.bnf\` and \`arith-leftrec.bnf\` now use \`number = 1*DIGIT\` (single core reference) instead of open-coding \`%x30-39\`. Six new tests cover auto-inclusion, upper/lower ALPHA, transitive HEXDIG → DIGIT, BIT, local override, and the "no core added" negative case. 415/415 tests pass.

New \`test/grammar/rfc3986-uri.abnf\` contains the Appendix A collected ABNF from RFC 3986 — 27 user-authored productions covering URI, scheme, authority, userinfo, host (with IP-literal, IPv4address, reg-name alternatives), full IPv6address, path-abempty / absolute / rootless / empty, query, fragment, pct-encoded, unreserved, sub-delims — written in the same syntax the RFC prints. The converter transitively pulls in ALPHA, DIGIT, and HEXDIG from the core-rules library. The only adjustment made to the RFC text is replacing \`path-empty = 0<pchar>\` (which uses prose-val notation outside the scope of ABNF literal interpretation) with \`path-empty = ""\`, semantically equivalent. New \`test/rfc3986.test.js\` exercises the grammar in three parts: 1. Compilation: the full grammar builds without error, and every named production from the RFC survives into the emitted spec alongside the auto-included core rules. 2. URI acceptance: parses URIs that don't exercise the grammar's LL(k) ambiguities — \`urn:isbn:0451450523\`, \`mailto:alice@example.com\`, \`tag:yaml.org,2002:int\`, \`http://[::1]/\` — and rejects structurally-invalid inputs. 3. Known limitations: explicitly documents that URIs like \`http://example.com\` and \`ftp://user@host\` currently throw. The \`authority\` rule's \`[ userinfo "@" ]\` optional is the root cause: userinfo and reg-name share a leading character set, and the converter's static FIRST-set dispatcher can't peek past the userinfo body to decide whether the trailing \`@\` is there. ctx.rewind — shipped earlier in this project — is the primitive a future emitter change could use to close that gap. Full suite: 425/425 tests pass.

The BNF converter now detects optional-prefix ambiguities of the form `[X D] Y` where X and Y share a character vocabulary and D is a terminal disambiguator, and emits a probe + phase-retry dispatcher using only standard jsonic primitives (r:, k:, c:, ctx.mark/rewind/t). On probe return the decide action peeks ctx.t[0], rewinds to the recorded mark, and retries the dispatcher with the committed phase — RFC 3986's authority ambiguity (userinfo vs reg-name) now parses all URI shapes (http://example.com, ftp://user@host, etc). Also fixes two latent jsonic bugs the new pattern exposed: ctx.rewind now re-queues pre-lexed lookahead and clears the cached end-of-source token, and the alt bitset check no longer ORs bitAA into partitions other than 0 (it only belongs where #AA itself lives). https://claude.ai/code/session_01F3V3mSpQonb9XHG5tr2265

When an alt's position lists #AA (ANY), drop the per-partition bitset to the existing null sentinel so the parse_alts matcher skips the bitset check entirely. Previously the alt's S[i] was sized from AA's own tin (=4, partition 0), so tokens with tin >= 31 hit an undefined higher partition and silently failed to match. The null-S[i] path was already the "no constraint" contract in the matcher; this just routes the wildcard through it. tcol collation is unaffected — t[i] still carries the raw tin list. https://claude.ai/code/session_01F3V3mSpQonb9XHG5tr2265

Every user-declared rule now produces an AST node `{rule, src, kids}` where `rule` is the grammar-rule name, `src` is the matched source text (concatenated), and `kids` is the array of child user-rule nodes. Core rules (ALPHA, DIGIT, HEXDIG, …) and synthesised helpers (desugar / dispatcher / chain-step) are transparent: their `src` appends to the enclosing user rule and their `kids` extend it. For grammars without leading-ref inlining the tree mirrors the ABNF almost exactly — an RFC 3986 URI parse surfaces `hier-part`, `authority`, `host`, `path-abempty`, etc. as named nodes. Leading refs at the head of a production alt still get flattened by Paull's substitution; documenting this as a current limitation. Also fixes a jsonic bug the tree-builder surfaced: `r:` rule replacement wasn't updating the parent's `child` pointer, so the parent's close-state actions saw the stale pre-replacement rule instead of the replacement. Rewound the AST tree for probe-dispatch paths like authority's. https://claude.ai/code/session_01F3V3mSpQonb9XHG5tr2265

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2648edde42

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-21T10:32:01Z

        (tokenMatcher as any).tin$ &&
+        !(tokenMatcher as any).eager$ &&
        !rule.spec.def.tcol[oc][tI].includes((tokenMatcher as any).tin$)
      ) {


Prevent eager matchers from preempting expected literals

By skipping the tcol check for eager$ matchers, the lexer now accepts the first regex token that matches even when that token is not valid in the current rule position. Because token matchers are iterated by tin order and return on first match, a shorter case-insensitive literal can consume input before a longer literal that is actually expected (e.g. grammar contains both "A" and "AB", with "A" allocated first, and start rule expects only "AB"; parsing ab fails). This introduces false parse errors for valid ABNF grammars with prefix-overlapping literals.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-21T10:32:01Z

+  const parts = body.split('.')
+  const chars = parts.map((n) => String.fromCharCode(parseInt(n, radix)))
+  return { kind: 'term', literal: chars.join('') }


Validate numeric-value digits against the declared base

parseNumericValue uses parseInt directly on each numeric chunk, but parseInt accepts partial strings, so invalid ABNF such as %d6A or %b2 is silently accepted and compiled into unintended characters/ranges instead of being rejected. That produces grammars that parse the wrong language and makes numeric-value mistakes hard to diagnose.

Useful? React with 👍 / 👎.

claude added 20 commits April 20, 2026 10:23

rjrodger merged commit 069a2c8 into main Apr 21, 2026
6 of 7 checks passed

chatgpt-codex-connector Bot reviewed Apr 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Claude/bnf jsonicjs converter n aurn#72

Claude/bnf jsonicjs converter n aurn#72
rjrodger merged 20 commits into
mainfrom
claude/bnf-jsonicjs-converter-NAurn

rjrodger commented Apr 21, 2026

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 21, 2026

Uh oh!

chatgpt-codex-connector Bot Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rjrodger commented Apr 21, 2026

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants