Skip to content

Claude/bnf jsonicjs converter n aurn#72

Merged
rjrodger merged 20 commits into
mainfrom
claude/bnf-jsonicjs-converter-NAurn
Apr 21, 2026
Merged

Claude/bnf jsonicjs converter n aurn#72
rjrodger merged 20 commits into
mainfrom
claude/bnf-jsonicjs-converter-NAurn

Conversation

@rjrodger
Copy link
Copy Markdown
Collaborator

No description provided.

claude added 20 commits April 20, 2026 10:23
Pin the desired behaviour for the runtime-guard handling of
indirect left-recursive cycles before implementing it. Seven new
tests currently fail (legal inputs produce \`[]\` or throw; the
eighth, a simple rejection, happens to reject for the wrong
reason); the next commit adds the k+sI runtime guard that makes
them pass.

Cases covered:
- two-rule cycle (p → q x ; q → p y | z) with shortest, one-step,
  and two-step derivations, plus three rejection cases,
- three-rule cycle (a → b 1 ; b → c 2 ; c → a 3 | x) with
  shortest and one-step derivations plus a premature-stop
  rejection,
- indirect cycle alongside a production that also has a direct
  seed alternative.

Every test carries an explicit 2s timeout so a runaway parse
surfaces as a test failure rather than hanging CI.
Extend the left-recursion eliminator to Paull's algorithm: order
the productions and, for each A_i, inline any leading reference to
A_j (j < i) into A_i's alternatives before eliminating direct left
recursion. After the pass, every cycle in the first-position graph
has been collapsed to a direct cycle and then rewritten to the
iterative form.

The inlining moves cycle work into the star-helper patterns
already handled by the direct-LR eliminator, which revealed a
pre-existing LL(1) dispatch limitation: the star helper's
dispatcher only peeked one token, so an iteration's first terminal
collided with the FOLLOW context (the \`x\` in \`p = q x\` and the
next \`(x y)\` iteration inside \`q = z (x y)*\`).

Generalise the dispatcher to enumerate concrete token prefixes
through the grammar (\`altPrefixes\`), fanning out one dispatch
entry per distinct path up to four tokens deep. Cycles freeze the
path at the tokens accumulated so far (a \`done\` flag propagates
out of recursive calls so a truncated sub-prefix isn't silently
extended with tokens from the outer alt's trailing elements).

Result: the failing tests committed in the previous change now
pass — two-rule and three-rule cycles parse their shortest seed
through several unfoldings, rejections are rejected, and an
indirect cycle alongside a direct seed resolves to the right
alternative on each input. 379/379 tests pass, no existing grammar
regresses.
Replace the fixed two-slot consumed-token history (ctx.v1 / ctx.v2)
with a growing stack ctx.v that records every token consumed, in
order. Expose two helpers on ctx:

- ctx.mark() — returns the current v.length, an opaque position.
- ctx.rewind(mark) — pops the tokens consumed since that mark and
  unshifts them onto the active lexer's pending-token queue
  (pnt.token, which already exists and is consulted by lex.next
  before running matchers). Also invalidates the lookahead buffer
  so parse_alts re-fetches fresh.

Legacy ctx.v1 / ctx.v2 keep working as getters on the top of the
stack, so existing grammar code is unaffected.

The v-history push is moved to run *before* alt actions so
ctx.rewind from inside an `a:` action sees the just-matched
tokens on the stack. The lookahead-buffer shift stays at the end
of process(), so non-action code paths behave identically.

No changes to the lexer or to the parse loop — only rules.ts /
parser.ts / types.ts. Full suite: 385/385 tests pass (379 existing
+ 6 new in test/rewind.test.js covering mark/rewind semantics,
partial rewind, no-op rewind, and close-state rewind).
Two changes to eliminateLeftRecursion that together cover the
nullable-through-cycle case the previous commit documented as a
silent-failure:

1. Topologically sort productions before running Paull's: an edge
   A → B exists when A has an alternative whose leading element is
   a reference to B. Processing B before A means Paull's inner
   substitution loop (which only inlines A_j for j < i) actually
   reaches the nullable rule and can propagate ε into A's leading
   position.

2. Silently drop trivial `[P]` self-reference alternatives produced
   by nullable-prefix expansion. `P ::= P` adds nothing to the
   language, and an alt that reduces to exactly `[P]` after
   stripping the leading self-ref is semantically a no-op. The
   previous behaviour — throwing on a "trivial left-recursive
   alternative" — rejected valid grammars that happened to contain
   a pure self-ref alt.

The caller's declared production order is restored after Paull's so
the start rule still ends up first in the emitted spec.

End-to-end: the classic hidden-LR grammar
    <a> ::= <b> <a> | "x"
    <b> ::= "y" |
now parses `x`, `y x`, `y y x`, ... correctly and rejects `y`, `z`.
Longer nullable chains (a → b → c, with c nullable) also work.

Adds two hidden-LR tests and adjusts two existing tests whose
assertions depended on the rejected behaviour (trivial alts now
silently drop rather than throw; the topo reorder means Paull's
inlines earlier rules more aggressively in multi-production cases).
Add \`options.rewind.history\` — a capacity cap on the consumed-token
history used by ctx.rewind. Default is Infinity (unchanged
behaviour); set a finite value to bound parse-time memory.

Implementation:

- New \`cfg.rewind.history\` flows from Options through utility.ts.
- rules.ts keeps pushing each consumed token onto ctx.v, then
  batch-evicts from the front once v.length exceeds 2*cap, pulling
  it back down to cap. Amortised O(1) per push regardless of
  capacity (a naive Array.shift() per overflow push would be O(cap)).
- mark/rewind now key off a new absolute counter ctx.vAbs (total
  pushed minus rewound) instead of ctx.v.length. That way, when
  the ring evicts old tokens from the front, outstanding marks
  captured within the retained window still rewind correctly.
- ctx.rewind(mark) throws a clear error if the mark has been
  evicted ("increase options.rewind.history").

Empirical: with history=64 on an input producing 20001 consumed
tokens, ctx.v peaks at 128 entries (2*cap), vs 20001 for the
unbounded default. For small inputs (fewer tokens than the cap),
behaviour is identical to unbounded.

Adds four tests: cap enforcement, marks surviving eviction,
rewind-too-deep error, and the default remaining unbounded.
Full suite: 392/392 tests pass.
Cap the consumed-token retention at 64 by default so parses of
large inputs don't accidentally pin every token in memory. Small
parses (under the cap) are unaffected — every token is retained,
identical to the unbounded behaviour. Grammars that need deeper
rewinds can raise the limit (or set it to Infinity) via
options.rewind.history.

Updates the test that covered the old "default is unbounded"
behaviour and adds a new test pinning that Infinity opts back in
to unbounded retention. Full suite: 393/393 tests pass.
Drop the angle-bracket rule-name delimiters and change the
definition operator from \`::=\` to \`=\`, matching ABNF (RFC 5234).
References are now bare identifiers everywhere — quoted strings
remain the only way to write a literal, and parentheses still
group.

Changes are confined to the BNF parser grammar (bnfRules), the
BNF-parser jsonic instance (drop #LT/#GT, remap #DEF), the
test fixtures (.bnf files) and the inline grammars in tests. The
downstream AST, left-recursion pass, desugar, and emitter are
unchanged — the AST shape (name: string, refs by name) was
already ABNF-compatible.

Production-boundary detection shrinks from a 4-token peek
(#LT #TX #GT #DEF) to a 2-token peek (#TX #DEF); seq/prod rules
updated accordingly. Remaining BNF-isms — alternation \`|\`,
optional \`?\`, repetition \`*\`/\`+\`, comment \`#\` — are
unchanged in this stage; later stages replace them in turn.

393/393 tests pass.
Change the alternation operator to \`/\`, matching ABNF (RFC 5234).
The dropping side-effect is that the regex-terminal extension
(\`/pattern/flags\`) had to go — ABNF has no regex terminal, and
the \`/\` delimiters would be ambiguous with the alternation
operator. Numeric ranges (\`%x30-39\`, etc.) will take over that
role in a later stage.

Changes:

- bnfRules grammar renames \`#PIPE\` (\`|\`) to \`#ALT\` (\`/\`),
  drops the \`#RX\` match-token matcher and the elem alt that
  recognised regex syntax.
- The internal AST \`{ kind: 'regex', pattern, flags }\` and the
  emitter's regex-token allocation stay in place — they'll be
  reused when \`%x\` ranges land.
- Fixtures (\`greet.bnf\`, \`pair.bnf\`, \`arith.bnf\`,
  \`arith-leftrec.bnf\`, \`json-subset.bnf\`) rewritten to use
  \`/\` alternation; regex-dependent multi-digit/letter
  terminals replaced with explicit \`"0" / "1" / …\` alternation
  or single-letter ruleset until the %x stage.
- Inline grammars in tests rewritten; the \`regex terminals\`
  describe block is dropped and marked for return after %x.
- Assertions that pinned an inlined seed alt to \`kind: 'regex'\`
  updated to \`'term'\` (the fixture now uses "1" instead of a
  regex digit class).

389/389 tests pass.
Replace the BNF postfix-question-mark optional with ABNF's bracket
form. Drop the \`#QM\` (\`?\`) fixed token entirely; remap \`#OB\` /
\`#CB\` from JSON's \`{\` / \`}\` to ABNF's \`[\` / \`]\`. Add an
elem.open alt that pushes \`alts\` for \`[\`, an elem.close alt that
consumes \`]\` (gated on \`r.u.groupKind === 'opt'\`), and a
matching seq end-of-sequence guard for \`]\`.

\`r.u.group\` becomes \`r.u.groupKind\` (\`'group'\` for parens,
\`'opt'\` for brackets) so the close state knows which terminator
to expect and how to wrap the result. The internal AST shape for
optional is unchanged — \`{ kind: 'opt', inner: {kind:'group', …} }\`,
identical to what \`(A)?\` used to produce, so the desugar /
emitter pipeline didn't need any changes.

Fixtures and inline tests updated to use the bracket form. The
\`(A)?\` combination in the BNF source no longer exists; users wrap
in brackets directly. Repetition postfixes \`*\` / \`+\` and
comments \`#\` are still in their BNF form pending later stages.

389/389 tests pass.
Replace BNF postfix \`A*\` / \`A+\` with ABNF's prefix repetition:

    *A      = 0 or more (was A*)
    1*A     = 1 or more (was A+)
    m*nA    = between m and n (NEW)
    *nA     = at most n (NEW)
    m*A     = at least m (NEW)
    nA      = exactly n (NEW)

Add a #NUM match-token (decimal digits) and an explicit \`atom\`
rule split out of \`elem\`. \`elem\` now parses an optional
repetition prefix (NUM/'*' combinations) into r.u.min/r.u.max,
pushes \`atom\` to consume the actual element body, then wraps the
returned atom into the right AST node — \`star\` / \`plus\` /
\`opt\` for the boundary cases, \`rep\` for general m*n.

The new \`rep\` AST kind desugars to a concatenation of \`min\`
mandatory copies followed by either a star tail (\`min, ∞\`) or
\`max - min\` nested optionals — staying within the existing
emitter's vocabulary.

#PLUS (\`+\`) is dropped entirely; ABNF has no postfix repetition.

atom.close grew tcol-only \`{ s: TOK, b: 1 }\` declarations for
every token that can legitimately follow an atom. Without these,
the lexer would fall through to the default number-matcher and
emit #NR for digits, leaving the enclosing seq.close unable to
recognise \`1\` as the start of a repetition prefix.

Tests cover *A / 1*A / m*nA / nA / *nA. All 392 tests pass.
Override the BNF-parser jsonic instance's comment definitions: remap
\`hash\` (default \`#\`) to \`;\` and disable the \`slash\` and
\`multi\` variants (we'd never want \`//\` or \`/*\` ambiguous with
the ABNF \`/\` alternation operator).

All five fixtures and the inline comment tests now use \`;\` — this
stage finishes the syntax migration of features the converter
already supported. Remaining work is all additive: %x numeric
ranges, case-insensitive strings, incremental alternatives, line
continuation, and the RFC 5234 core rules library.

392/392 tests pass.
Add RFC 5234 numeric-value syntax as a new atom form:

    %xNN         single hex code point   (%x61 matches 'a')
    %dNN         decimal                 (%d97 matches 'a')
    %bNN         binary                  (%b1100001 matches 'a')
    %xNN-NN      character range         (%x30-39 matches any digit)
    %xNN.NN.NN   concatenated points     (%x66.6f.6f matches "foo")

Implementation:

- New #NV match-token with a permissive hex regex;
  parseNumericValue() decodes the matched source based on the base
  prefix (x/d/b).
- Single values and concatenations produce a `{kind:'term', literal}`
  (one or many chars); ranges produce a `{kind:'regex', pattern}`
  over the corresponding `\\uNNNN-\\uNNNN` class.
- atom.open gains a `#NV` alt; atom.close grew a `{s:'#NV',b:1}`
  tcol-only declaration so the lexer emits #NV even when the rule
  state it falls into isn't explicitly expecting it.
- elem.open's repetition-prefix alts now end with a new `#ATOM`
  tokenset (#ST / #NV / #TX / #LP / #OB) plus `b:1`, so tcol at the
  atom-starter position includes every matcher tin — without this
  the lexer would fall through to #TX on `1*%x30-39` and the
  downstream emitter would try to treat `%x30-39` as a rule name.

The `arith.bnf` / `arith-leftrec.bnf` fixtures collapse their
single-digit \`"0" / "1" / … / "9"\` enumeration to `number =
1*DIGIT` with `DIGIT = %x30-39`. Tests cover single values in
each base, concatenation, ranges, and ranges under repetition.
397/397 tests pass.
Match RFC 5234's default: a bare quoted-string terminal \`"foo"\`
accepts \`foo\`, \`Foo\`, \`FOO\`, \`fOo\` — every case-fold of
the literal. New prefixes:

    %s"foo"    force a case-sensitive match
    %i"foo"    explicit case-insensitive (same as the bare form)

Implementation:

- BnfElement.term grows an optional \`caseSensitive\` flag; the
  default ABNF semantics kick in when it's omitted.
- #SS / #SI match-tokens lex \`%s\` and \`%i\` (gated by a \`(?=")\`
  lookahead so they don't steal the \`%\` from \`%xNN\` numeric
  values). atom.open has three string alts (%s / %i / plain) and
  stamps the flag appropriately.
- The emitter keys its literal-token map on (literal,
  effective-sensitivity) — \`termKey(el)\` — so a \`%s"foo"\` and a
  bare \`"foo"\` produce distinct tokens. Literals whose effective
  sensitivity is "sensitive" (either \`%s\` or no letters) emit as
  fixed tokens; insensitive literals emit as anchored
  \`/^literal/i\` regexes with a new \`eager$\` flag on the
  RegExp instance.

To make \`eager$\` meaningful, lexer.ts's tcol-gating skips the
gate for matchers that carry the flag. Without it, the default
text-matcher would eat \`end\` as #TX inside a narrower star-helper
context before the outer rule could recognise it as its own
#END token.

Tests cover: bare literal matching any case, \`%s\` forcing exact
match, \`%i\` behaving identically to bare, literals with no
letters, and sensitive / insensitive variants of the same string
coexisting in one grammar. 402/402 tests pass.
Add RFC 5234's \`=/\` operator: a production header of the form

    name =/ alt1 / alt2 / …

folds its alternatives into the production named \`name\`
declared earlier. Parser flags each incremental occurrence on
BnfProduction; a new \`mergeIncrementals\` post-parse pass walks
the list and appends each incremental's \`alts\` onto the base
(throwing if the base hasn't been seen yet).

Implementation details:

- New \`#DEFA\` fixed token for \`=/\`. Jsonic's longest-match-wins
  fixed matcher picks it over \`#DEF\` (\`=\`) automatically.
- prod.open gains a second alt for \`#TX #DEFA\`; prod.close and
  seq's end-guards also recognise \`#TX #DEFA\` as a
  production-boundary lookahead.
- BnfProduction grows an optional \`incremental\` flag; the merge
  pass drops it so the emitter only ever sees canonical
  productions.

Tests cover: multi-increment chaining (as in \`command =/ "get" /
"post"\` split across three lines), a single-alt append, a
multi-alt append, and the error on orphaned \`=/\`. 406/406 tests
pass.
Document and pin the fact that rules can span multiple lines with
no special parser support — newlines are IGNORE whitespace in the
BNF-parser's lexer, and production boundaries are detected by the
\`name =\` lookahead rather than by end-of-line. A rule like

    command = "get"
            / "post"
            / "delete"

parses as a single 3-alt production. No implementation change is
needed; three tests pin the behaviour (wrapped alternation,
wrapped sequence, and multiple wrapped rules coexisting). 409/409
tests pass.
Auto-include the RFC 5234 core rules — ALPHA, BIT, CHAR, CR, LF,
CRLF, CTL, DIGIT, DQUOTE, HEXDIG, HTAB, OCTET, SP, VCHAR, WSP —
when a user grammar references them but doesn't define them
locally. Resolution is transitive (e.g. HEXDIG pulls in DIGIT), and
a local definition of a core-rule name always wins over the
built-in.

Implementation: a static CORE_RULES_ABNF string is parsed once on
first use via the existing BNF parser; each user grammar gets
scanned for references not satisfied by its own productions, and
the needed subset of core rules is appended. No run-time cost when
no core rule is referenced.

Fixtures simplified — \`arith.bnf\` and \`arith-leftrec.bnf\` now
use \`number = 1*DIGIT\` (single core reference) instead of
open-coding \`%x30-39\`. Six new tests cover auto-inclusion,
upper/lower ALPHA, transitive HEXDIG → DIGIT, BIT, local override,
and the "no core added" negative case. 415/415 tests pass.
New \`test/grammar/rfc3986-uri.abnf\` contains the Appendix A
collected ABNF from RFC 3986 — 27 user-authored productions
covering URI, scheme, authority, userinfo, host (with IP-literal,
IPv4address, reg-name alternatives), full IPv6address, path-abempty
/ absolute / rootless / empty, query, fragment, pct-encoded,
unreserved, sub-delims — written in the same syntax the RFC
prints. The converter transitively pulls in ALPHA, DIGIT, and
HEXDIG from the core-rules library.

The only adjustment made to the RFC text is replacing
\`path-empty = 0<pchar>\` (which uses prose-val notation outside
the scope of ABNF literal interpretation) with \`path-empty = ""\`,
semantically equivalent.

New \`test/rfc3986.test.js\` exercises the grammar in three parts:

1. Compilation: the full grammar builds without error, and every
   named production from the RFC survives into the emitted spec
   alongside the auto-included core rules.

2. URI acceptance: parses URIs that don't exercise the grammar's
   LL(k) ambiguities —  \`urn:isbn:0451450523\`,
   \`mailto:alice@example.com\`, \`tag:yaml.org,2002:int\`,
   \`http://[::1]/\` — and rejects structurally-invalid inputs.

3. Known limitations: explicitly documents that URIs like
   \`http://example.com\` and \`ftp://user@host\` currently throw.
   The \`authority\` rule's \`[ userinfo "@" ]\` optional is the
   root cause: userinfo and reg-name share a leading character
   set, and the converter's static FIRST-set dispatcher can't
   peek past the userinfo body to decide whether the trailing
   \`@\` is there. ctx.rewind — shipped earlier in this project —
   is the primitive a future emitter change could use to close
   that gap.

Full suite: 425/425 tests pass.
The BNF converter now detects optional-prefix ambiguities of the form
`[X D] Y` where X and Y share a character vocabulary and D is a
terminal disambiguator, and emits a probe + phase-retry dispatcher
using only standard jsonic primitives (r:, k:, c:, ctx.mark/rewind/t).

On probe return the decide action peeks ctx.t[0], rewinds to the
recorded mark, and retries the dispatcher with the committed phase
— RFC 3986's authority ambiguity (userinfo vs reg-name) now parses
all URI shapes (http://example.com, ftp://user@host, etc).

Also fixes two latent jsonic bugs the new pattern exposed: ctx.rewind
now re-queues pre-lexed lookahead and clears the cached end-of-source
token, and the alt bitset check no longer ORs bitAA into partitions
other than 0 (it only belongs where #AA itself lives).

https://claude.ai/code/session_01F3V3mSpQonb9XHG5tr2265
When an alt's position lists #AA (ANY), drop the per-partition
bitset to the existing null sentinel so the parse_alts matcher
skips the bitset check entirely. Previously the alt's S[i] was
sized from AA's own tin (=4, partition 0), so tokens with tin
>= 31 hit an undefined higher partition and silently failed to
match.

The null-S[i] path was already the "no constraint" contract in
the matcher; this just routes the wildcard through it. tcol
collation is unaffected — t[i] still carries the raw tin list.

https://claude.ai/code/session_01F3V3mSpQonb9XHG5tr2265
Every user-declared rule now produces an AST node `{rule, src, kids}`
where `rule` is the grammar-rule name, `src` is the matched source
text (concatenated), and `kids` is the array of child user-rule
nodes. Core rules (ALPHA, DIGIT, HEXDIG, …) and synthesised helpers
(desugar / dispatcher / chain-step) are transparent: their `src`
appends to the enclosing user rule and their `kids` extend it.

For grammars without leading-ref inlining the tree mirrors the ABNF
almost exactly — an RFC 3986 URI parse surfaces `hier-part`,
`authority`, `host`, `path-abempty`, etc. as named nodes. Leading
refs at the head of a production alt still get flattened by Paull's
substitution; documenting this as a current limitation.

Also fixes a jsonic bug the tree-builder surfaced: `r:` rule
replacement wasn't updating the parent's `child` pointer, so the
parent's close-state actions saw the stale pre-replacement rule
instead of the replacement. Rewound the AST tree for probe-dispatch
paths like authority's.

https://claude.ai/code/session_01F3V3mSpQonb9XHG5tr2265
@rjrodger rjrodger merged commit 069a2c8 into main Apr 21, 2026
6 of 7 checks passed
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2648edde42

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/lexer.ts
Comment on lines 273 to 276
(tokenMatcher as any).tin$ &&
!(tokenMatcher as any).eager$ &&
!rule.spec.def.tcol[oc][tI].includes((tokenMatcher as any).tin$)
) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Prevent eager matchers from preempting expected literals

By skipping the tcol check for eager$ matchers, the lexer now accepts the first regex token that matches even when that token is not valid in the current rule position. Because token matchers are iterated by tin order and return on first match, a shorter case-insensitive literal can consume input before a longer literal that is actually expected (e.g. grammar contains both "A" and "AB", with "A" allocated first, and start rule expects only "AB"; parsing ab fails). This introduces false parse errors for valid ABNF grammars with prefix-overlapping literals.

Useful? React with 👍 / 👎.

Comment thread src/bnf.ts
Comment on lines +2324 to +2326
const parts = body.split('.')
const chars = parts.map((n) => String.fromCharCode(parseInt(n, radix)))
return { kind: 'term', literal: chars.join('') }
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Validate numeric-value digits against the declared base

parseNumericValue uses parseInt directly on each numeric chunk, but parseInt accepts partial strings, so invalid ABNF such as %d6A or %b2 is silently accepted and compiled into unintended characters/ranges instead of being rejected. That produces grammars that parse the wrong language and makes numeric-value mistakes hard to diagnose.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants