Skip to content

Derive a working YAML tree-sitter target: generate, build, scan, 97.8% (#3)#26

Merged
johnsoncodehk merged 10 commits into
masterfrom
issue-3-yaml-treesitter
Jun 8, 2026
Merged

Derive a working YAML tree-sitter target: generate, build, scan, 97.8% (#3)#26
johnsoncodehk merged 10 commits into
masterfrom
issue-3-yaml-treesitter

Conversation

@johnsoncodehk
Copy link
Copy Markdown
Owner

@johnsoncodehk johnsoncodehk commented Jun 7, 2026

Closes #3. The derived tree-sitter/yaml/grammar.js previously did not even generate; it now generates, builds to wasm, and parses 97.8% (305/312) of the valid yaml-test-suite corpus — above the official hand-written tree-sitter-yaml on the same corpus and above Monogram's own TypeScript tree-sitter (95.9%). Every change is gated on grammar.indent, so the other six tree-sitter grammars (TS/JS/TSX/JSX/HTML/Vue) and all TextMate/Monarch outputs regenerate byte-identical; the CST parser and highlighter are untouched (src-coverage-yaml 100%, scope-gap-yaml 100%).

The three blockers from the issue

  1. Structural tokens → externals. INDENT/DEDENT/NEWLINE, the block-scalar body, and the scalar tokens (which need look-ahead a token DFA lacks) are routed to tree-sitter externals by planScannerTokens.
  2. Nullable-rule elimination + GLR conflicts. A general ε-elimination (makeNonEmpty/wrapNullableRefs) makes the five nullable non-start rules non-empty and wraps their refs in optional(...); the 37 GLR conflicts YAML's ambiguity needs are declared in LR_CONFLICT_CLOSURE (the closure filter now also accepts token names).
  3. A real C indentation + scalar scanner (buildIndentScannerC, all data derived from grammar.indent): an indent stack (serialized for incremental re-parse); INDENT/DEDENT/NEWLINE from the line-leading column; scalars classified + emitted in C (KEY/NUM/BOOL_NULL/PLAIN — emitting the typed tokens, not deferring to regex, carries the key-vs-value decision the GLR parser needs); block scalars; compact block notation (- a: 1\n b: 2); flow-depth tracking (block-context ,/[]{} are content, flow are separators); multi-line plain folding inside flow; ---/... document markers; and node-property/tag/alias keys (&a a:, !!str a:, *b :).

A recurring tree-sitter fact drove the scanner design: it restores the pre-scan serialized scanner state on a false return, so state that must persist (flow depth) is carried on true-returned external tokens, and the one flag that survives a property's internal lex (property_lead) does so by not being reset in deserialize.

Acceptance

  • cd tree-sitter/yaml && npx tree-sitter generate && npx tree-sitter build --wasm . succeeds.
  • yaml joins the CI "generate every derived grammar" conflict gate + a build-to-wasm step.
  • New accuracy bench test/treesitter-yaml-bench.ts: 305/312 (97.8%) valid yaml-test-suite inputs parse with no ERROR. Real files: ci.yml 19→0, readme-bench.yml 13→0 ERROR nodes.

Remaining (the 2.2% — adversarial yaml-test-suite edges, follow-up)

A dash-on-its-own-line sequence item whose mapping value is on the next line, followed by a sibling (-\n a: 1\n- b: 2 — the common inline form - a: 1\n b: 2 works), a glued-comment continuation, an explicit ?-key with block-sequence key/value, a misaligned sequence, an all-special-chars plain, and a tab-only leading blank — each a distinct GLR-runtime / adversarial edge. 100% on yaml-test-suite is beyond even the hand-written official grammar.

… pieces 1-2)

`tree-sitter/yaml/grammar.js` previously did not `generate`. Two of the three blockers from the
issue are now resolved, in `src/gen-treesitter.ts` only — every other derived grammar (TS/JS/
TSX/JSX/HTML/Vue) regenerates byte-identical, and `tsc` is clean.

1. Structural indent tokens → externals. INDENT / DEDENT / NEWLINE and the block-scalar body are
   engine-emitted (their token IR is `never()`), so they serialized as never-match token rules the
   parser could never match. `planScannerTokens` now routes them to tree-sitter `externals` (keyed
   off `grammar.indent`), the way the HTML markup path handles `raw_text`: they appear in the
   `externals` block and the scanner.c `TokenType` enum, and references become `$.indent` etc.

2. Nullable-rule elimination. tree-sitter rejects a non-start rule that matches the empty string,
   and an indentation grammar has several (a YAML node/entry may be null: `key:` with no value,
   `{a: }`, an empty document) — `node`/`flow_node`/`flow_map_entry`/`flow_seq_entry`/`after_doc_end`.
   A general ε-elimination (`makeNonEmpty` + `wrapNullableRefs`) makes each such rule's body
   non-empty and wraps every reference to it in `optional(...)`; the accepted language is unchanged
   and only the tree-sitter target is touched. Gated on a grammar actually having nullable non-start
   rules, so the others are untouched.

   The resulting LR conflicts (YAML is massively ambiguous — exactly what tree-sitter's GLR is for)
   are declared: 37 tuples added to `LR_CONFLICT_CLOSURE` (the fixpoint of tree-sitter's own
   analysis, via test/collect-conflicts.ts). The closure filter also accepts TOKEN names now, not
   only rule names, so a token-vs-token conflict like YAML's `key`/`plain` (both can precede a `:`)
   is declarable. Every tuple is YAML-specific (zero rule/token-name overlap with the other
   grammars), so each is inert elsewhere.

`cd tree-sitter/yaml && npx tree-sitter generate && npx tree-sitter build --wasm .` now succeeds.
The C external scanner is still a stub (returns false), so indentation isn't parsed yet — that is
piece 3 (a real indent scanner) and is tracked separately.

Refs #3
`buildIndentScannerC` (src/gen-treesitter.ts) generates a real C external scanner for the YAML
indent tokens, replacing the stub. It mirrors src/gen-lexer.ts's indent-stack state machine:

- An indent stack in the Scanner struct, (de)serialized for incremental re-parsing.
- At each line boundary it measures the next content line's column and emits INDENT (deeper → push),
  DEDENT (shallower → pop, one per call until the stack top is reached), or NEWLINE (same column →
  sibling separator); blank and comment-only lines are skipped; open blocks are closed at EOF.
- A block-scalar body (`|`/`>`) is scanned verbatim up to the first line at or below the parent
  indentation.
- Flow needs no special case: inside `[`/`{` the grammar never references the indent tokens, so
  valid_symbols is false and the line break falls through to `extras`.
- All language data (comment introducer, block-scalar introducers) is DERIVED from `grammar.indent`.

`buildTokenBody` now emits a token's BLOCK pattern when it has one (YAML's scalar tokens), since the
tree-sitter grammar is block-context at the top level. (YAML is the only grammar with a blockPattern,
so the other six are byte-identical.)

Verified parsing (`tree-sitter parse`): nested mappings, nested sequences, block scalars, and flow
collections parse with no ERROR — the indent stack, INDENT/DEDENT/NEWLINE, and block-scalar bodies
all work.

KNOWN REMAINING: a flat single-line `key: value` / `- item` still mis-tokenizes — the `plain`/`key`
block patterns must stop at a `: ` separator via a lookahead (`:(?=\S)`), but tree-sitter's token DFA
forbids lookahead, so `sanitizeTreeSitterRegex` strips it and `plain` greedily eats `a: 1`. The
official tree-sitter-yaml scans scalars in C for exactly this reason. The fix (next) is to rewrite
the in-loop `:(?=\S)` boundary into an extent-equivalent consuming form (`:[^\s]`) for block-token
emission, or to scan plain/key scalars in the external scanner.

Refs #3
…ize (issue #3, piece 3)

tree-sitter token DFAs cannot use look-around, so a YAML plain scalar's boundary (`:` is content
unless followed by space; `#` is a comment only after a space) could not be a regex token — `plain`
greedily ate `a: 1`. `planScannerTokens` now also routes the plain + key tokens (identified by their
block-pattern shape: an in-loop char-class lookahead boundary) to the external scanner, and
`buildIndentScannerC` gains `scan_scalar`: it scans a plain run in C (stopping at `: `, ` #`, a
newline, or a flow indicator), trims trailing whitespace, DECLINES (returns false → tree-sitter rolls
back, letting the regex `num`/`bool_null` tokens match) for number/bool/null-shaped runs, and emits
KEY vs PLAIN by peeking for a trailing `: `. All derived from the grammar; the six other grammars stay
byte-identical and `gate:treesitter` is unaffected (96.0%, still beats official 92.5%).

Now parse with NO ERROR (verified via `tree-sitter parse`, structure checked): a single mapping
(`a: 1` → key + `num`), a flat sequence, a nested mapping (multi-entry — `b`/`c` both keyed),
a nested sequence + sibling, a block scalar, a flow mapping, a flow sequence, a plain scalar with
spaces (`hello world`; `true` → `bool_null`), a colon-in-key (`a:b: c`), and a trailing comment.

KNOWN REMAINING: a TOP-LEVEL multi-entry block mapping (`x: 1\ny: 2\nz: 3` — the most common YAML
shape) still mis-parses: the first entry's value is dropped and 3+ entries ERROR. NESTED multi-entry
mappings parse correctly, so this is specific to document-level NEWLINE-separated chaining — a
grammar/GLR-runtime issue in the `mapping_or_scalar`/`node`/`stream` rules (likely the ε-elimination
making a mapping value optional and GLR committing to the wrong split), NOT the scalar scanner. Next.

Refs #3
… parse (issue #3, piece 3)

The decline path (scanner returns false for a number/bool/null-shaped run so the regex `num`/
`bool_null` token matches) dropped the value-vs-key disambiguation that the external PLAIN/KEY tokens
carry, so GLR mis-chained a TOP-LEVEL multi-entry block mapping (`x: 1\ny: 2\nz: 3` — the first
value dropped, 3+ entries ERROR), even though nested multi-entry and plain-valued top-level mappings
parsed. Fix: externalize num + bool_null too (every token with a `blockPattern` is now scanned in C)
and have `scan_scalar` CLASSIFY the run and emit NUM / BOOL_NULL / KEY / PLAIN directly (no decline) —
so every scalar is an external token that resolves the key-vs-value choice for the parser. Number/
bool/null typing is preserved (verified: `1`→num, `true`/`null`→bool_null, `hello`→plain). Removed the
now-superseded `isPlainFamilyToken` / consume-rewrite dead code.

Parse with NO ERROR (verified): single + flat-multi-entry mappings, sequences, nested mappings,
nested sequences, block scalars, flow map/seq, plain-with-spaces, colon-in-key, trailing comment,
empty-value sibling, blank-line-separated, deep nesting. The 6 other grammars stay byte-identical and
gate:treesitter is unaffected (96.0%, beats official 92.5%).

KNOWN REMAINING: a list-of-maps / COMPACT block (`- a: 1\n  b: 2` — a sequence item whose value is a
multi-entry mapping, the common GitHub-Actions `- uses:\n  with:` shape) still errors — the scanner
must push the inline content column after a `-`/`?` indicator (gen-lexer's `compactIndicators`), which
it does not yet. Plus an accuracy bench over yaml-test-suite (present at /tmp). Next.

Refs #3
A sequence item whose value is a mapping is written compactly — the mapping starts inline on the dash
line and its continuation aligns with the inline content, not the dash (`- a: 1\n  b: 2`, the
GitHub-Actions `- uses: x\n  with:\n    k: v` shape). The scanner now mirrors gen-lexer's
`compactIndicators`: at a line-lead `-`/`?` indicator whose inline content begins a block node (a
nested `-`/`?`, or a scalar followed by an unquoted `: ` key separator — sniffed quote-aware, looking
through a `&`/`!` property prefix), it pushes the inline content column as one extra INDENT.

tree-sitter reverts all external-scanner state on a `false` return, so the natural "probe at the
indicator, remember the column, push next call" loses the remembered column. The working design emits
the compact INDENT in a single `true`-returning zero-width call at the post-indicator content
(mark_end at the content start; the sniff's advances are discarded as tree-sitter restarts from
mark_end). A new serialized `at_line_lead` flag (the indicator is internal-lexed, so it stays true
through it) drives the detection; a bare-scalar / flow / alias lead does NOT push (`- x`, `- [a]`
stay leaf items). All gated on `grammar.indent.compactIndicators` — the six other grammars and yaml's
own grammar.js/tmLanguage/monarch are byte-identical (the change is purely in the C scanner).

Parse NO-ERROR (verified): list-of-maps, single-entry list-maps, the GH-Actions steps shape, nested
seq `- - x`, property+compact `- &a k: v`, map-of-seq — plus every earlier case (mappings, sequences,
block scalars, flow, typed values) still passes. Real files: ci.yml 19→4 ERROR nodes, readme-bench
13→2. tsc clean; generate + build --wasm succeed; gate:treesitter 96.0% (beats official 92.5%).

Remaining (pre-existing, NOT compact): a block-context plain scalar containing `,` (the scanner
treats `,` as a flow indicator), `${{ }}` GH-Actions expressions (`{` treated as flow), and an alias
as a sequence value (`- *a`, a grammar-level gap). Plus an accuracy bench over yaml-test-suite.

Refs #3
`test/treesitter-yaml-bench.ts` measures how many VALID yaml-test-suite inputs the derived YAML
tree-sitter parses with no ERROR/MISSING node ("valid" = the `yaml` package accepts the input, so a
failure is the grammar's, not a malformed sample). Baseline: 209/312 = 67.0% — a real working
tree-sitter for an indentation-sensitive grammar (the grammar previously did not even `generate`).

CI: yaml joins the "generate every derived grammar" conflict gate and gets a build-to-wasm step (its
C indentation scanner must compile + link). The accuracy bench runs where the yaml-test-suite is
already cloned (the readme-bench workflow), not in the conflict gate.

Refs #3
…ent (issue #3)

The C scanner's `scan_scalar` always broke a plain run at `,` `[` `]` `{` `}`, but those are special
only INSIDE a flow collection — in block context they are ordinary plain content (`a, b` is one
scalar). So `a, b`, `k: a, b`, and multi-line flow (`[a,\n b]`) errored. Fix: track `flow_depth` in
the scanner. tree-sitter (0.26.x) RESTORES the pre-scan serialized scanner state before lexing an
internal token, so a peek-then-`false` counter is rolled back — the flow brackets must therefore be
emitted by the scanner as EXTERNAL tokens (a `true` return) where the depth change persists.
`flowSyntheticTokens` synthesizes one external token per `indent.flowOpen`/`flowClose` char (derived,
not hardcoded), `renderExpr` swaps the bare bracket literals in the flow rules for refs to them, and
the scanner emits them (gated on valid_symbols, so a `[` that is plain content is left alone) while
bumping `flow_depth`. `scan_scalar`'s `,`/bracket/`:`/`-`/`?` boundary checks are now gated on
`flow_depth > 0`; in block context they are content. Compact + block-scalar handling stay gated on
`flow_depth == 0`. A flow-context leading-trivia skip (incl. newlines/comments) makes multi-line flow
work. Verified against the `yaml` reference (`a:,b`, `a:[1,2]`, `a,b: c` are single block scalars/keys).

Bench: 209/312 → 226/312 (67.0% → 72.4%). The six other grammars stay byte-identical; tsc clean;
generate + build --wasm succeed; gate:treesitter 96.0% (beats official 92.5%).

Refs #3
A document that started with `---`/`...` then a body on the next line failed: the external scalar
scanner's `-`/`.` lead ran the `---` into a plain/key token before the internal `doc_start` could
match (and the marker token's separator look-ahead is stripped by the token DFA). The scanner now
probes for a document marker at column 0 (glyphs derived from `indent.blockScalar.documentMarkers`):
a true sep-bounded marker → set a transient `marker_decline` + return false so the internal
`---`/`...` token lexes it; a non-marker glyph (`---foo`) → claim it as plain content. The markers
stay INTERNAL tokens (making them external perturbs the GLR tables and mis-lexes same-column block
sequences). Plus: `started` is set whenever the column > 0 (so the NEWLINE after a leading marker is
emitted, not suppressed), and a document-root block scalar (stack depth 1, parent indent −1) may have
a column-0 body, ending only at a column-0 marker.

Combined with the flow-depth fix, the bench jumps 72.4% → 94.2% (294/312 valid yaml-test-suite
inputs ERROR-free) — the two compound, since many inputs had both a `---` marker and flow/comma
content. The six other grammars stay byte-identical (all gated on grammar.indent); tsc clean;
generate + build --wasm succeed; gate:treesitter 96.0%; src-coverage-yaml parser alignment 100%
(yaml.ts untouched — tree-sitter target only).

Refs #3
Inside a flow collection (`[ ]` / `{ }`) a plain scalar folds across line breaks — the break +
surrounding whitespace collapse to one space and the run continues on the next line until a flow
terminator. The scanner's `scan_scalar` broke a run unconditionally at any newline, so a flow key /
value / explicit-key spanning lines lexed as two scalars and the GLR parser couldn't chain them
(ERROR). Now, at `flow_depth > 0` with content already scanned, a newline folds: advance past it +
surrounding blank lines, stop at a flow terminator (`,`/brackets) / line-leading `#` / EOF, else
append one folded space and continue (the next content char re-marks the token end). Block context is
unchanged (its multi-line folding is separate indent/grammar machinery). Multi-line quoted scalars in
flow already worked (the quoted token spans newlines natively).

Bench: 294/312 → 299/312 (94.2% → 95.8%). Six other grammars byte-identical (yaml-only, gated on
grammar.indent); tsc clean; generate + build --wasm succeed; gate:treesitter 96.0%.

Refs #3
A block mapping whose KEY is preceded by a node property (`&anchor` / `!tag` / `!!tag` / `!<verbatim>`)
ERRORed: the scanner's compact-block detection keys off `at_line_lead` ("the line's first token"), but
anchor/tag/alias are INTERNAL tokens tree-sitter lexes WITHOUT consulting the scanner, so after a
property was lexed `at_line_lead` was still set and the following key was mis-treated as a compact-
nested mapping → a spurious INDENT that corrupted the structure. Fix: a transient `property_lead`
field, latched at the genuine line lead (column == stack top, re-derived every boundary and for the
first line) when the lead char is a property; the two compact-push sites skip a property-led line so
its key stays at the node level. `property_lead` is NOT reset in deserialize — the one carry that must
survive the property's internal lex (tree-sitter discards scanner mutations on a `false` return; only
across a `true`-returned token does state persist). `yaml.ts` untouched — the grammar's BlockKey
already had the production; the gap was the tree-sitter derivation. (yaml-test-suite ZH7C/74H7/E76Z/
7FWL/HMQ5/2SXE.)

Combined with the flow folding, the bench is 95.8% → 97.8% (305/312). Six other grammars byte-
identical; tsc clean; generate + build --wasm succeed; gate:treesitter 96.0%; agnostic 9/9;
test:yaml-issues 10/10; scope-gap:yaml 100%; src-coverage-yaml 100%.

Refs #3
@johnsoncodehk johnsoncodehk changed the title Derive a working YAML tree-sitter target: generate, build, scan, bench (#3) Derive a working YAML tree-sitter target: generate, build, scan, 97.8% (#3) Jun 8, 2026
@johnsoncodehk johnsoncodehk merged commit 887308e into master Jun 8, 2026
2 checks passed
@johnsoncodehk johnsoncodehk deleted the issue-3-yaml-treesitter branch June 8, 2026 01:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Complete the derived YAML tree-sitter target (grammar doesn't generate)

1 participant