Derive a working YAML tree-sitter target: generate, build, scan, 97.8% (#3) by johnsoncodehk · Pull Request #26 · johnsoncodehk/monogram

johnsoncodehk · 2026-06-07T22:33:31Z

Closes #3. The derived tree-sitter/yaml/grammar.js previously did not even generate; it now generates, builds to wasm, and parses 97.8% (305/312) of the valid yaml-test-suite corpus — above the official hand-written tree-sitter-yaml on the same corpus and above Monogram's own TypeScript tree-sitter (95.9%). Every change is gated on grammar.indent, so the other six tree-sitter grammars (TS/JS/TSX/JSX/HTML/Vue) and all TextMate/Monarch outputs regenerate byte-identical; the CST parser and highlighter are untouched (src-coverage-yaml 100%, scope-gap-yaml 100%).

The three blockers from the issue

Structural tokens → externals. INDENT/DEDENT/NEWLINE, the block-scalar body, and the scalar tokens (which need look-ahead a token DFA lacks) are routed to tree-sitter externals by planScannerTokens.
Nullable-rule elimination + GLR conflicts. A general ε-elimination (makeNonEmpty/wrapNullableRefs) makes the five nullable non-start rules non-empty and wraps their refs in optional(...); the 37 GLR conflicts YAML's ambiguity needs are declared in LR_CONFLICT_CLOSURE (the closure filter now also accepts token names).
A real C indentation + scalar scanner (buildIndentScannerC, all data derived from grammar.indent): an indent stack (serialized for incremental re-parse); INDENT/DEDENT/NEWLINE from the line-leading column; scalars classified + emitted in C (KEY/NUM/BOOL_NULL/PLAIN — emitting the typed tokens, not deferring to regex, carries the key-vs-value decision the GLR parser needs); block scalars; compact block notation (- a: 1\n b: 2); flow-depth tracking (block-context ,/[]{} are content, flow are separators); multi-line plain folding inside flow; ---/... document markers; and node-property/tag/alias keys (&a a:, !!str a:, *b :).

A recurring tree-sitter fact drove the scanner design: it restores the pre-scan serialized scanner state on a false return, so state that must persist (flow depth) is carried on true-returned external tokens, and the one flag that survives a property's internal lex (property_lead) does so by not being reset in deserialize.

Acceptance

cd tree-sitter/yaml && npx tree-sitter generate && npx tree-sitter build --wasm . succeeds.
yaml joins the CI "generate every derived grammar" conflict gate + a build-to-wasm step.
New accuracy bench test/treesitter-yaml-bench.ts: 305/312 (97.8%) valid yaml-test-suite inputs parse with no ERROR. Real files: ci.yml 19→0, readme-bench.yml 13→0 ERROR nodes.

Remaining (the 2.2% — adversarial yaml-test-suite edges, follow-up)

A dash-on-its-own-line sequence item whose mapping value is on the next line, followed by a sibling (-\n a: 1\n- b: 2 — the common inline form - a: 1\n b: 2 works), a glued-comment continuation, an explicit ?-key with block-sequence key/value, a misaligned sequence, an all-special-chars plain, and a tab-only leading blank — each a distinct GLR-runtime / adversarial edge. 100% on yaml-test-suite is beyond even the hand-written official grammar.

… pieces 1-2) `tree-sitter/yaml/grammar.js` previously did not `generate`. Two of the three blockers from the issue are now resolved, in `src/gen-treesitter.ts` only — every other derived grammar (TS/JS/ TSX/JSX/HTML/Vue) regenerates byte-identical, and `tsc` is clean. 1. Structural indent tokens → externals. INDENT / DEDENT / NEWLINE and the block-scalar body are engine-emitted (their token IR is `never()`), so they serialized as never-match token rules the parser could never match. `planScannerTokens` now routes them to tree-sitter `externals` (keyed off `grammar.indent`), the way the HTML markup path handles `raw_text`: they appear in the `externals` block and the scanner.c `TokenType` enum, and references become `$.indent` etc. 2. Nullable-rule elimination. tree-sitter rejects a non-start rule that matches the empty string, and an indentation grammar has several (a YAML node/entry may be null: `key:` with no value, `{a: }`, an empty document) — `node`/`flow_node`/`flow_map_entry`/`flow_seq_entry`/`after_doc_end`. A general ε-elimination (`makeNonEmpty` + `wrapNullableRefs`) makes each such rule's body non-empty and wraps every reference to it in `optional(...)`; the accepted language is unchanged and only the tree-sitter target is touched. Gated on a grammar actually having nullable non-start rules, so the others are untouched. The resulting LR conflicts (YAML is massively ambiguous — exactly what tree-sitter's GLR is for) are declared: 37 tuples added to `LR_CONFLICT_CLOSURE` (the fixpoint of tree-sitter's own analysis, via test/collect-conflicts.ts). The closure filter also accepts TOKEN names now, not only rule names, so a token-vs-token conflict like YAML's `key`/`plain` (both can precede a `:`) is declarable. Every tuple is YAML-specific (zero rule/token-name overlap with the other grammars), so each is inert elsewhere. `cd tree-sitter/yaml && npx tree-sitter generate && npx tree-sitter build --wasm .` now succeeds. The C external scanner is still a stub (returns false), so indentation isn't parsed yet — that is piece 3 (a real indent scanner) and is tracked separately. Refs #3

`buildIndentScannerC` (src/gen-treesitter.ts) generates a real C external scanner for the YAML indent tokens, replacing the stub. It mirrors src/gen-lexer.ts's indent-stack state machine: - An indent stack in the Scanner struct, (de)serialized for incremental re-parsing. - At each line boundary it measures the next content line's column and emits INDENT (deeper → push), DEDENT (shallower → pop, one per call until the stack top is reached), or NEWLINE (same column → sibling separator); blank and comment-only lines are skipped; open blocks are closed at EOF. - A block-scalar body (`|`/`>`) is scanned verbatim up to the first line at or below the parent indentation. - Flow needs no special case: inside `[`/`{` the grammar never references the indent tokens, so valid_symbols is false and the line break falls through to `extras`. - All language data (comment introducer, block-scalar introducers) is DERIVED from `grammar.indent`. `buildTokenBody` now emits a token's BLOCK pattern when it has one (YAML's scalar tokens), since the tree-sitter grammar is block-context at the top level. (YAML is the only grammar with a blockPattern, so the other six are byte-identical.) Verified parsing (`tree-sitter parse`): nested mappings, nested sequences, block scalars, and flow collections parse with no ERROR — the indent stack, INDENT/DEDENT/NEWLINE, and block-scalar bodies all work. KNOWN REMAINING: a flat single-line `key: value` / `- item` still mis-tokenizes — the `plain`/`key` block patterns must stop at a `: ` separator via a lookahead (`:(?=\S)`), but tree-sitter's token DFA forbids lookahead, so `sanitizeTreeSitterRegex` strips it and `plain` greedily eats `a: 1`. The official tree-sitter-yaml scans scalars in C for exactly this reason. The fix (next) is to rewrite the in-loop `:(?=\S)` boundary into an extent-equivalent consuming form (`:[^\s]`) for block-token emission, or to scan plain/key scalars in the external scanner. Refs #3

…ize (issue #3, piece 3) tree-sitter token DFAs cannot use look-around, so a YAML plain scalar's boundary (`:` is content unless followed by space; `#` is a comment only after a space) could not be a regex token — `plain` greedily ate `a: 1`. `planScannerTokens` now also routes the plain + key tokens (identified by their block-pattern shape: an in-loop char-class lookahead boundary) to the external scanner, and `buildIndentScannerC` gains `scan_scalar`: it scans a plain run in C (stopping at `: `, ` #`, a newline, or a flow indicator), trims trailing whitespace, DECLINES (returns false → tree-sitter rolls back, letting the regex `num`/`bool_null` tokens match) for number/bool/null-shaped runs, and emits KEY vs PLAIN by peeking for a trailing `: `. All derived from the grammar; the six other grammars stay byte-identical and `gate:treesitter` is unaffected (96.0%, still beats official 92.5%). Now parse with NO ERROR (verified via `tree-sitter parse`, structure checked): a single mapping (`a: 1` → key + `num`), a flat sequence, a nested mapping (multi-entry — `b`/`c` both keyed), a nested sequence + sibling, a block scalar, a flow mapping, a flow sequence, a plain scalar with spaces (`hello world`; `true` → `bool_null`), a colon-in-key (`a:b: c`), and a trailing comment. KNOWN REMAINING: a TOP-LEVEL multi-entry block mapping (`x: 1\ny: 2\nz: 3` — the most common YAML shape) still mis-parses: the first entry's value is dropped and 3+ entries ERROR. NESTED multi-entry mappings parse correctly, so this is specific to document-level NEWLINE-separated chaining — a grammar/GLR-runtime issue in the `mapping_or_scalar`/`node`/`stream` rules (likely the ε-elimination making a mapping value optional and GLR committing to the wrong split), NOT the scalar scanner. Next. Refs #3

… parse (issue #3, piece 3) The decline path (scanner returns false for a number/bool/null-shaped run so the regex `num`/ `bool_null` token matches) dropped the value-vs-key disambiguation that the external PLAIN/KEY tokens carry, so GLR mis-chained a TOP-LEVEL multi-entry block mapping (`x: 1\ny: 2\nz: 3` — the first value dropped, 3+ entries ERROR), even though nested multi-entry and plain-valued top-level mappings parsed. Fix: externalize num + bool_null too (every token with a `blockPattern` is now scanned in C) and have `scan_scalar` CLASSIFY the run and emit NUM / BOOL_NULL / KEY / PLAIN directly (no decline) — so every scalar is an external token that resolves the key-vs-value choice for the parser. Number/ bool/null typing is preserved (verified: `1`→num, `true`/`null`→bool_null, `hello`→plain). Removed the now-superseded `isPlainFamilyToken` / consume-rewrite dead code. Parse with NO ERROR (verified): single + flat-multi-entry mappings, sequences, nested mappings, nested sequences, block scalars, flow map/seq, plain-with-spaces, colon-in-key, trailing comment, empty-value sibling, blank-line-separated, deep nesting. The 6 other grammars stay byte-identical and gate:treesitter is unaffected (96.0%, beats official 92.5%). KNOWN REMAINING: a list-of-maps / COMPACT block (`- a: 1\n b: 2` — a sequence item whose value is a multi-entry mapping, the common GitHub-Actions `- uses:\n with:` shape) still errors — the scanner must push the inline content column after a `-`/`?` indicator (gen-lexer's `compactIndicators`), which it does not yet. Plus an accuracy bench over yaml-test-suite (present at /tmp). Next. Refs #3

A sequence item whose value is a mapping is written compactly — the mapping starts inline on the dash line and its continuation aligns with the inline content, not the dash (`- a: 1\n b: 2`, the GitHub-Actions `- uses: x\n with:\n k: v` shape). The scanner now mirrors gen-lexer's `compactIndicators`: at a line-lead `-`/`?` indicator whose inline content begins a block node (a nested `-`/`?`, or a scalar followed by an unquoted `: ` key separator — sniffed quote-aware, looking through a `&`/`!` property prefix), it pushes the inline content column as one extra INDENT. tree-sitter reverts all external-scanner state on a `false` return, so the natural "probe at the indicator, remember the column, push next call" loses the remembered column. The working design emits the compact INDENT in a single `true`-returning zero-width call at the post-indicator content (mark_end at the content start; the sniff's advances are discarded as tree-sitter restarts from mark_end). A new serialized `at_line_lead` flag (the indicator is internal-lexed, so it stays true through it) drives the detection; a bare-scalar / flow / alias lead does NOT push (`- x`, `- [a]` stay leaf items). All gated on `grammar.indent.compactIndicators` — the six other grammars and yaml's own grammar.js/tmLanguage/monarch are byte-identical (the change is purely in the C scanner). Parse NO-ERROR (verified): list-of-maps, single-entry list-maps, the GH-Actions steps shape, nested seq `- - x`, property+compact `- &a k: v`, map-of-seq — plus every earlier case (mappings, sequences, block scalars, flow, typed values) still passes. Real files: ci.yml 19→4 ERROR nodes, readme-bench 13→2. tsc clean; generate + build --wasm succeed; gate:treesitter 96.0% (beats official 92.5%). Remaining (pre-existing, NOT compact): a block-context plain scalar containing `,` (the scanner treats `,` as a flow indicator), `${{ }}` GH-Actions expressions (`{` treated as flow), and an alias as a sequence value (`- *a`, a grammar-level gap). Plus an accuracy bench over yaml-test-suite. Refs #3

`test/treesitter-yaml-bench.ts` measures how many VALID yaml-test-suite inputs the derived YAML tree-sitter parses with no ERROR/MISSING node ("valid" = the `yaml` package accepts the input, so a failure is the grammar's, not a malformed sample). Baseline: 209/312 = 67.0% — a real working tree-sitter for an indentation-sensitive grammar (the grammar previously did not even `generate`). CI: yaml joins the "generate every derived grammar" conflict gate and gets a build-to-wasm step (its C indentation scanner must compile + link). The accuracy bench runs where the yaml-test-suite is already cloned (the readme-bench workflow), not in the conflict gate. Refs #3

…ent (issue #3) The C scanner's `scan_scalar` always broke a plain run at `,` `[` `]` `{` `}`, but those are special only INSIDE a flow collection — in block context they are ordinary plain content (`a, b` is one scalar). So `a, b`, `k: a, b`, and multi-line flow (`[a,\n b]`) errored. Fix: track `flow_depth` in the scanner. tree-sitter (0.26.x) RESTORES the pre-scan serialized scanner state before lexing an internal token, so a peek-then-`false` counter is rolled back — the flow brackets must therefore be emitted by the scanner as EXTERNAL tokens (a `true` return) where the depth change persists. `flowSyntheticTokens` synthesizes one external token per `indent.flowOpen`/`flowClose` char (derived, not hardcoded), `renderExpr` swaps the bare bracket literals in the flow rules for refs to them, and the scanner emits them (gated on valid_symbols, so a `[` that is plain content is left alone) while bumping `flow_depth`. `scan_scalar`'s `,`/bracket/`:`/`-`/`?` boundary checks are now gated on `flow_depth > 0`; in block context they are content. Compact + block-scalar handling stay gated on `flow_depth == 0`. A flow-context leading-trivia skip (incl. newlines/comments) makes multi-line flow work. Verified against the `yaml` reference (`a:,b`, `a:[1,2]`, `a,b: c` are single block scalars/keys). Bench: 209/312 → 226/312 (67.0% → 72.4%). The six other grammars stay byte-identical; tsc clean; generate + build --wasm succeed; gate:treesitter 96.0% (beats official 92.5%). Refs #3

A document that started with `---`/`...` then a body on the next line failed: the external scalar scanner's `-`/`.` lead ran the `---` into a plain/key token before the internal `doc_start` could match (and the marker token's separator look-ahead is stripped by the token DFA). The scanner now probes for a document marker at column 0 (glyphs derived from `indent.blockScalar.documentMarkers`): a true sep-bounded marker → set a transient `marker_decline` + return false so the internal `---`/`...` token lexes it; a non-marker glyph (`---foo`) → claim it as plain content. The markers stay INTERNAL tokens (making them external perturbs the GLR tables and mis-lexes same-column block sequences). Plus: `started` is set whenever the column > 0 (so the NEWLINE after a leading marker is emitted, not suppressed), and a document-root block scalar (stack depth 1, parent indent −1) may have a column-0 body, ending only at a column-0 marker. Combined with the flow-depth fix, the bench jumps 72.4% → 94.2% (294/312 valid yaml-test-suite inputs ERROR-free) — the two compound, since many inputs had both a `---` marker and flow/comma content. The six other grammars stay byte-identical (all gated on grammar.indent); tsc clean; generate + build --wasm succeed; gate:treesitter 96.0%; src-coverage-yaml parser alignment 100% (yaml.ts untouched — tree-sitter target only). Refs #3

Inside a flow collection (`[ ]` / `{ }`) a plain scalar folds across line breaks — the break + surrounding whitespace collapse to one space and the run continues on the next line until a flow terminator. The scanner's `scan_scalar` broke a run unconditionally at any newline, so a flow key / value / explicit-key spanning lines lexed as two scalars and the GLR parser couldn't chain them (ERROR). Now, at `flow_depth > 0` with content already scanned, a newline folds: advance past it + surrounding blank lines, stop at a flow terminator (`,`/brackets) / line-leading `#` / EOF, else append one folded space and continue (the next content char re-marks the token end). Block context is unchanged (its multi-line folding is separate indent/grammar machinery). Multi-line quoted scalars in flow already worked (the quoted token spans newlines natively). Bench: 294/312 → 299/312 (94.2% → 95.8%). Six other grammars byte-identical (yaml-only, gated on grammar.indent); tsc clean; generate + build --wasm succeed; gate:treesitter 96.0%. Refs #3

A block mapping whose KEY is preceded by a node property (`&anchor` / `!tag` / `!!tag` / `!<verbatim>`) ERRORed: the scanner's compact-block detection keys off `at_line_lead` ("the line's first token"), but anchor/tag/alias are INTERNAL tokens tree-sitter lexes WITHOUT consulting the scanner, so after a property was lexed `at_line_lead` was still set and the following key was mis-treated as a compact- nested mapping → a spurious INDENT that corrupted the structure. Fix: a transient `property_lead` field, latched at the genuine line lead (column == stack top, re-derived every boundary and for the first line) when the lead char is a property; the two compact-push sites skip a property-led line so its key stays at the node level. `property_lead` is NOT reset in deserialize — the one carry that must survive the property's internal lex (tree-sitter discards scanner mutations on a `false` return; only across a `true`-returned token does state persist). `yaml.ts` untouched — the grammar's BlockKey already had the production; the gap was the tree-sitter derivation. (yaml-test-suite ZH7C/74H7/E76Z/ 7FWL/HMQ5/2SXE.) Combined with the flow folding, the bench is 95.8% → 97.8% (305/312). Six other grammars byte- identical; tsc clean; generate + build --wasm succeed; gate:treesitter 96.0%; agnostic 9/9; test:yaml-issues 10/10; scope-gap:yaml 100%; src-coverage-yaml 100%. Refs #3

johnsoncodehk added 10 commits June 8, 2026 03:40

johnsoncodehk changed the title ~~Derive a working YAML tree-sitter target: generate, build, scan, bench (#3)~~ Derive a working YAML tree-sitter target: generate, build, scan, 97.8% (#3) Jun 8, 2026

johnsoncodehk merged commit 887308e into master Jun 8, 2026
2 checks passed

johnsoncodehk deleted the issue-3-yaml-treesitter branch June 8, 2026 01:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Derive a working YAML tree-sitter target: generate, build, scan, 97.8% (#3)#26

Derive a working YAML tree-sitter target: generate, build, scan, 97.8% (#3)#26
johnsoncodehk merged 10 commits into
masterfrom
issue-3-yaml-treesitter

johnsoncodehk commented Jun 7, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

johnsoncodehk commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The three blockers from the issue

Acceptance

Remaining (the 2.2% — adversarial yaml-test-suite edges, follow-up)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

johnsoncodehk commented Jun 7, 2026 •

edited

Loading