feat(ci): add parse/compile code-block lint for PRs#7136
Merged
jstirnaman merged 57 commits intomasterfrom Apr 27, 2026
Merged
Conversation
PLAN_CODE_BLOCK_LINTING.md: design spec for parse/compile lint check. PLAN.md: 26-task implementation plan (9 phases) following TDD. Both will be scrubbed before merge to master per CLAUDE.local convention.
Also fix test:lint-codeblocks script to use a glob pattern; Node 25 treats bare directory args as module paths rather than test-file roots, so we need an explicit **/*.test.mjs glob.
…rules Phase 1: parse as-is. If OK, return. Phase 2 (on failure): apply placeholder substitution + Hugo shortcode strip. If retry succeeds, return OK with a notice listing which rules fired. If retry still fails, return phase-1 error positions so line numbers stay honest to source. Per-language substitutions: bare-identifier for bash/python/js, quoted string for json/yaml/toml. Shortcodes replaced with no-op tokens of the right syntactic category per language.
Thin wrapper over existing getSourceFromFrontmatter that falls back to the file path itself when no source: frontmatter exists. Lets the linter treat shared-source files as the single place to report diagnostics across many consumer pages.
Resolves each input through resolveCanonicalSource and groups consumers by canonical path. When a shared-source diagnostic fires, the annotation includes a 'referenced by' list (up to 3 consumers, then 'and N more') so reviewers can find the relevant consumer pages. Also gracefully handles unreadable canonical sources — rather than crashing the whole run, logs the error and continues to the next file.
Writes to GITHUB_STEP_SUMMARY when set. Four tables: totals, errors, warnings, normalization applied. Normalization table doubles as an audit signal — over time we can prune rules that rarely fire.
…-tests # Conflicts: # package.json
Adds a 'canonical' array to the detect-test-products output, computed by resolving each changed file through resolveCanonicalSource. The lint-codeblocks job consumes this list so it runs once per underlying shared source rather than once per consumer page.
New PR-mode job that runs scripts/ci/lint-codeblocks.mjs against the canonical sources detect-changes computed. JSON/YAML/TOML parse errors fail the job; bash/python/js parse errors emit warning annotations. Consumes the canonical-sources output added to detect-changes so the linter runs once per underlying shared source rather than once per consumer page.
Contributor
Vale Style Check Results
Warnings (41)
Showing first 20 of 41 warnings. ✅ Check passed |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…locks Two related off-by-one / non-contiguous-mapping bugs in how validator line numbers were translated back to the Markdown file: 1. For a single fence, block.startLine is the opening `````` line. The first content line is at startLine + 1, so the old formula `startLine + e.line - 1` was off by one (code line 1 -> fence line, not content line). 2. For continuation-joined blocks, the joined value is contiguous but the parts in the source file are not. Code lines from the second part were mapped as if they continued immediately after the first part's opening fence, which lands on the closing fence or the cont marker, not the actual content. Track per-part content-line counts in the extractor (partContentLines alongside partLines) and add mapCodeLineToFileLine(block, codeLine) which walks the parts, accumulates their sizes, and offsets from the correct part's opening fence. Orchestrator now calls the mapper instead of doing its own arithmetic. Addresses Copilot review comment #7136 (comment)
…eadable sources - quotedSub(): match "TOKEN" (already-quoted) or bare TOKEN with one pattern, replacing both with "TOKEN_ci". Prevents ""TOKEN_ci"" when a placeholder appears inside an existing JSON/YAML/TOML string. - SHORTCODE_REPLACEMENT for json/jsonl/yaml/toml: change from '"__SHORTCODE__"' to '0' (bare integer). Valid as a JSON/YAML/TOML value and safe inside existing strings — avoids ""0"/api"-style double-quoting when a shortcode appears inside a quoted value. - lint-codeblocks: emit ::warning:: when a canonical source is unreadable so the job doesn't silently skip files. - workflow: update stale comment to point at DOCS-TESTING.md instead of the scrubbed PLAN_CODE_BLOCK_LINTING.md. - DOCS-TESTING.md: document that | is a literal delimiter in placeholders="" — regex-style groupings are not supported. - Add 3 regression tests for the double-quote cases.
…utput skip; Node 22; frontmatter comments - stripHtmlComments(): replace each comment with the same number of \n it contained instead of removing it entirely. Removes the replace(/^\n+/, '') + trimEnd() that shifted line offsets and broke mapCodeLineToFileLine line mapping. - expected-output skipNext: only skip the next fence when rawLang is null (unlabeled). Expected-output fences are always unlabeled; previously any labeled fence after the marker could be silently dropped. - Standardize both detect-changes and suggest-tests jobs on Node 22 (was 20) to match lint-codeblocks. - getSourceFromFrontmatter(): allow trailing inline YAML comments (source: /shared/x.md # note) without breaking canonicalization. - Update existing html-comments test to assert line-preserving behavior; add 2 new regression tests.
Add remark-frontmatter so the parser treats the ---…--- block as a leaf yaml node rather than walking it as markdown. Without this, a description: | scalar containing ```json fences would be extracted and linted, producing false positives on valid doc pages.
…l-source dedupe Consumer pages that declare source: frontmatter are pure stubs with no body fences of their own. Linting only the shared canonical source is therefore sufficient; no consumer fences are silently skipped. Add a comment in resolveCanonicalSource() and the dedup loop calling out this invariant so future drift is obvious.
- frontmatter regex: require closing --- on its own line (followed by newline or EOF) so a literal --- inside a YAML value doesn't prematurely terminate the block. - findPagesReferencingSharedContent: switch to grep -rFl (fixed-string) with two -e patterns covering source: /shared/... and source: shared/... so path characters like dots aren't treated as regex metacharacters.
Add yarn lint-codeblocks as section 3 in Part 2: Testing, before the existing code block execution testing. Includes command, blocking policy table, normalization note, and cross-reference to DOCS-TESTING.md. Update Quick Decision Tree and quick-reference table. Renumber sections 3-6 → 4-7 to accommodate the new section.
- javascript.mjs: wrap mkdtempSync/writeFileSync in try/catch so FS
failures return {ok:false} instead of crashing the whole lint run;
make rmSync cleanup best-effort
- detect-test-products.js: skip canonical sources that no longer exist
(deleted/renamed files) to avoid spurious ::warning:: noise in CI
- codeblock-extractor.mjs: gate expected-output skip on proximity (≤3
lines from marker) so a misplaced marker can't silently eat an
unrelated unlabeled fence elsewhere in the file
- Add fixture + test for the distant-marker case
…frontmatter only grep -rFl matches source: anywhere in a file, so prose mentions like "See `source: /shared/x.md`" would incorrectly count as consumers. Post-filter candidates through getSourceFromFrontmatter, which parses only the frontmatter block, so prose and code-example matches are discarded. Also adds a unit test asserting the filter distinguishes a real consumer (source: in frontmatter) from a prose mention.
- package.json: bump engines.node to >=22.0.0 (node --test and CI already require Node 22; >=16 was misleading) - lint-codeblocks.mjs: compute consumer attribution once per file (was once per diagnostic — shared files with multiple failures triggered a repo-wide grep on every error) - content-utils.js: add searchRoot option to findPagesReferencingSharedContent so callers can inject the search directory; switch from execSync with a shell string to execFileSync with an explicit args array to avoid shell-quoting issues with unusual filenames - canonical.test.mjs: replace misleading test that only called getSourceFromFrontmatter with a real test that calls findPagesReferencingSharedContent end-to-end, exercising both the grep and the prose-mention post-filter
getSourceFromFrontmatter() was prepending 'content' to any path starting
with '/', so source: /content/shared/x became content/content/shared/x
(nonexistent). Three real consumer pages use this form in-tree:
content/influxdb3/{clustered,cloud-serverless,cloud-dedicated}/
process-data/downsample/quix.md
Fix: strip the leading slash and treat the remainder as repo-relative,
only prepending content/ when it isn't already there.
Also add a third -e pattern to findPagesReferencingSharedContent() so
pages using source: /content/shared/... are included in consumer
attribution, not just the /shared/... and shared/... forms.
Add a regression test that asserts the result never starts with
content/content/.
…form Adds .ci/scripts/check-source-paths.sh to enforce that source: values always start with /shared/. Fixes 14 existing violations: - source: /content/shared/... → source: /shared/... (3 files) - source: content/shared/... → source: /shared/... (9 files) - source: shared/... → source: /shared/... (2 files) Wires the check into lefthook (pre-commit, staged files only) and pr-link-check.yml (CI, changed files only) so the variant forms can't re-enter the repo.
- check-source-paths.sh: strip inline YAML comments (# ...) before validating source: values so lines like `source: /shared/foo.md # link` pass the check correctly - detect-test-products.js: remove existsSync guard from canonical set accumulation; missing canonical paths (broken source: pointers, deleted files) must reach lint-codeblocks.mjs so it can emit ::warning:: annotations — silently dropping them here defeated that signal - content-editing SKILL.md: correct three examples that still showed the non-canonical /content/shared/... prefix; examples now use /shared/
…ine comments Strip inline YAML comments before quote-stripping. Previously, a value like `source: "/shared/foo.md" # comment` would extract as `/shared/foo.md"` (trailing quote retained) because the quote-strip regex only matches at start/end and the trailing char was part of the comment text. The bash glob check still passed today, but the extracted value was malformed and would break any future exact-match logic (existence checks, etc.). New order: strip leading `source:`, strip trailing comment, trim trailing whitespace, then strip surrounding quotes.
Address two robustness gaps: 1. Inline comment stripping wasn't YAML-aware — a # inside a quoted value (e.g. `source: "/shared/has#hash.md"`) would be truncated incorrectly. 2. Quote stripping was permissive — `gsub(/^["']|["']$/, "")` stripped either end independently, which would silently "fix" malformed frontmatter like a missing closing quote and let it pass. New extraction matches one of three shapes: "value" [# comment] → keep inner text verbatim 'value' [# comment] → keep inner text verbatim value [# comment] → strip trailing comment Surrounding quotes are only stripped when both ends match. Malformed quoting now reaches the canonical-form check intact and surfaces as a violation.
…rvation
Address two robustness findings consistent with the stricter behavior
already in .ci/scripts/check-source-paths.sh:
1. getSourceFromFrontmatter previously stripped quotes independently
(`["']?...["']?`), so malformed frontmatter like
`source: "/shared/foo.md` (missing close quote) was silently
"repaired" and canonicalized. Match shape now requires both ends
to match — three explicit alternatives:
"value" [# note] → group 1 (any chars including #)
'value' [# note] → group 2 (any chars including #)
value [# note] → group 3 (no whitespace or # in value)
Malformed quoting falls through to null.
2. stripHtmlComments was global, which would silently alter inline
`<!-- ... -->` literals inside code (HTML/Markdown fences). Now
restricted to whole-line comments — opening `<!--` and closing
`-->` must each sit on their own line (with optional surrounding
whitespace). Pytest directives like `<!--pytest.mark.skip-->` are
unaffected since they are always written on their own line in this
repo.
Adds three regression tests (55/55 pass).
…contract test 1. getSourceFromFrontmatter now allows # mid-value in unquoted plain scalars. YAML treats # as a comment delimiter only when preceded by whitespace, so foo#bar is a literal value while foo #bar has trailing comment #bar. The previous regex disallowed # entirely in unquoted values, which would return null for valid YAML like `source: /shared/has#hash.md` while the bash check accepted it. 2. check-source-paths.sh: replace mapfile (bash 4+) with a NUL-delimited while-read loop so the script runs under macOS system /bin/bash 3.2. Verified: 'GNU bash, version 3.2.57' executes the no-args full-repo scan cleanly. 3. Add a contract test that runs both the bash script and the JS getSourceFromFrontmatter against a shared fixture set, asserting they agree on (a) parsing valid YAML scalar shapes and (b) rejecting malformed quoting. By design they diverge on canonical-form enforcement (bash is the strict CI gate, JS the forgiving runtime normalizer); the contract covers only the YAML-parsing concerns where drift would create inconsistent behavior.
…oted source: in consumer grep 1. Bash awk previously required whitespace before # after a closing quote, so `source: "/shared/foo.md"#note` parsed as malformed. js-yaml and the JS regex both accept no-space-before-# after a closing quote (the lenient YAML interpretation), so the bash check would falsely flag valid YAML the runtime accepts. Aligned by changing [[:space:]]+ to [[:space:]]* in both quoted branches; the unquoted branch keeps the strict whitespace requirement (foo#bar is a literal plain scalar, foo #bar has trailing comment). 2. findPagesReferencingSharedContent grep prefilter only matched unquoted source: values. If a page used source: "/shared/foo.md" or source: '/shared/foo.md', the consumer was silently dropped from shared-content expansion. Added quoted variants for all three known path shapes (/shared/, shared/, /content/shared/). The post-filter via getSourceFromFrontmatter remains the source of truth — grep just needs to surface the candidate. No content pages currently use quoted form; this closes the gap before it bites. Adds: regression test for quoted-consumer grep, two new no-space-# cases in the bash↔JS contract test.
…s on frontmatter delimiters
1. package.json now requires Node >=22.0.0 and test.yml uses Node 22,
but several workflows still pinned Node 18 or 20:
pr-link-check, pr-render-check, pr-feedback-links, pr-preview,
sync-plugins, audit-documentation, prepare-release, and the
uncommented Node steps in influxdb3-release.
Workflows on older Node would fail or warn against the engines
field, and any future syntax adoption (top-level await in scripts,
etc.) silently breaks on the older runners. Bumped all active
node-version values to '22'. Commented-out '18' references in
influxdb3-release.yml left as-is — they are dead code already.
2. The bash hook (.ci/scripts/check-source-paths.sh) accepts trailing
whitespace on --- delimiters via /^---[[:space:]]*$/, but the JS
regex required the delimiter on its own line with no trailing
whitespace. A file with a stray space after --- would pass the
pre-commit hook but fail JS canonical-source extraction, dropping
the consumer from shared-content expansion silently. Relaxed the JS
regex to /^---[ \t]*\r?\n.../ on both delimiters.
Contributor
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a fast, always-on PR check that parses code blocks in changed docs and fails on syntax errors in safe languages (JSON/YAML/TOML) while emitting warnings for risky ones (bash/python/javascript). Complements the existing pytest-based code-block execution — this check is purely static, needs no credentials, and runs in seconds.
Reference: DOCS-TESTING.md § "Parse/compile code-block lint"
What it does
Blocking policy
SQL and InfluxQL are planned follow-ups once dialect detection is designed.
Architecture
Single new dependency: `@iarna/toml`. Others (`js-yaml`, `remark`, `remark-parse`, `p-limit`) already in `package.json`.
Testing
Test plan
Rollout follow-ups (separate PRs)