Skip to content

feat(ci): add parse/compile code-block lint for PRs#7136

Merged
jstirnaman merged 57 commits intomasterfrom
worktree-code-block-tests
Apr 27, 2026
Merged

feat(ci): add parse/compile code-block lint for PRs#7136
jstirnaman merged 57 commits intomasterfrom
worktree-code-block-tests

Conversation

@jstirnaman
Copy link
Copy Markdown
Contributor

@jstirnaman jstirnaman commented Apr 23, 2026

Summary

Adds a fast, always-on PR check that parses code blocks in changed docs and fails on syntax errors in safe languages (JSON/YAML/TOML) while emitting warnings for risky ones (bash/python/javascript). Complements the existing pytest-based code-block execution — this check is purely static, needs no credentials, and runs in seconds.

Reference: DOCS-TESTING.md § "Parse/compile code-block lint"

What it does

  • Hooks into existing `Test Code Blocks` workflow as a new `lint-codeblocks` job
  • Resolves `source:` frontmatter so shared content is linted once, not per consumer
  • Emits `::error::` (JSON/YAML/TOML), `::warning::` (bash/python/js), and `::notice::` (normalization applied) annotations
  • Writes a Markdown step summary with error/warning/normalization tables

Blocking policy

Language Policy on parse failure
JSON, YAML, TOML `::error::` — fails the job
bash, python, javascript `::warning::` — does not fail the job

SQL and InfluxQL are planned follow-ups once dialect detection is designed.

Architecture

  • `scripts/ci/lint-codeblocks.mjs` — orchestrator (dedup, concurrency, reporting)
  • `scripts/lib/codeblock-extractor.mjs` — remark AST walk with alias normalization, attribute parsing, HTML-comment stripping, continuation join, expected-output skip
  • `scripts/lib/codeblock-normalizer.mjs` — hybrid two-phase (phase 1 raw; phase 2 placeholder substitution + Hugo shortcode strip); phase-1 error positions on fail
  • `scripts/lib/codeblock-validators/*.mjs` — one per language, pure `{ok, errors}` interface
  • `scripts/lib/content-utils.js` — new `resolveCanonicalSource` helper
  • `scripts/ci/detect-test-products.js` — adds `canonical` to output
  • `.github/workflows/test.yml` — adds `lint-codeblocks` job and `canonical-sources` output

Single new dependency: `@iarna/toml`. Others (`js-yaml`, `remark`, `remark-parse`, `p-limit`) already in `package.json`.

Testing

  • 38/38 unit tests pass (`yarn test:lint-codeblocks`)
  • Smoke tested on `content/influxdb3/core/admin/tokens/admin/*.md`: 25 blocks resolved to 3 shared sources, all passed, exit 0

Test plan

  • Watch the first PR run on this branch — confirm `lint-codeblocks` appears as a job and is green
  • Open a follow-up PR that intentionally breaks a JSON block → confirm job fails with annotation pointing at the right line
  • Open a follow-up PR that intentionally breaks a bash block → confirm job passes with warning annotation
  • Touch a shared content file with a bad JSON fence → confirm annotation includes "referenced by N pages" attribution
  • After 2 weeks of PR data, review the Normalization audit tables in step summaries — prune rules that never fire

Rollout follow-ups (separate PRs)

  • Graduate bash, then python, then javascript from warning to blocking once false-positive rates are known
  • v2: add SQL + InfluxQL validators (needs dialect detection)
  • Optional nightly full-repo scan

PLAN_CODE_BLOCK_LINTING.md: design spec for parse/compile lint check.
PLAN.md: 26-task implementation plan (9 phases) following TDD.

Both will be scrubbed before merge to master per CLAUDE.local convention.
Also fix test:lint-codeblocks script to use a glob pattern; Node 25
treats bare directory args as module paths rather than test-file
roots, so we need an explicit **/*.test.mjs glob.
…rules

Phase 1: parse as-is. If OK, return.
Phase 2 (on failure): apply placeholder substitution + Hugo shortcode
strip. If retry succeeds, return OK with a notice listing which rules
fired. If retry still fails, return phase-1 error positions so line
numbers stay honest to source.

Per-language substitutions: bare-identifier for bash/python/js, quoted
string for json/yaml/toml. Shortcodes replaced with no-op tokens of
the right syntactic category per language.
Thin wrapper over existing getSourceFromFrontmatter that falls back to
the file path itself when no source: frontmatter exists. Lets the
linter treat shared-source files as the single place to report
diagnostics across many consumer pages.
Resolves each input through resolveCanonicalSource and groups consumers
by canonical path. When a shared-source diagnostic fires, the
annotation includes a 'referenced by' list (up to 3 consumers, then
'and N more') so reviewers can find the relevant consumer pages.

Also gracefully handles unreadable canonical sources — rather than
crashing the whole run, logs the error and continues to the next file.
Writes to GITHUB_STEP_SUMMARY when set. Four tables: totals, errors,
warnings, normalization applied. Normalization table doubles as an
audit signal — over time we can prune rules that rarely fire.
Adds a 'canonical' array to the detect-test-products output, computed
by resolving each changed file through resolveCanonicalSource. The
lint-codeblocks job consumes this list so it runs once per underlying
shared source rather than once per consumer page.
New PR-mode job that runs scripts/ci/lint-codeblocks.mjs against the
canonical sources detect-changes computed. JSON/YAML/TOML parse errors
fail the job; bash/python/js parse errors emit warning annotations.

Consumes the canonical-sources output added to detect-changes so the
linter runs once per underlying shared source rather than once per
consumer page.
@jstirnaman jstirnaman requested a review from a team as a code owner April 23, 2026 20:39
@jstirnaman jstirnaman requested review from sanderson and removed request for a team April 23, 2026 20:39
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 23, 2026

Vale Style Check Results

Metric Count
Errors 0
Warnings 41
Suggestions 45
Warnings (41)
File Line Rule Message
.claude/skills/content-editing/SKILL.md 18 write-good.TooWordy 'multiple' is too wordy.
.claude/skills/content-editing/SKILL.md 73 write-good.Passive 'is Shared' may be passive voice. Use active voice if you can.
.claude/skills/content-editing/SKILL.md 75 write-good.TooWordy 'multiple' is too wordy.
.claude/skills/content-editing/SKILL.md 163 write-good.TooWordy 'equivalent' is too wordy.
.claude/skills/content-editing/SKILL.md 177 write-good.TooWordy 'validate' is too wordy.
.claude/skills/content-editing/SKILL.md 223 write-good.Passive 'are substituted' may be passive voice. Use active voice if you can.
.claude/skills/content-editing/SKILL.md 318 write-good.Passive 'be fixed' may be passive voice. Use active voice if you can.
.claude/skills/content-editing/SKILL.md 384 write-good.Passive 'is hosted' may be passive voice. Use active voice if you can.
.claude/skills/content-editing/SKILL.md 558 write-good.Passive 'is set' may be passive voice. Use active voice if you can.
DOCS-TESTING.md 16 write-good.TooWordy 'Validate' is too wordy.
DOCS-TESTING.md 76 write-good.Passive 'be deleted' may be passive voice. Use active voice if you can.
DOCS-TESTING.md 77 write-good.Passive 'is configured' may be passive voice. Use active voice if you can.
DOCS-TESTING.md 129 write-good.Weasel 'Several' is a weasel word!
DOCS-TESTING.md 224 write-good.Passive 'is skipped' may be passive voice. Use active voice if you can.
DOCS-TESTING.md 230 write-good.TooWordy 'therefore' is too wordy.
DOCS-TESTING.md 237 write-good.Passive 'is stored' may be passive voice. Use active voice if you can.
DOCS-TESTING.md 293 write-good.TooWordy 'expiration' is too wordy.
DOCS-TESTING.md 316 write-good.Passive 'are skipped' may be passive voice. Use active voice if you can.
DOCS-TESTING.md 317 write-good.Passive 'are planned' may be passive voice. Use active voice if you can.
DOCS-TESTING.md 318 write-good.Passive 'is designed' may be passive voice. Use active voice if you can.

Showing first 20 of 41 warnings.


Check passed

@jstirnaman jstirnaman requested a review from Copilot April 23, 2026 20:57
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 36 out of 37 changed files in this pull request and generated 2 comments.

Comment thread .github/workflows/test.yml Outdated
Comment thread scripts/ci/lint-codeblocks.mjs
jstirnaman and others added 15 commits April 24, 2026 09:50
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…locks

Two related off-by-one / non-contiguous-mapping bugs in how validator
line numbers were translated back to the Markdown file:

1. For a single fence, block.startLine is the opening `````` line. The
   first content line is at startLine + 1, so the old formula
   `startLine + e.line - 1` was off by one (code line 1 -> fence line,
   not content line).

2. For continuation-joined blocks, the joined value is contiguous but
   the parts in the source file are not. Code lines from the second
   part were mapped as if they continued immediately after the first
   part's opening fence, which lands on the closing fence or the cont
   marker, not the actual content.

Track per-part content-line counts in the extractor (partContentLines
alongside partLines) and add mapCodeLineToFileLine(block, codeLine)
which walks the parts, accumulates their sizes, and offsets from the
correct part's opening fence.

Orchestrator now calls the mapper instead of doing its own arithmetic.

Addresses Copilot review comment
#7136 (comment)
…eadable sources

- quotedSub(): match "TOKEN" (already-quoted) or bare TOKEN with one
  pattern, replacing both with "TOKEN_ci". Prevents ""TOKEN_ci"" when
  a placeholder appears inside an existing JSON/YAML/TOML string.
- SHORTCODE_REPLACEMENT for json/jsonl/yaml/toml: change from
  '"__SHORTCODE__"' to '0' (bare integer). Valid as a JSON/YAML/TOML
  value and safe inside existing strings — avoids ""0"/api"-style
  double-quoting when a shortcode appears inside a quoted value.
- lint-codeblocks: emit ::warning:: when a canonical source is
  unreadable so the job doesn't silently skip files.
- workflow: update stale comment to point at DOCS-TESTING.md instead
  of the scrubbed PLAN_CODE_BLOCK_LINTING.md.
- DOCS-TESTING.md: document that | is a literal delimiter in
  placeholders="" — regex-style groupings are not supported.
- Add 3 regression tests for the double-quote cases.
…utput skip; Node 22; frontmatter comments

- stripHtmlComments(): replace each comment with the same number of \n
  it contained instead of removing it entirely. Removes the
  replace(/^\n+/, '') + trimEnd() that shifted line offsets and broke
  mapCodeLineToFileLine line mapping.
- expected-output skipNext: only skip the next fence when rawLang is
  null (unlabeled). Expected-output fences are always unlabeled;
  previously any labeled fence after the marker could be silently
  dropped.
- Standardize both detect-changes and suggest-tests jobs on Node 22
  (was 20) to match lint-codeblocks.
- getSourceFromFrontmatter(): allow trailing inline YAML comments
  (source: /shared/x.md # note) without breaking canonicalization.
- Update existing html-comments test to assert line-preserving
  behavior; add 2 new regression tests.
Add remark-frontmatter so the parser treats the ---…--- block as a
leaf yaml node rather than walking it as markdown. Without this,
a description: | scalar containing ```json fences would be extracted
and linted, producing false positives on valid doc pages.
…l-source dedupe

Consumer pages that declare source: frontmatter are pure stubs with no
body fences of their own. Linting only the shared canonical source is
therefore sufficient; no consumer fences are silently skipped. Add a
comment in resolveCanonicalSource() and the dedup loop calling out this
invariant so future drift is obvious.
- frontmatter regex: require closing --- on its own line (followed by
  newline or EOF) so a literal --- inside a YAML value doesn't
  prematurely terminate the block.
- findPagesReferencingSharedContent: switch to grep -rFl (fixed-string)
  with two -e patterns covering source: /shared/... and source: shared/...
  so path characters like dots aren't treated as regex metacharacters.
Add yarn lint-codeblocks as section 3 in Part 2: Testing, before the
existing code block execution testing. Includes command, blocking policy
table, normalization note, and cross-reference to DOCS-TESTING.md.
Update Quick Decision Tree and quick-reference table. Renumber sections
3-6 → 4-7 to accommodate the new section.
- javascript.mjs: wrap mkdtempSync/writeFileSync in try/catch so FS
  failures return {ok:false} instead of crashing the whole lint run;
  make rmSync cleanup best-effort
- detect-test-products.js: skip canonical sources that no longer exist
  (deleted/renamed files) to avoid spurious ::warning:: noise in CI
- codeblock-extractor.mjs: gate expected-output skip on proximity (≤3
  lines from marker) so a misplaced marker can't silently eat an
  unrelated unlabeled fence elsewhere in the file
- Add fixture + test for the distant-marker case
…frontmatter only

grep -rFl matches source: anywhere in a file, so prose mentions like
"See `source: /shared/x.md`" would incorrectly count as consumers.
Post-filter candidates through getSourceFromFrontmatter, which parses
only the frontmatter block, so prose and code-example matches are
discarded.

Also adds a unit test asserting the filter distinguishes a real
consumer (source: in frontmatter) from a prose mention.
- package.json: bump engines.node to >=22.0.0 (node --test and CI
  already require Node 22; >=16 was misleading)
- lint-codeblocks.mjs: compute consumer attribution once per file
  (was once per diagnostic — shared files with multiple failures
  triggered a repo-wide grep on every error)
- content-utils.js: add searchRoot option to findPagesReferencingSharedContent
  so callers can inject the search directory; switch from execSync with
  a shell string to execFileSync with an explicit args array to avoid
  shell-quoting issues with unusual filenames
- canonical.test.mjs: replace misleading test that only called
  getSourceFromFrontmatter with a real test that calls
  findPagesReferencingSharedContent end-to-end, exercising both the
  grep and the prose-mention post-filter
getSourceFromFrontmatter() was prepending 'content' to any path starting
with '/', so source: /content/shared/x became content/content/shared/x
(nonexistent). Three real consumer pages use this form in-tree:
  content/influxdb3/{clustered,cloud-serverless,cloud-dedicated}/
  process-data/downsample/quix.md

Fix: strip the leading slash and treat the remainder as repo-relative,
only prepending content/ when it isn't already there.

Also add a third -e pattern to findPagesReferencingSharedContent() so
pages using source: /content/shared/... are included in consumer
attribution, not just the /shared/... and shared/... forms.

Add a regression test that asserts the result never starts with
content/content/.
…form

Adds .ci/scripts/check-source-paths.sh to enforce that source: values
always start with /shared/. Fixes 14 existing violations:
- source: /content/shared/... → source: /shared/... (3 files)
- source: content/shared/...  → source: /shared/... (9 files)
- source: shared/...          → source: /shared/... (2 files)

Wires the check into lefthook (pre-commit, staged files only) and
pr-link-check.yml (CI, changed files only) so the variant forms can't
re-enter the repo.
@github-actions github-actions Bot added product:v3-distributed InfluxDB 3 Cloud Serverless, Cloud Dedicated, Clustered product:v3-monolith InfluxDB 3 Core and Enterprise (single-node / clusterable) labels Apr 25, 2026
- check-source-paths.sh: strip inline YAML comments (# ...) before
  validating source: values so lines like
  `source: /shared/foo.md  # link` pass the check correctly

- detect-test-products.js: remove existsSync guard from canonical set
  accumulation; missing canonical paths (broken source: pointers, deleted
  files) must reach lint-codeblocks.mjs so it can emit ::warning::
  annotations — silently dropping them here defeated that signal

- content-editing SKILL.md: correct three examples that still showed
  the non-canonical /content/shared/... prefix; examples now use /shared/
…ine comments

Strip inline YAML comments before quote-stripping. Previously, a value
like `source: "/shared/foo.md"  # comment` would extract as
`/shared/foo.md"` (trailing quote retained) because the quote-strip
regex only matches at start/end and the trailing char was part of the
comment text. The bash glob check still passed today, but the extracted
value was malformed and would break any future exact-match logic
(existence checks, etc.).

New order: strip leading `source:`, strip trailing comment, trim
trailing whitespace, then strip surrounding quotes.
Address two robustness gaps:

1. Inline comment stripping wasn't YAML-aware — a # inside a quoted
   value (e.g. `source: "/shared/has#hash.md"`) would be truncated
   incorrectly.

2. Quote stripping was permissive — `gsub(/^["']|["']$/, "")`
   stripped either end independently, which would silently "fix"
   malformed frontmatter like a missing closing quote and let it pass.

New extraction matches one of three shapes:
  "value"  [# comment]   → keep inner text verbatim
  'value'  [# comment]   → keep inner text verbatim
  value    [# comment]   → strip trailing comment

Surrounding quotes are only stripped when both ends match. Malformed
quoting now reaches the canonical-form check intact and surfaces as a
violation.
…rvation

Address two robustness findings consistent with the stricter behavior
already in .ci/scripts/check-source-paths.sh:

1. getSourceFromFrontmatter previously stripped quotes independently
   (`["']?...["']?`), so malformed frontmatter like
   `source: "/shared/foo.md` (missing close quote) was silently
   "repaired" and canonicalized. Match shape now requires both ends
   to match — three explicit alternatives:
     "value"  [# note]   → group 1 (any chars including #)
     'value'  [# note]   → group 2 (any chars including #)
     value    [# note]   → group 3 (no whitespace or # in value)
   Malformed quoting falls through to null.

2. stripHtmlComments was global, which would silently alter inline
   `<!-- ... -->` literals inside code (HTML/Markdown fences). Now
   restricted to whole-line comments — opening `<!--` and closing
   `-->` must each sit on their own line (with optional surrounding
   whitespace). Pytest directives like `<!--pytest.mark.skip-->` are
   unaffected since they are always written on their own line in this
   repo.

Adds three regression tests (55/55 pass).
…contract test

1. getSourceFromFrontmatter now allows # mid-value in unquoted plain
   scalars. YAML treats # as a comment delimiter only when preceded by
   whitespace, so foo#bar is a literal value while foo #bar has trailing
   comment #bar. The previous regex disallowed # entirely in unquoted
   values, which would return null for valid YAML like
   `source: /shared/has#hash.md` while the bash check accepted it.

2. check-source-paths.sh: replace mapfile (bash 4+) with a NUL-delimited
   while-read loop so the script runs under macOS system /bin/bash 3.2.
   Verified: 'GNU bash, version 3.2.57' executes the no-args full-repo
   scan cleanly.

3. Add a contract test that runs both the bash script and the JS
   getSourceFromFrontmatter against a shared fixture set, asserting they
   agree on (a) parsing valid YAML scalar shapes and (b) rejecting
   malformed quoting. By design they diverge on canonical-form
   enforcement (bash is the strict CI gate, JS the forgiving runtime
   normalizer); the contract covers only the YAML-parsing concerns
   where drift would create inconsistent behavior.
…oted source: in consumer grep

1. Bash awk previously required whitespace before # after a closing
   quote, so `source: "/shared/foo.md"#note` parsed as malformed.
   js-yaml and the JS regex both accept no-space-before-# after a
   closing quote (the lenient YAML interpretation), so the bash check
   would falsely flag valid YAML the runtime accepts. Aligned by
   changing [[:space:]]+ to [[:space:]]* in both quoted branches; the
   unquoted branch keeps the strict whitespace requirement (foo#bar is
   a literal plain scalar, foo #bar has trailing comment).

2. findPagesReferencingSharedContent grep prefilter only matched
   unquoted source: values. If a page used source: "/shared/foo.md"
   or source: '/shared/foo.md', the consumer was silently dropped from
   shared-content expansion. Added quoted variants for all three known
   path shapes (/shared/, shared/, /content/shared/). The post-filter
   via getSourceFromFrontmatter remains the source of truth — grep
   just needs to surface the candidate. No content pages currently use
   quoted form; this closes the gap before it bites.

Adds: regression test for quoted-consumer grep, two new no-space-#
cases in the bash↔JS contract test.
…s on frontmatter delimiters

1. package.json now requires Node >=22.0.0 and test.yml uses Node 22,
   but several workflows still pinned Node 18 or 20:
     pr-link-check, pr-render-check, pr-feedback-links, pr-preview,
     sync-plugins, audit-documentation, prepare-release, and the
     uncommented Node steps in influxdb3-release.
   Workflows on older Node would fail or warn against the engines
   field, and any future syntax adoption (top-level await in scripts,
   etc.) silently breaks on the older runners. Bumped all active
   node-version values to '22'. Commented-out '18' references in
   influxdb3-release.yml left as-is — they are dead code already.

2. The bash hook (.ci/scripts/check-source-paths.sh) accepts trailing
   whitespace on --- delimiters via /^---[[:space:]]*$/, but the JS
   regex required the delimiter on its own line with no trailing
   whitespace. A file with a stray space after --- would pass the
   pre-commit hook but fail JS canonical-source extraction, dropping
   the consumer from shared-content expansion silently. Relaxed the JS
   regex to /^---[ \t]*\r?\n.../ on both delimiters.
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 27, 2026

PR Preview Action v1.4.8
🚀 Deployed preview to https://influxdata.github.io/docs-v2/pr-preview/pr-7136/
on branch gh-pages at 2026-04-27 21:53 UTC

@jstirnaman jstirnaman merged commit df66289 into master Apr 27, 2026
24 checks passed
@jstirnaman jstirnaman deleted the worktree-code-block-tests branch April 27, 2026 22:00
github-actions Bot added a commit that referenced this pull request Apr 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

product:v3-distributed InfluxDB 3 Cloud Serverless, Cloud Dedicated, Clustered product:v3-monolith InfluxDB 3 Core and Enterprise (single-node / clusterable)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants