feat(knowledge): phase 2 — markdown chunker + phase configs by leeovery · Pull Request #244 · leeovery/agentic-workflows

leeovery · 2026-04-11T11:05:04Z

Summary

Pure-function markdown chunker (src/knowledge/chunker.js) driven by per-phase JSON configs. Zero deps. Algorithm: CRLF normalise → frontmatter strip → keep_whole_below gate → fenced-code-aware heading parse → primary/fallback/whole-file chain → special_sections (own-chunk/skip/merge-up with first-section promotion) → flat H2→H3 size fallback (no recursion) → skip-empty. Wired into src/knowledge/index.js so esbuild bundles it — knowledge.cjs stays at 131.6 KB, well under the 150 KB threshold.
4 phase configs at skills/workflow-knowledge/chunking/{research,discussion,investigation,specification}.json. Discussion ships with Discussion Map/Summary → own-chunk from the design doc example.
9 fixtures under tests/fixtures/knowledge/ — 5 real artifacts copied verbatim from portal, pigeon, tick, agntc (one per phase + a no-Discussion-Map edge case) plus 4 synthetic fixtures for the size gate, missing-headings fallback, H3 fallback splitting, and fenced-code-block correctness. 39 chunker tests assert expected chunk boundaries, max_lines compliance, and the verbatim content preservation invariant (design doc line 74).

Test plan

node --test tests/scripts/test-knowledge-chunker.cjs — 39 tests pass
Phase 1 regression — full suite 111/111 green (store, embeddings, integration, chunker)
npm run build — bundle still 131.6 KB (< 150 KB threshold)
Verbatim invariant enforced across all 9 fixtures
CRLF fixture exercised

🤖 Generated with Claude Code

Adds a pure-function markdown chunking engine driven by per-phase JSON configs. Zero external deps — self-contained parser that the Phase 3 CLI plugs into the Phase 1 store layer. Algorithm (executed in order to resolve the gate/split-time ambiguity): 1. Strip YAML frontmatter 2. Whole-file gate: < keep_whole_below lines -> single chunk 3. Parse headings (fenced-code-block aware) 4. Split at primary_level; H1 + intro forms the first chunk 5. special_sections: own-chunk | skip | merge-up (first-section merge-up promotes to its own chunk) 6. Size fallback: sections > max_lines split once at fallback_level, never recurse (finding #9 — flat H2 -> H3 -> whole file chain) 7. skip_empty_sections drops heading-only sections Phase configs ship at skills/workflow-knowledge/chunking/ for research, discussion, investigation, specification. Discussion carries the special sections from the design doc example (Discussion Map, Summary -> own-chunk). Shared defaults match the design doc across all four. Wired into src/knowledge/index.js so esbuild picks it up — bundle 130.6kb -> 131.6kb, well under the 150kb threshold. Unit tests in tests/scripts/test-knowledge-chunker.cjs cover H2/H3 splitting, the flat fallback chain, special_sections behaviour, frontmatter handling, empty sections, whole-file gate vs missing-headings fallback, H1 handling, and fenced-code-block parsing correctness. 26 cases, plus schema assertions against the 4 shipped configs. Phase 1 regression suite (99 tests total) still green. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Adds fixture-based validation for the chunking engine. Nine fixtures — five copied verbatim from real workflow projects (portal, pigeon, tick, agntc) covering all four indexed phases plus a no-Discussion-Map edge case, and four synthetic fixtures for size-gate, missing-headings, fallback-splitting, and fenced-code-block paths. Real fixtures (never edited — copied into tests/fixtures/knowledge/): - research-fixture.md: portal cc-tool-plan (581 lines, 12 H2s) - discussion-fixture.md: pigeon application-architecture (458 lines, 11 H2s incl Discussion Map + Summary) - investigation-fixture.md: tick cascade-unchanged-noise (149 lines, Symptoms/Analysis/Fix Direction) - specification-fixture.md: portal portal/specification (757 lines, 12 H2s, frontmatter + H1) - discussion-no-map-fixture.md: agntc core-architecture (251 lines, Summary only) Tests each realistic fixture with its production phase config and asserts expected chunk count, expected heading boundaries, no chunk exceeds max_lines, and the verbatim content preservation invariant (every chunk content must appear verbatim in the post-frontmatter source — per design doc "no information loss", line 74). The oversized-section fixture (257 lines, single H2 > max_lines) proves the flat fallback chain splits once at H3 and never recurses. The minimal spec fixture (22 lines) proves the keep_whole_below gate fires before heading parsing. The code-block fixture proves fenced ## lines do not trigger false splits. One chunker fix surfaced by the fixtures: CRLF line endings are now normalised to LF at the start of chunk() so Windows-style inputs chunk identically. Edge case explicitly listed in the task. Full regression suite: 111 tests (72 phase-1 + 39 chunker), all green. Bundle unchanged at 131.6 KB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Addresses two gaps between the chunker implementation and the design contract, surfaced during PR #244 review. Issue #1 — merge-up broke content preservation (design line 74) ================================================================ The old merge-up path joined sections with a hard-coded `\n\n`: prev.content = prev.content + '\n\n' + section.text; That separator is injected by the chunker, not drawn from the source. If the gap between the parent and the merge-up section contained anything other than exactly two newlines (e.g. a blank line with trailing whitespace, or a comment), the merged chunk was no longer a verbatim substring of the source — silently violating the global "no lossy compression" invariant. Fix: track source line ranges on every section during buildSections and maintain them through the processing pipeline. merge-up now extends the preceding segment's endLine; chunk content is generated by slicing from the original line array once at the end. Contiguous source slice → verbatim guaranteed, regardless of what sits between the merged sections in the source. Issue #3 — sub-level special_sections was not implemented ========================================================== Task 2-1 states: "own-chunk: always split into its own chunk regardless of heading level". The previous implementation only matched at the split level (H2 for discussion), so an H3 "### Discussion Map" nested inside a regular H2 would remain part of its parent chunk instead of being extracted. Fix: added expandSubLevelSpecials, which walks each split-level section's line range and carves out any sub-level heading whose text matches a special_sections entry. Precedence rule: if the parent section's heading is itself in special_sections, the parent's action wins and no sub-carving happens — "Discussion Map as H2" stays one chunk. Sub-level matching handles own-chunk and skip; merge-up is a split-level concept only and sub-level merge-up entries are treated as regular sub-headings. Issue #2 — own-chunk + size fallback ambiguity ============================================== The task wording "always split into its own chunk" is ambiguous: does own-chunk bypass the max_lines fallback split, or not? The previous code made a silent choice. Added an inline comment documenting the decision and its reasoning: own-chunk is interpreted as literal one chunk, so a large Discussion Map stays intact even if it exceeds max_lines. special_sections mark semantic units the user has declared atomic, and splitting them would defeat that declaration. Testing ======= - Two new merge-up tests assert the verbatim invariant on synthetic sources with non-`\n\n` separators — would have caught the original bug. Both pass. - One new sub-level extraction test using a synthetic fixture (`sub-level-special-fixture.md`) where `### Discussion Map` lives inside `## Plan`. Verifies the Map is extracted, the parent's intro stays, and "## Context"/"## Summary" boundaries are unaffected. - One parent-wins precedence test against the real pigeon discussion fixture — makes the precedence rule explicit. Additional real fixtures for structural diversity ================================================== - spec-deep-nested-fixture.md (tick v1 tick-core/specification.md, 810L) — one huge H2 ("## Specification") that gets fallback-split into 11 H3 chunks, plus one standalone H2 ("## Dependencies"). Proves the flat fallback chain handles real artifacts with a single dominant H2. - research-oversized-h3-fixture.md (tick v1 research/exploration.md, 567L) — contains a ~310-line H3 "Session 1" that stays intact (no recursion, reported by knowledge status in Phase 4). - discussion-q-style-fixture.md (tick v1 cli-command-structure-ux.md, 416L) — Q1..Q6 discussion shape with no Discussion Map, proves the chunker copes with non-canonical discussion variants. - spec-folio-fixture.md (folio template-authoring-system/specification.md, 617L) — different spec structure from portal/tick for diversity. Regression: 119/119 tests green. Bundle 132.9 KB (up 1.3 KB, 17 KB under the 150 KB ceiling). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

leeovery and others added 3 commits April 11, 2026 11:47

leeovery merged commit 405baed into main Apr 11, 2026

leeovery deleted the feat/knowledge-base-phase-2 branch April 11, 2026 16:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(knowledge): phase 2 — markdown chunker + phase configs#244

feat(knowledge): phase 2 — markdown chunker + phase configs#244
leeovery merged 3 commits intomainfrom
feat/knowledge-base-phase-2

leeovery commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

leeovery commented Apr 11, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant