feat(knowledge): phase 2 — markdown chunker + phase configs#244
Merged
feat(knowledge): phase 2 — markdown chunker + phase configs#244
Conversation
Adds a pure-function markdown chunking engine driven by per-phase JSON configs. Zero external deps — self-contained parser that the Phase 3 CLI plugs into the Phase 1 store layer. Algorithm (executed in order to resolve the gate/split-time ambiguity): 1. Strip YAML frontmatter 2. Whole-file gate: < keep_whole_below lines -> single chunk 3. Parse headings (fenced-code-block aware) 4. Split at primary_level; H1 + intro forms the first chunk 5. special_sections: own-chunk | skip | merge-up (first-section merge-up promotes to its own chunk) 6. Size fallback: sections > max_lines split once at fallback_level, never recurse (finding #9 — flat H2 -> H3 -> whole file chain) 7. skip_empty_sections drops heading-only sections Phase configs ship at skills/workflow-knowledge/chunking/ for research, discussion, investigation, specification. Discussion carries the special sections from the design doc example (Discussion Map, Summary -> own-chunk). Shared defaults match the design doc across all four. Wired into src/knowledge/index.js so esbuild picks it up — bundle 130.6kb -> 131.6kb, well under the 150kb threshold. Unit tests in tests/scripts/test-knowledge-chunker.cjs cover H2/H3 splitting, the flat fallback chain, special_sections behaviour, frontmatter handling, empty sections, whole-file gate vs missing-headings fallback, H1 handling, and fenced-code-block parsing correctness. 26 cases, plus schema assertions against the 4 shipped configs. Phase 1 regression suite (99 tests total) still green. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds fixture-based validation for the chunking engine. Nine fixtures — five copied verbatim from real workflow projects (portal, pigeon, tick, agntc) covering all four indexed phases plus a no-Discussion-Map edge case, and four synthetic fixtures for size-gate, missing-headings, fallback-splitting, and fenced-code-block paths. Real fixtures (never edited — copied into tests/fixtures/knowledge/): - research-fixture.md: portal cc-tool-plan (581 lines, 12 H2s) - discussion-fixture.md: pigeon application-architecture (458 lines, 11 H2s incl Discussion Map + Summary) - investigation-fixture.md: tick cascade-unchanged-noise (149 lines, Symptoms/Analysis/Fix Direction) - specification-fixture.md: portal portal/specification (757 lines, 12 H2s, frontmatter + H1) - discussion-no-map-fixture.md: agntc core-architecture (251 lines, Summary only) Tests each realistic fixture with its production phase config and asserts expected chunk count, expected heading boundaries, no chunk exceeds max_lines, and the verbatim content preservation invariant (every chunk content must appear verbatim in the post-frontmatter source — per design doc "no information loss", line 74). The oversized-section fixture (257 lines, single H2 > max_lines) proves the flat fallback chain splits once at H3 and never recurses. The minimal spec fixture (22 lines) proves the keep_whole_below gate fires before heading parsing. The code-block fixture proves fenced ## lines do not trigger false splits. One chunker fix surfaced by the fixtures: CRLF line endings are now normalised to LF at the start of chunk() so Windows-style inputs chunk identically. Edge case explicitly listed in the task. Full regression suite: 111 tests (72 phase-1 + 39 chunker), all green. Bundle unchanged at 131.6 KB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Addresses two gaps between the chunker implementation and the design contract, surfaced during PR #244 review. Issue #1 — merge-up broke content preservation (design line 74) ================================================================ The old merge-up path joined sections with a hard-coded `\n\n`: prev.content = prev.content + '\n\n' + section.text; That separator is injected by the chunker, not drawn from the source. If the gap between the parent and the merge-up section contained anything other than exactly two newlines (e.g. a blank line with trailing whitespace, or a comment), the merged chunk was no longer a verbatim substring of the source — silently violating the global "no lossy compression" invariant. Fix: track source line ranges on every section during buildSections and maintain them through the processing pipeline. merge-up now extends the preceding segment's endLine; chunk content is generated by slicing from the original line array once at the end. Contiguous source slice → verbatim guaranteed, regardless of what sits between the merged sections in the source. Issue #3 — sub-level special_sections was not implemented ========================================================== Task 2-1 states: "own-chunk: always split into its own chunk regardless of heading level". The previous implementation only matched at the split level (H2 for discussion), so an H3 "### Discussion Map" nested inside a regular H2 would remain part of its parent chunk instead of being extracted. Fix: added expandSubLevelSpecials, which walks each split-level section's line range and carves out any sub-level heading whose text matches a special_sections entry. Precedence rule: if the parent section's heading is itself in special_sections, the parent's action wins and no sub-carving happens — "Discussion Map as H2" stays one chunk. Sub-level matching handles own-chunk and skip; merge-up is a split-level concept only and sub-level merge-up entries are treated as regular sub-headings. Issue #2 — own-chunk + size fallback ambiguity ============================================== The task wording "always split into its own chunk" is ambiguous: does own-chunk bypass the max_lines fallback split, or not? The previous code made a silent choice. Added an inline comment documenting the decision and its reasoning: own-chunk is interpreted as literal one chunk, so a large Discussion Map stays intact even if it exceeds max_lines. special_sections mark semantic units the user has declared atomic, and splitting them would defeat that declaration. Testing ======= - Two new merge-up tests assert the verbatim invariant on synthetic sources with non-`\n\n` separators — would have caught the original bug. Both pass. - One new sub-level extraction test using a synthetic fixture (`sub-level-special-fixture.md`) where `### Discussion Map` lives inside `## Plan`. Verifies the Map is extracted, the parent's intro stays, and "## Context"/"## Summary" boundaries are unaffected. - One parent-wins precedence test against the real pigeon discussion fixture — makes the precedence rule explicit. Additional real fixtures for structural diversity ================================================== - spec-deep-nested-fixture.md (tick v1 tick-core/specification.md, 810L) — one huge H2 ("## Specification") that gets fallback-split into 11 H3 chunks, plus one standalone H2 ("## Dependencies"). Proves the flat fallback chain handles real artifacts with a single dominant H2. - research-oversized-h3-fixture.md (tick v1 research/exploration.md, 567L) — contains a ~310-line H3 "Session 1" that stays intact (no recursion, reported by knowledge status in Phase 4). - discussion-q-style-fixture.md (tick v1 cli-command-structure-ux.md, 416L) — Q1..Q6 discussion shape with no Discussion Map, proves the chunker copes with non-canonical discussion variants. - spec-folio-fixture.md (folio template-authoring-system/specification.md, 617L) — different spec structure from portal/tick for diversity. Regression: 119/119 tests green. Bundle 132.9 KB (up 1.3 KB, 17 KB under the 150 KB ceiling). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
src/knowledge/chunker.js) driven by per-phase JSON configs. Zero deps. Algorithm: CRLF normalise → frontmatter strip →keep_whole_belowgate → fenced-code-aware heading parse → primary/fallback/whole-file chain →special_sections(own-chunk/skip/merge-up with first-section promotion) → flat H2→H3 size fallback (no recursion) → skip-empty. Wired intosrc/knowledge/index.jsso esbuild bundles it — knowledge.cjs stays at 131.6 KB, well under the 150 KB threshold.skills/workflow-knowledge/chunking/{research,discussion,investigation,specification}.json. Discussion ships withDiscussion Map/Summary→ own-chunk from the design doc example.tests/fixtures/knowledge/— 5 real artifacts copied verbatim fromportal,pigeon,tick,agntc(one per phase + a no-Discussion-Map edge case) plus 4 synthetic fixtures for the size gate, missing-headings fallback, H3 fallback splitting, and fenced-code-block correctness. 39 chunker tests assert expected chunk boundaries,max_linescompliance, and the verbatim content preservation invariant (design doc line 74).Test plan
node --test tests/scripts/test-knowledge-chunker.cjs— 39 tests passnpm run build— bundle still 131.6 KB (< 150 KB threshold)🤖 Generated with Claude Code