Skip to content

feat(knowledge): phase 2 — markdown chunker + phase configs#244

Merged
leeovery merged 3 commits intomainfrom
feat/knowledge-base-phase-2
Apr 11, 2026
Merged

feat(knowledge): phase 2 — markdown chunker + phase configs#244
leeovery merged 3 commits intomainfrom
feat/knowledge-base-phase-2

Conversation

@leeovery
Copy link
Copy Markdown
Owner

Summary

  • Pure-function markdown chunker (src/knowledge/chunker.js) driven by per-phase JSON configs. Zero deps. Algorithm: CRLF normalise → frontmatter strip → keep_whole_below gate → fenced-code-aware heading parse → primary/fallback/whole-file chain → special_sections (own-chunk/skip/merge-up with first-section promotion) → flat H2→H3 size fallback (no recursion) → skip-empty. Wired into src/knowledge/index.js so esbuild bundles it — knowledge.cjs stays at 131.6 KB, well under the 150 KB threshold.
  • 4 phase configs at skills/workflow-knowledge/chunking/{research,discussion,investigation,specification}.json. Discussion ships with Discussion Map/Summary → own-chunk from the design doc example.
  • 9 fixtures under tests/fixtures/knowledge/ — 5 real artifacts copied verbatim from portal, pigeon, tick, agntc (one per phase + a no-Discussion-Map edge case) plus 4 synthetic fixtures for the size gate, missing-headings fallback, H3 fallback splitting, and fenced-code-block correctness. 39 chunker tests assert expected chunk boundaries, max_lines compliance, and the verbatim content preservation invariant (design doc line 74).

Test plan

  • node --test tests/scripts/test-knowledge-chunker.cjs — 39 tests pass
  • Phase 1 regression — full suite 111/111 green (store, embeddings, integration, chunker)
  • npm run build — bundle still 131.6 KB (< 150 KB threshold)
  • Verbatim invariant enforced across all 9 fixtures
  • CRLF fixture exercised

🤖 Generated with Claude Code

leeovery and others added 3 commits April 11, 2026 11:47
Adds a pure-function markdown chunking engine driven by per-phase JSON
configs. Zero external deps — self-contained parser that the Phase 3 CLI
plugs into the Phase 1 store layer.

Algorithm (executed in order to resolve the gate/split-time ambiguity):
1. Strip YAML frontmatter
2. Whole-file gate: < keep_whole_below lines -> single chunk
3. Parse headings (fenced-code-block aware)
4. Split at primary_level; H1 + intro forms the first chunk
5. special_sections: own-chunk | skip | merge-up (first-section merge-up
   promotes to its own chunk)
6. Size fallback: sections > max_lines split once at fallback_level,
   never recurse (finding #9 — flat H2 -> H3 -> whole file chain)
7. skip_empty_sections drops heading-only sections

Phase configs ship at skills/workflow-knowledge/chunking/ for research,
discussion, investigation, specification. Discussion carries the special
sections from the design doc example (Discussion Map, Summary ->
own-chunk). Shared defaults match the design doc across all four.

Wired into src/knowledge/index.js so esbuild picks it up — bundle
130.6kb -> 131.6kb, well under the 150kb threshold.

Unit tests in tests/scripts/test-knowledge-chunker.cjs cover H2/H3
splitting, the flat fallback chain, special_sections behaviour, frontmatter
handling, empty sections, whole-file gate vs missing-headings fallback,
H1 handling, and fenced-code-block parsing correctness. 26 cases, plus
schema assertions against the 4 shipped configs. Phase 1 regression suite
(99 tests total) still green.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds fixture-based validation for the chunking engine. Nine fixtures —
five copied verbatim from real workflow projects (portal, pigeon, tick,
agntc) covering all four indexed phases plus a no-Discussion-Map edge
case, and four synthetic fixtures for size-gate, missing-headings,
fallback-splitting, and fenced-code-block paths.

Real fixtures (never edited — copied into tests/fixtures/knowledge/):
- research-fixture.md: portal cc-tool-plan (581 lines, 12 H2s)
- discussion-fixture.md: pigeon application-architecture (458 lines,
  11 H2s incl Discussion Map + Summary)
- investigation-fixture.md: tick cascade-unchanged-noise (149 lines,
  Symptoms/Analysis/Fix Direction)
- specification-fixture.md: portal portal/specification (757 lines,
  12 H2s, frontmatter + H1)
- discussion-no-map-fixture.md: agntc core-architecture (251 lines,
  Summary only)

Tests each realistic fixture with its production phase config and
asserts expected chunk count, expected heading boundaries, no chunk
exceeds max_lines, and the verbatim content preservation invariant
(every chunk content must appear verbatim in the post-frontmatter
source — per design doc "no information loss", line 74).

The oversized-section fixture (257 lines, single H2 > max_lines) proves
the flat fallback chain splits once at H3 and never recurses. The
minimal spec fixture (22 lines) proves the keep_whole_below gate fires
before heading parsing. The code-block fixture proves fenced ## lines
do not trigger false splits.

One chunker fix surfaced by the fixtures: CRLF line endings are now
normalised to LF at the start of chunk() so Windows-style inputs chunk
identically. Edge case explicitly listed in the task.

Full regression suite: 111 tests (72 phase-1 + 39 chunker), all green.
Bundle unchanged at 131.6 KB.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Addresses two gaps between the chunker implementation and the design
contract, surfaced during PR #244 review.

Issue #1 — merge-up broke content preservation (design line 74)
================================================================
The old merge-up path joined sections with a hard-coded `\n\n`:

    prev.content = prev.content + '\n\n' + section.text;

That separator is injected by the chunker, not drawn from the source.
If the gap between the parent and the merge-up section contained
anything other than exactly two newlines (e.g. a blank line with
trailing whitespace, or a comment), the merged chunk was no longer a
verbatim substring of the source — silently violating the global
"no lossy compression" invariant.

Fix: track source line ranges on every section during buildSections and
maintain them through the processing pipeline. merge-up now extends the
preceding segment's endLine; chunk content is generated by slicing from
the original line array once at the end. Contiguous source slice →
verbatim guaranteed, regardless of what sits between the merged
sections in the source.

Issue #3 — sub-level special_sections was not implemented
==========================================================
Task 2-1 states: "own-chunk: always split into its own chunk
regardless of heading level". The previous implementation only matched
at the split level (H2 for discussion), so an H3 "### Discussion Map"
nested inside a regular H2 would remain part of its parent chunk
instead of being extracted.

Fix: added expandSubLevelSpecials, which walks each split-level
section's line range and carves out any sub-level heading whose text
matches a special_sections entry. Precedence rule: if the parent
section's heading is itself in special_sections, the parent's action
wins and no sub-carving happens — "Discussion Map as H2" stays one
chunk. Sub-level matching handles own-chunk and skip; merge-up is a
split-level concept only and sub-level merge-up entries are treated
as regular sub-headings.

Issue #2 — own-chunk + size fallback ambiguity
==============================================
The task wording "always split into its own chunk" is ambiguous: does
own-chunk bypass the max_lines fallback split, or not? The previous
code made a silent choice. Added an inline comment documenting the
decision and its reasoning: own-chunk is interpreted as literal one
chunk, so a large Discussion Map stays intact even if it exceeds
max_lines. special_sections mark semantic units the user has declared
atomic, and splitting them would defeat that declaration.

Testing
=======
- Two new merge-up tests assert the verbatim invariant on
  synthetic sources with non-`\n\n` separators — would have caught the
  original bug. Both pass.
- One new sub-level extraction test using a synthetic fixture
  (`sub-level-special-fixture.md`) where `### Discussion Map` lives
  inside `## Plan`. Verifies the Map is extracted, the parent's intro
  stays, and "## Context"/"## Summary" boundaries are unaffected.
- One parent-wins precedence test against the real pigeon discussion
  fixture — makes the precedence rule explicit.

Additional real fixtures for structural diversity
==================================================
- spec-deep-nested-fixture.md (tick v1 tick-core/specification.md, 810L)
  — one huge H2 ("## Specification") that gets fallback-split into 11
  H3 chunks, plus one standalone H2 ("## Dependencies"). Proves the flat
  fallback chain handles real artifacts with a single dominant H2.
- research-oversized-h3-fixture.md (tick v1 research/exploration.md,
  567L) — contains a ~310-line H3 "Session 1" that stays intact (no
  recursion, reported by knowledge status in Phase 4).
- discussion-q-style-fixture.md (tick v1 cli-command-structure-ux.md,
  416L) — Q1..Q6 discussion shape with no Discussion Map, proves the
  chunker copes with non-canonical discussion variants.
- spec-folio-fixture.md (folio template-authoring-system/specification.md,
  617L) — different spec structure from portal/tick for diversity.

Regression: 119/119 tests green. Bundle 132.9 KB (up 1.3 KB,
17 KB under the 150 KB ceiling).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@leeovery leeovery merged commit 405baed into main Apr 11, 2026
@leeovery leeovery deleted the feat/knowledge-base-phase-2 branch April 11, 2026 16:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant