feat(memory): add heading-aware chunking option for improved search#59154
feat(memory): add heading-aware chunking option for improved search#59154chinar-amrutkar wants to merge 2 commits intoopenclaw:mainfrom
Conversation
Introduces a headingAware config option in memory search chunking that preserves markdown heading structure during text chunking. Instead of splitting purely by token count, the chunker splits on heading boundaries first, then sub-chunks within sections. - Add headingAware boolean to chunking config schema and merge logic - chunkMarkdown() flushes current chunk on heading lines when enabled - Track chunkHeadingAware in MemoryIndexMeta for reindex detection - Add config label and help text for the new option Closes openclaw#45545
Greptile SummaryThis PR adds an opt-in Remaining concerns:
Confidence Score: 4/5The feature is opt-in and defaults to off, so no regression for existing users in production behaviour, but three P1 correctness bugs remain unaddressed. Three P1 issues exist: one newly identified (fenced code block false positives that corrupt chunk boundaries) and two from prior threads that are still unresolved (spurious full reindex on upgrade, indented-code-block false heading detection). All affect users who enable headingAware; the reindex bug also silently fires for all users on first upgrade. Score held at 4 rather than lower because the feature is opt-in and the codebase is otherwise clean. src/memory/internal.ts (heading detection logic) and src/memory/manager-sync-ops.ts (undefined vs false reindex comparison) need the most attention before merging. Prompt To Fix All With AIThis is a comment left during a code review.
Path: src/memory/internal.ts
Line: 348-350
Comment:
**Fenced code blocks not excluded from heading detection**
`isHeading` has no awareness of fenced code blocks (`` ``` `` or `~~~` delimiters). Any line inside a fenced code block that happens to start with `#` — e.g. a Python/shell comment (`# install deps`), a C preprocessor directive (`#include`), or a Ruby comment — will be incorrectly treated as a Markdown heading and trigger a spurious chunk flush in the middle of the code block.
This is a distinct problem from the indented-code-block `.trim()` issue: fenced blocks are far more common in practice and the `#` character appears frequently inside them.
A minimal fix is to track the fence state as you iterate over lines:
```typescript
let inFencedBlock = false;
for (let i = 0; i < lines.length; i += 1) {
const line = lines[i] ?? "";
const lineNo = i + 1;
if (/^(`{3,}|~{3,})/.test(line)) {
inFencedBlock = !inFencedBlock;
}
// Heading-aware: flush on heading (unless it's the first heading)
if (headingAware && !inFencedBlock && isHeading(line) && current.length > 0) {
flush();
carryOverlap();
}
// ... rest of loop
}
```
How can I resolve this? If you propose a fix, please make it concise.Reviews (2): Last reviewed commit: "feat(memory): add heading-aware chunking..." | Re-trigger Greptile |
| meta.scopeHash !== configuredScopeHash || | ||
| meta.chunkTokens !== this.settings.chunking.tokens || | ||
| meta.chunkOverlap !== this.settings.chunking.overlap || | ||
| meta.chunkHeadingAware !== this.settings.chunking.headingAware || |
There was a problem hiding this comment.
Undefined vs false comparison triggers spurious full reindex
chunkHeadingAware is declared optional (chunkHeadingAware?: boolean) in MemoryIndexMeta, so any existing persisted index will have undefined for this field. The default resolved value of this.settings.chunking.headingAware is false (from mergeConfig). This means every existing user will hit undefined !== false → true on the very first run after this upgrade, triggering a full reindex even though they never enabled headingAware.
The fix is to normalise meta.chunkHeadingAware before comparing:
| meta.chunkHeadingAware !== this.settings.chunking.headingAware || | |
| (meta.chunkHeadingAware ?? false) !== this.settings.chunking.headingAware || |
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/memory/manager-sync-ops.ts
Line: 1006
Comment:
**Undefined vs false comparison triggers spurious full reindex**
`chunkHeadingAware` is declared optional (`chunkHeadingAware?: boolean`) in `MemoryIndexMeta`, so any existing persisted index will have `undefined` for this field. The default resolved value of `this.settings.chunking.headingAware` is `false` (from `mergeConfig`). This means every existing user will hit `undefined !== false` → `true` on the very first run after this upgrade, triggering a full reindex even though they never enabled `headingAware`.
The fix is to normalise `meta.chunkHeadingAware` before comparing:
```suggestion
(meta.chunkHeadingAware ?? false) !== this.settings.chunking.headingAware ||
```
How can I resolve this? If you propose a fix, please make it concise.| const isHeading = (line: string): boolean => { | ||
| return /^#{1,6}\s+/.test(line.trim()); | ||
| }; |
There was a problem hiding this comment.
line.trim() causes false heading detection inside indented code blocks
Calling line.trim() strips all leading whitespace before applying the heading regex. In Markdown (CommonMark spec), a line with 4 or more leading spaces is treated as an indented code block, not a heading — so # not a heading inside a code block should not trigger a chunk boundary. After trimming, it becomes # not a heading and matches the regex, producing an incorrect split.
The fix is to drop .trim() and instead allow only 0–3 optional leading spaces (per spec):
| const isHeading = (line: string): boolean => { | |
| return /^#{1,6}\s+/.test(line.trim()); | |
| }; | |
| return /^ {0,3}#{1,6}[ \t]/.test(line); |
This matches valid Markdown headings (≤3 spaces indent, followed by a space/tab after the # run) while correctly ignoring # content inside indented code blocks.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/memory/internal.ts
Line: 348-350
Comment:
**`line.trim()` causes false heading detection inside indented code blocks**
Calling `line.trim()` strips all leading whitespace before applying the heading regex. In Markdown (CommonMark spec), a line with 4 or more leading spaces is treated as an indented code block, not a heading — so ` # not a heading` inside a code block should not trigger a chunk boundary. After trimming, it becomes `# not a heading` and matches the regex, producing an incorrect split.
The fix is to drop `.trim()` and instead allow only 0–3 optional leading spaces (per spec):
```suggestion
return /^ {0,3}#{1,6}[ \t]/.test(line);
```
This matches valid Markdown headings (≤3 spaces indent, followed by a space/tab after the `#` run) while correctly ignoring `#` content inside indented code blocks.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 32e45e501d
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| meta.scopeHash !== configuredScopeHash || | ||
| meta.chunkTokens !== this.settings.chunking.tokens || | ||
| meta.chunkOverlap !== this.settings.chunking.overlap || | ||
| meta.chunkHeadingAware !== this.settings.chunking.headingAware || |
There was a problem hiding this comment.
Treat absent headingAware metadata as false in reindex check
This comparison forces a full reindex for every pre-existing index after upgrade because old metadata does not have chunkHeadingAware (so it is undefined) while the resolved setting defaults to false; undefined !== false evaluates true and triggers reindex even though chunking behavior is unchanged. On large memory stores this adds unnecessary startup/sync cost and can look like a regression. Compare with a defaulted value (e.g. meta.chunkHeadingAware ?? false) to avoid one-time mass reindexing.
Useful? React with 👍 / 👎.
| flush(); | ||
| carryOverlap(); |
There was a problem hiding this comment.
Avoid carrying overlap when splitting on new headings
In heading-aware mode this keeps overlap lines from the previous section and prepends them to the next heading chunk, so chunks can cross heading boundaries despite the option’s stated behavior (“split on heading boundaries first”). With non-zero overlap, searches may retrieve mixed-section chunks and reduce relevance for structured docs. Heading-triggered flushes should start a fresh section chunk instead of reusing prior-section overlap.
Useful? React with 👍 / 👎.
Adds a headingAware config option that splits on level-2 and level-3 markdown headings (##, ###) before falling back to token-based chunking. This keeps heading-delimited sections as discrete chunks, improving retrieval precision for structured memory files. - chunkMarkdown() flushes current chunk on ## and ### headings when enabled - Tracks chunkHeadingAware in MemoryIndexMeta for reindex detection - Config schema, labels, and help text for the new option - Backward-compatible: files without headings behave exactly as today Closes openclaw#44395
Introduces a
headingAwareconfig option in memory search chunking that preserves markdown heading structure during text chunking. Instead of splitting purely by token count, the chunker splits on heading boundaries first, then sub-chunks within sections.headingAwareboolean to chunking config schema and merge logicchunkMarkdown()flushes current chunk on heading lines when enabledchunkHeadingAwareinMemoryIndexMetafor reindex detectionCloses #45545