Skip to content

feat(memory): add heading-aware chunking option for improved search#59154

Closed
chinar-amrutkar wants to merge 2 commits intoopenclaw:mainfrom
chinar-amrutkar:fix/memory-heading-aware-chunking
Closed

feat(memory): add heading-aware chunking option for improved search#59154
chinar-amrutkar wants to merge 2 commits intoopenclaw:mainfrom
chinar-amrutkar:fix/memory-heading-aware-chunking

Conversation

@chinar-amrutkar
Copy link
Copy Markdown
Contributor

Introduces a headingAware config option in memory search chunking that preserves markdown heading structure during text chunking. Instead of splitting purely by token count, the chunker splits on heading boundaries first, then sub-chunks within sections.

  • Add headingAware boolean to chunking config schema and merge logic
  • chunkMarkdown() flushes current chunk on heading lines when enabled
  • Track chunkHeadingAware in MemoryIndexMeta for reindex detection
  • Add config label and help text for the new option

Closes #45545

Introduces a headingAware config option in memory search chunking that
preserves markdown heading structure during text chunking. Instead of
splitting purely by token count, the chunker splits on heading boundaries
first, then sub-chunks within sections.

- Add headingAware boolean to chunking config schema and merge logic
- chunkMarkdown() flushes current chunk on heading lines when enabled
- Track chunkHeadingAware in MemoryIndexMeta for reindex detection
- Add config label and help text for the new option

Closes openclaw#45545
@openclaw-barnacle openclaw-barnacle bot added agents Agent runtime and tooling size: XS labels Apr 1, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 1, 2026

Greptile Summary

This PR adds an opt-in headingAware boolean to the memory-search chunking config that flushes the current chunk whenever a Markdown heading is encountered, keeping section content together. The schema, label, help text, mergeConfig wiring, chunkMarkdown implementation, and MemoryIndexMeta persistence all look cohesive.

Remaining concerns:

  • Fenced code block false positives (P1 — new): isHeading has no awareness of fenced code blocks. A # line inside a ``` block (Python/shell comment, C preprocessor directive, etc.) will be treated as a heading and trigger a spurious chunk split. This makes headingAware actively harmful for content that mixes prose and fenced code. See the inline comment for a minimal fence-tracking fix.

  • Spurious full reindex on first upgrade (P1 — flagged in prior thread, still unresolved): meta.chunkHeadingAware is undefined in any existing persisted index; comparing it against the resolved default false via !== evaluates to true, triggering a full reindex for every existing user immediately after upgrading.

  • .trim() misidentifies indented code block lines as headings (P1 — flagged in prior thread, still unresolved): Stripping leading whitespace before the heading regex means a line with four or more leading spaces (an indented code block per the CommonMark spec) can falsely match as a heading.

Confidence Score: 4/5

The feature is opt-in and defaults to off, so no regression for existing users in production behaviour, but three P1 correctness bugs remain unaddressed.

Three P1 issues exist: one newly identified (fenced code block false positives that corrupt chunk boundaries) and two from prior threads that are still unresolved (spurious full reindex on upgrade, indented-code-block false heading detection). All affect users who enable headingAware; the reindex bug also silently fires for all users on first upgrade. Score held at 4 rather than lower because the feature is opt-in and the codebase is otherwise clean.

src/memory/internal.ts (heading detection logic) and src/memory/manager-sync-ops.ts (undefined vs false reindex comparison) need the most attention before merging.

Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/memory/internal.ts
Line: 348-350

Comment:
**Fenced code blocks not excluded from heading detection**

`isHeading` has no awareness of fenced code blocks (`` ``` `` or `~~~` delimiters). Any line inside a fenced code block that happens to start with `#` — e.g. a Python/shell comment (`# install deps`), a C preprocessor directive (`#include`), or a Ruby comment — will be incorrectly treated as a Markdown heading and trigger a spurious chunk flush in the middle of the code block.

This is a distinct problem from the indented-code-block `.trim()` issue: fenced blocks are far more common in practice and the `#` character appears frequently inside them.

A minimal fix is to track the fence state as you iterate over lines:

```typescript
let inFencedBlock = false;
for (let i = 0; i < lines.length; i += 1) {
  const line = lines[i] ?? "";
  const lineNo = i + 1;

  if (/^(`{3,}|~{3,})/.test(line)) {
    inFencedBlock = !inFencedBlock;
  }

  // Heading-aware: flush on heading (unless it's the first heading)
  if (headingAware && !inFencedBlock && isHeading(line) && current.length > 0) {
    flush();
    carryOverlap();
  }
  // ... rest of loop
}
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (2): Last reviewed commit: "feat(memory): add heading-aware chunking..." | Re-trigger Greptile

meta.scopeHash !== configuredScopeHash ||
meta.chunkTokens !== this.settings.chunking.tokens ||
meta.chunkOverlap !== this.settings.chunking.overlap ||
meta.chunkHeadingAware !== this.settings.chunking.headingAware ||
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Undefined vs false comparison triggers spurious full reindex

chunkHeadingAware is declared optional (chunkHeadingAware?: boolean) in MemoryIndexMeta, so any existing persisted index will have undefined for this field. The default resolved value of this.settings.chunking.headingAware is false (from mergeConfig). This means every existing user will hit undefined !== falsetrue on the very first run after this upgrade, triggering a full reindex even though they never enabled headingAware.

The fix is to normalise meta.chunkHeadingAware before comparing:

Suggested change
meta.chunkHeadingAware !== this.settings.chunking.headingAware ||
(meta.chunkHeadingAware ?? false) !== this.settings.chunking.headingAware ||
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/memory/manager-sync-ops.ts
Line: 1006

Comment:
**Undefined vs false comparison triggers spurious full reindex**

`chunkHeadingAware` is declared optional (`chunkHeadingAware?: boolean`) in `MemoryIndexMeta`, so any existing persisted index will have `undefined` for this field. The default resolved value of `this.settings.chunking.headingAware` is `false` (from `mergeConfig`). This means every existing user will hit `undefined !== false``true` on the very first run after this upgrade, triggering a full reindex even though they never enabled `headingAware`.

The fix is to normalise `meta.chunkHeadingAware` before comparing:

```suggestion
      (meta.chunkHeadingAware ?? false) !== this.settings.chunking.headingAware ||
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +348 to +350
const isHeading = (line: string): boolean => {
return /^#{1,6}\s+/.test(line.trim());
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 line.trim() causes false heading detection inside indented code blocks

Calling line.trim() strips all leading whitespace before applying the heading regex. In Markdown (CommonMark spec), a line with 4 or more leading spaces is treated as an indented code block, not a heading — so # not a heading inside a code block should not trigger a chunk boundary. After trimming, it becomes # not a heading and matches the regex, producing an incorrect split.

The fix is to drop .trim() and instead allow only 0–3 optional leading spaces (per spec):

Suggested change
const isHeading = (line: string): boolean => {
return /^#{1,6}\s+/.test(line.trim());
};
return /^ {0,3}#{1,6}[ \t]/.test(line);

This matches valid Markdown headings (≤3 spaces indent, followed by a space/tab after the # run) while correctly ignoring # content inside indented code blocks.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/memory/internal.ts
Line: 348-350

Comment:
**`line.trim()` causes false heading detection inside indented code blocks**

Calling `line.trim()` strips all leading whitespace before applying the heading regex. In Markdown (CommonMark spec), a line with 4 or more leading spaces is treated as an indented code block, not a heading — so `    # not a heading` inside a code block should not trigger a chunk boundary. After trimming, it becomes `# not a heading` and matches the regex, producing an incorrect split.

The fix is to drop `.trim()` and instead allow only 0–3 optional leading spaces (per spec):

```suggestion
    return /^ {0,3}#{1,6}[ \t]/.test(line);
```

This matches valid Markdown headings (≤3 spaces indent, followed by a space/tab after the `#` run) while correctly ignoring `#` content inside indented code blocks.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 32e45e501d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

meta.scopeHash !== configuredScopeHash ||
meta.chunkTokens !== this.settings.chunking.tokens ||
meta.chunkOverlap !== this.settings.chunking.overlap ||
meta.chunkHeadingAware !== this.settings.chunking.headingAware ||
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Treat absent headingAware metadata as false in reindex check

This comparison forces a full reindex for every pre-existing index after upgrade because old metadata does not have chunkHeadingAware (so it is undefined) while the resolved setting defaults to false; undefined !== false evaluates true and triggers reindex even though chunking behavior is unchanged. On large memory stores this adds unnecessary startup/sync cost and can look like a regression. Compare with a defaulted value (e.g. meta.chunkHeadingAware ?? false) to avoid one-time mass reindexing.

Useful? React with 👍 / 👎.

Comment on lines +405 to +406
flush();
carryOverlap();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid carrying overlap when splitting on new headings

In heading-aware mode this keeps overlap lines from the previous section and prepends them to the next heading chunk, so chunks can cross heading boundaries despite the option’s stated behavior (“split on heading boundaries first”). With non-zero overlap, searches may retrieve mixed-section chunks and reduce relevance for structured docs. Heading-triggered flushes should start a fresh section chunk instead of reusing prior-section overlap.

Useful? React with 👍 / 👎.

Adds a headingAware config option that splits on level-2 and level-3
markdown headings (##, ###) before falling back to token-based chunking.
This keeps heading-delimited sections as discrete chunks, improving
retrieval precision for structured memory files.

- chunkMarkdown() flushes current chunk on ## and ### headings when enabled
- Tracks chunkHeadingAware in MemoryIndexMeta for reindex detection
- Config schema, labels, and help text for the new option
- Backward-compatible: files without headings behave exactly as today

Closes openclaw#44395
@chinar-amrutkar chinar-amrutkar deleted the fix/memory-heading-aware-chunking branch April 1, 2026 21:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling size: XS

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant