Skip to content

fix(memory): incremental session sync (#40919)#59577

Closed
wr-web wants to merge 1 commit into
openclaw:mainfrom
wr-web:main
Closed

fix(memory): incremental session sync (#40919)#59577
wr-web wants to merge 1 commit into
openclaw:mainfrom
wr-web:main

Conversation

@wr-web
Copy link
Copy Markdown

@wr-web wr-web commented Apr 2, 2026

Summary

Describe the problem and fix in 2–5 bullets:

If this PR fixes a plugin beta-release blocker, title it fix(<plugin-id>): beta blocker - <summary> and link the matching Beta blocker: <plugin-name> - <summary> issue labeled beta-blocker. Contributors cannot label PRs, so the title is the PR-side signal for maintainers and automation.

  • Problem: When a session file grows, the memory index performs a full reindex - deleting all chunks, re-generating all content, re-computing all embeddings, and re-inserting everything. This causes O(n) performance degradation as sessions grow longer.
  • Why it matters: Each 1.5 seconds of session growth triggers a full reindex. For a 50-turn session with 10 chunks, this means 10 embedding API calls every time, even when only 1-2 new chunks were added. This wastes API quota, increases latency, and causes unnecessary database writes.
  • What changed: Implement hash-based incremental indexing for session files. Only new or changed chunks are re-embedded, while unchanged chunks are preserved. This reduces embedding API calls by 28-67% depending on session growth.
  • What did NOT change (scope boundary): Memory files (MEMORY.md) still use full reindex. Only session files use incremental sync. The chunking algorithm is unchanged. The embedding provider is unchanged.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Closes #
  • Related #
  • This PR fixes a bug or regression

Root Cause / Regression History (if applicable)

For bug fixes or regressions, explain why this happened, not just what changed. Otherwise write N/A. If the cause is unclear, write Unknown.

N/A - This is a feature, not a bug fix.

Regression Test Plan (if applicable)

For bug fixes or regressions, name the smallest reliable test coverage that should have caught this. Otherwise write N/A.

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file: manager.incremental-session.test.ts
  • Scenario the test should lock in: Session growth triggers incremental sync instead of full reindex
  • Why this is the smallest reliable guardrail: Directly tests the hash-based deduplication logic
  • Existing test that already covers this (if any): manager.atomic-reindex.test.ts covers full reindex
  • If no new test is added, why not: N/A

User-visible / Behavior Changes

List user-visible changes (including defaults/config).
If none, write None.

None. The incremental sync is transparent to users. Session syncs are faster but the behavior is identical.

Diagram (if applicable)

For UI changes or non-trivial logic flows, include a small ASCII diagram reviewers can scan quickly. Otherwise write N/A.

Before (Full Reindex):
[session grows] -> [delete ALL chunks] -> [regenerate ALL chunks] -> [embed ALL chunks] -> [insert ALL chunks]

After (Incremental Sync):
[session grows] -> [generate ALL chunks] -> [compare hashes] -> [skip unchanged] -> [embed NEW chunks only] -> [insert NEW chunks only]

Security Impact (required)

  • New permissions/capabilities? (No)
  • Secrets/tokens handling changed? (No)
  • New/changed network calls? (No)
  • Command/tool execution surface changed? (No)
  • Data access scope changed? (No)
  • If any Yes, explain risk + mitigation: N/A

Repro + Verification

Environment

  • OS: Linux (Fedora)
  • Runtime/container: Node.js 22
  • Model/provider: Ollama + nomic-embed-text
  • Integration/channel: Session memory sync
  • Relevant config (redacted):
    {
      "memory": {
        "sources": ["sessions"],
        "chunking": { "tokens": 200, "overlap": 50 },
        "experimental": { "sessionMemory": true }
      }
    }

Steps

  1. Create a session with 3 turns (6 messages)
  2. Run initial sync with force: true
  3. Grow session to 5 turns (+2 turns = +4 messages)
  4. Run incremental sync with sessionFiles: [path]
  5. Verify that only 1 chunk was re-embedded (not all 3)

Expected

  • Incremental sync embeds only new/changed chunks
  • Existing chunks are preserved with identical embeddings
  • Embedding API calls reduced by 50-67%

Actual

  • Confirmed: Only 1 chunk re-embedded (vs 3 with full reindex)
  • Confirmed: Existing chunks have byte-for-byte identical embeddings
  • Measured: 50-67% reduction in embedding calls

Evidence

Attach at least one:

  • Failing test/log before + passing after
  • Perf numbers (if relevant)

Benchmark Results (Real Ollama API)

Scenario Chunks Before Chunks After Full Reindex Incremental Saving
overlap=0, +2 turns 2 3 3 calls 1 call 66.7%
overlap=0, +5 turns 3 5 5 calls 2 calls 60.0%
overlap=50, +2 turns 2 4 4 calls 2 calls 50.0%
overlap=50, +5 turns 4 9 9 calls 5 calls 44.4%

Embedding Consistency Verification

Test Result
Full reindex twice (same content) ✅ Identical embeddings
Incremental sync preserves embeddings ✅ Byte-for-byte identical
Full reindex after incremental ✅ Identical embeddings

Human Verification (required)

What you personally verified (not just CI), and how:

  • Verified scenarios:
    • Session growth with overlap=0
    • Session growth with overlap=50
    • Unicode content preservation
    • Long messages (multiple chunks per message)
    • Very long sessions (50+ turns)
  • Edge cases checked:
    • Empty file → single message
    • Single message → multiple messages
    • Large chunks (all content in one chunk)
  • What you did not verify:
    • Real-world production sessions with actual user data
    • Performance on very large sessions (100+ turns)

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

If a bot review conversation is addressed by this PR, resolve that conversation yourself. Do not leave bot review conversation cleanup for maintainers.

Compatibility / Migration

  • Backward compatible? (Yes)
  • Config/env changes? (No)
  • Migration needed? (No)
  • If yes, exact upgrade steps: N/A

Compatibility Details

Aspect Status Details
Database Schema ✅ Compatible Uses ensureColumn() to safely add new columns with default values
API Compatibility ✅ Compatible indexFile() preserved, indexSessionFileIncremental() is new
Existing Data ✅ Compatible Existing chunks work normally, new columns default to 0
Downgrade ✅ Compatible Old versions ignore new columns, still function correctly

Risks and Mitigations

List only real risks for this PR. Add/remove entries as needed. If none, write None.

  • Risk: Existing chunks have start_offset=0, end_offset=0 which may cause hash lookup issues

    • Mitigation: Hash-based lookup ignores offset values, uses content hash for matching
  • Risk: Single-chunk sessions don't benefit from incremental sync

    • Mitigation: This is expected behavior - small sessions naturally re-embed all content. No regression compared to before.
  • Risk: Overlap causes boundary chunks to change, reducing hash preservation rate

    • Mitigation: This is inherent to the overlap mechanism. Users can reduce overlap to improve preservation rate.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dd6cc48645

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/memory/manager-embedding-ops.ts Outdated
Comment thread src/memory/manager-embedding-ops.ts
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 2, 2026

Greptile Summary

This PR implements hash-based incremental indexing for session memory files, replacing the previous O(n) full reindex on every session update with a scheme that only re-embeds new or changed chunks. The approach is well-scoped (session files only), backward-compatible, and produces meaningful API-call savings (44–67% measured). The new chunkMarkdownWithOffset / indexSessionFileIncremental pair fits cleanly into the existing abstract class hierarchy.

Issues found:

  • Missing ON CONFLICT for new-chunk INSERT (manager-embedding-ops.ts, step 7): The full-reindex path uses ON CONFLICT(id) DO UPDATE SET … so re-running it is always safe. The incremental path uses a plain INSERT for newly discovered chunks. If a previous partial sync completed the INSERT but failed before upsertFileRecord, the next sync will see a stale file record, re-derive the same "new" chunks, and hit a UNIQUE constraint violation. Adding the same upsert pattern as indexFile closes this gap.

  • Redundant try/catch (manager-embedding-ops.ts, step 6): The catch block catches and immediately re-throws with no side effects. Remove the wrapper.

  • endOffset off-by-one in chunkMarkdownWithOffset (internal.ts): flush() computes endOffset = charOffset + currentChars - 1 where currentChars counts each line plus its trailing \n. This produces endOffset = startOffset + text.length (points past the last text character at the line-terminator). enforceEmbeddingMaxInputTokens uses the convention endOffset = startOffset + text.length - 1. For non-split chunks the stored offset is 1 too large; for split chunks the last sub-chunk's offset is correct — so the two paths produce inconsistent values. Since start_offset/end_offset are metadata-only (not used in any search query), there is no functional regression today, but the inconsistency will mislead future consumers.

Confidence Score: 3/5

  • Safe to merge after addressing the missing ON CONFLICT clause; the other issues are minor quality concerns.
  • The core hash-based deduplication logic is correct and well-tested for the append-only session file case. The schema migration is backward-compatible. However, the plain INSERT without ON CONFLICT in the incremental path creates a real (if uncommon) failure mode on partial-sync recovery that the full-reindex path does not have. The off-by-one in endOffset and the redundant try/catch are minor issues. Confidence is 3/5 pending the INSERT robustness fix.
  • src/memory/manager-embedding-ops.ts (INSERT without ON CONFLICT for new chunks, redundant try/catch), src/memory/internal.ts (endOffset off-by-one in chunkMarkdownWithOffset)
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/memory/manager-embedding-ops.ts
Line: 908-916

Comment:
**Redundant try/catch block**

The `try/catch` around the embedding call in step 6 catches the error and immediately re-throws it without any logging, transformation, or side effects. This adds indentation noise without any benefit and differs from how `indexFile` handles the same code path (where the `catch` block adds meaningful logic like checking `isStructuredInputTooLargeError`).

```suggestion
    if (allNeedEmbedding.length > 0) {
      embeddings = this.batch.enabled
        ? await this.embedChunksWithBatch(allNeedEmbedding, entry, options.source)
        : await this.embedChunksInBatches(allNeedEmbedding);
    }
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/memory/manager-embedding-ops.ts
Line: 929-947

Comment:
**Plain INSERT without ON CONFLICT may throw on duplicate IDs**

The full-reindex path (`indexFile`) uses `ON CONFLICT(id) DO UPDATE SET ...` for its chunk INSERT (line 1124), making it safe to retry or run against an existing record. The new incremental path uses a plain INSERT with no conflict handling.

Although in the normal append-only case the incremental logic correctly avoids inserting a chunk that's already matched by hash, consider a race condition or a previous partial failure where `upsertFileRecord` was never reached (so the file record is stale). On the next sync, the caller selects existing chunks, matches them by hash, and marks them as "used." But if the DB already has a row for the new chunk from the previous partially-completed run, the plain INSERT will throw `UNIQUE constraint failed: chunks.id`.

Unlike the matched/preserved path, chunks in `chunksToEmbed` have no existing DB record that would absorb the conflict — making the asymmetry with `indexFile` a real risk.

Consider using the same upsert pattern here:

```suggestion
      this.db
        .prepare(
          `INSERT INTO chunks (id, path, source, start_line, end_line, hash, model, text, embedding, updated_at, start_offset, end_offset)
           VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
           ON CONFLICT(id) DO UPDATE SET
             hash=excluded.hash,
             model=excluded.model,
             text=excluded.text,
             embedding=excluded.embedding,
             updated_at=excluded.updated_at,
             start_offset=excluded.start_offset,
             end_offset=excluded.end_offset`,
        )
        .run(
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/memory/internal.ts
Line: 455-467

Comment:
**`endOffset` is off-by-one vs. the convention used in `enforceEmbeddingMaxInputTokens`**

`flush()` computes:
```ts
const endOffset = charOffset + currentChars - 1;
```
where `currentChars` counts each segment as `segment.length + 1` (adding 1 for the newline). Because the `text` is assembled with `.join("\n")` — no trailing newline — `currentChars = text.length + 1`. This means `endOffset = startOffset + text.length`, which points to the position of the **trailing newline** that separates this line from the next.

By contrast, `enforceEmbeddingMaxInputTokens` (when splitting an oversized chunk) sets:
```ts
endOffset = offsetCursor + text.length - 1;
```
which points to the **last character of the text itself** (exclusive of any trailing newline).

The two conventions are inconsistent. For chunks that fit within `maxInputTokens` (pass-through), `endOffset` is `startOffset + text.length`; for chunks that are split, the last sub-chunk gets `endOffset = startOffset + text.length - 1`. The test in `embedding-chunk-limits.test.ts` validates the split convention but a chunk coming from `chunkMarkdownWithOffset` would arrive with the +1 convention, causing the last split's `endOffset` to be one less than the parent chunk's `endOffset`.

Since `start_offset`/`end_offset` are only stored metadata and are not used in any search query today, there is no functional regression — but future consumers of these fields would see the inconsistency. The fix is to use `currentChars - 2` (or `text.length - 1` plus `charOffset`) in `flush()`:

```suggestion
    const endOffset = charOffset + currentChars - 2;
```
and advance the cursor accordingly:
```suggestion
    charOffset += currentChars - 1;
```
(so `charOffset` now points to the start of the first non-overlapping character of the next chunk, consistent with how `enforceEmbeddingMaxInputTokens` tracks `offsetCursor`.)

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "fix(memory): incremental session sync (#..." | Re-trigger Greptile

Comment thread src/memory/manager-embedding-ops.ts
Comment thread src/memory/manager-embedding-ops.ts
Comment thread src/memory/internal.ts
@wr-web
Copy link
Copy Markdown
Author

wr-web commented Apr 3, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5184cdc332

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +930 to +933
`UPDATE chunks SET start_line = ?, end_line = ?, start_offset = ?, end_offset = ?, updated_at = ?
WHERE id = ?`,
)
.run(chunk.startLine, chunk.endLine, chunk.startOffset, chunk.endOffset, now, existing.id);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Refresh FTS rows when only chunk metadata changes

This metadata-refresh branch updates chunks.start_line/end_line but never updates chunks_fts, and searchKeyword reads start_line/end_line directly from the FTS table. In the truncation/line-shift scenario called out in this function’s comment, keyword (and keyword-only hybrid) results will therefore return stale line ranges even though the chunks table was corrected; you should update or replace the matching FTS row whenever a reused chunk’s line metadata changes.

Useful? React with 👍 / 👎.

Comment thread src/memory/internal.ts
Comment on lines +488 to +489
const overlapCharCount = kept.reduce((sum, entry) => sum + entry.line.length + 1, 0);
charOffset -= overlapCharCount;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 Badge Correct overlap offset rewind in chunkMarkdownWithOffset

charOffset is advanced using currentChars - 1 in flush, but overlap rewind subtracts full overlapCharCount (which includes the synthetic per-line +1 newline accounting). That mismatch shifts subsequent chunk offsets one character early (and can produce negative starts for single-segment overlap), so persisted start_offset/end_offset values are off-by-one whenever overlap is enabled.

Useful? React with 👍 / 👎.

@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented Apr 30, 2026

Thanks for the context here. I swept through the related work, and this is now duplicate or superseded.

Close as superseded: the same author opened #75179 as the current replacement for this incremental session sync work after review feedback, and that newer PR targets the active memory-core surface instead of this older branch's stale memory tree.

So I’m closing this here and keeping the remaining discussion on the canonical linked item.

Review details

Best possible solution:

Close this older PR and keep the remaining implementation/review path on #75179 and #40919, with any eventual fix landing in extensions/memory-core and shared host SDK helpers while preserving FTS-only search, retry safety, citation metadata, and current chunking semantics.

Do we have a high-confidence way to reproduce the issue?

Yes. Source inspection on current main gives a high-confidence path: a dirty growing session transcript reaches indexFile, and writeChunks clears indexed rows before rewriting all regenerated chunks.

Is this the best way to solve the issue?

No. Hash-based reuse is a reasonable direction, but this branch is not the best vehicle because it targets the old src/memory tree and has known FTS and failure-ordering regressions; the newer #75179 branch is the canonical place to review the replacement implementation.

Security review:

Security review cleared: The diff is limited to memory indexing, SQLite schema metadata, and tests; I found no concrete dependency, workflow, secret-handling, permission, or command-execution concern.

What I checked:

  • Author-submitted replacement: The GitHub context includes a May 1, 2026 author comment saying they submitted a new PR while retaining this previous version's modifications, and the linked replacement fix(memory): incremental session sync (openclaw#40919) (openclaw#59577) #75179 is open with the same incremental-session-sync scope. (84cc80f51cd3)
  • Active memory surface is memory-core: The CLI docs state that openclaw memory is provided by the active memory plugin, defaulting to memory-core, so current work belongs under the memory-core plugin surface. Public docs: docs/cli/memory.md. (docs/cli/memory.md:13, c9828635a801)
  • Current session sync still full-indexes changed sessions: Current main builds each dirty session entry and calls indexFile for source: "sessions"; there is no current-main incremental session write path here. (extensions/memory-core/src/memory/manager-sync-ops.ts:884, c9828635a801)
  • Current write path clears and rewrites indexed rows: writeChunks clears existing indexed data before inserting regenerated chunk, vector, and FTS rows, matching the root performance problem but also showing why the old branch was touching the wrong current path. (extensions/memory-core/src/memory/manager-embedding-ops.ts:587, c9828635a801)
  • Older branch targets removed/stale paths: Current main lists active memory files under extensions/memory-core and packages/memory-host-sdk, while the PR file list changes src/memory/*; rg found no current indexSessionFileIncremental or chunkMarkdownWithOffset implementation. (c9828635a801)
  • Review history already identified replacement blockers: The prior ClawSweeper review found this branch not merge-ready because it patched the old memory tree and introduced FTS/failure-ordering regressions; the replacement PR was opened after that review. (5184cdc33242)

Likely related people:

  • steipete: Current-main blame for the active memory-core session sync, indexing/write path, chunking helper, session entry builder, and memory docs points to Peter Steinberger; this is the best visible routing candidate for the current implementation surface. (role: recent maintainer; confidence: medium; commits: ec1b96cdfa08, c9828635a801; files: extensions/memory-core/src/memory/manager-sync-ops.ts, extensions/memory-core/src/memory/manager-embedding-ops.ts, packages/memory-host-sdk/src/host/internal.ts)

Codex review notes: model gpt-5.5, reasoning high; reviewed against c9828635a801.

@wr-web
Copy link
Copy Markdown
Author

wr-web commented May 1, 2026

Because the file differences are quite large, I have submitted a new PR for this issue while retaining the modifications from the previous version. #75179

@clawsweeper clawsweeper Bot closed this May 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant