fix(memory): incremental session sync (#40919)#59577
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: dd6cc48645
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Greptile SummaryThis PR implements hash-based incremental indexing for session memory files, replacing the previous O(n) full reindex on every session update with a scheme that only re-embeds new or changed chunks. The approach is well-scoped (session files only), backward-compatible, and produces meaningful API-call savings (44–67% measured). The new Issues found:
Confidence Score: 3/5
Prompt To Fix All With AIThis is a comment left during a code review.
Path: src/memory/manager-embedding-ops.ts
Line: 908-916
Comment:
**Redundant try/catch block**
The `try/catch` around the embedding call in step 6 catches the error and immediately re-throws it without any logging, transformation, or side effects. This adds indentation noise without any benefit and differs from how `indexFile` handles the same code path (where the `catch` block adds meaningful logic like checking `isStructuredInputTooLargeError`).
```suggestion
if (allNeedEmbedding.length > 0) {
embeddings = this.batch.enabled
? await this.embedChunksWithBatch(allNeedEmbedding, entry, options.source)
: await this.embedChunksInBatches(allNeedEmbedding);
}
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: src/memory/manager-embedding-ops.ts
Line: 929-947
Comment:
**Plain INSERT without ON CONFLICT may throw on duplicate IDs**
The full-reindex path (`indexFile`) uses `ON CONFLICT(id) DO UPDATE SET ...` for its chunk INSERT (line 1124), making it safe to retry or run against an existing record. The new incremental path uses a plain INSERT with no conflict handling.
Although in the normal append-only case the incremental logic correctly avoids inserting a chunk that's already matched by hash, consider a race condition or a previous partial failure where `upsertFileRecord` was never reached (so the file record is stale). On the next sync, the caller selects existing chunks, matches them by hash, and marks them as "used." But if the DB already has a row for the new chunk from the previous partially-completed run, the plain INSERT will throw `UNIQUE constraint failed: chunks.id`.
Unlike the matched/preserved path, chunks in `chunksToEmbed` have no existing DB record that would absorb the conflict — making the asymmetry with `indexFile` a real risk.
Consider using the same upsert pattern here:
```suggestion
this.db
.prepare(
`INSERT INTO chunks (id, path, source, start_line, end_line, hash, model, text, embedding, updated_at, start_offset, end_offset)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
ON CONFLICT(id) DO UPDATE SET
hash=excluded.hash,
model=excluded.model,
text=excluded.text,
embedding=excluded.embedding,
updated_at=excluded.updated_at,
start_offset=excluded.start_offset,
end_offset=excluded.end_offset`,
)
.run(
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: src/memory/internal.ts
Line: 455-467
Comment:
**`endOffset` is off-by-one vs. the convention used in `enforceEmbeddingMaxInputTokens`**
`flush()` computes:
```ts
const endOffset = charOffset + currentChars - 1;
```
where `currentChars` counts each segment as `segment.length + 1` (adding 1 for the newline). Because the `text` is assembled with `.join("\n")` — no trailing newline — `currentChars = text.length + 1`. This means `endOffset = startOffset + text.length`, which points to the position of the **trailing newline** that separates this line from the next.
By contrast, `enforceEmbeddingMaxInputTokens` (when splitting an oversized chunk) sets:
```ts
endOffset = offsetCursor + text.length - 1;
```
which points to the **last character of the text itself** (exclusive of any trailing newline).
The two conventions are inconsistent. For chunks that fit within `maxInputTokens` (pass-through), `endOffset` is `startOffset + text.length`; for chunks that are split, the last sub-chunk gets `endOffset = startOffset + text.length - 1`. The test in `embedding-chunk-limits.test.ts` validates the split convention but a chunk coming from `chunkMarkdownWithOffset` would arrive with the +1 convention, causing the last split's `endOffset` to be one less than the parent chunk's `endOffset`.
Since `start_offset`/`end_offset` are only stored metadata and are not used in any search query today, there is no functional regression — but future consumers of these fields would see the inconsistency. The fix is to use `currentChars - 2` (or `text.length - 1` plus `charOffset`) in `flush()`:
```suggestion
const endOffset = charOffset + currentChars - 2;
```
and advance the cursor accordingly:
```suggestion
charOffset += currentChars - 1;
```
(so `charOffset` now points to the start of the first non-overlapping character of the next chunk, consistent with how `enforceEmbeddingMaxInputTokens` tracks `offsetCursor`.)
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "fix(memory): incremental session sync (#..." | Re-trigger Greptile |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5184cdc332
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| `UPDATE chunks SET start_line = ?, end_line = ?, start_offset = ?, end_offset = ?, updated_at = ? | ||
| WHERE id = ?`, | ||
| ) | ||
| .run(chunk.startLine, chunk.endLine, chunk.startOffset, chunk.endOffset, now, existing.id); |
There was a problem hiding this comment.
Refresh FTS rows when only chunk metadata changes
This metadata-refresh branch updates chunks.start_line/end_line but never updates chunks_fts, and searchKeyword reads start_line/end_line directly from the FTS table. In the truncation/line-shift scenario called out in this function’s comment, keyword (and keyword-only hybrid) results will therefore return stale line ranges even though the chunks table was corrected; you should update or replace the matching FTS row whenever a reused chunk’s line metadata changes.
Useful? React with 👍 / 👎.
| const overlapCharCount = kept.reduce((sum, entry) => sum + entry.line.length + 1, 0); | ||
| charOffset -= overlapCharCount; |
There was a problem hiding this comment.
Correct overlap offset rewind in chunkMarkdownWithOffset
charOffset is advanced using currentChars - 1 in flush, but overlap rewind subtracts full overlapCharCount (which includes the synthetic per-line +1 newline accounting). That mismatch shifts subsequent chunk offsets one character early (and can produce negative starts for single-segment overlap), so persisted start_offset/end_offset values are off-by-one whenever overlap is enabled.
Useful? React with 👍 / 👎.
|
Thanks for the context here. I swept through the related work, and this is now duplicate or superseded. Close as superseded: the same author opened #75179 as the current replacement for this incremental session sync work after review feedback, and that newer PR targets the active memory-core surface instead of this older branch's stale memory tree. So I’m closing this here and keeping the remaining discussion on the canonical linked item. Review detailsBest possible solution: Close this older PR and keep the remaining implementation/review path on #75179 and #40919, with any eventual fix landing in Do we have a high-confidence way to reproduce the issue? Yes. Source inspection on current main gives a high-confidence path: a dirty growing session transcript reaches Is this the best way to solve the issue? No. Hash-based reuse is a reasonable direction, but this branch is not the best vehicle because it targets the old Security review: Security review cleared: The diff is limited to memory indexing, SQLite schema metadata, and tests; I found no concrete dependency, workflow, secret-handling, permission, or command-execution concern. What I checked:
Likely related people:
Codex review notes: model gpt-5.5, reasoning high; reviewed against c9828635a801. |
|
Because the file differences are quite large, I have submitted a new PR for this issue while retaining the modifications from the previous version. #75179 |
Summary
Describe the problem and fix in 2–5 bullets:
If this PR fixes a plugin beta-release blocker, title it
fix(<plugin-id>): beta blocker - <summary>and link the matchingBeta blocker: <plugin-name> - <summary>issue labeledbeta-blocker. Contributors cannot label PRs, so the title is the PR-side signal for maintainers and automation.Change Type (select all)
Scope (select all touched areas)
Linked Issue/PR
Root Cause / Regression History (if applicable)
For bug fixes or regressions, explain why this happened, not just what changed. Otherwise write
N/A. If the cause is unclear, writeUnknown.N/A - This is a feature, not a bug fix.
Regression Test Plan (if applicable)
For bug fixes or regressions, name the smallest reliable test coverage that should have caught this. Otherwise write
N/A.manager.incremental-session.test.tsmanager.atomic-reindex.test.tscovers full reindexUser-visible / Behavior Changes
List user-visible changes (including defaults/config).
If none, write
None.None. The incremental sync is transparent to users. Session syncs are faster but the behavior is identical.
Diagram (if applicable)
For UI changes or non-trivial logic flows, include a small ASCII diagram reviewers can scan quickly. Otherwise write
N/A.Security Impact (required)
No)No)No)No)No)Yes, explain risk + mitigation: N/ARepro + Verification
Environment
{ "memory": { "sources": ["sessions"], "chunking": { "tokens": 200, "overlap": 50 }, "experimental": { "sessionMemory": true } } }Steps
force: truesessionFiles: [path]Expected
Actual
Evidence
Attach at least one:
Benchmark Results (Real Ollama API)
Embedding Consistency Verification
Human Verification (required)
What you personally verified (not just CI), and how:
Review Conversations
If a bot review conversation is addressed by this PR, resolve that conversation yourself. Do not leave bot review conversation cleanup for maintainers.
Compatibility / Migration
Yes)No)No)Compatibility Details
ensureColumn()to safely add new columns with default valuesindexFile()preserved,indexSessionFileIncremental()is newRisks and Mitigations
List only real risks for this PR. Add/remove entries as needed. If none, write
None.Risk: Existing chunks have
start_offset=0, end_offset=0which may cause hash lookup issuesRisk: Single-chunk sessions don't benefit from incremental sync
Risk: Overlap causes boundary chunks to change, reducing hash preservation rate