memory: strip inbound metadata envelopes from user messages in session corpus#66548
Conversation
Greptile SummaryThis PR fixes a corpus-ingestion bug where raw inbound metadata envelopes (prepended by Confidence Score: 5/5
Prompt To Fix All With AIThis is a comment left during a code review.
Path: src/memory-host-sdk/host/session-files.ts
Line: 188
Comment:
**Parallel copy in `packages/` not updated**
`packages/memory-host-sdk/src/host/session-files.ts` is a separate standalone implementation (not a re-export facade like the rest of the package) and was not patched. It still calls `extractSessionText(message.content)` without passing `message.role` and has no `stripInboundMetadata` import. While that copy is currently unexported and not on any production code path, if it is ever promoted or wired up the corpus-truncation bug will silently return.
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "memory: strip inbound metadata envelopes..." | Re-trigger Greptile |
| continue; | ||
| } | ||
| const text = extractSessionText(message.content); | ||
| const text = extractSessionText(message.content, message.role); |
There was a problem hiding this comment.
Parallel copy in
packages/ not updated
packages/memory-host-sdk/src/host/session-files.ts is a separate standalone implementation (not a re-export facade like the rest of the package) and was not patched. It still calls extractSessionText(message.content) without passing message.role and has no stripInboundMetadata import. While that copy is currently unexported and not on any production code path, if it is ever promoted or wired up the corpus-truncation bug will silently return.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/memory-host-sdk/host/session-files.ts
Line: 188
Comment:
**Parallel copy in `packages/` not updated**
`packages/memory-host-sdk/src/host/session-files.ts` is a separate standalone implementation (not a re-export facade like the rest of the package) and was not patched. It still calls `extractSessionText(message.content)` without passing `message.role` and has no `stripInboundMetadata` import. While that copy is currently unexported and not on any production code path, if it is ever promoted or wired up the corpus-truncation bug will silently return.
How can I resolve this? If you propose a fix, please make it concise.|
Addressed Greptile's P2: applied the same strip to the parallel |
3c92f0b to
4cfe940
Compare
…n corpus Session ingestion was feeding raw Telegram/Discord/Slack inbound envelopes into the dreaming corpus. The 338-char Conversation info + Sender JSON prefix on every user message blew past SESSION_INGESTION_MAX_SNIPPET_CHARS (280), so the user's actual words never made it in and REM extraction latched onto envelope words like 'assistant' as top topics. Strip inbound metadata on user-role text blocks BEFORE normalizeSessionText collapses newlines. stripInboundMetadata needs the line structure and fenced-json markers to find sentinels, so the order matters. Assistant messages are left alone — they may legitimately discuss the envelope format. Fixes openclaw#63921
…-sdk copy Greptile flagged the `packages/memory-host-sdk/src/host/session-files.ts` copy as a P2 gap: the parallel standalone implementation was not updated with the same fix, so if it ever gets wired up the corpus-truncation bug returns silently. The file isn't currently on any production code path, but keeping the two copies consistent prevents a future regression.
4cfe940 to
98562b2
Compare
|
Merged via squash.
Thanks @zqchris! |
…n corpus (openclaw#66548) Merged via squash. Prepared head SHA: 98562b2 Co-authored-by: zqchris <4436110+zqchris@users.noreply.github.com> Co-authored-by: jalehman <550978+jalehman@users.noreply.github.com> Reviewed-by: @jalehman
Summary
Session corpus ingestion was feeding raw Telegram/Discord/Slack inbound envelopes into the dreaming corpus unchanged. Each user message in the transcript carries a ~338-char `Conversation info` + `Sender` JSON prefix built by `buildInboundUserContextPrefix`, which exceeds the `SESSION_INGESTION_MAX_SNIPPET_CHARS` (280) cap used downstream in `dreaming-phases.ts`. Result: the user's actual words never made it into the corpus, and REM topic extraction latched onto envelope words like `assistant` / `untrusted metadata` as the top "topics".
Root cause is ordering in `buildSessionEntry` → `extractSessionText`: `normalizeSessionText` collapses newlines to spaces per text block, and once newlines are gone, `stripInboundMetadata` can no longer locate sentinel lines or fenced-json blocks. So stripping must happen before normalization, and only for `user` role (assistant messages may legitimately discuss envelope formats).
Changes
Test plan