Skip to content

memory: strip inbound metadata envelopes from user messages in session corpus#66548

Merged
jalehman merged 5 commits intoopenclaw:mainfrom
zqchris:fix/dreaming-corpus-strip-inbound-envelope
Apr 16, 2026
Merged

memory: strip inbound metadata envelopes from user messages in session corpus#66548
jalehman merged 5 commits intoopenclaw:mainfrom
zqchris:fix/dreaming-corpus-strip-inbound-envelope

Conversation

@zqchris
Copy link
Copy Markdown
Contributor

@zqchris zqchris commented Apr 14, 2026

Summary

Session corpus ingestion was feeding raw Telegram/Discord/Slack inbound envelopes into the dreaming corpus unchanged. Each user message in the transcript carries a ~338-char `Conversation info` + `Sender` JSON prefix built by `buildInboundUserContextPrefix`, which exceeds the `SESSION_INGESTION_MAX_SNIPPET_CHARS` (280) cap used downstream in `dreaming-phases.ts`. Result: the user's actual words never made it into the corpus, and REM topic extraction latched onto envelope words like `assistant` / `untrusted metadata` as the top "topics".

Root cause is ordering in `buildSessionEntry` → `extractSessionText`: `normalizeSessionText` collapses newlines to spaces per text block, and once newlines are gone, `stripInboundMetadata` can no longer locate sentinel lines or fenced-json blocks. So stripping must happen before normalization, and only for `user` role (assistant messages may legitimately discuss envelope formats).

Changes

  • `src/memory-host-sdk/host/session-files.ts`: add `role` parameter to `extractSessionText`; strip inbound metadata on user-role text blocks before `normalizeSessionText` runs.
  • `src/memory-host-sdk/host/session-files.test.ts`: new test using a real multi-line Telegram envelope asserts the corpus entry contains only the actual user text; separate test confirms assistant messages containing sentinel-like text are preserved untouched.

Test plan

  • `pnpm test src/memory-host-sdk/host/session-files.test.ts` — 8/8 passing (6 existing + 2 new)
  • `pnpm check` — clean (lint, format, type, import cycles, madge, webhook/auth guards)
  • Manual verification on next dreaming cycle: REM topics should reflect real user words instead of envelope noise

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 14, 2026

Greptile Summary

This PR fixes a corpus-ingestion bug where raw inbound metadata envelopes (prepended by buildInboundUserContextPrefix for Telegram/Discord/Slack messages) were being fed into the dreaming corpus unchanged, causing SESSION_INGESTION_MAX_SNIPPET_CHARS truncation to cut the actual user text entirely. The fix — stripping the envelope before normalizeSessionText collapses newlines — is correct and well-targeted, and the new tests directly cover the root-cause ordering constraint.

Confidence Score: 5/5

  • Safe to merge; the fix is correct and well-tested with no production-affecting issues.
  • All remaining findings are P2. The only notable gap is that the parallel packages/memory-host-sdk/src/host/session-files.ts copy was not updated, but it is not exported or used in any production code path so it poses no current risk.
  • packages/memory-host-sdk/src/host/session-files.ts — parallel standalone copy not updated with this fix.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/memory-host-sdk/host/session-files.ts
Line: 188

Comment:
**Parallel copy in `packages/` not updated**

`packages/memory-host-sdk/src/host/session-files.ts` is a separate standalone implementation (not a re-export facade like the rest of the package) and was not patched. It still calls `extractSessionText(message.content)` without passing `message.role` and has no `stripInboundMetadata` import. While that copy is currently unexported and not on any production code path, if it is ever promoted or wired up the corpus-truncation bug will silently return.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "memory: strip inbound metadata envelopes..." | Re-trigger Greptile

continue;
}
const text = extractSessionText(message.content);
const text = extractSessionText(message.content, message.role);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Parallel copy in packages/ not updated

packages/memory-host-sdk/src/host/session-files.ts is a separate standalone implementation (not a re-export facade like the rest of the package) and was not patched. It still calls extractSessionText(message.content) without passing message.role and has no stripInboundMetadata import. While that copy is currently unexported and not on any production code path, if it is ever promoted or wired up the corpus-truncation bug will silently return.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/memory-host-sdk/host/session-files.ts
Line: 188

Comment:
**Parallel copy in `packages/` not updated**

`packages/memory-host-sdk/src/host/session-files.ts` is a separate standalone implementation (not a re-export facade like the rest of the package) and was not patched. It still calls `extractSessionText(message.content)` without passing `message.role` and has no `stripInboundMetadata` import. While that copy is currently unexported and not on any production code path, if it is ever promoted or wired up the corpus-truncation bug will silently return.

How can I resolve this? If you propose a fix, please make it concise.

@zqchris
Copy link
Copy Markdown
Contributor Author

zqchris commented Apr 14, 2026

Addressed Greptile's P2: applied the same strip to the parallel packages/memory-host-sdk/src/host/session-files.ts copy in ab90812. That file isn't currently exported or on any production code path, but keeping the two copies consistent prevents a silent regression if it ever gets wired up later.

@jalehman jalehman self-assigned this Apr 15, 2026
@jalehman jalehman force-pushed the fix/dreaming-corpus-strip-inbound-envelope branch 3 times, most recently from 3c92f0b to 4cfe940 Compare April 15, 2026 22:16
Chris Zhang and others added 5 commits April 16, 2026 11:10
…n corpus

Session ingestion was feeding raw Telegram/Discord/Slack inbound envelopes
into the dreaming corpus. The 338-char Conversation info + Sender JSON prefix
on every user message blew past SESSION_INGESTION_MAX_SNIPPET_CHARS (280),
so the user's actual words never made it in and REM extraction latched
onto envelope words like 'assistant' as top topics.

Strip inbound metadata on user-role text blocks BEFORE normalizeSessionText
collapses newlines. stripInboundMetadata needs the line structure and
fenced-json markers to find sentinels, so the order matters. Assistant
messages are left alone — they may legitimately discuss the envelope
format.

Fixes openclaw#63921
…-sdk copy

Greptile flagged the `packages/memory-host-sdk/src/host/session-files.ts`
copy as a P2 gap: the parallel standalone implementation was not updated
with the same fix, so if it ever gets wired up the corpus-truncation bug
returns silently. The file isn't currently on any production code path,
but keeping the two copies consistent prevents a future regression.
@jalehman jalehman force-pushed the fix/dreaming-corpus-strip-inbound-envelope branch from 4cfe940 to 98562b2 Compare April 16, 2026 18:11
@jalehman jalehman merged commit 82e349a into openclaw:main Apr 16, 2026
42 checks passed
@jalehman
Copy link
Copy Markdown
Contributor

Merged via squash.

Thanks @zqchris!

xudaiyanzi pushed a commit to xudaiyanzi/openclaw that referenced this pull request Apr 17, 2026
…n corpus (openclaw#66548)

Merged via squash.

Prepared head SHA: 98562b2
Co-authored-by: zqchris <4436110+zqchris@users.noreply.github.com>
Co-authored-by: jalehman <550978+jalehman@users.noreply.github.com>
Reviewed-by: @jalehman
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants