Context
Netclaw's memory system is actively degrading the assistant's performance.
After one week of heavy use, 71% of stored records are junk tool outputs,
the same memories get re-injected 14x per multi-step turn, and the proposal
gate creates duplicates instead of updating existing knowledge. The skill
discovery system shares the same underlying retrieval weakness.
This is the highest-priority work ahead of security posture hardening (#352).
Security policy should adapt to how memory is used, not the other way around.
Evidence (v0.7.9 baseline)
- 696 total memories after ~1 week of use, most formed in a single day
- 308 of 431 records (71%) are junk
verified-tool-finding records from
web_search and web_fetch tool outputs
- 212 memories formed from 253 eval turns — nearly 1 per turn
- 10 copies of "Aaron's favorite color is blue" under the same anchor
- 1,956 memory recalls for 253 turns — same 3 docs re-injected every
tool loop iteration
- Dedup analysis: anchor matching finds 371 redundant docs, vector embedding
(qwen3-embedding:4b) finds 183 additional semantic duplicates missed by
anchor/title matching
- Skill auto-loading: 0/10 on
skill_diagnostics and skill_memory due to
vocabulary gap between operator-authored skill content and user-facing language
Phase 1: Memory Formation Quality (#379)
Goal: Form fewer, better memories. Update existing knowledge instead of
creating duplicates.
1a. Stop forming memories for raw tool outputs
web_search, web_fetch, and other tool results should NOT generate durable
memories. The observation sidecar is treating every tool return as a
"verified-tool-finding" worth storing. Only conclusions drawn from tool
results are memory-worthy, not the raw outputs themselves.
1b. Deduplicate at formation time
Before creating a new memory, check if a semantically similar one already
exists. If found:
- Same concept, new info → update the existing memory
- Same concept, no new info → skip entirely
- New concept → create (current behavior)
Requires: similarity check at write time. Options:
- Anchor-based matching (no new dependencies, catches 371/696 dupes)
- Embedding similarity (requires embedding endpoint, catches 183 additional)
1c. Disable/reduce memory formation for headless sessions
netclaw -p eval prompts and single-turn headless sessions should not form
durable memories. Either disable formation for headless channel type or
require a minimum turn count (>= 3) before forming memories.
Phase 2: Memory Recall Pipeline (#370)
Goal: Stop wasting context tokens on redundant recall. Surface diverse
memories across a session.
2a. Recall only at turn boundaries, not tool loop iterations
FireLlmCall() currently runs ResolveRecallBundle() on every invocation
including tool loop follow-ups. Recall should only fire when:
2b. Exclusion-based progressive recall
Track injected memory IDs per session. Pass as exclusion filter to recall
so each turn surfaces different relevant memories instead of the same top-3
every time. Reset on compaction.
2c. Evaluate embedding-based recall vs current ranked retrieval
Current lexical TF-IDF recall scores 10/10 on eval. Memorizer's experience
(PR #157) shows pure vector search fails on short queries at scale — needed
hybrid search with RRF fusion. Evaluate whether the current approach is
sufficient or whether hybrid recall is needed as the memory corpus grows.
Phase 3: Skill Discovery (#355)
Goal: Match user intent to skills using semantic understanding, not just
keyword token overlap.
3a. Embedding-based skill matching
Embed skill description fields (6 skills, tiny corpus). On each user
message, compare message embedding against skill description embeddings.
Load skills above a similarity threshold.
At 6 skills, vector search works fine — no scale compression issues like
Memorizer saw at 2300+ items. No hybrid search needed.
3b. Deprecate keyword-only matching as primary
Keep keyword matching as a fallback/boost signal, but make semantic matching
the primary discovery mechanism. This handles the vocabulary gap where
"something is wrong with my session" needs to match operator-authored
diagnostics guidance.
3c. Revert skill compaction preservation (#315)
Once semantic matching is in place, skills can be re-discovered naturally
after compaction based on current conversation context. The force-reload
behavior from PR #315 / bd92318 becomes unnecessary.
Embedding Infrastructure
All three phases benefit from embedding capability. The existing Ollama
dependency already provides IEmbeddingGenerator<string, Embedding<float>>
via OllamaSharp — no new package dependencies needed.
Configuration:
- New config:
Embeddings.Model (e.g., qwen3-embedding:4b)
- Uses the same Ollama endpoint as the chat model
- Required for Phase 1b (embedding dedup) and Phase 3
- Phase 1a, 1c, 2a, 2b can ship without embeddings
Operational cost:
- Embedding models are small (0.6B-4B params) and fast (~37s for 265 docs)
- Single embedding per proposal at write time, not bulk operations
- Stored vectors enable O(n) similarity scan, sufficient for current scale
Sequencing
Phase 1a (stop tool output memories) — no dependencies, ship immediately
Phase 1c (headless session filtering) — no dependencies, ship immediately
Phase 2a (recall at turn boundaries) — no dependencies, ship immediately
Phase 2b (exclusion-based progressive) — no dependencies, ship immediately
Phase 1b (dedup at formation) — anchor-based first, embedding later
Phase 3a (skill embedding matching) — requires embedding config
Phase 3b (deprecate keyword-only) — after 3a validated by evals
Phase 2c (evaluate hybrid recall) — after corpus grows, data-driven
Phase 3c (revert skill preservation) — after 3a proven stable
Success Criteria
Measured by eval suite (evals/run-evals.sh):
- Memory Pipeline stays GREEN (10/10)
- Skill Auto-Loading moves from 2/4 to 4/4
- Memory database growth rate drops by >50%
- No memory re-injection within tool loop iterations
- Overall eval score improves from 72.7% baseline
Related Issues
Context
Netclaw's memory system is actively degrading the assistant's performance.
After one week of heavy use, 71% of stored records are junk tool outputs,
the same memories get re-injected 14x per multi-step turn, and the proposal
gate creates duplicates instead of updating existing knowledge. The skill
discovery system shares the same underlying retrieval weakness.
This is the highest-priority work ahead of security posture hardening (#352).
Security policy should adapt to how memory is used, not the other way around.
Evidence (v0.7.9 baseline)
verified-tool-findingrecords fromweb_searchandweb_fetchtool outputstool loop iteration
(qwen3-embedding:4b) finds 183 additional semantic duplicates missed by
anchor/title matching
skill_diagnosticsandskill_memorydue tovocabulary gap between operator-authored skill content and user-facing language
Phase 1: Memory Formation Quality (#379)
Goal: Form fewer, better memories. Update existing knowledge instead of
creating duplicates.
1a. Stop forming memories for raw tool outputs
web_search,web_fetch, and other tool results should NOT generate durablememories. The observation sidecar is treating every tool return as a
"verified-tool-finding" worth storing. Only conclusions drawn from tool
results are memory-worthy, not the raw outputs themselves.
1b. Deduplicate at formation time
Before creating a new memory, check if a semantically similar one already
exists. If found:
Requires: similarity check at write time. Options:
1c. Disable/reduce memory formation for headless sessions
netclaw -peval prompts and single-turn headless sessions should not formdurable memories. Either disable formation for headless channel type or
require a minimum turn count (>= 3) before forming memories.
Phase 2: Memory Recall Pipeline (#370)
Goal: Stop wasting context tokens on redundant recall. Surface diverse
memories across a session.
2a. Recall only at turn boundaries, not tool loop iterations
FireLlmCall()currently runsResolveRecallBundle()on every invocationincluding tool loop follow-ups. Recall should only fire when:
2b. Exclusion-based progressive recall
Track injected memory IDs per session. Pass as exclusion filter to recall
so each turn surfaces different relevant memories instead of the same top-3
every time. Reset on compaction.
2c. Evaluate embedding-based recall vs current ranked retrieval
Current lexical TF-IDF recall scores 10/10 on eval. Memorizer's experience
(PR #157) shows pure vector search fails on short queries at scale — needed
hybrid search with RRF fusion. Evaluate whether the current approach is
sufficient or whether hybrid recall is needed as the memory corpus grows.
Phase 3: Skill Discovery (#355)
Goal: Match user intent to skills using semantic understanding, not just
keyword token overlap.
3a. Embedding-based skill matching
Embed skill
descriptionfields (6 skills, tiny corpus). On each usermessage, compare message embedding against skill description embeddings.
Load skills above a similarity threshold.
At 6 skills, vector search works fine — no scale compression issues like
Memorizer saw at 2300+ items. No hybrid search needed.
3b. Deprecate keyword-only matching as primary
Keep keyword matching as a fallback/boost signal, but make semantic matching
the primary discovery mechanism. This handles the vocabulary gap where
"something is wrong with my session" needs to match operator-authored
diagnostics guidance.
3c. Revert skill compaction preservation (#315)
Once semantic matching is in place, skills can be re-discovered naturally
after compaction based on current conversation context. The force-reload
behavior from PR #315 /
bd92318becomes unnecessary.Embedding Infrastructure
All three phases benefit from embedding capability. The existing Ollama
dependency already provides
IEmbeddingGenerator<string, Embedding<float>>via OllamaSharp — no new package dependencies needed.
Configuration:
Embeddings.Model(e.g.,qwen3-embedding:4b)Operational cost:
Sequencing
Success Criteria
Measured by eval suite (
evals/run-evals.sh):Related Issues