Skip to content

epic: memory quality overhaul — formation, recall, and skill discovery #380

Description

@Aaronontheweb

Context

Netclaw's memory system is actively degrading the assistant's performance.
After one week of heavy use, 71% of stored records are junk tool outputs,
the same memories get re-injected 14x per multi-step turn, and the proposal
gate creates duplicates instead of updating existing knowledge. The skill
discovery system shares the same underlying retrieval weakness.

This is the highest-priority work ahead of security posture hardening (#352).
Security policy should adapt to how memory is used, not the other way around.

Evidence (v0.7.9 baseline)

  • 696 total memories after ~1 week of use, most formed in a single day
  • 308 of 431 records (71%) are junk verified-tool-finding records from
    web_search and web_fetch tool outputs
  • 212 memories formed from 253 eval turns — nearly 1 per turn
  • 10 copies of "Aaron's favorite color is blue" under the same anchor
  • 1,956 memory recalls for 253 turns — same 3 docs re-injected every
    tool loop iteration
  • Dedup analysis: anchor matching finds 371 redundant docs, vector embedding
    (qwen3-embedding:4b) finds 183 additional semantic duplicates missed by
    anchor/title matching
  • Skill auto-loading: 0/10 on skill_diagnostics and skill_memory due to
    vocabulary gap between operator-authored skill content and user-facing language

Phase 1: Memory Formation Quality (#379)

Goal: Form fewer, better memories. Update existing knowledge instead of
creating duplicates.

1a. Stop forming memories for raw tool outputs

web_search, web_fetch, and other tool results should NOT generate durable
memories. The observation sidecar is treating every tool return as a
"verified-tool-finding" worth storing. Only conclusions drawn from tool
results are memory-worthy, not the raw outputs themselves.

1b. Deduplicate at formation time

Before creating a new memory, check if a semantically similar one already
exists. If found:

  • Same concept, new info → update the existing memory
  • Same concept, no new info → skip entirely
  • New concept → create (current behavior)

Requires: similarity check at write time. Options:

  • Anchor-based matching (no new dependencies, catches 371/696 dupes)
  • Embedding similarity (requires embedding endpoint, catches 183 additional)

1c. Disable/reduce memory formation for headless sessions

netclaw -p eval prompts and single-turn headless sessions should not form
durable memories. Either disable formation for headless channel type or
require a minimum turn count (>= 3) before forming memories.

Phase 2: Memory Recall Pipeline (#370)

Goal: Stop wasting context tokens on redundant recall. Surface diverse
memories across a session.

2a. Recall only at turn boundaries, not tool loop iterations

FireLlmCall() currently runs ResolveRecallBundle() on every invocation
including tool loop follow-ups. Recall should only fire when:

2b. Exclusion-based progressive recall

Track injected memory IDs per session. Pass as exclusion filter to recall
so each turn surfaces different relevant memories instead of the same top-3
every time. Reset on compaction.

2c. Evaluate embedding-based recall vs current ranked retrieval

Current lexical TF-IDF recall scores 10/10 on eval. Memorizer's experience
(PR #157) shows pure vector search fails on short queries at scale — needed
hybrid search with RRF fusion. Evaluate whether the current approach is
sufficient or whether hybrid recall is needed as the memory corpus grows.

Phase 3: Skill Discovery (#355)

Goal: Match user intent to skills using semantic understanding, not just
keyword token overlap.

3a. Embedding-based skill matching

Embed skill description fields (6 skills, tiny corpus). On each user
message, compare message embedding against skill description embeddings.
Load skills above a similarity threshold.

At 6 skills, vector search works fine — no scale compression issues like
Memorizer saw at 2300+ items. No hybrid search needed.

3b. Deprecate keyword-only matching as primary

Keep keyword matching as a fallback/boost signal, but make semantic matching
the primary discovery mechanism. This handles the vocabulary gap where
"something is wrong with my session" needs to match operator-authored
diagnostics guidance.

3c. Revert skill compaction preservation (#315)

Once semantic matching is in place, skills can be re-discovered naturally
after compaction based on current conversation context. The force-reload
behavior from PR #315 / bd92318 becomes unnecessary.

Embedding Infrastructure

All three phases benefit from embedding capability. The existing Ollama
dependency already provides IEmbeddingGenerator<string, Embedding<float>>
via OllamaSharp — no new package dependencies needed.

Configuration:

  • New config: Embeddings.Model (e.g., qwen3-embedding:4b)
  • Uses the same Ollama endpoint as the chat model
  • Required for Phase 1b (embedding dedup) and Phase 3
  • Phase 1a, 1c, 2a, 2b can ship without embeddings

Operational cost:

  • Embedding models are small (0.6B-4B params) and fast (~37s for 265 docs)
  • Single embedding per proposal at write time, not bulk operations
  • Stored vectors enable O(n) similarity scan, sufficient for current scale

Sequencing

Phase 1a (stop tool output memories)     — no dependencies, ship immediately
Phase 1c (headless session filtering)    — no dependencies, ship immediately
Phase 2a (recall at turn boundaries)     — no dependencies, ship immediately
Phase 2b (exclusion-based progressive)   — no dependencies, ship immediately
Phase 1b (dedup at formation)            — anchor-based first, embedding later
Phase 3a (skill embedding matching)      — requires embedding config
Phase 3b (deprecate keyword-only)        — after 3a validated by evals
Phase 2c (evaluate hybrid recall)        — after corpus grows, data-driven
Phase 3c (revert skill preservation)     — after 3a proven stable

Success Criteria

Measured by eval suite (evals/run-evals.sh):

  • Memory Pipeline stays GREEN (10/10)
  • Skill Auto-Loading moves from 2/4 to 4/4
  • Memory database growth rate drops by >50%
  • No memory re-injection within tool loop iterations
  • Overall eval score improves from 72.7% baseline

Related Issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions