epic: memory quality overhaul — formation, recall, and skill discovery

## Context

Netclaw's memory system is actively degrading the assistant's performance.
After one week of heavy use, 71% of stored records are junk tool outputs,
the same memories get re-injected 14x per multi-step turn, and the proposal
gate creates duplicates instead of updating existing knowledge. The skill
discovery system shares the same underlying retrieval weakness.

This is the highest-priority work ahead of security posture hardening (#352).
Security policy should adapt to how memory is used, not the other way around.

### Evidence (v0.7.9 baseline)

- **696 total memories** after ~1 week of use, most formed in a single day
- **308 of 431 records (71%)** are junk `verified-tool-finding` records from
  `web_search` and `web_fetch` tool outputs
- **212 memories formed from 253 eval turns** — nearly 1 per turn
- **10 copies** of "Aaron's favorite color is blue" under the same anchor
- **1,956 memory recalls** for 253 turns — same 3 docs re-injected every
  tool loop iteration
- Dedup analysis: anchor matching finds 371 redundant docs, vector embedding
  (qwen3-embedding:4b) finds 183 additional semantic duplicates missed by
  anchor/title matching
- Skill auto-loading: 0/10 on `skill_diagnostics` and `skill_memory` due to
  vocabulary gap between operator-authored skill content and user-facing language

## Phase 1: Memory Formation Quality (#379)

**Goal:** Form fewer, better memories. Update existing knowledge instead of
creating duplicates.

### 1a. Stop forming memories for raw tool outputs

`web_search`, `web_fetch`, and other tool results should NOT generate durable
memories. The observation sidecar is treating every tool return as a
"verified-tool-finding" worth storing. Only conclusions drawn from tool
results are memory-worthy, not the raw outputs themselves.

### 1b. Deduplicate at formation time

Before creating a new memory, check if a semantically similar one already
exists. If found:
- **Same concept, new info** → update the existing memory
- **Same concept, no new info** → skip entirely
- **New concept** → create (current behavior)

Requires: similarity check at write time. Options:
- Anchor-based matching (no new dependencies, catches 371/696 dupes)
- Embedding similarity (requires embedding endpoint, catches 183 additional)

### 1c. Disable/reduce memory formation for headless sessions

`netclaw -p` eval prompts and single-turn headless sessions should not form
durable memories. Either disable formation for headless channel type or
require a minimum turn count (>= 3) before forming memories.

## Phase 2: Memory Recall Pipeline (#370)

**Goal:** Stop wasting context tokens on redundant recall. Surface diverse
memories across a session.

### 2a. Recall only at turn boundaries, not tool loop iterations

`FireLlmCall()` currently runs `ResolveRecallBundle()` on every invocation
including tool loop follow-ups. Recall should only fire when:
- New user message starts a turn
- Buffered user message is drained mid-loop (PR #351)

### 2b. Exclusion-based progressive recall

Track injected memory IDs per session. Pass as exclusion filter to recall
so each turn surfaces different relevant memories instead of the same top-3
every time. Reset on compaction.

### 2c. Evaluate embedding-based recall vs current ranked retrieval

Current lexical TF-IDF recall scores 10/10 on eval. Memorizer's experience
(PR #157) shows pure vector search fails on short queries at scale — needed
hybrid search with RRF fusion. Evaluate whether the current approach is
sufficient or whether hybrid recall is needed as the memory corpus grows.

## Phase 3: Skill Discovery (#355)

**Goal:** Match user intent to skills using semantic understanding, not just
keyword token overlap.

### 3a. Embedding-based skill matching

Embed skill `description` fields (6 skills, tiny corpus). On each user
message, compare message embedding against skill description embeddings.
Load skills above a similarity threshold.

At 6 skills, vector search works fine — no scale compression issues like
Memorizer saw at 2300+ items. No hybrid search needed.

### 3b. Deprecate keyword-only matching as primary

Keep keyword matching as a fallback/boost signal, but make semantic matching
the primary discovery mechanism. This handles the vocabulary gap where
"something is wrong with my session" needs to match operator-authored
diagnostics guidance.

### 3c. Revert skill compaction preservation (#315)

Once semantic matching is in place, skills can be re-discovered naturally
after compaction based on current conversation context. The force-reload
behavior from PR #315 / `bd92318` becomes unnecessary.

## Embedding Infrastructure

All three phases benefit from embedding capability. The existing Ollama
dependency already provides `IEmbeddingGenerator<string, Embedding<float>>`
via OllamaSharp — no new package dependencies needed.

**Configuration:**
- New config: `Embeddings.Model` (e.g., `qwen3-embedding:4b`)
- Uses the same Ollama endpoint as the chat model
- Required for Phase 1b (embedding dedup) and Phase 3
- Phase 1a, 1c, 2a, 2b can ship without embeddings

**Operational cost:**
- Embedding models are small (0.6B-4B params) and fast (~37s for 265 docs)
- Single embedding per proposal at write time, not bulk operations
- Stored vectors enable O(n) similarity scan, sufficient for current scale

## Sequencing

```
Phase 1a (stop tool output memories)     — no dependencies, ship immediately
Phase 1c (headless session filtering)    — no dependencies, ship immediately
Phase 2a (recall at turn boundaries)     — no dependencies, ship immediately
Phase 2b (exclusion-based progressive)   — no dependencies, ship immediately
Phase 1b (dedup at formation)            — anchor-based first, embedding later
Phase 3a (skill embedding matching)      — requires embedding config
Phase 3b (deprecate keyword-only)        — after 3a validated by evals
Phase 2c (evaluate hybrid recall)        — after corpus grows, data-driven
Phase 3c (revert skill preservation)     — after 3a proven stable
```

## Success Criteria

Measured by eval suite (`evals/run-evals.sh`):
- Memory Pipeline stays GREEN (10/10)
- Skill Auto-Loading moves from 2/4 to 4/4
- Memory database growth rate drops by >50%
- No memory re-injection within tool loop iterations
- Overall eval score improves from 72.7% baseline

## Related Issues

- #379 — memory formation duplicates (Phase 1)
- #370 — memory recall deduplication (Phase 2)
- #355 — semantic skill discovery (Phase 3)
- #352 — command-level approval gates (deferred, security adapts to memory)
- #315 — skill compaction preservation (revert candidate after Phase 3)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

epic: memory quality overhaul — formation, recall, and skill discovery #380

Context

Evidence (v0.7.9 baseline)

Phase 1: Memory Formation Quality (#379)

1a. Stop forming memories for raw tool outputs

1b. Deduplicate at formation time

1c. Disable/reduce memory formation for headless sessions

Phase 2: Memory Recall Pipeline (#370)

2a. Recall only at turn boundaries, not tool loop iterations

2b. Exclusion-based progressive recall

2c. Evaluate embedding-based recall vs current ranked retrieval

Phase 3: Skill Discovery (#355)

3a. Embedding-based skill matching

3b. Deprecate keyword-only matching as primary

3c. Revert skill compaction preservation (#315)

Embedding Infrastructure

Sequencing

Success Criteria

Related Issues

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

epic: memory quality overhaul — formation, recall, and skill discovery #380

Description

Context

Evidence (v0.7.9 baseline)

Phase 1: Memory Formation Quality (#379)

1a. Stop forming memories for raw tool outputs

1b. Deduplicate at formation time

1c. Disable/reduce memory formation for headless sessions

Phase 2: Memory Recall Pipeline (#370)

2a. Recall only at turn boundaries, not tool loop iterations

2b. Exclusion-based progressive recall

2c. Evaluate embedding-based recall vs current ranked retrieval

Phase 3: Skill Discovery (#355)

3a. Embedding-based skill matching

3b. Deprecate keyword-only matching as primary

3c. Revert skill compaction preservation (#315)

Embedding Infrastructure

Sequencing

Success Criteria

Related Issues

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions