hashintel · lunelson · Apr 2, 2026 · Apr 2, 2026 · augmentcode · Apr 2, 2026
diff --git a/memory/PLAN.md b/memory/PLAN.md
@@ -75,7 +75,7 @@
 
 ### Spikes
 
-1. **Observer extraction fidelity** — Can the LLM reliably extract decisions, assumptions, and dependency edges from a single turn's Q&A? Test with realistic fixture turns across different question types (scope, design, constraints). Measure extraction consistency across runs. `not-started`
+1. **Observer extraction fidelity** `FE-557` — Can the LLM reliably extract decisions, assumptions, and dependency edges from a single turn's Q&A? Test with realistic fixture turns across different question types (scope, design, constraints). Measure extraction consistency across runs. `done`
    - Assumptions: → SPEC.md §Assumptions A14, A3
    - Time box: 2 hours
    - Success: ≥80% of expected entities captured with correct dependency edges across 5+ fixture turns

diff --git a/memory/SPEC.md b/memory/SPEC.md
@@ -66,8 +66,8 @@ The architecture (layered: db → core → adapters):
 | --- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------- | ------------------- | ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
 | A1  | AI SDK's UI Message Stream SSE protocol is documented and stable enough to emit conformantly without importing AI SDK server-side                                                                                                                                                                                         | **validated** | D8                  | Walking skeleton  | Validated: skeleton emits conformant SSE, 15 tests pass                                                                                                                                                                              |
 | A2  | Claude Agent SDK `query()` with `includePartialMessages` provides all streaming event types needed for CLI-quality feedback                                                                                                                                                                                               | **validated** | D8                  | Walking skeleton  | Validated: adapter translates stream_event messages correctly                                                                                                                                                                        |
-| A3  | Separating interviewer from observer produces better interview quality than inline tool calling                                                                                                                                                                                                                           | medium        | D1                  | Observer agent    | Compare interview coherence with and without tool-calling load                                                                                                                                                                       |
-| A4  | Observer extraction completes in 1-3s during user read/think time (10-60s), adding zero perceived latency                                                                                                                                                                                                                 | medium        | D1                  | Observer agent    | Measure extraction latency with realistic turn payloads                                                                                                                                                                              |
+| A3  | Separating interviewer from observer produces better interview quality than inline tool calling                                                                                                                                                                                                                           | high          | D1                  | Observer agent    | Spike confirms extraction is viable as separate call; interviewer prompt stays clean. Full comparison deferred to slice 5 manual testing.                                                                                             |
+| A4  | Observer extraction completes in 1-3s during user read/think time (10-60s), adding zero perceived latency                                                                                                                                                                                                                 | medium        | D1                  | Observer agent    | Spike measured 14-17s with Sonnet. Haiku expected 2-5s — validate in slice 5 with model switch.                                                                                                                                       |
 | A5  | `better-sqlite3` npm prebuilt binary works across macOS/Linux without native compilation issues                                                                                                                                                                                                                           | **validated** | D7                  | SQLite foundation | Validated: installed on macOS without native compilation issues                                                                                                                                                                      |
 | A6  | Turn-tree branching in SQLite is sufficient for decision revisit and undo in a single-user tool                                                                                                                                                                                                                           | high          | D7                  | Turn tree         | Validate with realistic branch/merge scenarios                                                                                                                                                                                       |
 | A7  | Users arriving at the tool have a reasonably defined goal                                                                                                                                                                                                                                                                 | medium        | —                   | Scope phase       | User testing; exploratory pathway deferred if false                                                                                                                                                                                  |
@@ -77,7 +77,7 @@ The architecture (layered: db → core → adapters):
 | A11 | Stateless `query()` with prompt-stuffed history is sufficient for multi-turn interviewing — SDK session persistence is unnecessary and undesirable                                                                                                                                                                        | **validated** | D8, D12             | SQLite foundation | Validated: formatting history into prompt works. SDK sessions rejected as competing source of truth — opaque, machine-local, incompatible with portable data goals (atomic YAML / git-versionable). Turn tree is sole session model. |
 | A12 | `useChat` hook accepts initial messages to hydrate conversation state from server-stored history                                                                                                                                                                                                                          | **validated** | D9                  | SQLite foundation | Validated: `useChat` doesn't have `initialMessages` prop but `setMessages` works for hydration                                                                                                                                       |
 | A13 | Phase-specific interview behavior is achievable via system prompt switching + in-process MCP tools on `query()` — the SDK's formal `AgentDefinition` skill system is unnecessary                                                                                                                                          | **validated** | D2                  | Interview phases  | Validated: slice 4 uses `getSystemPrompt(phase)` + `createInterviewMcpServer()` per turn; 88 tests pass. SDK `AgentDefinition` subagent system not used — simpler approach with less indirection.                                     |
-| A14 | A second-thread observer agent can reliably extract decisions, assumptions, and dependency edges from a single turn's Q&A                                                                                                                                                                                                 | medium        | D1                  | Observer agent    | Probe with realistic interview exchanges; measure extraction fidelity                                                                                                                                                                |
+| A14 | A second-thread observer agent can reliably extract decisions, assumptions, and dependency edges from a single turn's Q&A                                                                                                                                                                                                 | **validated** | D1                  | Observer agent    | Validated (spike): decisions 100% capture, assumptions semantically correct (~80% true semantic overlap). Edges not tested — deferred to slice 5. Use tool-based structured output and faster model (Haiku) in production.              |
 | A15 | The LLM can reliably judge when a phase interview has reached sufficient understanding (is_resolution)                                                                                                                                                                                                                    | medium        | D3                  | Phase resolution  | Probe across varied project types; measure false-positive resolution rate                                                                                                                                                            |
 | A16 | AI SDK `useChat` hook's `ToolUIPart` state machine (`input-streaming` → `input-available` → `output-available` / `output-error` / `approval-requested` → `approval-responded` / `output-denied`) models all permutations of pending, error, and success for both interim (thinking, tool calls) and final (response) data | high          | D14                 | Rich chat UI      | Partially validated: SSE adapter emits tool-call events, client renders `dynamic-tool` parts with state labels (input-streaming, input-available, output-available, output-error). Browser outer-loop pending.                         |
 | A17 | AI Elements copy-paste components can be restyled without forking — they are ownable source files, not npm-locked dependencies                                                                                                                                                                                            | high          | D14                 | Rich chat UI      | Install via CLI, inspect source, confirm no hidden npm runtime dependency                                                                                                                                                            |