FE-557: Spike: observer extraction fidelity#21
Conversation
🤖 Augment PR SummarySummary: This PR closes out the FE-557 spike to evaluate “observer” extraction fidelity (decisions/assumptions) from a single interview turn. Changes:
Technical Notes: The spike uses streaming 🤖 Was this summary useful? React with 👍 or 👎 |
| | A12 | `useChat` hook accepts initial messages to hydrate conversation state from server-stored history | **validated** | D9 | SQLite foundation | Validated: `useChat` doesn't have `initialMessages` prop but `setMessages` works for hydration | | ||
| | A13 | Phase-specific interview behavior is achievable via system prompt switching + in-process MCP tools on `query()` — the SDK's formal `AgentDefinition` skill system is unnecessary | **validated** | D2 | Interview phases | Validated: slice 4 uses `getSystemPrompt(phase)` + `createInterviewMcpServer()` per turn; 88 tests pass. SDK `AgentDefinition` subagent system not used — simpler approach with less indirection. | | ||
| | A14 | A second-thread observer agent can reliably extract decisions, assumptions, and dependency edges from a single turn's Q&A | medium | D1 | Observer agent | Probe with realistic interview exchanges; measure extraction fidelity | | ||
| | A14 | A second-thread observer agent can reliably extract decisions, assumptions, and dependency edges from a single turn's Q&A | **validated** | D1 | Observer agent | Validated (spike): decisions 100% capture, assumptions semantically correct (~80% true semantic overlap). Edges not tested — deferred to slice 5. Use tool-based structured output and faster model (Haiku) in production. | |
There was a problem hiding this comment.
memory/SPEC.md:80 — A14 is marked validated, but the text says “Edges not tested — deferred to slice 5”; since A14 explicitly includes “dependency edges”, this reads as internally inconsistent and could mislead downstream planning.
Severity: medium
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
| | A2 | Claude Agent SDK `query()` with `includePartialMessages` provides all streaming event types needed for CLI-quality feedback | **validated** | D8 | Walking skeleton | Validated: adapter translates stream_event messages correctly | | ||
| | A3 | Separating interviewer from observer produces better interview quality than inline tool calling | medium | D1 | Observer agent | Compare interview coherence with and without tool-calling load | | ||
| | A4 | Observer extraction completes in 1-3s during user read/think time (10-60s), adding zero perceived latency | medium | D1 | Observer agent | Measure extraction latency with realistic turn payloads | | ||
| | A3 | Separating interviewer from observer produces better interview quality than inline tool calling | high | D1 | Observer agent | Spike confirms extraction is viable as separate call; interviewer prompt stays clean. Full comparison deferred to slice 5 manual testing. | |
There was a problem hiding this comment.
memory/SPEC.md:69 — A3 is about interview quality vs inline tool calling, but the evidence note is about extraction viability/clean prompts; consider aligning the note to the actual claim being made (or leaving it as unvalidated until the comparison is run).
Severity: low
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
| | A3 | Separating interviewer from observer produces better interview quality than inline tool calling | medium | D1 | Observer agent | Compare interview coherence with and without tool-calling load | | ||
| | A4 | Observer extraction completes in 1-3s during user read/think time (10-60s), adding zero perceived latency | medium | D1 | Observer agent | Measure extraction latency with realistic turn payloads | | ||
| | A3 | Separating interviewer from observer produces better interview quality than inline tool calling | high | D1 | Observer agent | Spike confirms extraction is viable as separate call; interviewer prompt stays clean. Full comparison deferred to slice 5 manual testing. | | ||
| | A4 | Observer extraction completes in 1-3s during user read/think time (10-60s), adding zero perceived latency | medium | D1 | Observer agent | Spike measured 14-17s with Sonnet. Haiku expected 2-5s — validate in slice 5 with model switch. | |
There was a problem hiding this comment.
| console.log('==================================\n'); | ||
|
|
||
| const results = []; | ||
| for (const fixture of FIXTURES) { |
There was a problem hiding this comment.
spike/observer-fidelity.ts:256 — The spike runs each fixture exactly once, so it doesn’t actually measure “extraction consistency across runs” (variance/instability) even though that’s part of the stated goal in PLAN/PR description.
Severity: medium
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
| if (extracted.assumptions.some((a) => fuzzyMatch(a, exp))) assumptionHits++; | ||
| } | ||
|
|
||
| const totalExpected = expected.decisions.length + expected.assumptions.length; |
There was a problem hiding this comment.
spike/observer-fidelity.ts:191 — scoreExtraction() measures recall only (hits/expected) and doesn’t penalize extra/hallucinated entities, so the reported “PASS” can overstate extraction fidelity if the model is adding incorrect decisions/assumptions.
Severity: medium
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
| ### Spikes | ||
|
|
||
| 1. **Observer extraction fidelity** — Can the LLM reliably extract decisions, assumptions, and dependency edges from a single turn's Q&A? Test with realistic fixture turns across different question types (scope, design, constraints). Measure extraction consistency across runs. `not-started` | ||
| 1. **Observer extraction fidelity** `FE-557` — Can the LLM reliably extract decisions, assumptions, and dependency edges from a single turn's Q&A? Test with realistic fixture turns across different question types (scope, design, constraints). Measure extraction consistency across runs. `done` |
There was a problem hiding this comment.
memory/PLAN.md:78 — This spike is marked done, but the success criteria mention “correct dependency edges” and the spike artifact/code only evaluates decisions/assumptions (and SPEC A14 notes edges weren’t tested); that completion state may be ahead of what was actually verified.
Severity: medium
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
| @@ -0,0 +1,287 @@ | |||
| /** | |||
There was a problem hiding this comment.
PR metadata: the PR title doesn’t match the repo convention (Rule: AGENTS.md) requiring FE-557: …-style titles; consider renaming for consistency/traceability.
Severity: low
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
5 fixture turns across scope/design/constraints question types. Decision extraction: 100% capture. Assumption extraction: semantically correct (~80% true overlap, fuzzy matcher underestimates at 47%). Latency: 14-17s with Sonnet (Haiku expected 2-5s). Recommendations for slice 5: use tool-based structured output, Haiku model, and LLM-as-judge for differential testing. Spike artifact: spike/observer-fidelity.ts (throwaway). Made-with: Cursor
fb3b7f1 to
0d9562a
Compare

5 fixture turns across scope/design/constraints question types.
Decision extraction: 100% capture. Assumption extraction: semantically
correct (~80% true overlap, fuzzy matcher underestimates at 47%).
Latency: 14-17s with Sonnet (Haiku expected 2-5s).
Recommendations for slice 5: use tool-based structured output,
Haiku model, and LLM-as-judge for differential testing.
Spike artifact: spike/observer-fidelity.ts (throwaway).
Made-with: Cursor