-
Notifications
You must be signed in to change notification settings - Fork 0
FE-557: Spike: observer extraction fidelity #21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -66,8 +66,8 @@ The architecture (layered: db → core → adapters): | |
| | --- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------- | ------------------- | ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | ||
| | A1 | AI SDK's UI Message Stream SSE protocol is documented and stable enough to emit conformantly without importing AI SDK server-side | **validated** | D8 | Walking skeleton | Validated: skeleton emits conformant SSE, 15 tests pass | | ||
| | A2 | Claude Agent SDK `query()` with `includePartialMessages` provides all streaming event types needed for CLI-quality feedback | **validated** | D8 | Walking skeleton | Validated: adapter translates stream_event messages correctly | | ||
| | A3 | Separating interviewer from observer produces better interview quality than inline tool calling | medium | D1 | Observer agent | Compare interview coherence with and without tool-calling load | | ||
| | A4 | Observer extraction completes in 1-3s during user read/think time (10-60s), adding zero perceived latency | medium | D1 | Observer agent | Measure extraction latency with realistic turn payloads | | ||
| | A3 | Separating interviewer from observer produces better interview quality than inline tool calling | high | D1 | Observer agent | Spike confirms extraction is viable as separate call; interviewer prompt stays clean. Full comparison deferred to slice 5 manual testing. | | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Severity: low 🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage. |
||
| | A4 | Observer extraction completes in 1-3s during user read/think time (10-60s), adding zero perceived latency | medium | D1 | Observer agent | Spike measured 14-17s with Sonnet. Haiku expected 2-5s — validate in slice 5 with model switch. | | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| | A5 | `better-sqlite3` npm prebuilt binary works across macOS/Linux without native compilation issues | **validated** | D7 | SQLite foundation | Validated: installed on macOS without native compilation issues | | ||
| | A6 | Turn-tree branching in SQLite is sufficient for decision revisit and undo in a single-user tool | high | D7 | Turn tree | Validate with realistic branch/merge scenarios | | ||
| | A7 | Users arriving at the tool have a reasonably defined goal | medium | — | Scope phase | User testing; exploratory pathway deferred if false | | ||
|
|
@@ -77,7 +77,7 @@ The architecture (layered: db → core → adapters): | |
| | A11 | Stateless `query()` with prompt-stuffed history is sufficient for multi-turn interviewing — SDK session persistence is unnecessary and undesirable | **validated** | D8, D12 | SQLite foundation | Validated: formatting history into prompt works. SDK sessions rejected as competing source of truth — opaque, machine-local, incompatible with portable data goals (atomic YAML / git-versionable). Turn tree is sole session model. | | ||
| | A12 | `useChat` hook accepts initial messages to hydrate conversation state from server-stored history | **validated** | D9 | SQLite foundation | Validated: `useChat` doesn't have `initialMessages` prop but `setMessages` works for hydration | | ||
| | A13 | Phase-specific interview behavior is achievable via system prompt switching + in-process MCP tools on `query()` — the SDK's formal `AgentDefinition` skill system is unnecessary | **validated** | D2 | Interview phases | Validated: slice 4 uses `getSystemPrompt(phase)` + `createInterviewMcpServer()` per turn; 88 tests pass. SDK `AgentDefinition` subagent system not used — simpler approach with less indirection. | | ||
| | A14 | A second-thread observer agent can reliably extract decisions, assumptions, and dependency edges from a single turn's Q&A | medium | D1 | Observer agent | Probe with realistic interview exchanges; measure extraction fidelity | | ||
| | A14 | A second-thread observer agent can reliably extract decisions, assumptions, and dependency edges from a single turn's Q&A | **validated** | D1 | Observer agent | Validated (spike): decisions 100% capture, assumptions semantically correct (~80% true semantic overlap). Edges not tested — deferred to slice 5. Use tool-based structured output and faster model (Haiku) in production. | | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Severity: medium 🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage. |
||
| | A15 | The LLM can reliably judge when a phase interview has reached sufficient understanding (is_resolution) | medium | D3 | Phase resolution | Probe across varied project types; measure false-positive resolution rate | | ||
| | A16 | AI SDK `useChat` hook's `ToolUIPart` state machine (`input-streaming` → `input-available` → `output-available` / `output-error` / `approval-requested` → `approval-responded` / `output-denied`) models all permutations of pending, error, and success for both interim (thinking, tool calls) and final (response) data | high | D14 | Rich chat UI | Partially validated: SSE adapter emits tool-call events, client renders `dynamic-tool` parts with state labels (input-streaming, input-available, output-available, output-error). Browser outer-loop pending. | | ||
| | A17 | AI Elements copy-paste components can be restyled without forking — they are ownable source files, not npm-locked dependencies | high | D14 | Rich chat UI | Install via CLI, inspect source, confirm no hidden npm runtime dependency | | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
memory/PLAN.md:78— This spike is markeddone, but the success criteria mention “correct dependency edges” and the spike artifact/code only evaluates decisions/assumptions (and SPEC A14 notes edges weren’t tested); that completion state may be ahead of what was actually verified.Severity: medium
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.