Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion memory/PLAN.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,7 @@

### Spikes

1. **Observer extraction fidelity** — Can the LLM reliably extract decisions, assumptions, and dependency edges from a single turn's Q&A? Test with realistic fixture turns across different question types (scope, design, constraints). Measure extraction consistency across runs. `not-started`
1. **Observer extraction fidelity** `FE-557` — Can the LLM reliably extract decisions, assumptions, and dependency edges from a single turn's Q&A? Test with realistic fixture turns across different question types (scope, design, constraints). Measure extraction consistency across runs. `done`
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memory/PLAN.md:78 — This spike is marked done, but the success criteria mention “correct dependency edges” and the spike artifact/code only evaluates decisions/assumptions (and SPEC A14 notes edges weren’t tested); that completion state may be ahead of what was actually verified.

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

- Assumptions: → SPEC.md §Assumptions A14, A3
- Time box: 2 hours
- Success: ≥80% of expected entities captured with correct dependency edges across 5+ fixture turns
Expand Down
6 changes: 3 additions & 3 deletions memory/SPEC.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,8 +66,8 @@ The architecture (layered: db → core → adapters):
| --- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------- | ------------------- | ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| A1 | AI SDK's UI Message Stream SSE protocol is documented and stable enough to emit conformantly without importing AI SDK server-side | **validated** | D8 | Walking skeleton | Validated: skeleton emits conformant SSE, 15 tests pass |
| A2 | Claude Agent SDK `query()` with `includePartialMessages` provides all streaming event types needed for CLI-quality feedback | **validated** | D8 | Walking skeleton | Validated: adapter translates stream_event messages correctly |
| A3 | Separating interviewer from observer produces better interview quality than inline tool calling | medium | D1 | Observer agent | Compare interview coherence with and without tool-calling load |
| A4 | Observer extraction completes in 1-3s during user read/think time (10-60s), adding zero perceived latency | medium | D1 | Observer agent | Measure extraction latency with realistic turn payloads |
| A3 | Separating interviewer from observer produces better interview quality than inline tool calling | high | D1 | Observer agent | Spike confirms extraction is viable as separate call; interviewer prompt stays clean. Full comparison deferred to slice 5 manual testing. |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memory/SPEC.md:69 — A3 is about interview quality vs inline tool calling, but the evidence note is about extraction viability/clean prompts; consider aligning the note to the actual claim being made (or leaving it as unvalidated until the comparison is run).

Severity: low

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

| A4 | Observer extraction completes in 1-3s during user read/think time (10-60s), adding zero perceived latency | medium | D1 | Observer agent | Spike measured 14-17s with Sonnet. Haiku expected 2-5s — validate in slice 5 with model switch. |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memory/SPEC.md:70 — A4 states “1-3s … adding zero perceived latency”, but the spike note reports 14–17s; as written this looks like a contradiction rather than a “medium” assumption awaiting validation.

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

| A5 | `better-sqlite3` npm prebuilt binary works across macOS/Linux without native compilation issues | **validated** | D7 | SQLite foundation | Validated: installed on macOS without native compilation issues |
| A6 | Turn-tree branching in SQLite is sufficient for decision revisit and undo in a single-user tool | high | D7 | Turn tree | Validate with realistic branch/merge scenarios |
| A7 | Users arriving at the tool have a reasonably defined goal | medium | — | Scope phase | User testing; exploratory pathway deferred if false |
Expand All @@ -77,7 +77,7 @@ The architecture (layered: db → core → adapters):
| A11 | Stateless `query()` with prompt-stuffed history is sufficient for multi-turn interviewing — SDK session persistence is unnecessary and undesirable | **validated** | D8, D12 | SQLite foundation | Validated: formatting history into prompt works. SDK sessions rejected as competing source of truth — opaque, machine-local, incompatible with portable data goals (atomic YAML / git-versionable). Turn tree is sole session model. |
| A12 | `useChat` hook accepts initial messages to hydrate conversation state from server-stored history | **validated** | D9 | SQLite foundation | Validated: `useChat` doesn't have `initialMessages` prop but `setMessages` works for hydration |
| A13 | Phase-specific interview behavior is achievable via system prompt switching + in-process MCP tools on `query()` — the SDK's formal `AgentDefinition` skill system is unnecessary | **validated** | D2 | Interview phases | Validated: slice 4 uses `getSystemPrompt(phase)` + `createInterviewMcpServer()` per turn; 88 tests pass. SDK `AgentDefinition` subagent system not used — simpler approach with less indirection. |
| A14 | A second-thread observer agent can reliably extract decisions, assumptions, and dependency edges from a single turn's Q&A | medium | D1 | Observer agent | Probe with realistic interview exchanges; measure extraction fidelity |
| A14 | A second-thread observer agent can reliably extract decisions, assumptions, and dependency edges from a single turn's Q&A | **validated** | D1 | Observer agent | Validated (spike): decisions 100% capture, assumptions semantically correct (~80% true semantic overlap). Edges not tested — deferred to slice 5. Use tool-based structured output and faster model (Haiku) in production. |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memory/SPEC.md:80 — A14 is marked validated, but the text says “Edges not tested — deferred to slice 5”; since A14 explicitly includes “dependency edges”, this reads as internally inconsistent and could mislead downstream planning.

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

| A15 | The LLM can reliably judge when a phase interview has reached sufficient understanding (is_resolution) | medium | D3 | Phase resolution | Probe across varied project types; measure false-positive resolution rate |
| A16 | AI SDK `useChat` hook's `ToolUIPart` state machine (`input-streaming` → `input-available` → `output-available` / `output-error` / `approval-requested` → `approval-responded` / `output-denied`) models all permutations of pending, error, and success for both interim (thinking, tool calls) and final (response) data | high | D14 | Rich chat UI | Partially validated: SSE adapter emits tool-call events, client renders `dynamic-tool` parts with state labels (input-streaming, input-available, output-available, output-error). Browser outer-loop pending. |
| A17 | AI Elements copy-paste components can be restyled without forking — they are ownable source files, not npm-locked dependencies | high | D14 | Rich chat UI | Install via CLI, inspect source, confirm no hidden npm runtime dependency |
Expand Down
Loading