FE-557: Spike: observer extraction fidelity by lunelson · Pull Request #21 · hashintel/brunch

lunelson · 2026-04-02T11:45:12Z

5 fixture turns across scope/design/constraints question types.
Decision extraction: 100% capture. Assumption extraction: semantically
correct (~80% true overlap, fuzzy matcher underestimates at 47%).
Latency: 14-17s with Sonnet (Haiku expected 2-5s).

Recommendations for slice 5: use tool-based structured output,
Haiku model, and LLM-as-judge for differential testing.

Spike artifact: spike/observer-fidelity.ts (throwaway).

Made-with: Cursor

linear · 2026-04-02T11:45:18Z

FE-557 Spike: observer extraction fidelity

lunelson · 2026-04-02T11:45:31Z

This stack of pull requests is managed by Graphite. Learn more about stacking.

augmentcode · 2026-04-02T12:29:02Z

🤖 Augment PR Summary

Summary: This PR closes out the FE-557 spike to evaluate “observer” extraction fidelity (decisions/assumptions) from a single interview turn.

Changes:

Marks the “Observer extraction fidelity” spike as done in memory/PLAN.md and links it to FE-557
Updates memory/SPEC.md assumptions with spike findings (latency + extraction viability) and marks A14 as validated
Adds a throwaway spike script (spike/observer-fidelity.ts) that runs 5 fixture turns through Claude Agent SDK query() and scores extracted entities vs a golden master using a fuzzy matcher

Technical Notes: The spike uses streaming stream_event deltas to reconstruct the model response, parses JSON output, and reports per-fixture + averaged capture rates and latency.

_{🤖 Was this summary useful? React with 👍 or 👎}

augmentcode

Review completed. 7 suggestions posted.

Comment augment review to trigger a new review at any time.

augmentcode · 2026-04-02T12:29:03Z

 | A12 | `useChat` hook accepts initial messages to hydrate conversation state from server-stored history                                                                                                                                                                                                                          | **validated** | D9                  | SQLite foundation | Validated: `useChat` doesn't have `initialMessages` prop but `setMessages` works for hydration                                                                                                                                       |
 | A13 | Phase-specific interview behavior is achievable via system prompt switching + in-process MCP tools on `query()` — the SDK's formal `AgentDefinition` skill system is unnecessary                                                                                                                                          | **validated** | D2                  | Interview phases  | Validated: slice 4 uses `getSystemPrompt(phase)` + `createInterviewMcpServer()` per turn; 88 tests pass. SDK `AgentDefinition` subagent system not used — simpler approach with less indirection.                                     |
-| A14 | A second-thread observer agent can reliably extract decisions, assumptions, and dependency edges from a single turn's Q&A                                                                                                                                                                                                 | medium        | D1                  | Observer agent    | Probe with realistic interview exchanges; measure extraction fidelity                                                                                                                                                                |
+| A14 | A second-thread observer agent can reliably extract decisions, assumptions, and dependency edges from a single turn's Q&A                                                                                                                                                                                                 | **validated** | D1                  | Observer agent    | Validated (spike): decisions 100% capture, assumptions semantically correct (~80% true semantic overlap). Edges not tested — deferred to slice 5. Use tool-based structured output and faster model (Haiku) in production.              |


memory/SPEC.md:80 — A14 is marked validated, but the text says “Edges not tested — deferred to slice 5”; since A14 explicitly includes “dependency edges”, this reads as internally inconsistent and could mislead downstream planning.

Severity: medium

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

augmentcode · 2026-04-02T12:29:03Z

 | A2  | Claude Agent SDK `query()` with `includePartialMessages` provides all streaming event types needed for CLI-quality feedback                                                                                                                                                                                               | **validated** | D8                  | Walking skeleton  | Validated: adapter translates stream_event messages correctly                                                                                                                                                                        |
-| A3  | Separating interviewer from observer produces better interview quality than inline tool calling                                                                                                                                                                                                                           | medium        | D1                  | Observer agent    | Compare interview coherence with and without tool-calling load                                                                                                                                                                       |
-| A4  | Observer extraction completes in 1-3s during user read/think time (10-60s), adding zero perceived latency                                                                                                                                                                                                                 | medium        | D1                  | Observer agent    | Measure extraction latency with realistic turn payloads                                                                                                                                                                              |
+| A3  | Separating interviewer from observer produces better interview quality than inline tool calling                                                                                                                                                                                                                           | high          | D1                  | Observer agent    | Spike confirms extraction is viable as separate call; interviewer prompt stays clean. Full comparison deferred to slice 5 manual testing.                                                                                             |


memory/SPEC.md:69 — A3 is about interview quality vs inline tool calling, but the evidence note is about extraction viability/clean prompts; consider aligning the note to the actual claim being made (or leaving it as unvalidated until the comparison is run).

Severity: low

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

augmentcode · 2026-04-02T12:29:03Z

-| A3  | Separating interviewer from observer produces better interview quality than inline tool calling                                                                                                                                                                                                                           | medium        | D1                  | Observer agent    | Compare interview coherence with and without tool-calling load                                                                                                                                                                       |
-| A4  | Observer extraction completes in 1-3s during user read/think time (10-60s), adding zero perceived latency                                                                                                                                                                                                                 | medium        | D1                  | Observer agent    | Measure extraction latency with realistic turn payloads                                                                                                                                                                              |
+| A3  | Separating interviewer from observer produces better interview quality than inline tool calling                                                                                                                                                                                                                           | high          | D1                  | Observer agent    | Spike confirms extraction is viable as separate call; interviewer prompt stays clean. Full comparison deferred to slice 5 manual testing.                                                                                             |
+| A4  | Observer extraction completes in 1-3s during user read/think time (10-60s), adding zero perceived latency                                                                                                                                                                                                                 | medium        | D1                  | Observer agent    | Spike measured 14-17s with Sonnet. Haiku expected 2-5s — validate in slice 5 with model switch.                                                                                                                                       |


memory/SPEC.md:70 — A4 states “1-3s … adding zero perceived latency”, but the spike note reports 14–17s; as written this looks like a contradiction rather than a “medium” assumption awaiting validation.

Severity: medium

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

augmentcode · 2026-04-02T12:29:03Z

+  console.log('==================================\n');
+
+  const results = [];
+  for (const fixture of FIXTURES) {


spike/observer-fidelity.ts:256 — The spike runs each fixture exactly once, so it doesn’t actually measure “extraction consistency across runs” (variance/instability) even though that’s part of the stated goal in PLAN/PR description.

Severity: medium

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

augmentcode · 2026-04-02T12:29:03Z

+    if (extracted.assumptions.some((a) => fuzzyMatch(a, exp))) assumptionHits++;
+  }
+
+  const totalExpected = expected.decisions.length + expected.assumptions.length;


spike/observer-fidelity.ts:191 — scoreExtraction() measures recall only (hits/expected) and doesn’t penalize extra/hallucinated entities, so the reported “PASS” can overstate extraction fidelity if the model is adding incorrect decisions/assumptions.

Severity: medium

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

augmentcode · 2026-04-02T12:29:03Z

 ### Spikes

-1. **Observer extraction fidelity** — Can the LLM reliably extract decisions, assumptions, and dependency edges from a single turn's Q&A? Test with realistic fixture turns across different question types (scope, design, constraints). Measure extraction consistency across runs. `not-started`
+1. **Observer extraction fidelity** `FE-557` — Can the LLM reliably extract decisions, assumptions, and dependency edges from a single turn's Q&A? Test with realistic fixture turns across different question types (scope, design, constraints). Measure extraction consistency across runs. `done`


memory/PLAN.md:78 — This spike is marked done, but the success criteria mention “correct dependency edges” and the spike artifact/code only evaluates decisions/assumptions (and SPEC A14 notes edges weren’t tested); that completion state may be ahead of what was actually verified.

Severity: medium

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

augmentcode · 2026-04-02T12:29:03Z

@@ -0,0 +1,287 @@
+/**


PR metadata: the PR title doesn’t match the repo convention (Rule: AGENTS.md) requiring FE-557: …-style titles; consider renaming for consistency/traceability.

Severity: low

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

lunelson · 2026-04-02T18:52:23Z

Merge activity

Apr 2, 6:52 PM UTC: A user started a stack merge that includes this pull request via Graphite.
Apr 2, 6:57 PM UTC: Graphite rebased this pull request as part of a merge.
Apr 2, 6:58 PM UTC: @lunelson merged this pull request with Graphite.

5 fixture turns across scope/design/constraints question types. Decision extraction: 100% capture. Assumption extraction: semantically correct (~80% true overlap, fuzzy matcher underestimates at 47%). Latency: 14-17s with Sonnet (Haiku expected 2-5s). Recommendations for slice 5: use tool-based structured output, Haiku model, and LLM-as-judge for differential testing. Spike artifact: spike/observer-fidelity.ts (throwaway). Made-with: Cursor

This was referenced Apr 2, 2026

FE-554: Structured interview: scope phase #17

Merged

FE-555: Parts-based persistence + context builders #19

Merged

FE-556: Structured interview: client UI #20

Merged

FE-558: UI foundation: shadcn/ui + Tailwind 4 + AI Elements #22

Merged

lunelson marked this pull request as ready for review April 2, 2026 12:26

lunelson changed the title ~~spike: observer extraction fidelity — A14 validated~~ FE-557: Spike: observer extraction fidelity Apr 2, 2026

augmentcode Bot reviewed Apr 2, 2026

View reviewed changes

lunelson mentioned this pull request Apr 2, 2026

FE-537: Observer agent + entity persistence #23

Merged

lunelson changed the base branch from ln/fe-556-interview-client-ui to graphite-base/21 April 2, 2026 18:55

lunelson changed the base branch from graphite-base/21 to main April 2, 2026 18:56

lunelson force-pushed the ln/fe-557-observer-fidelity-spike branch from fb3b7f1 to 0d9562a Compare April 2, 2026 18:57

lunelson merged commit cf64b1b into main Apr 2, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FE-557: Spike: observer extraction fidelity#21

FE-557: Spike: observer extraction fidelity#21
lunelson merged 1 commit into
mainfrom
ln/fe-557-observer-fidelity-spike

lunelson commented Apr 2, 2026

Uh oh!

linear Bot commented Apr 2, 2026

Uh oh!

lunelson commented Apr 2, 2026 •

edited

Loading

Uh oh!

augmentcode Bot commented Apr 2, 2026

Uh oh!

augmentcode Bot left a comment

Uh oh!

augmentcode Bot Apr 2, 2026

Uh oh!

augmentcode Bot Apr 2, 2026

Uh oh!

augmentcode Bot Apr 2, 2026

Uh oh!

augmentcode Bot Apr 2, 2026

Uh oh!

augmentcode Bot Apr 2, 2026

Uh oh!

augmentcode Bot Apr 2, 2026

Uh oh!

augmentcode Bot Apr 2, 2026

Uh oh!

lunelson commented Apr 2, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lunelson commented Apr 2, 2026

Uh oh!

linear Bot commented Apr 2, 2026

Uh oh!

lunelson commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

augmentcode Bot commented Apr 2, 2026

Uh oh!

augmentcode Bot left a comment

Choose a reason for hiding this comment

Uh oh!

augmentcode Bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

augmentcode Bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

augmentcode Bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

augmentcode Bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

augmentcode Bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

augmentcode Bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

augmentcode Bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

lunelson commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge activity

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lunelson commented Apr 2, 2026 •

edited

Loading

lunelson commented Apr 2, 2026 •

edited

Loading