feat: drive interactive skills via an LLM responder (#303)#304
feat: drive interactive skills via an LLM responder (#303)#304adamdougal wants to merge 17 commits into
Conversation
Adds the approved design for an LLM-backed surrogate user that answers a skill's follow-up questions per task under inputs.responder, with reply/stop/abstain classification, a runner-driven follow-up loop reusing the agent session, and distinct result tagging for abstain (StatusError) and cap-exhaustion. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…t#303 Bite-sized TDD task breakdown covering the inputs.responder config model and validation, the internal/responder package (persistent surrogate-user session with reply/stop/abstain classification), the runner-driven follow-up loop, ResponderInfo reporting, JSON schema, docs, and dashboard surfacing. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ft#303 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…soft#303 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…t#303 Responder Classify used EphemeralSession=true, which the engine deletes after the first turn, breaking session resume and dropping instructions on every subsequent turn. Switch to a persistent (non-ephemeral) session, add Classifier.Close plus CopilotEngine.DeleteSession to tear it down explicitly, and call Close via defer at the end of the responder loop with a detached context so cleanup runs even on cancellation. Capture sessionID before the error check so an error-with-decision still persists the session id. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Rebuild web/dist/index.html so its asset hash matches the freshly built bundle (fixes TestIndexHTMLReferencesExistingAssets after the responder dashboard change) and correct a misspelling flagged by golangci-lint in the responder cleanup comment. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…t#303 A non-ephemeral session registers in both e.sessions and e.usageCollectors, but DeleteSession only removed it from e.sessions, orphaning the usage collector for the engine's lifetime. Each responder-driven task leaked one collector; under concurrent runs this accumulated monotonically. Also delete the usageCollectors entry under its mutex. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds an LLM-backed "responder" that role-plays the user for interactive, multi-turn skills, enabling evaluation of skills whose follow-up questions cannot be pre-scripted.
Changes:
- New
internal/responderpackage implementing aClassifierthat drives a persistent surrogate-user session and emitsreply/stop/abstaindecisions via structured tool calls. - Runner integration (
executeResponderLoop) that drives the agent loop, merges responses, and records aResponderInfosummary with outcomescompleted/stopped/abstained/cap_exhausted/error. - Config/schema/validation, API/dashboard surfacing, docs, and tests for the new
inputs.responderfield.
Show a summary per file
| File | Description |
|---|---|
| internal/responder/responder.go | New responder Classifier with persistent session + 3 decision tools. |
| internal/responder/responder_test.go | Unit tests for tools, session reuse, cleanup, and model defaulting. |
| internal/orchestration/runner.go | Adds executeResponderLoop/sendResponderReply and newClassifier hook. |
| internal/orchestration/responder_loop_test.go | Tests reply→stop, abstain→error, cap-exhausted scenarios. |
| internal/models/testcase.go | Adds ResponderConfig on TaskStimulus + validation. |
| internal/models/testcase_test.go | Validation tests for responder config. |
| internal/models/outcome.go | Adds ResponderInfo and outcome constants. |
| internal/models/outcome_test.go | JSON serialization test for Responder. |
| internal/execution/copilot.go | New DeleteSession for explicit teardown. |
| internal/webapi/types.go | Adds ResponderInfoResponse. |
| internal/webapi/store.go | Maps run.Responder to API response. |
| internal/webapi/additional_test.go | Test for responder mapping. |
| internal/validation/schema_test.go | Schema acceptance test for responder. |
| schemas/task.schema.json | Schema for inputs.responder. |
| web/src/api/client.ts | TypeScript ResponderInfo type. |
| web/src/components/RunDetail.tsx | ResponderBadge for task rows. |
| web/dist/index.html | Rebuilt asset reference. |
| site/src/content/docs/, README.md, docs/plans/ | Documentation and design notes. |
Copilot's findings
- Files reviewed: 20/21 changed files
- Comments generated: 3
| } | ||
|
|
||
| if lastWasReply { | ||
| info.Outcome = models.ResponderOutcomeCapExhausted | ||
| slog.WarnContext(ctx, "responder budget exhausted while agent still asking questions", | ||
| "test", tc.DisplayName, "max_followups", cfg.MaxFollowups) | ||
| } | ||
| return info |
There was a problem hiding this comment.
Good catch — validated and applied in f539f0b. Dropped lastWasReply and the misleading ResponderOutcomeCompleted seed; the post-loop branch now unconditionally records cap_exhausted since every other exit returns early. Also removed the now-unused ResponderOutcomeCompleted constant.
| Handler: func(inv copilot.ToolInvocation) (copilot.ToolResult, error) { | ||
| var args struct { | ||
| Answer string `mapstructure:"answer"` | ||
| } | ||
| _ = mapstructure.Decode(inv.Arguments, &args) | ||
| d.decision = Decision{Kind: DecisionReply, Answer: args.Answer} | ||
| d.set = true | ||
| return copilot.ToolResult{}, nil | ||
| }, |
There was a problem hiding this comment.
Validated and fixed in f539f0b. Each handler now returns the decode error and stores it on the recorder; Classify surfaces it as responder tool call invalid: ... instead of fabricating an empty reply/abstain. Added a regression test (TestClassifyMalformedArgsIsError).
| set bool | ||
| } | ||
|
|
||
| func (d *decisionRecorder) tools() []copilot.Tool { |
There was a problem hiding this comment.
Validated and fixed in f539f0b. The recorder now refuses a second decision call via guardDuplicate and records the conflict on d.err; Classify surfaces it rather than letting handler order pick a winner. Added regression tests (TestDecisionToolsRejectDuplicateCall, TestClassifyDuplicateDecisionIsError).
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #304 +/- ##
=======================================
Coverage ? 75.30%
=======================================
Files ? 160
Lines ? 18859
Branches ? 0
=======================================
Hits ? 14202
Misses ? 3640
Partials ? 1017
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Addresses three review comments on PR microsoft#304: * Reject duplicate decision tool calls in the same turn instead of letting handler order silently pick the winner. The recorder now returns an error on the second call and Classify surfaces it. * Propagate mapstructure decode failures from each tool handler so malformed arguments become a 'responder tool call invalid' error rather than a fabricated empty reply/abstain. * Drop the unused lastWasReply flag and the dead initial ResponderOutcomeCompleted seed in the responder loop. The loop can only exit normally after a reply, so the post-loop branch unconditionally records cap_exhausted. Removed the now-unused ResponderOutcomeCompleted constant. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Summary
Adds a responder — an LLM-backed surrogate user that drives interactive (multi-turn) skills during evals. When an agent asks a follow-up question, the responder classifies it and decides whether to reply, stop the conversation, or abstain (the question can't be answered from its brief), letting us evaluate back-and-forth skills without scripting every turn. It reuses the same Copilot engine as the agent under test (no extra LLM deployment) but runs in its own isolated, persistent session, configured per task under
inputs.responder.Related issue
Closes #303
Agent handoff
internal/models/testcase.go&outcome.go(config +ResponderInfo),internal/responder/responder.go(classifier with persistent session + teardown),internal/orchestration/runner.go(executeResponderLoop, injectable classifier factory),internal/execution/copilot.go(DeleteSession),internal/webapi/{types,store}.go,web/src/components/RunDetail.tsx&api/client.ts(responder badge),schemas/task.schema.json, README +site/docs.inputs.responderis a sibling offollow_up_promptsand mutually exclusive with it; responder runs in a separate, non-ephemeral Copilot session with explicitClose()teardown to avoid polluting the agent transcript; abstain marks the task errored, stop ends normally, cap exhaustion stops the loop and grades what exists;modelis optional and defaults to the eval'sconfig.model; each task builds its own classifier (concurrency-safe).completedoutcome value is effectively unreachable in practice (a self-initiated stop returnsstopped); left as-is and considered acceptable.Type of change
Validation
go test ./...make lintorgolangci-lint runweb/changedDocumentation
site/docs updated, if CLI, YAML, dashboard, or validator behavior changedRisk and rollback
inputs.responderfield — tasks without it are unaffected. Revert the branch's commits (or the squash-merge commit) to fully remove it; no data migrations or schema-compat concerns.Notes for reviewers
The responder's session lifecycle is the area most worth a close look:
Classifylazily creates a persistent session on the first call and resumes it thereafter, andexecuteResponderLoopdefersClose()with a detached 30s context so teardown still runs on cancellation.CopilotEngine.DeleteSessionremoves the session from bothe.sessionsande.usageCollectors(the latter fixed a collector leak). Also worth confirming: the load-time mutual-exclusivity validation betweenresponderandfollow_up_prompts, and that the orchestration branch givesResponderprecedence overFollowUps