Support for driving interactive skills via a responder LLM

## Problem
  
A growing category of skills is inherently multi-turn, especially if the agent doesn't have everything it needs from the initial prompt, so it pauses to ask the user follow-up questions before completing the task. 

**Concrete example: configuring a new agent inside an agentic application.** A `configure-agent` skill is invoked with something like "Add a new agent to my application". To actually generate the agent definition, the skill needs:
  - A **name** for the agent.
  - The **system instructions** (the agent's role and persona).
  - The set of **tools** the agent should have access to.
  - Any **handoffs** to other agents in the application.
  
None of these can be reasonably inferred from "add a new agent", and they shouldn't be, because part of the skill's value *is* the structured Q&A. So the skill asks one piece at a time, gathers the answers, then implements the agent.
  
Today, evaluating this in waza forces a bad trade-off: either pre-bake all four answers into the initial prompt (which collapses the skill into a degenerate one-shot version of itself, you're no longer testing the thing you ship) or evaluate manually.
  
  ## Proposed
  
A **responder** — an LLM-backed surrogate user — configurable on a run/eval. After each `session/prompt` turn ends, if the agent produced chat text (not a tool result) *and* `responder.max_followups > 0`, the runner drains that text and asks the responder LLM to classify it. Three outcomes:
  - **Reply** → send a new user `session/prompt` with the responder's answer, decrement the cap, continue.
  - **Stop** → exit the loop cleanly (task done, no further questions).
  - **Abstain** → abort the run with a distinct failure classification — the responder explicitly couldn't answer. This is a configuration signal that the brief/skill is too vague, *not* a transient error, so it should be reportable separately from a model timeout or network blip.
    
  ## User-facing surface

Continuing the `configure-agent` example from the Problem section, the responder is given the agent configuration the test is exercising — name, system instructions, tools, and handoffs — and answers the skill's questions consistently with   
  ```yaml
     responder:
       model: gpt-4o
       instructions: |
         You are configuring a new agent inside an agentic application.
         The agent you want to create has:
           - name: research-agent
           - system instructions: "Search the web and summarise findings on the topic the user provides."
           - tools: web_search, url_fetch
           - handoffs: none
         Answer the skill's questions consistently with this configuration,
         regardless of the order in which the skill asks for each piece.
         If you genuinely can't infer an answer from the above, abstain.
       max_followups: 8
   ```
   
   Per-task override of `instructions` and `max_followups` is useful in practice — the same `configure-agent` skill can be exercised against several target configurations (a research agent, a customer-support agent, a triage agent with handoffs) to test how robust the skill is to varying intents.
   
  ## Next Steps
  
Would this be something you'd be willing to add to Waza? 

If so, if we can agree on a design I'm more than happy to implement and contribute!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for driving interactive skills via a responder LLM #303

Problem

Proposed

User-facing surface

Next Steps

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Support for driving interactive skills via a responder LLM #303

Description

Problem

Proposed

User-facing surface

Next Steps

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions