Problem
A growing category of skills is inherently multi-turn, especially if the agent doesn't have everything it needs from the initial prompt, so it pauses to ask the user follow-up questions before completing the task.
Concrete example: configuring a new agent inside an agentic application. A configure-agent skill is invoked with something like "Add a new agent to my application". To actually generate the agent definition, the skill needs:
- A name for the agent.
- The system instructions (the agent's role and persona).
- The set of tools the agent should have access to.
- Any handoffs to other agents in the application.
None of these can be reasonably inferred from "add a new agent", and they shouldn't be, because part of the skill's value is the structured Q&A. So the skill asks one piece at a time, gathers the answers, then implements the agent.
Today, evaluating this in waza forces a bad trade-off: either pre-bake all four answers into the initial prompt (which collapses the skill into a degenerate one-shot version of itself, you're no longer testing the thing you ship) or evaluate manually.
Proposed
A responder — an LLM-backed surrogate user — configurable on a run/eval. After each session/prompt turn ends, if the agent produced chat text (not a tool result) and responder.max_followups > 0, the runner drains that text and asks the responder LLM to classify it. Three outcomes:
- Reply → send a new user
session/prompt with the responder's answer, decrement the cap, continue.
- Stop → exit the loop cleanly (task done, no further questions).
- Abstain → abort the run with a distinct failure classification — the responder explicitly couldn't answer. This is a configuration signal that the brief/skill is too vague, not a transient error, so it should be reportable separately from a model timeout or network blip.
User-facing surface
Continuing the configure-agent example from the Problem section, the responder is given the agent configuration the test is exercising — name, system instructions, tools, and handoffs — and answers the skill's questions consistently with
responder:
model: gpt-4o
instructions: |
You are configuring a new agent inside an agentic application.
The agent you want to create has:
- name: research-agent
- system instructions: "Search the web and summarise findings on the topic the user provides."
- tools: web_search, url_fetch
- handoffs: none
Answer the skill's questions consistently with this configuration,
regardless of the order in which the skill asks for each piece.
If you genuinely can't infer an answer from the above, abstain.
max_followups: 8
Per-task override of instructions and max_followups is useful in practice — the same configure-agent skill can be exercised against several target configurations (a research agent, a customer-support agent, a triage agent with handoffs) to test how robust the skill is to varying intents.
Next Steps
Would this be something you'd be willing to add to Waza?
If so, if we can agree on a design I'm more than happy to implement and contribute!
Problem
A growing category of skills is inherently multi-turn, especially if the agent doesn't have everything it needs from the initial prompt, so it pauses to ask the user follow-up questions before completing the task.
Concrete example: configuring a new agent inside an agentic application. A
configure-agentskill is invoked with something like "Add a new agent to my application". To actually generate the agent definition, the skill needs:None of these can be reasonably inferred from "add a new agent", and they shouldn't be, because part of the skill's value is the structured Q&A. So the skill asks one piece at a time, gathers the answers, then implements the agent.
Today, evaluating this in waza forces a bad trade-off: either pre-bake all four answers into the initial prompt (which collapses the skill into a degenerate one-shot version of itself, you're no longer testing the thing you ship) or evaluate manually.
Proposed
A responder — an LLM-backed surrogate user — configurable on a run/eval. After each
session/promptturn ends, if the agent produced chat text (not a tool result) andresponder.max_followups > 0, the runner drains that text and asks the responder LLM to classify it. Three outcomes:session/promptwith the responder's answer, decrement the cap, continue.User-facing surface
Continuing the
configure-agentexample from the Problem section, the responder is given the agent configuration the test is exercising — name, system instructions, tools, and handoffs — and answers the skill's questions consistently withPer-task override of
instructionsandmax_followupsis useful in practice — the sameconfigure-agentskill can be exercised against several target configurations (a research agent, a customer-support agent, a triage agent with handoffs) to test how robust the skill is to varying intents.Next Steps
Would this be something you'd be willing to add to Waza?
If so, if we can agree on a design I'm more than happy to implement and contribute!