The eval-yaml guide and the schema reference list skill: / agent: as required at the top level of an eval spec (with the ✓* "one, not both" qualifier). In practice, omitting both still works against Waza 0.31.0 — and for trigger-precision evals it appears to be the only way to get a realistic measurement.
We'd like either the docs clarified to legitimize this, or an explicit opt-out for SKILL.md injection.
Context: trigger evals
Trigger evals measure whether the agent decides to invoke the skill in response to a given prompt (positive prompts should activate, negative prompts should not). The signal we want to observe is a skill.invoked event, scored with a behavior grader and required_tools: [skill] / forbidden_tools: [skill].
The problem
When skill: <name> is set, buildSkillSystemMessage (internal/execution/copilot.go) injects the full SKILL.md body into the agent's system prompt as a <skill_context> block. The agent then already has the skill content and can perform skill-aligned work without ever invoking the skill tool. This silently destroys trigger measurement:
- A
behavior grader with required_tools: [skill] or a skill_invocation grader produce false negatives even when the agent correctly recognized the trigger.
- The agent may proceed straight to executing the workflow with no observable trigger-side signal at all.
The bundled trigger grader is not a viable substitute — it scores the prompt heuristically against the skill's keyword set, so it's blind to what the agent actually did and has little connection to reality. In other words, the trigger grader just makes a wild guess - in practice, it is completely unreliable and useless for trigger evals in our experience.
Current workaround
Omit the top-level skill: field entirely. Waza still discovers the skill (walking one level of subdirectories under the CWD for skills/<name>/SKILL.md) and exposes it via the compact <available_skills> summary (name + description only). The agent then must invoke the skill tool to read the body — which is exactly the signal a trigger eval needs to observe.
A real spec using this pattern:
name: xyz-trigger
description: >
Trigger-precision tasks for the xyz skill: positive
prompts that should activate the skill and negative prompts that
should not.
# Deliberately no top-level `skill:` field. Setting it makes Waza
# inject the full SKILL.md body into the agent's system prompt (see
# buildSkillSystemMessage in waza/internal/execution/copilot.go),
# which defeats trigger measurement — the agent then "knows" the
# skill without ever invoking the `skill` tool. With `skill:`
# omitted, the agent only sees the compact <available_skills>
# summary (name + description) and must invoke the tool to read the
# body.
version: "1.0"
config:
trials_per_task: 1
timeout_seconds: 300
parallel: false
executor: copilot-sdk
model: claude-opus-4.6
metrics:
- name: task_completion
weight: 1.0
threshold: 0.8
tasks:
- "tasks/trigger/*.yaml"
Paired with task-level graders like:
# Positive trigger task
expected:
graders:
- type: skill_invocation
name: xyz-invoked
config:
required_skills:
- xyz
mode: any_order
allow_extra: true
# Negative trigger task
expected:
graders:
- type: behavior
name: skill-not-invoked
config:
forbidden_tools:
- skill
This works on Waza 0.31.0 with executor: copilot-sdk, but the docs explicitly mark skill/agent as required, so we may be relying on undocumented behaviour that could regress in a future release.
Ask
Either of the following would unblock us:
-
Docs clarification. If "neither skill: nor agent:" is a supported configuration, please update the eval-yaml guide and YAML schema reference to reflect that, and describe the resulting agent-visible behaviour (compact <available_skills> summary, no <skill_context> injection, skill still discoverable via the skill tool). Trigger-precision evaluation is worth calling out as the motivating use case.
-
Or, an explicit opt-out. Add a flag that keeps skill: set — preserving skill association, hooks, dashboard grouping, and the rest of the eval's identity — but suppresses SKILL.md injection into the system prompt. Strawman:
skill: xyz
config:
inject_skill_body: false # or e.g. skill_injection: summary-only
With this, trigger evals could keep skill: set (matching the documented schema) while still requiring the agent to invoke the skill tool to access the body, making behavior / skill_invocation / required_tools: [skill] graders meaningful.
Option 2 is our preference because it lets specs continue to declare their associated skill (useful for filtering, reporting, dashboard grouping, hook context, etc.) while still producing realistic trigger measurements. Option 1 alone is acceptable.
Environment
- Waza 0.31.0
executor: copilot-sdk
- Model:
claude-opus-4.6
The eval-yaml guide and the schema reference list
skill:/agent:as required at the top level of an eval spec (with the✓*"one, not both" qualifier). In practice, omitting both still works against Waza 0.31.0 — and for trigger-precision evals it appears to be the only way to get a realistic measurement.We'd like either the docs clarified to legitimize this, or an explicit opt-out for SKILL.md injection.
Context: trigger evals
Trigger evals measure whether the agent decides to invoke the skill in response to a given prompt (positive prompts should activate, negative prompts should not). The signal we want to observe is a
skill.invokedevent, scored with abehaviorgrader andrequired_tools: [skill]/forbidden_tools: [skill].The problem
When
skill: <name>is set,buildSkillSystemMessage(internal/execution/copilot.go) injects the fullSKILL.mdbody into the agent's system prompt as a<skill_context>block. The agent then already has the skill content and can perform skill-aligned work without ever invoking theskilltool. This silently destroys trigger measurement:behaviorgrader withrequired_tools: [skill]or askill_invocationgrader produce false negatives even when the agent correctly recognized the trigger.The bundled
triggergrader is not a viable substitute — it scores the prompt heuristically against the skill's keyword set, so it's blind to what the agent actually did and has little connection to reality. In other words, thetriggergrader just makes a wild guess - in practice, it is completely unreliable and useless for trigger evals in our experience.Current workaround
Omit the top-level
skill:field entirely. Waza still discovers the skill (walking one level of subdirectories under the CWD forskills/<name>/SKILL.md) and exposes it via the compact<available_skills>summary (name + description only). The agent then must invoke theskilltool to read the body — which is exactly the signal a trigger eval needs to observe.A real spec using this pattern:
Paired with task-level graders like:
This works on Waza 0.31.0 with
executor: copilot-sdk, but the docs explicitly markskill/agentas required, so we may be relying on undocumented behaviour that could regress in a future release.Ask
Either of the following would unblock us:
Docs clarification. If "neither
skill:noragent:" is a supported configuration, please update the eval-yaml guide and YAML schema reference to reflect that, and describe the resulting agent-visible behaviour (compact<available_skills>summary, no<skill_context>injection, skill still discoverable via theskilltool). Trigger-precision evaluation is worth calling out as the motivating use case.Or, an explicit opt-out. Add a flag that keeps
skill:set — preserving skill association, hooks, dashboard grouping, and the rest of the eval's identity — but suppressesSKILL.mdinjection into the system prompt. Strawman:With this, trigger evals could keep
skill:set (matching the documented schema) while still requiring the agent to invoke theskilltool to access the body, makingbehavior/skill_invocation/required_tools: [skill]graders meaningful.Option 2 is our preference because it lets specs continue to declare their associated skill (useful for filtering, reporting, dashboard grouping, hook context, etc.) while still producing realistic trigger measurements. Option 1 alone is acceptable.
Environment
executor: copilot-sdkclaude-opus-4.6