Provide an officially supported way to run skill trigger evals that do NOT inject the skill into the prompt

The [eval-yaml guide](https://microsoft.github.io/waza/guides/eval-yaml/) and the schema reference list `skill:` / `agent:` as required at the top level of an eval spec (with the `✓*` "one, not both" qualifier). In practice, omitting both still works against Waza 0.31.0 — and **for trigger-precision evals it appears to be the only way to get a realistic measurement**.

We'd like either the docs clarified to legitimize this, or an explicit opt-out for SKILL.md injection.

## Context: trigger evals

**Trigger evals** measure *whether the agent decides to invoke the skill* in response to a given prompt (positive prompts should activate, negative prompts should not). The signal we want to observe is a `skill.invoked` event, scored with a `behavior` grader and `required_tools: [skill]` / `forbidden_tools: [skill]`.

## The problem

When `skill: <name>` is set, `buildSkillSystemMessage` (`internal/execution/copilot.go`) injects the full `SKILL.md` body into the agent's system prompt as a `<skill_context>` block. The agent then already has the skill content and can perform skill-aligned work without ever invoking the `skill` tool. This silently destroys trigger measurement:

- A `behavior` grader with `required_tools: [skill]` or a `skill_invocation` grader produce false negatives even when the agent correctly recognized the trigger.
- The agent may proceed straight to executing the workflow with no observable trigger-side signal at all.

The bundled `trigger` grader is not a viable substitute — it scores the prompt heuristically against the skill's keyword set, so it's blind to what the agent actually did and has little connection to reality. In other words, the `trigger` grader just makes a wild guess - in practice, it is completely unreliable and useless for trigger evals in our experience.

## Current workaround

Omit the top-level `skill:` field entirely. Waza still discovers the skill (walking one level of subdirectories under the CWD for `skills/<name>/SKILL.md`) and exposes it via the compact `<available_skills>` summary (name + description only). The agent then must invoke the `skill` tool to read the body — which is exactly the signal a trigger eval needs to observe.

A real spec using this pattern:

```yaml
name: xyz-trigger
description: >
  Trigger-precision tasks for the xyz skill: positive
  prompts that should activate the skill and negative prompts that
  should not.
# Deliberately no top-level `skill:` field. Setting it makes Waza
# inject the full SKILL.md body into the agent's system prompt (see
# buildSkillSystemMessage in waza/internal/execution/copilot.go),
# which defeats trigger measurement — the agent then "knows" the
# skill without ever invoking the `skill` tool. With `skill:`
# omitted, the agent only sees the compact <available_skills>
# summary (name + description) and must invoke the tool to read the
# body.
version: "1.0"
config:
  trials_per_task: 1
  timeout_seconds: 300
  parallel: false
  executor: copilot-sdk
  model: claude-opus-4.6
metrics:
  - name: task_completion
    weight: 1.0
    threshold: 0.8
tasks:
  - "tasks/trigger/*.yaml"
```

Paired with task-level graders like:

```yaml
# Positive trigger task
expected:
  graders:
  - type: skill_invocation
    name: xyz-invoked
    config:
      required_skills:
        - xyz
      mode: any_order
      allow_extra: true

# Negative trigger task
expected:
  graders:
    - type: behavior
      name: skill-not-invoked
      config:
        forbidden_tools:
          - skill
```

This works on Waza 0.31.0 with `executor: copilot-sdk`, but the docs explicitly mark `skill`/`agent` as required, so we may be relying on undocumented behaviour that could regress in a future release.

## Ask

Either of the following would unblock us:

1. **Docs clarification.** If "neither `skill:` nor `agent:`" is a supported configuration, please update the [eval-yaml guide](https://microsoft.github.io/waza/guides/eval-yaml/) and [YAML schema reference](https://microsoft.github.io/waza/reference/schema/) to reflect that, and describe the resulting agent-visible behaviour (compact `<available_skills>` summary, no `<skill_context>` injection, skill still discoverable via the `skill` tool). Trigger-precision evaluation is worth calling out as the motivating use case.

2. **Or, an explicit opt-out.** Add a flag that keeps `skill:` set — preserving skill association, hooks, dashboard grouping, and the rest of the eval's identity — but suppresses `SKILL.md` injection into the system prompt. Strawman:

   ```yaml
   skill: xyz
   config:
     inject_skill_body: false   # or e.g. skill_injection: summary-only
   ```

   With this, trigger evals could keep `skill:` set (matching the documented schema) while still requiring the agent to invoke the `skill` tool to access the body, making `behavior` / `skill_invocation` / `required_tools: [skill]` graders meaningful.

Option 2 is our preference because it lets specs continue to declare their associated skill (useful for filtering, reporting, dashboard grouping, hook context, etc.) while still producing realistic trigger measurements. Option 1 alone is acceptable.

## Environment

- Waza 0.31.0
- `executor: copilot-sdk`
- Model: `claude-opus-4.6`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide an officially supported way to run skill trigger evals that do NOT inject the skill into the prompt #285

Context: trigger evals

The problem

Current workaround

Ask

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Provide an officially supported way to run skill trigger evals that do NOT inject the skill into the prompt #285

Description

Context: trigger evals

The problem

Current workaround

Ask

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions