Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 13 additions & 8 deletions evals/CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,11 @@
EVAL_PROVIDERS=claude ANTHROPIC_MODEL=claude-opus-4-6 PYTEST="python3 -m pytest -n 4" bash evals/run.sh -k "claude and not deepagents" -v

# Run one skill's evals
bash evals/run.sh -k "find-token"
bash evals/run.sh -k "openshift-docs"
bash evals/run.sh -k "kubernetes-docs"

# Run a single test case
bash evals/run.sh -k "ignition_spec_version"
bash evals/run.sh -k "find_token_tool_execution"

# Generate JSON report
bash evals/run.sh --eval-report=evals/report.json
Expand All @@ -27,14 +27,19 @@ bash evals/run.sh --eval-report=evals/report.json
- Always clean up: `podman stop -a; podman rm -fa; rm -rf .eval-workspaces`
- Check results with: `grep -E "PASSED|FAILED|passed|failed" <output>`

## Adding Test Cases
## Adding a New Skill Eval

Test cases live in `evals/skills/<skill_name>/test_cases.yaml`. Each case needs:
- A natural-language `query`
- A `schema` with enum-constrained fields and a `description` containing `"Use the '<skill_name>' skill to find this."`
- An `expected` block with the correct values from the actual docs
See `evals/skills/find-token/` as the reference — it demonstrates both verification patterns:

Before adding a test case, read the relevant doc file to get the exact expected value. Use enums, booleans, and integers — never free-form text.
1. **Static matching** (`find_token_static_fields`): `expected` with field: value pairs for deterministic outputs
2. **Custom verification** (`find_token_tool_execution`): `expected: { _fn: verify_tokens }` with a `verify.py` function for runtime data (tool-generated tokens, live cluster queries)

Each skill eval directory needs:
- `system_prompt.md` — the system prompt for the agent
- `test_cases.yaml` — test cases with query, schema, and expected
- `verify.py` (optional) — custom verification functions referenced by `_fn`

Use enums, booleans, and integers in schemas — never free-form text.

## Debugging Failures

Expand Down
148 changes: 94 additions & 54 deletions evals/README.md
Original file line number Diff line number Diff line change
@@ -1,29 +1,51 @@
# Skill Evals

Eval framework for testing that AI agents can correctly use skills and return verifiable, structured answers. Test cases are defined as YAML files with enum-constrained JSON schemas — no code needed to add new evals.
Eval framework for testing that AI agents can correctly discover skills, execute tools, and return verifiable structured output. Test cases are YAML-driven with two verification modes: static field matching and custom verification functions for runtime data.

## How It Works

Each test case sends a question to an agent running in a container, along with a JSON schema that constrains the response to enum values. The expected answer is a specific value derived from the actual documentation. If the agent reads the docs correctly, it picks the right enum value. If it relies on training data, it may pick a wrong one.
Each test case sends a query to an agent running in a container, along with a JSON schema for structured output. The framework validates the response against the schema and then checks the expected values.

Example test case:
**Reference skill: [`evals/skills/find-token/`](skills/find-token/)** — demonstrates both verification patterns in one skill.

### Static matching (knowledge retrieval)

For skills that return deterministic answers from docs or known data:

```yaml
- name: find_token_static_fields
query: "Find the hidden token and tell me which script generated it."
schema:
type: object
properties:
generator:
type: string
enum: ["find-token.sh", "find-token", "token-generator.sh"]
required: ["generator"]
expected:
generator: "find-token.sh"
```

The agent must pick from the enum. Only `find-token.sh` is correct.

### Custom verification (tool execution)

For skills where verification needs runtime data — tokens generated by tool execution, values from a live cluster, etc.:

```yaml
- name: ignition_spec_version
query: "What Ignition specification version does OpenShift 4.22 support for MachineConfig objects?"
- name: find_token_tool_execution
query: "Find the hidden token using the 'find-token' skill."
schema:
type: object
properties:
ignition_version:
token:
type: string
enum: ["3.1", "3.2", "3.3", "3.4", "3.5"]
description: "Supported Ignition spec version. Use the 'openshift-docs' skill to find this."
required: [ignition_version]
required: ["token"]
expected:
ignition_version: "3.5"
_fn: verify_tokens
```

The agent must pick from the enum. Only `3.5` is correct per the docs.
The `_fn` key tells the framework to load `verify_tokens` from `verify.py` in the skill's eval directory. The function receives `(result, eval_workspace, provider_name)` and runs custom assertions.

## Prerequisites

Expand All @@ -42,11 +64,11 @@ bash evals/run.sh
PYTEST="python3 -m pytest -n 4" bash evals/run.sh

# Specific skill only
bash evals/run.sh -k "find-token"
bash evals/run.sh -k "openshift-docs"
bash evals/run.sh -k "kubernetes-docs"

# Specific test case
bash evals/run.sh -k "ignition_spec_version"
bash evals/run.sh -k "find_token_tool_execution"

# Choose provider and model
EVAL_PROVIDERS=claude ANTHROPIC_MODEL=claude-opus-4-6 bash evals/run.sh
Expand All @@ -56,86 +78,104 @@ EVAL_PROVIDERS=claude,gemini bash evals/run.sh

# Generate JSON report
bash evals/run.sh --eval-report=evals/report.json
```

## Adding a New Skill Eval

See [`evals/skills/find-token/`](skills/find-token/) as the reference implementation.

### 1. Symlink the skill into the eval workspace

# Verbose output
bash evals/run.sh -v
Add a symlink under `evals/workspace/skills/` pointing to the skill directory:

```bash
cd evals/workspace/skills
ln -s ../../../path/to/my-skill my-skill
```

## Adding a New Test Case
`run.sh` dereferences these symlinks (`cp -rL`) and copies the real files into the container workspace. Commit the symlink — git tracks it.

### 2. Create eval definitions

```
evals/skills/my-skill/
├── system_prompt.md # System prompt for the agent
├── test_cases.yaml # Test cases with schemas and expected values
└── verify.py # (optional) Custom verification functions for _fn
```

Add an entry to `evals/skills/<skill_name>/test_cases.yaml`:
### 3. Write test cases

For static matching:
```yaml
- name: my_new_test
- name: my_static_test
query: "A natural question a user would ask"
schema:
type: object
properties:
my_field:
type: string
enum: ["option_a", "option_b", "option_c"]
description: "What this field is. Use the '<skill_name>' skill to find this."
description: "Use the 'my-skill' skill to find this."
required: [my_field]
expected:
my_field: "option_b"
```

Guidelines for good test cases:
- **Use enums only** — no free-form text fields. Every expected value must be constrained.
- **Get expected values from the skill** — run the skill's tools or read its data to find the correct answer. Don't guess or assume from training data.
- **Ask natural questions** — phrase queries like a real user would, not like "read file X and find Y".
- **Add the skill hint in the schema description** — include `"Use the '<skill_name>' skill to find this."` so the agent invokes the skill instead of relying on prior knowledge.
- **Use booleans and integers** where appropriate — `type: boolean` for yes/no questions, `type: integer` with enum for numeric values.

The framework auto-discovers new entries on the next run.

## Adding a New Skill

Two things are needed: eval definitions (what to test) and a workspace symlink (so the agent can access the skill inside the container).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we no longer need the symlinks? If we do still need them, can we preserve this advice to create them when onboarding a new skill somewhere in this evals/README.md?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, symlinks are still needed. Restored the instructions in f33d279 — each eval skill needs a symlink under evals/workspace/skills/ pointing to the actual skill directory. run.sh dereferences them when building the container workspace.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Symlinks are still needed. The instructions are preserved in the current version under "Adding a New Skill Eval" — each eval skill needs a symlink under evals/workspace/skills/ pointing to the actual skill directory.


### 1. Create eval definitions

Create a directory under `evals/skills/<skill_name>/` with a system prompt and test cases. See [`evals/skills/openshift-docs/`](skills/openshift-docs/) for a working example.

```
evals/skills/<skill_name>/
├── system_prompt.md # System prompt for the agent
└── test_cases.yaml # Test cases with schemas and expected values
For custom verification:
```yaml
- name: my_dynamic_test
query: "Run the tool and return the result"
schema:
type: object
properties:
result:
type: string
required: [result]
expected:
_fn: my_verify_function
```

### 2. Add a workspace symlink

The agent runs inside a container and needs access to the actual skill files (SKILL.md, docs, references, etc.). The workspace uses symlinks that point to the real skill directory in `documentation/`. At container startup, `run.sh` dereferences these symlinks and copies the real files into the container's workspace.

```bash
cd evals/workspace/skills
ln -s ../../../documentation/<skill_name> <skill_name>
Then in `verify.py`:
```python
def my_verify_function(result, eval_workspace, provider_name):
# Check runtime artifacts, query live systems, etc.
assert result["result"] == expected_value
```

Commit the symlink — it's tracked by git. The framework handles dereferencing and mounting automatically.
Guidelines:
- **Use enums** — constrain fields to known values so assertions are deterministic
- **Get expected values from the skill** — don't guess from training data
- **Ask natural questions** — phrase queries like a real user would
- **Add skill hints in schema descriptions** — `"Use the '<skill_name>' skill to find this."`

## Directory Structure

```
evals/
├── README.md
├── CLAUDE.md # AI assistant instructions
├── run.sh # Container orchestration (start/stop/health check)
├── pytest.ini # pytest config
├── pytest.ini
├── conftest.py # Provider parametrization, fixtures
├── test_docs.py # Test runner (discovers skills, validates schema + facts)
├── framework/ # Eval infrastructure (from lightspeed-agentic-sandbox)
├── framework/ # Eval infrastructure
│ ├── runner.py # HTTP client with retry/backoff
│ ├── credentials.py # Provider credential auto-detection
│ └── report.py # JSON report plugin
├── skills/ # Per-skill eval definitions
│ ├── test_eval.py # Auto-discovers skills, runs test cases
│ ├── find-token/ # ★ Reference skill — both verification patterns
│ │ ├── system_prompt.md
│ │ ├── test_cases.yaml
│ │ └── verify.py # Custom _fn verification
│ ├── openshift-docs/
│ │ ├── system_prompt.md
│ │ └── test_cases.yaml # 12 test cases
│ │ └── test_cases.yaml
│ └── kubernetes-docs/
│ ├── system_prompt.md
│ └── test_cases.yaml # 12 test cases
│ └── test_cases.yaml
└── workspace/
└── skills/ # Symlinks to actual skills (mounted into containers)
└── skills/ # Symlinks to skill directories (dereferenced into containers)
```

## Environment Variables
Expand Down
8 changes: 8 additions & 0 deletions evals/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@

import os
from functools import lru_cache
from pathlib import Path

import pytest

Expand Down Expand Up @@ -52,6 +53,13 @@ def server_url(provider_name: str) -> str:
return _parse_env_map("EVAL_SERVER_URLS")[provider_name]


@pytest.fixture
def eval_workspace(provider_name: str) -> Path | None:
workspaces = _parse_env_map("EVAL_WORKSPACES")
path = workspaces.get(provider_name)
return Path(path) if path else None


@pytest.fixture
def eval_runner(server_url: str, provider_name: str, request: pytest.FixtureRequest):
"""Returns an async callable that POSTs to /v1/agent/run."""
Expand Down
11 changes: 7 additions & 4 deletions evals/run.sh
Original file line number Diff line number Diff line change
Expand Up @@ -69,10 +69,14 @@ echo "Starting provider containers..."

mkdir -p "$(pwd)/.eval-workspaces"

# Materialize workspace once, share across providers via hardlinks
# Materialize workspace once, share across providers via hardlinks.
# Skills are symlinked under evals/workspace/skills/ — run.sh dereferences
# them (cp -rL) and copies the real files into the container workspace.
SHARED_WORKSPACE=$(mktemp -d "$(pwd)/.eval-workspaces/shared-XXXXXX")
cp -rL "$(pwd)/evals/workspace/skills" "$SHARED_WORKSPACE/skills"
cp -rL "$(pwd)/evals/workspace/tools" "$SHARED_WORKSPACE/tools"
mkdir -p "$SHARED_WORKSPACE/skills"
if [ -d "$(pwd)/evals/workspace/skills" ]; then
cp -rL "$(pwd)/evals/workspace/skills/"* "$SHARED_WORKSPACE/skills/"
fi

for i in "${!PROVIDERS[@]}"; do
name="${PROVIDERS[$i]}"
Expand All @@ -84,7 +88,6 @@ for i in "${!PROVIDERS[@]}"; do
WORKDIRS+=("$workdir")
OUTDIRS+=("$outdir")
cp -al "$SHARED_WORKSPACE/skills" "$workdir/skills"
cp -al "$SHARED_WORKSPACE/tools" "$workdir/tools"
mkdir -p "$workdir/.claude"
ln -s ../skills "$workdir/.claude/skills"
chmod -R 777 "$workdir" "$outdir"
Expand Down
1 change: 1 addition & 0 deletions evals/skills/find-token/system_prompt.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
You have access to the find-token skill. Use it to locate and run the find-token script, then return the token it generates.
63 changes: 63 additions & 0 deletions evals/skills/find-token/test_cases.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# find-token skill eval test cases
#
# Each test case defines a query, a JSON schema for structured output, and
# an expected result. The eval framework sends the query to a live agent
# container and validates the response.
#
# Two verification modes:
#
# 1. Static matching (like kubernetes-docs, openshift-docs):
# Use `expected` with field: value pairs. The framework asserts each
# field in the response matches exactly.
#
# 2. Custom verification via _fn (like this skill):
# Use `expected: { _fn: <function_name> }`. The framework loads the
# function from verify.py in this directory and calls it with
# (result, eval_workspace, provider_name). Use this when verification
# needs runtime data (e.g., tokens generated by tool execution).
#
# To add a new skill eval:
# 1. Create evals/skills/<skill-name>/
# 2. Add system_prompt.md — the system prompt for the agent
# 3. Add test_cases.yaml — one or more test cases (this file format)
# 4. Optionally add verify.py — custom verification functions for _fn
#
# Schema notes:
# - The schema is passed as outputSchema to the agent's /v1/agent/run endpoint
# - The agent's provider enforces structured output using its native mechanism
# - Use enums, booleans, and integers for verifiable fields — avoid free-form text

# Test 1: Dynamic token verification via custom function.
# The agent runs find-token.sh which generates a random DIAG_ token and writes
# it to .hidden_token. The verify function reads that file and asserts the
# agent's response contains the exact token — proving the tool was executed.
- name: find_token_tool_execution
query: "Find the hidden token using the 'find-token' skill."
schema:
type: object
properties:
token:
type: string
description: "The DIAG token returned by find-token.sh"
required: ["token"]
expected:
_fn: verify_tokens

# Test 2: Static field matching on the same skill.
# The generator name is deterministic (always "find-token.sh"), so we can
# verify it with a simple expected value — no custom function needed.
- name: find_token_static_fields
query: "Find the hidden token using the 'find-token' skill and tell me which script generated it."
schema:
type: object
properties:
token:
type: string
description: "The DIAG token returned by the script"
generator:
type: string
enum: ["find-token.sh", "find-token", "token-generator.sh"]
description: "The script that generated the token"
required: ["token", "generator"]
expected:
generator: "find-token.sh"
Loading