openshift · openshift-merge-bot · May 28, 2026 · May 28, 2026 · wking · May 12, 2026
diff --git a/evals/CLAUDE.md b/evals/CLAUDE.md
@@ -7,11 +7,11 @@
 EVAL_PROVIDERS=claude ANTHROPIC_MODEL=claude-opus-4-6 PYTEST="python3 -m pytest -n 4" bash evals/run.sh -k "claude and not deepagents" -v
 
 # Run one skill's evals
+bash evals/run.sh -k "find-token"
 bash evals/run.sh -k "openshift-docs"
-bash evals/run.sh -k "kubernetes-docs"
 
 # Run a single test case
-bash evals/run.sh -k "ignition_spec_version"
+bash evals/run.sh -k "find_token_tool_execution"
 
 # Generate JSON report
 bash evals/run.sh --eval-report=evals/report.json
@@ -27,14 +27,19 @@ bash evals/run.sh --eval-report=evals/report.json
 - Always clean up: `podman stop -a; podman rm -fa; rm -rf .eval-workspaces`
 - Check results with: `grep -E "PASSED|FAILED|passed|failed" <output>`
 
-## Adding Test Cases
+## Adding a New Skill Eval
 
-Test cases live in `evals/skills/<skill_name>/test_cases.yaml`. Each case needs:
-- A natural-language `query`
-- A `schema` with enum-constrained fields and a `description` containing `"Use the '<skill_name>' skill to find this."`
-- An `expected` block with the correct values from the actual docs
+See `evals/skills/find-token/` as the reference — it demonstrates both verification patterns:
 
-Before adding a test case, read the relevant doc file to get the exact expected value. Use enums, booleans, and integers — never free-form text.
+1. **Static matching** (`find_token_static_fields`): `expected` with field: value pairs for deterministic outputs
+2. **Custom verification** (`find_token_tool_execution`): `expected: { _fn: verify_tokens }` with a `verify.py` function for runtime data (tool-generated tokens, live cluster queries)
+
+Each skill eval directory needs:
+- `system_prompt.md` — the system prompt for the agent
+- `test_cases.yaml` — test cases with query, schema, and expected
+- `verify.py` (optional) — custom verification functions referenced by `_fn`
+
+Use enums, booleans, and integers in schemas — never free-form text.
 
 ## Debugging Failures
 

diff --git a/evals/README.md b/evals/README.md
@@ -1,29 +1,51 @@
 # Skill Evals
 
-Eval framework for testing that AI agents can correctly use skills and return verifiable, structured answers. Test cases are defined as YAML files with enum-constrained JSON schemas — no code needed to add new evals.
+Eval framework for testing that AI agents can correctly discover skills, execute tools, and return verifiable structured output. Test cases are YAML-driven with two verification modes: static field matching and custom verification functions for runtime data.
 
 ## How It Works
 
-Each test case sends a question to an agent running in a container, along with a JSON schema that constrains the response to enum values. The expected answer is a specific value derived from the actual documentation. If the agent reads the docs correctly, it picks the right enum value. If it relies on training data, it may pick a wrong one.
+Each test case sends a query to an agent running in a container, along with a JSON schema for structured output. The framework validates the response against the schema and then checks the expected values.
 
-Example test case:
+**Reference skill: [`evals/skills/find-token/`](skills/find-token/)** — demonstrates both verification patterns in one skill.
+
+### Static matching (knowledge retrieval)
+
+For skills that return deterministic answers from docs or known data:
+
+```yaml
+- name: find_token_static_fields
+  query: "Find the hidden token and tell me which script generated it."
+  schema:
+    type: object
+    properties:
+      generator:
+        type: string
+        enum: ["find-token.sh", "find-token", "token-generator.sh"]
+    required: ["generator"]
+  expected:
+    generator: "find-token.sh"
+```
+
+The agent must pick from the enum. Only `find-token.sh` is correct.
+
+### Custom verification (tool execution)
+
+For skills where verification needs runtime data — tokens generated by tool execution, values from a live cluster, etc.:
 
 ```yaml
-- name: ignition_spec_version
-  query: "What Ignition specification version does OpenShift 4.22 support for MachineConfig objects?"
+- name: find_token_tool_execution
+  query: "Find the hidden token using the 'find-token' skill."
   schema:
     type: object
     properties:
-      ignition_version:
+      token:
         type: string
-        enum: ["3.1", "3.2", "3.3", "3.4", "3.5"]
-        description: "Supported Ignition spec version. Use the 'openshift-docs' skill to find this."
-    required: [ignition_version]
+    required: ["token"]
   expected:
-    ignition_version: "3.5"
+    _fn: verify_tokens
 ```
 
-The agent must pick from the enum. Only `3.5` is correct per the docs.
+The `_fn` key tells the framework to load `verify_tokens` from `verify.py` in the skill's eval directory. The function receives `(result, eval_workspace, provider_name)` and runs custom assertions.
 
 ## Prerequisites
 
@@ -42,11 +64,11 @@ bash evals/run.sh
 PYTEST="python3 -m pytest -n 4" bash evals/run.sh
 
 # Specific skill only
+bash evals/run.sh -k "find-token"
 bash evals/run.sh -k "openshift-docs"
-bash evals/run.sh -k "kubernetes-docs"
 
 # Specific test case
-bash evals/run.sh -k "ignition_spec_version"
+bash evals/run.sh -k "find_token_tool_execution"
 
 # Choose provider and model
 EVAL_PROVIDERS=claude ANTHROPIC_MODEL=claude-opus-4-6 bash evals/run.sh
@@ -56,86 +78,104 @@ EVAL_PROVIDERS=claude,gemini bash evals/run.sh
 
 # Generate JSON report
 bash evals/run.sh --eval-report=evals/report.json
+```
+
+## Adding a New Skill Eval
+
+See [`evals/skills/find-token/`](skills/find-token/) as the reference implementation.
+
+### 1. Symlink the skill into the eval workspace
 
-# Verbose output
-bash evals/run.sh -v
+Add a symlink under `evals/workspace/skills/` pointing to the skill directory:
+
+```bash
+cd evals/workspace/skills
+ln -s ../../../path/to/my-skill my-skill
 ```
 
-## Adding a New Test Case
+`run.sh` dereferences these symlinks (`cp -rL`) and copies the real files into the container workspace. Commit the symlink — git tracks it.
+
+### 2. Create eval definitions
+
+```
+evals/skills/my-skill/
+├── system_prompt.md      # System prompt for the agent
+├── test_cases.yaml       # Test cases with schemas and expected values
+└── verify.py             # (optional) Custom verification functions for _fn
+```
 
-Add an entry to `evals/skills/<skill_name>/test_cases.yaml`:
+### 3. Write test cases
 
+For static matching:
 ```yaml
-- name: my_new_test
+- name: my_static_test
   query: "A natural question a user would ask"
   schema:
     type: object
     properties:
       my_field:
         type: string
         enum: ["option_a", "option_b", "option_c"]
-        description: "What this field is. Use the '<skill_name>' skill to find this."
+        description: "Use the 'my-skill' skill to find this."
     required: [my_field]
   expected:
     my_field: "option_b"
 ```
 
-Guidelines for good test cases:
-- **Use enums only** — no free-form text fields. Every expected value must be constrained.
-- **Get expected values from the skill** — run the skill's tools or read its data to find the correct answer. Don't guess or assume from training data.
-- **Ask natural questions** — phrase queries like a real user would, not like "read file X and find Y".
-- **Add the skill hint in the schema description** — include `"Use the '<skill_name>' skill to find this."` so the agent invokes the skill instead of relying on prior knowledge.
-- **Use booleans and integers** where appropriate — `type: boolean` for yes/no questions, `type: integer` with enum for numeric values.
-
-The framework auto-discovers new entries on the next run.
-
-## Adding a New Skill
-
-Two things are needed: eval definitions (what to test) and a workspace symlink (so the agent can access the skill inside the container).
-
-### 1. Create eval definitions
-
-Create a directory under `evals/skills/<skill_name>/` with a system prompt and test cases. See [`evals/skills/openshift-docs/`](skills/openshift-docs/) for a working example.
-
-```
-evals/skills/<skill_name>/
-├── system_prompt.md      # System prompt for the agent
-└── test_cases.yaml       # Test cases with schemas and expected values
+For custom verification:
+```yaml
+- name: my_dynamic_test
+  query: "Run the tool and return the result"
+  schema:
+    type: object
+    properties:
+      result:
+        type: string
+    required: [result]
+  expected:
+    _fn: my_verify_function
 ```
 
-### 2. Add a workspace symlink
-
-The agent runs inside a container and needs access to the actual skill files (SKILL.md, docs, references, etc.). The workspace uses symlinks that point to the real skill directory in `documentation/`. At container startup, `run.sh` dereferences these symlinks and copies the real files into the container's workspace.
-
-```bash
-cd evals/workspace/skills
-ln -s ../../../documentation/<skill_name> <skill_name>
+Then in `verify.py`:
+```python
+def my_verify_function(result, eval_workspace, provider_name):
+    # Check runtime artifacts, query live systems, etc.
+    assert result["result"] == expected_value
 ```
 
-Commit the symlink — it's tracked by git. The framework handles dereferencing and mounting automatically.
+Guidelines:
+- **Use enums** — constrain fields to known values so assertions are deterministic
+- **Get expected values from the skill** — don't guess from training data
+- **Ask natural questions** — phrase queries like a real user would
+- **Add skill hints in schema descriptions** — `"Use the '<skill_name>' skill to find this."`
 
 ## Directory Structure
 
 ```
 evals/
 ├── README.md
+├── CLAUDE.md               # AI assistant instructions
 ├── run.sh                  # Container orchestration (start/stop/health check)
-├── pytest.ini              # pytest config
+├── pytest.ini
 ├── conftest.py             # Provider parametrization, fixtures
-├── test_docs.py            # Test runner (discovers skills, validates schema + facts)
-├── framework/              # Eval infrastructure (from lightspeed-agentic-sandbox)
+├── framework/              # Eval infrastructure
 │   ├── runner.py           # HTTP client with retry/backoff
 │   ├── credentials.py      # Provider credential auto-detection
 │   └── report.py           # JSON report plugin
 ├── skills/                 # Per-skill eval definitions
+│   ├── test_eval.py        # Auto-discovers skills, runs test cases
+│   ├── find-token/         # ★ Reference skill — both verification patterns
+│   │   ├── system_prompt.md
+│   │   ├── test_cases.yaml
+│   │   └── verify.py       # Custom _fn verification
 │   ├── openshift-docs/
 │   │   ├── system_prompt.md
-│   │   └── test_cases.yaml     # 12 test cases
+│   │   └── test_cases.yaml
 │   └── kubernetes-docs/
 │       ├── system_prompt.md
-│       └── test_cases.yaml     # 12 test cases
+│       └── test_cases.yaml
 └── workspace/
-    └── skills/             # Symlinks to actual skills (mounted into containers)
+    └── skills/             # Symlinks to skill directories (dereferenced into containers)
 ```
 
 ## Environment Variables

diff --git a/evals/conftest.py b/evals/conftest.py
@@ -4,6 +4,7 @@
 
 import os
 from functools import lru_cache
+from pathlib import Path
 
 import pytest
 
@@ -52,6 +53,13 @@ def server_url(provider_name: str) -> str:
     return _parse_env_map("EVAL_SERVER_URLS")[provider_name]
 
 
+@pytest.fixture
+def eval_workspace(provider_name: str) -> Path | None:
+    workspaces = _parse_env_map("EVAL_WORKSPACES")
+    path = workspaces.get(provider_name)
+    return Path(path) if path else None
+
+
 @pytest.fixture
 def eval_runner(server_url: str, provider_name: str, request: pytest.FixtureRequest):
     """Returns an async callable that POSTs to /v1/agent/run."""

diff --git a/evals/run.sh b/evals/run.sh
@@ -69,10 +69,14 @@ echo "Starting provider containers..."
 
 mkdir -p "$(pwd)/.eval-workspaces"
 
-# Materialize workspace once, share across providers via hardlinks
+# Materialize workspace once, share across providers via hardlinks.
+# Skills are symlinked under evals/workspace/skills/ — run.sh dereferences
+# them (cp -rL) and copies the real files into the container workspace.
 SHARED_WORKSPACE=$(mktemp -d "$(pwd)/.eval-workspaces/shared-XXXXXX")
-cp -rL "$(pwd)/evals/workspace/skills" "$SHARED_WORKSPACE/skills"
-cp -rL "$(pwd)/evals/workspace/tools" "$SHARED_WORKSPACE/tools"
+mkdir -p "$SHARED_WORKSPACE/skills"
+if [ -d "$(pwd)/evals/workspace/skills" ]; then
+    cp -rL "$(pwd)/evals/workspace/skills/"* "$SHARED_WORKSPACE/skills/"
+fi
 
 for i in "${!PROVIDERS[@]}"; do
     name="${PROVIDERS[$i]}"
@@ -84,7 +88,6 @@ for i in "${!PROVIDERS[@]}"; do
     WORKDIRS+=("$workdir")
     OUTDIRS+=("$outdir")
     cp -al "$SHARED_WORKSPACE/skills" "$workdir/skills"
-    cp -al "$SHARED_WORKSPACE/tools" "$workdir/tools"
     mkdir -p "$workdir/.claude"
     ln -s ../skills "$workdir/.claude/skills"
     chmod -R 777 "$workdir" "$outdir"

diff --git a/evals/skills/find-token/system_prompt.md b/evals/skills/find-token/system_prompt.md
@@ -0,0 +1 @@
+You have access to the find-token skill. Use it to locate and run the find-token script, then return the token it generates.
diff --git a/evals/skills/find-token/test_cases.yaml b/evals/skills/find-token/test_cases.yaml
@@ -0,0 +1,63 @@
+# find-token skill eval test cases
+#
+# Each test case defines a query, a JSON schema for structured output, and
+# an expected result. The eval framework sends the query to a live agent
+# container and validates the response.
+#
+# Two verification modes:
+#
+#   1. Static matching (like kubernetes-docs, openshift-docs):
+#      Use `expected` with field: value pairs. The framework asserts each
+#      field in the response matches exactly.
+#
+#   2. Custom verification via _fn (like this skill):
+#      Use `expected: { _fn: <function_name> }`. The framework loads the
+#      function from verify.py in this directory and calls it with
+#      (result, eval_workspace, provider_name). Use this when verification
+#      needs runtime data (e.g., tokens generated by tool execution).
+#
+# To add a new skill eval:
+#   1. Create evals/skills/<skill-name>/
+#   2. Add system_prompt.md — the system prompt for the agent
+#   3. Add test_cases.yaml — one or more test cases (this file format)
+#   4. Optionally add verify.py — custom verification functions for _fn
+#
+# Schema notes:
+#   - The schema is passed as outputSchema to the agent's /v1/agent/run endpoint
+#   - The agent's provider enforces structured output using its native mechanism
+#   - Use enums, booleans, and integers for verifiable fields — avoid free-form text
+
+# Test 1: Dynamic token verification via custom function.
+# The agent runs find-token.sh which generates a random DIAG_ token and writes
+# it to .hidden_token. The verify function reads that file and asserts the
+# agent's response contains the exact token — proving the tool was executed.
+- name: find_token_tool_execution
+  query: "Find the hidden token using the 'find-token' skill."
+  schema:
+    type: object
+    properties:
+      token:
+        type: string
+        description: "The DIAG token returned by find-token.sh"
+    required: ["token"]
+  expected:
+    _fn: verify_tokens
+
+# Test 2: Static field matching on the same skill.
+# The generator name is deterministic (always "find-token.sh"), so we can
+# verify it with a simple expected value — no custom function needed.
+- name: find_token_static_fields
+  query: "Find the hidden token using the 'find-token' skill and tell me which script generated it."
+  schema:
+    type: object
+    properties:
+      token:
+        type: string
+        description: "The DIAG token returned by the script"
+      generator:
+        type: string
+        enum: ["find-token.sh", "find-token", "token-generator.sh"]
+        description: "The script that generated the token"
+    required: ["token", "generator"]
+  expected:
+    generator: "find-token.sh"
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		You have access to the find-token skill. Use it to locate and run the find-token script, then return the token it generates.