Disclaimer: This is a vibe-coded repo, built by copying and adapting open-source Anthropic's skill-creator (not leaked).
A CLI tool to create, test, and evaluate Agent Skills for OpenCode. Skills are SKILL.md files with frontmatter and step-by-step instructions that extend an AI agent's capabilities. This tool measures whether a skill actually improves the agent's output quality compared to running without it.
Note: While this is a CLI tool, it is designed to be used inside OpenCode's agent. The agent invokes these commands to run evaluations, grade outputs, and generate benchmarks — you don't typically run them manually from your terminal.
npm install -g .Or install directly from the repo:
npm install
npm run buildRequires Node.js >= 20.
# 1. Scaffold a new skill
opencode-skill-eval scaffold my-skill
# 2. Edit the generated SKILL.md and evals/evals.json
# 3. Run the full pipeline (test → grade → benchmark → view)
opencode-skill-eval run-all path/to/my-skill
# Or step by step:
opencode-skill-eval test path/to/my-skill
opencode-skill-eval grade path/to/my-skill-workspace/iteration-1
opencode-skill-eval benchmark path/to/my-skill-workspace/iteration-1The benchmark and run-all commands automatically generate an HTML viewer and open it in your browser. Use --no-view to suppress this.
Note: The auto-open functionality is currently broken. After running evaluation, open the HTML manually at
<workspace>/iteration-N/review.html.
The evaluation loop runs each test case twice:
- With skill —
opencode runis invoked withOPENCODE_SKILL_PATHpointing to the skill directory. OpenCode loads the skill transparently via this environment variable, injecting its instructions into the agent's context. - Without skill (baseline) — The same
opencode runcommand executes but without the env var, so the agent has no skill loaded.
The agent cannot detect whether a skill is present — it just responds to the prompt. This ensures a fair comparison where the only variable is the skill's presence.
After execution, a grading LLM evaluates each run's transcript and output files against the defined assertions, producing pass/fail verdicts with evidence. Results are aggregated into statistics (pass rate, time, tokens) with deltas between configurations.
Creates a new skill directory with a template SKILL.md and evals/evals.json.
opencode-skill-eval scaffold my-skill -p ./skills| Option | Description | Default |
|---|---|---|
-p, --path <dir> |
Base directory for the skill | . |
Runs the full eval pipeline: test → grade → benchmark → view. Prints the benchmark table to stdout and opens the HTML viewer in your browser.
# Standard mode: with_skill vs without_skill
opencode-skill-eval run-all ./skills/my-skill
# Compare mode: A/B blind comparison against an old skill version
opencode-skill-eval run-all ./skills/my-skill-v2 --compare ./skills/my-skill-v1
# Skip the HTML viewer
opencode-skill-eval run-all ./skills/my-skill --no-view| Option | Description | Default |
|---|---|---|
-e, --evals <path> |
Path to evals.json | <skill-path>/evals/evals.json |
-m, --model <model> |
Model to use for execution | anthropic/claude-sonnet-4-20250514 |
-w, --workspace <dir> |
Output workspace directory | <skill-path>-workspace |
-i, --iteration <n> |
Iteration number | 1 |
-p, --parallel <n> |
Concurrent runs | 2 |
-t, --timeout <ms> |
Per-run timeout in ms | 300000 |
--no-baseline |
Skip baseline runs | — |
--baseline-mode <mode> |
without_skill or old_skill |
without_skill |
--old-skill <path> |
Path to old skill for baseline | — |
--compare <old-skill> |
Enable A/B compare mode | — |
--no-view |
Skip generating HTML viewer | — |
Runs all evals with the skill and baseline (parallel, configurable model/timeout).
opencode-skill-eval test ./skills/my-skill \
--model anthropic/claude-sonnet-4-20250514 \
--parallel 4 \
--timeout 300000 \
--baseline-mode without_skill| Option | Description | Default |
|---|---|---|
-e, --evals <path> |
Path to evals.json | <skill-path>/evals/evals.json |
-m, --model <model> |
Model to use for execution | anthropic/claude-sonnet-4-20250514 |
-w, --workspace <dir> |
Output workspace directory | <skill-path>-workspace |
-i, --iteration <n> |
Iteration number | 1 |
-p, --parallel <n> |
Concurrent runs | 2 |
-t, --timeout <ms> |
Per-run timeout in ms | 300000 |
--no-baseline |
Skip baseline runs | — |
--baseline-mode <mode> |
without_skill or old_skill |
without_skill |
--old-skill <path> |
Path to old skill for comparison | — |
Grades all eval outputs against assertions using an LLM. Reads transcripts and output files, evaluates each assertion, and writes grading.json per run.
opencode-skill-eval grade ./skills/my-skill-workspace/iteration-1 \
--model anthropic/claude-sonnet-4-20250514| Option | Description | Default |
|---|---|---|
-m, --model <model> |
Model to use for grading | anthropic/claude-sonnet-4-20250514 |
Aggregates grading results into statistics with mean, stddev, min, max, and deltas between configurations. Prints the benchmark table to stdout, outputs benchmark.json and benchmark.md, and automatically generates and opens the HTML viewer.
opencode-skill-eval benchmark ./skills/my-skill-workspace/iteration-1 \
--skill-name my-skill| Option | Description | Default |
|---|---|---|
-n, --skill-name <name> |
Skill name | Inferred from directory |
--skill-path <path> |
Skill path | — |
-m, --model <model> |
Model used for execution | anthropic/claude-sonnet-4-20250514 |
--no-view |
Skip generating HTML viewer | — |
Generates a self-contained HTML viewer with tabs for individual run outputs, A/B comparisons, and benchmark summaries. Opens in browser by default, or writes static HTML.
# Open in browser
opencode-skill-eval view ./skills/my-skill-workspace/iteration-1
# Write static HTML
opencode-skill-eval view ./skills/my-skill-workspace/iteration-1 \
--static ./results.html| Option | Description | Default |
|---|---|---|
-n, --skill-name <name> |
Skill name | Inferred from directory |
--benchmark <path> |
Path to benchmark.json | — |
--previous-workspace <path> |
Previous iteration for feedback comparison | — |
--static <path> |
Write static HTML instead of serving | — |
Blind A/B comparison of two skill versions. Runs both on the same evals, then an LLM judge evaluates outputs (labeled A/B) on content and structure rubrics (1-5 scoring). Automatically generates and opens the HTML viewer with a Comparison tab.
opencode-skill-eval compare ./skills/my-skill-v1 ./skills/my-skill-v2| Option | Description | Default |
|---|---|---|
-e, --evals <path> |
Path to evals.json | <new-skill>/evals/evals.json |
-m, --model <model> |
Model to use | anthropic/claude-sonnet-4-20250514 |
-w, --workspace <dir> |
Workspace directory | <new-skill>-compare-workspace |
-p, --parallel <n> |
Parallel runs | 2 |
-t, --timeout <ms> |
Per-run timeout in ms | 300000 |
--no-view |
Skip generating HTML viewer | — |
Iteratively improves the skill description for better trigger accuracy. Splits eval queries into train/test sets, evaluates, identifies failures, asks an LLM to improve the description, and selects the best version by test score to prevent overfitting.
opencode-skill-eval optimize-triggers ./skills/my-skill \
--max-iterations 5 \
--holdout 0.4 \
--threshold 0.5| Option | Description | Default |
|---|---|---|
-e, --eval-set <path> |
Path to trigger eval JSON | <skill-path>/trigger-evals.json |
-m, --model <model> |
Model to use | anthropic/claude-sonnet-4-20250514 |
--max-iterations <n> |
Max optimization iterations | 5 |
--runs-per-query <n> |
Runs per query | 3 |
--threshold <n> |
Trigger threshold | 0.5 |
--holdout <n> |
Test holdout fraction | 0.4 |
-w, --workspace <dir> |
Workspace directory | <skill-path>-trigger-workspace |
Creates a copy of a skill directory for version comparison.
opencode-skill-eval snapshot ./skills/my-skill| Option | Description | Default |
|---|---|---|
-w, --workspace <dir> |
Output directory | <skill-path>-workspace |
Evals are defined in evals/evals.json:
{
"skill_name": "my-skill",
"evals": [
{
"id": 1,
"prompt": "A realistic user prompt that should trigger this skill",
"expected_output": "Description of the expected result",
"files": ["evals/files/input.csv"],
"assertions": [
"Output is a valid CSV file",
"Column headers match input schema",
"All rows are processed without data loss"
]
}
]
}Each eval item has:
| Field | Description |
|---|---|
id |
Unique identifier |
prompt |
The user prompt to send to the agent |
expected_output |
Human-readable description of expected result |
files |
Input file paths (relative to skill directory) |
assertions |
Verifiable claims about the output, evaluated by the grading LLM |
After running test + grade + benchmark (or run-all):
my-skill-workspace/
├── iteration-1/
│ ├── eval-1/
│ │ ├── eval_metadata.json # Eval ID, prompt, assertions
│ │ ├── inputs/ # Copied input files
│ │ ├── with_skill/
│ │ │ └── run-1/
│ │ │ ├── outputs/
│ │ │ │ ├── transcript.md # Full execution transcript
│ │ │ │ └── ... # Files created by the agent
│ │ │ ├── grading.json # Pass/fail per assertion with evidence
│ │ │ └── timing.json # Duration and token usage
│ │ └── without_skill/
│ │ └── run-1/ # Same structure (baseline)
│ ├── eval-2/
│ │ └── ...
│ ├── comparison.json # Blind A/B comparison (compare mode only)
│ ├── benchmark.json # Aggregated stats with deltas
│ ├── benchmark.md # Human-readable summary table
│ └── review.html # HTML viewer
├── iteration-2/ # Subsequent iterations
│ └── ...
└── feedback.json # User review feedback from viewer
- Write your skill and evals
- Run
run-allfor the full pipeline, or step through individually:testto execute all evalsgradeto score outputsbenchmarkto see statistics (prints table + opens viewer)
- Improve the skill based on results
- Run
run-allagain with--iteration 2 - Use
compareorrun-all --comparefor A/B comparison between versions
The viewer is a self-contained HTML file with three tabs:
- Outputs — Browse individual eval runs with transcripts, output files, and formal grades (PASS/FAIL with evidence)
- Comparison — View blind A/B comparison results with winner badges, rubric scores, and strengths/weaknesses per output (available when using
--comparemode) - Benchmark — Summary table with pass rate, time, and token statistics across configurations
Navigation uses prev/next buttons or arrow keys. A feedback bar at the bottom lets you annotate runs and download feedback.json.
src/
├── cli.ts # Commander.js CLI — all 9 commands
├── eval/
│ ├── runner.ts # Orchestrates parallel eval execution
│ ├── grader.ts # LLM-based assertion grading
│ ├── aggregator.ts # Statistical benchmark aggregation
│ ├── analyzer.ts # LLM-based benchmark analysis
│ ├── comparator.ts # Blind A/B comparison
│ └── types.ts # Zod schemas for all data structures
├── trigger/
│ ├── runner.ts # Tests skill trigger accuracy
│ ├── optimizer.ts # Iterative description optimization
│ └── types.ts # Zod schemas for trigger evaluation
├── skill/
│ ├── loader.ts # Parses SKILL.md frontmatter
│ ├── scaffold.ts # Creates new skill template
│ └── snapshot.ts # Copies skill for comparison
├── viewer/
│ └── generate.ts # HTML results viewer generator
└── utils/
├── filesystem.ts # File/directory helpers
└── subprocess.ts # Spawns opencode run with timeout
| Package | Purpose |
|---|---|
commander |
CLI framework |
zod |
Schema validation for all JSON data structures |
p-limit |
Concurrency control for parallel eval runs |
typescript |
Compilation target |
npm install
npm run build # Compile TypeScript
npm run dev # Watch mode
npm run typecheck # Type check without emitting- Token counts are not correct yet and will be worked on soon
Apache 2.0