Skip to content

feat: Add eval scaffolding command (waza eval new)#94

Open
spboyer wants to merge 3 commits intomicrosoft:mainfrom
spboyer:squad/83-eval-new
Open

feat: Add eval scaffolding command (waza eval new)#94
spboyer wants to merge 3 commits intomicrosoft:mainfrom
spboyer:squad/83-eval-new

Conversation

@spboyer
Copy link
Member

@spboyer spboyer commented Mar 5, 2026

Closes #83

Working as Linus (Backend Developer)
⚠️ This task was flagged as "needs review" — please have a squad member review before merging.

Copilot AI review requested due to automatic review settings March 5, 2026 02:23
@spboyer spboyer self-assigned this Mar 5, 2026
@github-actions github-actions bot enabled auto-merge (squash) March 5, 2026 02:24
@chlowell
Copy link
Member

chlowell commented Mar 5, 2026

Should this go in init or new instead of a new verb?

@codecov-commenter
Copy link

codecov-commenter commented Mar 5, 2026

Codecov Report

❌ Patch coverage is 88.29787% with 22 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@a75477e). Learn more about missing BASE report.

Files with missing lines Patch % Lines
cmd/waza/cmd_eval.go 88.23% 14 Missing and 8 partials ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main      #94   +/-   ##
=======================================
  Coverage        ?   72.42%           
=======================================
  Files           ?      129           
  Lines           ?    14440           
  Branches        ?        0           
=======================================
  Hits            ?    10458           
  Misses          ?     3210           
  Partials        ?      772           
Flag Coverage Δ
go-implementation 72.42% <88.29%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new waza eval new <skill-name> CLI subcommand to scaffold an evaluation suite from a skill’s SKILL.md frontmatter, plus docs/tests to make it discoverable and verifiable.

Changes:

  • Introduce waza eval new command that parses SKILL.md triggers and generates eval.yaml + starter trigger/anti-trigger task YAMLs.
  • Add unit tests validating scaffold generation, custom output path behavior, and missing SKILL.md error handling.
  • Update CLI docs + README + command metadata expectations to include the new eval top-level command.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
site/src/content/docs/reference/cli.mdx Documents the new waza eval new command in the CLI reference.
cmd/waza/root.go Registers the new eval command in the root CLI.
cmd/waza/cmd_metadata_test.go Updates metadata test to expect the new top-level eval command.
cmd/waza/cmd_eval_test.go Adds tests covering scaffold generation, --output, and error cases.
cmd/waza/cmd_eval.go Implements waza eval new scaffold generation logic.
README.md Adds usage docs for waza eval new.
.squad/log/2026-03-05T00-36-issue-assignment-pipeline.md Adds squad session log (non-functional change).
.squad/log/2026-03-05T00-26-rusty-token-diff-design.md Adds squad session log (non-functional change).
.squad/decisions.md Records squad decisions (non-functional change).
Comments suppressed due to low confidence (2)

cmd/waza/cmd_eval.go:74

  • When --output is not provided, the default output path is hard-coded to evals/<skill-name>/eval.yaml. If the project uses a custom paths.evals in .waza.yaml, this will write scaffolding into the wrong directory. It would be more consistent with other commands to derive the default evals directory from project config/workspace detection.
	if outputPath == "" {
		outputPath = filepath.Join("evals", skillName, "eval.yaml")
	}
	tasksDir := filepath.Join(filepath.Dir(outputPath), "tasks")

site/src/content/docs/reference/cli.mdx:202

  • Flag docs list --output without indicating it takes a path argument. For consistency with the README and the actual flag help, consider documenting it as --output <path>.
| Flag | Description |
|------|-------------|
| `--output` | Custom path for generated `eval.yaml` |

Comment on lines +27 to +30
origDir, err := os.Getwd()
require.NoError(t, err)
require.NoError(t, os.Chdir(dir))
t.Cleanup(func() { _ = os.Chdir(origDir) })
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests manually manage os.Getwd/os.Chdir + cleanup. The rest of the cmd/waza tests typically use t.Chdir(dir) (available in this Go version) which is simpler and avoids missing cleanup paths. Consider switching to t.Chdir for consistency and maintainability.

Suggested change
origDir, err := os.Getwd()
require.NoError(t, err)
require.NoError(t, os.Chdir(dir))
t.Cleanup(func() { _ = os.Chdir(origDir) })
require.NoError(t, t.Chdir(dir))

Copilot uses AI. Check for mistakes.
Copy link
Member Author

@spboyer spboyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — Rusty. waza eval new is a clean scaffolding command. Good use of SKILL.md frontmatter parsing for positive/negative trigger generation. extractKeywords with stop words is smart. Tests validate generated YAML through validation.ValidateEvalBytes/ValidateTaskBytes — nice. README + CLI reference updated. Ship it. (Self-authored PR — cannot self-approve via API.)

@spboyer spboyer force-pushed the squad/83-eval-new branch from 79893e4 to 02fb261 Compare March 5, 2026 17:12
spboyer added a commit that referenced this pull request Mar 5, 2026
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 5, 2026 17:38
spboyer and others added 2 commits March 5, 2026 12:46
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@spboyer spboyer force-pushed the squad/83-eval-new branch from f016c5f to c00d9c3 Compare March 5, 2026 17:46
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Comment on lines +193 to +202
- `evals/[skill-name]/eval.yaml`
- `evals/[skill-name]/tasks/positive-trigger-1.yaml`
- `evals/[skill-name]/tasks/positive-trigger-2.yaml`
- `evals/[skill-name]/tasks/negative-trigger-1.yaml`

### Flags

| Flag | Description |
|------|-------------|
| `--output` | Custom path for generated `eval.yaml` |
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs contain two issues that will likely render incorrectly / be confusing: (1) the paths use [skill-name] while the rest of the CLI docs use <skill-name>, and (2) the flags table rows start with ||, which Markdown interprets as an extra empty column. Consider switching to evals/<skill-name>/... and using single leading pipes (| Flag | Description |). Also, it would help to note that when --output is set, task YAMLs are generated under the sibling tasks/ directory next to the provided eval path.

Suggested change
- `evals/[skill-name]/eval.yaml`
- `evals/[skill-name]/tasks/positive-trigger-1.yaml`
- `evals/[skill-name]/tasks/positive-trigger-2.yaml`
- `evals/[skill-name]/tasks/negative-trigger-1.yaml`
### Flags
| Flag | Description |
|------|-------------|
| `--output` | Custom path for generated `eval.yaml` |
- `evals/<skill-name>/eval.yaml`
- `evals/<skill-name>/tasks/positive-trigger-1.yaml`
- `evals/<skill-name>/tasks/positive-trigger-2.yaml`
- `evals/<skill-name>/tasks/negative-trigger-1.yaml`
### Flags
| Flag | Description |
|------|-------------|
| `--output` | Custom path for generated `eval.yaml`; task YAMLs are created in a sibling `tasks/` directory next to this file. |

Copilot uses AI. Check for mistakes.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Eval scaffolding command — waza eval new

4 participants