feat: Add eval coverage grid generator#92
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #92 +/- ##
=======================================
Coverage ? 72.11%
=======================================
Files ? 129
Lines ? 14494
Branches ? 0
=======================================
Hits ? 10452
Misses ? 3258
Partials ? 784
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
Adds a new waza coverage CLI command to generate an eval-coverage “grid” across discovered skills, with supporting docs and tests.
Changes:
- Introduces
waza coverage [root]withtext,markdown, andjsonoutput. - Implements skill/eval discovery and a coverage classification (none/partial/full) based on tasks + grader types.
- Updates CLI reference docs and README, and adds unit tests for report building + markdown rendering.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| site/src/content/docs/reference/cli.mdx | Documents the new waza coverage command, flags, and examples. |
| cmd/waza/root.go | Registers the new coverage subcommand on the root CLI. |
| cmd/waza/cmd_coverage.go | Implements discovery, report generation, and output renderers (text/markdown/json). |
| cmd/waza/cmd_coverage_test.go | Adds unit tests for report classification, markdown output, and command registration. |
| README.md | Adds waza coverage usage example and CLI reference entry. |
| .squad/log/2026-03-05T00-36-issue-assignment-pipeline.md | Adds team process log (non-functional). |
| .squad/log/2026-03-05T00-26-rusty-token-diff-design.md | Adds team process log (non-functional). |
| .squad/decisions.md | Records team workflow/design decisions (non-functional). |
| for _, evalPath := range evalPaths { | ||
| spec, parseErr := parseEvalSpec(evalPath) | ||
| if parseErr != nil { | ||
| continue | ||
| } |
There was a problem hiding this comment.
buildCoverageReport silently ignores eval.yaml parsing failures (parseEvalSpec errors are dropped with continue). This can misreport skills as uncovered/partial when the eval file exists but is malformed or uses unsupported fields. Consider returning an error (or at least collecting these parse errors and surfacing a warning/summary to stderr and/or in the JSON report) so coverage results remain trustworthy.
cmd/waza/cmd_coverage.go
Outdated
| if d.Name() == "eval.yaml" { | ||
| absPath, _ := filepath.Abs(path) | ||
| candidates[absPath] = struct{}{} | ||
| } | ||
| return nil |
There was a problem hiding this comment.
This command only discovers eval.yaml files, but other parts of the repo treat both eval.yaml and eval.yml as valid (e.g., init path detection). To avoid undercounting coverage, include eval.yml in discovery (both the directory walk and the per-skill candidate list).
| cmd.Flags().StringVarP(&outputFormat, "format", "f", "text", "Output format: text, markdown, or json") | ||
| cmd.Flags().StringArrayVar(&discoverPaths, "discover", nil, "Additional directories to scan for skills/evals (repeatable)") | ||
| return cmd |
There was a problem hiding this comment.
--discover is defined here as a repeatable string array of additional search roots, but waza run --discover already exists as a boolean flag with different semantics. Reusing the same flag name with a different type/meaning across commands is likely to confuse users and makes CLI behavior less consistent; consider renaming this flag (e.g., --path/--search-root) or aligning semantics with the existing --discover behavior.
spboyer
left a comment
There was a problem hiding this comment.
Verified by Rusty (Opus 4.6) — LGTM ✅
Clean eval coverage grid generator:
- New \waza coverage\ command with text/markdown/json output
- Smart skill/eval discovery with deduplication, hidden dir skipping
- Coverage classification (Full/Partial/None) is conservative and correct
- Tests cover no-eval, partial/full, markdown rendering, root command integration
- Docs updated: README, CLI reference
- CI green on ubuntu + windows + lint
Minor: \�valSpecLite.Tasks\ is []string\ while real eval YAML has structured task objects — means task count defaults to 0, showing Partial instead of Full for real evals. Conservative for a reporting tool. Worth fixing to []any\ in a follow-up.
Note: Can't self-approve via API. Setting auto-merge.
40b1f4c to
e1365f2
Compare
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
a6ca52d to
f3aa0c4
Compare
| | Flag | Short | Description | | ||
| |------|-------|-------------| | ||
| | `--format <fmt>` | `-f` | Output format: `text`, `markdown`, or `json` (default: `text`) | | ||
| | `--discover <dir>` | | Additional directory to scan for skills/evals (repeatable) | |
There was a problem hiding this comment.
The README documents a --discover flag, but the implementation registers --path (repeatable). Please align the docs with the actual CLI (either rename the flag in code or update the README entry to --path <dir>).
| | `--discover <dir>` | | Additional directory to scan for skills/evals (repeatable) | | |
| | `--path <dir>` | | Additional directory to scan for skills/evals (repeatable) | |
| absPath, _ := filepath.Abs(path) | ||
| if _, ok := seenPaths[absPath]; ok { | ||
| return nil | ||
| } | ||
| seenPaths[absPath] = struct{}{} |
There was a problem hiding this comment.
These filepath.Abs(...) calls ignore errors (_ :=). Since discovery correctness depends on the resulting canonical path, handle the error (and either return it or fall back to filepath.Clean(path)), or remove the Abs call if inputs are already absolute (they appear to be, given WalkDir roots are built from absRoot).
| absPath, _ := filepath.Abs(path) | ||
| candidates[absPath] = struct{}{} |
There was a problem hiding this comment.
These filepath.Abs(...) calls ignore errors (_ :=). Since discovery correctness depends on the resulting canonical path, handle the error (and either return it or fall back to filepath.Clean(path)), or remove the Abs call if inputs are already absolute (they appear to be, given WalkDir roots are built from absRoot).
| absPath, _ := filepath.Abs(p) | ||
| candidates[absPath] = struct{}{} |
There was a problem hiding this comment.
These filepath.Abs(...) calls ignore errors (_ :=). Since discovery correctness depends on the resulting canonical path, handle the error (and either return it or fall back to filepath.Clean(path)), or remove the Abs call if inputs are already absolute (they appear to be, given WalkDir roots are built from absRoot).
| } | ||
| err := filepath.WalkDir(sr, func(path string, d fs.DirEntry, err error) error { | ||
| if err != nil { | ||
| return nil |
There was a problem hiding this comment.
The WalkDir callbacks swallow filesystem traversal errors by returning nil. This can silently produce incomplete/incorrect coverage results (e.g., permission errors causing missed skills/evals) with no signal to the user. Prefer returning the err (possibly wrapped with the path) or collecting and reporting traversal failures similarly to parseFailures.
| return nil | |
| return fmt.Errorf("error walking %s: %w", path, err) |
| } | ||
| if err := filepath.WalkDir(evalRoot, func(path string, d fs.DirEntry, err error) error { | ||
| if err != nil { | ||
| return nil |
There was a problem hiding this comment.
The WalkDir callbacks swallow filesystem traversal errors by returning nil. This can silently produce incomplete/incorrect coverage results (e.g., permission errors causing missed skills/evals) with no signal to the user. Prefer returning the err (possibly wrapped with the path) or collecting and reporting traversal failures similarly to parseFailures.
| return nil | |
| return fmt.Errorf("error walking %s: %w", path, err) |
| switch outputFormat { | ||
| case "text": | ||
| renderCoverageText(cmd.OutOrStdout(), report) | ||
| case "markdown": | ||
| renderCoverageMarkdown(cmd.OutOrStdout(), report) | ||
| case "json": | ||
| if err := renderCoverageJSON(cmd.OutOrStdout(), report); err != nil { | ||
| return err | ||
| } | ||
| default: | ||
| return fmt.Errorf("unsupported format %q: must be text, markdown, or json", outputFormat) | ||
| } |
There was a problem hiding this comment.
There are tests for report building and markdown rendering, but no tests for (a) JSON rendering output shape/indentation and (b) unsupported --format error behavior. Adding targeted tests for these paths would help prevent regressions in this new user-facing command.
| func renderCoverageJSON(w io.Writer, report *coverageReport) error { | ||
| enc := json.NewEncoder(w) | ||
| enc.SetIndent("", " ") | ||
| return enc.Encode(report) | ||
| } |
There was a problem hiding this comment.
There are tests for report building and markdown rendering, but no tests for (a) JSON rendering output shape/indentation and (b) unsupported --format error behavior. Adding targeted tests for these paths would help prevent regressions in this new user-facing command.
|
|
||
| | Flag | Description | | ||
| |------|-------------| | ||
| | `--format` | Output format: `text` (default), `markdown`, `json` | |
There was a problem hiding this comment.
The CLI docs list --format but omit the -f shorthand that’s supported by the command. Consider updating the flag table to include the short form for consistency with other CLI docs and to match the actual UX.
| | `--format` | Output format: `text` (default), `markdown`, `json` | | |
| | `-f, --format` | Output format: `text` (default), `markdown`, `json` | |
Closes #82