Skip to content

feat: Add eval coverage grid generator#92

Open
spboyer wants to merge 2 commits intomicrosoft:mainfrom
spboyer:squad/82-eval-coverage-grid
Open

feat: Add eval coverage grid generator#92
spboyer wants to merge 2 commits intomicrosoft:mainfrom
spboyer:squad/82-eval-coverage-grid

Conversation

@spboyer
Copy link
Member

@spboyer spboyer commented Mar 5, 2026

Closes #82

Copilot AI review requested due to automatic review settings March 5, 2026 01:56
@spboyer spboyer self-assigned this Mar 5, 2026
@github-actions github-actions bot enabled auto-merge (squash) March 5, 2026 01:57
@codecov-commenter
Copy link

codecov-commenter commented Mar 5, 2026

Codecov Report

❌ Patch coverage is 66.11570% with 82 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@a75477e). Learn more about missing BASE report.

Files with missing lines Patch % Lines
cmd/waza/cmd_coverage.go 65.97% 62 Missing and 20 partials ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main      #92   +/-   ##
=======================================
  Coverage        ?   72.11%           
=======================================
  Files           ?      129           
  Lines           ?    14494           
  Branches        ?        0           
=======================================
  Hits            ?    10452           
  Misses          ?     3258           
  Partials        ?      784           
Flag Coverage Δ
go-implementation 72.11% <66.11%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new waza coverage CLI command to generate an eval-coverage “grid” across discovered skills, with supporting docs and tests.

Changes:

  • Introduces waza coverage [root] with text, markdown, and json output.
  • Implements skill/eval discovery and a coverage classification (none/partial/full) based on tasks + grader types.
  • Updates CLI reference docs and README, and adds unit tests for report building + markdown rendering.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
site/src/content/docs/reference/cli.mdx Documents the new waza coverage command, flags, and examples.
cmd/waza/root.go Registers the new coverage subcommand on the root CLI.
cmd/waza/cmd_coverage.go Implements discovery, report generation, and output renderers (text/markdown/json).
cmd/waza/cmd_coverage_test.go Adds unit tests for report classification, markdown output, and command registration.
README.md Adds waza coverage usage example and CLI reference entry.
.squad/log/2026-03-05T00-36-issue-assignment-pipeline.md Adds team process log (non-functional).
.squad/log/2026-03-05T00-26-rusty-token-diff-design.md Adds team process log (non-functional).
.squad/decisions.md Records team workflow/design decisions (non-functional).

Comment on lines +115 to +119
for _, evalPath := range evalPaths {
spec, parseErr := parseEvalSpec(evalPath)
if parseErr != nil {
continue
}
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buildCoverageReport silently ignores eval.yaml parsing failures (parseEvalSpec errors are dropped with continue). This can misreport skills as uncovered/partial when the eval file exists but is malformed or uses unsupported fields. Consider returning an error (or at least collecting these parse errors and surfacing a warning/summary to stderr and/or in the JSON report) so coverage results remain trustworthy.

Copilot uses AI. Check for mistakes.
Comment on lines +258 to +262
if d.Name() == "eval.yaml" {
absPath, _ := filepath.Abs(path)
candidates[absPath] = struct{}{}
}
return nil
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This command only discovers eval.yaml files, but other parts of the repo treat both eval.yaml and eval.yml as valid (e.g., init path detection). To avoid undercounting coverage, include eval.yml in discovery (both the directory walk and the per-skill candidate list).

Copilot uses AI. Check for mistakes.
Comment on lines +84 to +86
cmd.Flags().StringVarP(&outputFormat, "format", "f", "text", "Output format: text, markdown, or json")
cmd.Flags().StringArrayVar(&discoverPaths, "discover", nil, "Additional directories to scan for skills/evals (repeatable)")
return cmd
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--discover is defined here as a repeatable string array of additional search roots, but waza run --discover already exists as a boolean flag with different semantics. Reusing the same flag name with a different type/meaning across commands is likely to confuse users and makes CLI behavior less consistent; consider renaming this flag (e.g., --path/--search-root) or aligning semantics with the existing --discover behavior.

Copilot uses AI. Check for mistakes.
Copy link
Member Author

@spboyer spboyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified by Rusty (Opus 4.6) — LGTM ✅

Clean eval coverage grid generator:

  • New \waza coverage\ command with text/markdown/json output
  • Smart skill/eval discovery with deduplication, hidden dir skipping
  • Coverage classification (Full/Partial/None) is conservative and correct
  • Tests cover no-eval, partial/full, markdown rendering, root command integration
  • Docs updated: README, CLI reference
  • CI green on ubuntu + windows + lint

Minor: \�valSpecLite.Tasks\ is []string\ while real eval YAML has structured task objects — means task count defaults to 0, showing Partial instead of Full for real evals. Conservative for a reporting tool. Worth fixing to []any\ in a follow-up.

Note: Can't self-approve via API. Setting auto-merge.

@spboyer spboyer force-pushed the squad/82-eval-coverage-grid branch from 40b1f4c to e1365f2 Compare March 5, 2026 17:12
spboyer added a commit that referenced this pull request Mar 5, 2026
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 5, 2026 17:42
spboyer and others added 2 commits March 5, 2026 12:46
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@spboyer spboyer force-pushed the squad/82-eval-coverage-grid branch from a6ca52d to f3aa0c4 Compare March 5, 2026 17:46
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 9 comments.

| Flag | Short | Description |
|------|-------|-------------|
| `--format <fmt>` | `-f` | Output format: `text`, `markdown`, or `json` (default: `text`) |
| `--discover <dir>` | | Additional directory to scan for skills/evals (repeatable) |
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README documents a --discover flag, but the implementation registers --path (repeatable). Please align the docs with the actual CLI (either rename the flag in code or update the README entry to --path <dir>).

Suggested change
| `--discover <dir>` | | Additional directory to scan for skills/evals (repeatable) |
| `--path <dir>` | | Additional directory to scan for skills/evals (repeatable) |

Copilot uses AI. Check for mistakes.
Comment on lines +219 to +223
absPath, _ := filepath.Abs(path)
if _, ok := seenPaths[absPath]; ok {
return nil
}
seenPaths[absPath] = struct{}{}
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These filepath.Abs(...) calls ignore errors (_ :=). Since discovery correctness depends on the resulting canonical path, handle the error (and either return it or fall back to filepath.Clean(path)), or remove the Abs call if inputs are already absolute (they appear to be, given WalkDir roots are built from absRoot).

Copilot uses AI. Check for mistakes.
Comment on lines +265 to +266
absPath, _ := filepath.Abs(path)
candidates[absPath] = struct{}{}
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These filepath.Abs(...) calls ignore errors (_ :=). Since discovery correctness depends on the resulting canonical path, handle the error (and either return it or fall back to filepath.Clean(path)), or remove the Abs call if inputs are already absolute (they appear to be, given WalkDir roots are built from absRoot).

Copilot uses AI. Check for mistakes.
Comment on lines +283 to +284
absPath, _ := filepath.Abs(p)
candidates[absPath] = struct{}{}
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These filepath.Abs(...) calls ignore errors (_ :=). Since discovery correctness depends on the resulting canonical path, handle the error (and either return it or fall back to filepath.Clean(path)), or remove the Abs call if inputs are already absolute (they appear to be, given WalkDir roots are built from absRoot).

Copilot uses AI. Check for mistakes.
}
err := filepath.WalkDir(sr, func(path string, d fs.DirEntry, err error) error {
if err != nil {
return nil
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The WalkDir callbacks swallow filesystem traversal errors by returning nil. This can silently produce incomplete/incorrect coverage results (e.g., permission errors causing missed skills/evals) with no signal to the user. Prefer returning the err (possibly wrapped with the path) or collecting and reporting traversal failures similarly to parseFailures.

Suggested change
return nil
return fmt.Errorf("error walking %s: %w", path, err)

Copilot uses AI. Check for mistakes.
}
if err := filepath.WalkDir(evalRoot, func(path string, d fs.DirEntry, err error) error {
if err != nil {
return nil
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The WalkDir callbacks swallow filesystem traversal errors by returning nil. This can silently produce incomplete/incorrect coverage results (e.g., permission errors causing missed skills/evals) with no signal to the user. Prefer returning the err (possibly wrapped with the path) or collecting and reporting traversal failures similarly to parseFailures.

Suggested change
return nil
return fmt.Errorf("error walking %s: %w", path, err)

Copilot uses AI. Check for mistakes.
Comment on lines +68 to +79
switch outputFormat {
case "text":
renderCoverageText(cmd.OutOrStdout(), report)
case "markdown":
renderCoverageMarkdown(cmd.OutOrStdout(), report)
case "json":
if err := renderCoverageJSON(cmd.OutOrStdout(), report); err != nil {
return err
}
default:
return fmt.Errorf("unsupported format %q: must be text, markdown, or json", outputFormat)
}
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are tests for report building and markdown rendering, but no tests for (a) JSON rendering output shape/indentation and (b) unsupported --format error behavior. Adding targeted tests for these paths would help prevent regressions in this new user-facing command.

Copilot uses AI. Check for mistakes.
Comment on lines +361 to +365
func renderCoverageJSON(w io.Writer, report *coverageReport) error {
enc := json.NewEncoder(w)
enc.SetIndent("", " ")
return enc.Encode(report)
}
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are tests for report building and markdown rendering, but no tests for (a) JSON rendering output shape/indentation and (b) unsupported --format error behavior. Adding targeted tests for these paths would help prevent regressions in this new user-facing command.

Copilot uses AI. Check for mistakes.

| Flag | Description |
|------|-------------|
| `--format` | Output format: `text` (default), `markdown`, `json` |
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CLI docs list --format but omit the -f shorthand that’s supported by the command. Consider updating the flag table to include the short form for consistency with other CLI docs and to match the actual UX.

Suggested change
| `--format` | Output format: `text` (default), `markdown`, `json` |
| `-f, --format` | Output format: `text` (default), `markdown`, `json` |

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Eval coverage grid generator

3 participants