Add `grade` verb for grading run outcomes by chlowell · Pull Request #102 · microsoft/waza

chlowell · 2026-03-09T19:46:36Z

This separates eval execution and result grading:

waza run --skip-graders runs evals but doesn't grade them
waza grade runs graders against saved waza run output

Usage

# Run but skip grading
waza run eval.yaml --skip-graders --output results.json

# Grade later (outputs summary to stdout)
waza grade eval.yaml --results results.json

# Grade and output a full EvaluationOutcome (as waza run would have without --skip-graders)
waza grade eval.yaml --results results.json -o outcome.json

Changes

New command: waza grade <eval.yaml> (cmd/waza/cmd_grade.go)

--results (required) — path to waza run --output JSON
--task — grade a single task by ID
--workspace — workspace dir for file-based graders
--judge-model — override model for prompt graders
--output / -o — write full EvaluationOutcome JSON (compatible with waza compare)
Prints a summary JSON to stdout with per-task scores, pass/fail, and grader averages
Distinguishes "task not in results" from "task has zero runs" errors

New flag: waza run --skip-graders (cmd/waza/cmd_run.go)

Skips all grading after execution; runs get status: "n/a"
Overall task status becomes StatusNA; Digest.Skipped is incremented
JUnit output marks these as <skipped> elements

Refactored grading into shared code:

internal/graders.RunAll — extracted from TestRunner.runGraders, used by both waza run and waza grade
internal/graders.WithModel — extracted from injectJudgeModel, copies params with model override
internal/orchestration/outcome.go — extracted ComputeTestStats, BuildDigest, and digest helpers from runner.go; waza grade --output calls RegradeOutcome which uses the same BuildDigest + ComputeTestStats as waza run, ensuring identical EvaluationOutcome structure

Copilot

Pull request overview

Adds a waza grade command to re-run graders against previously saved waza run --output results, and introduces waza run --skip-graders to separate execution from grading.

Changes:

Added waza grade <eval.yaml> command that reads prior run output JSON, grades tasks, prints a summary JSON, and can optionally emit a full EvaluationOutcome.
Added --skip-graders flag to waza run and updated reporting (JUnit) to mark ungraded tasks as skipped.
Refactored grading/stats/digest logic into shared helpers (internal/graders.RunAll, ComputeTestStats, BuildDigest, RegradeOutcome).

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
internal/reporting/junit.go	Marks `StatusNA` test cases as JUnit `<skipped>`.
internal/reporting/junit_test.go	Adds coverage for skipped JUnit test cases.
internal/orchestration/runner.go	Adds skip-graders option, refactors digest/stats, and reuses shared grader runner.
internal/orchestration/runner_orchestration_test.go	Adds coverage for `--skip-graders` behavior and updates tests after refactor.
internal/orchestration/outcome.go	Introduces shared stats/digest helpers and `RegradeOutcome`.
internal/orchestration/outcome_test.go	Adds unit tests for the new stats/digest/regrade helpers.
internal/orchestration/judge_model_test.go	Updates tests to target `WithModel` (but introduces a package mismatch).
internal/models/outcome.go	Updates `StatusNA` semantics to include “grading skipped”.
internal/graders/run.go	Extracts shared grading runner and model override helper.
cmd/waza/root.go	Registers the new `grade` subcommand.
cmd/waza/cmd_run.go	Adds `--skip-graders` flag wiring into the runner.
cmd/waza/cmd_grade.go	Implements the `waza grade` command and JSON summary/outcome output.
cmd/waza/cmd_grade_test.go	Adds comprehensive CLI-level test coverage for `waza grade`.
README.md	Documents `waza grade` usage and flags.

cmd/waza/cmd_grade.go

internal/orchestration/outcome.go

cmd/waza/cmd_grade.go

Copilot

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.

cmd/waza/cmd_grade.go

internal/orchestration/outcome.go

cmd/waza/cmd_grade.go

codecov-commenter · 2026-03-09T21:56:18Z

Codecov Report

❌ Patch coverage is 84.13926% with 82 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@3f73d31). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
cmd/waza/cmd_grade.go	83.93%	19 Missing and 12 partials ⚠️
internal/graders/run.go	32.43%	25 Missing ⚠️
internal/orchestration/outcome.go	94.76%	9 Missing and 2 partials ⚠️
internal/orchestration/runner.go	86.56%	6 Missing and 3 partials ⚠️
internal/models/testcase.go	0.00%	4 Missing ⚠️
cmd/waza/cmd_run.go	33.33%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #102   +/-   ##
=======================================
  Coverage        ?   73.45%           
=======================================
  Files           ?      138           
  Lines           ?    15771           
  Branches        ?        0           
=======================================
  Hits            ?    11584           
  Misses          ?     3346           
  Partials        ?      841

Flag	Coverage Δ
go-implementation	`73.45% <84.13%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.

cmd/waza/cmd_grade.go

richardpark-msft

There's a lot here, it seems okay on the surface. Maybe a couple of changes and then it'll be fine.

cmd/waza/cmd_grade.go

cmd/waza/cmd_run.go

README.md

This PR adds token limit checking functionality to the waza CLI and reorganizes testdata for proper test isolation. Closes microsoft#48, closes microsoft#49 ## Changes **New Features:** - Added `waza tokens check` command to validate markdown files against token limits defined in `.token-limits.json` - Supports both table and JSON output formats - Includes `--strict` flag to fail automation when limits are exceeded by exiting 1 - Includes `--quiet` flag for silent validation (suppresses success output) - Default token limits for common files (SKILL.md: 500, README.md: 3000, etc.) - Supports glob patterns and exact file path overrides in configuration **Implementation:** - New `cmd/waza/tokens/check.go` with full check command implementation - New `cmd/waza/tokens/internal/limits.go` parses `.token-limits.json`, implements limits, glob pattern matching, and pattern specificity scoring - Comprehensive test suite in `check_test.go` covering all scenarios: passing limits, exceeded limits, strict mode, JSON output, configuration overrides, and default limits - New `docs/TOKEN-LIMITS.md` with detailed configuration documentation **Test Infrastructure:** - Reorganized `cmd/waza/tokens/testdata/` into `check/` and `count/` subdirectories to maintain test isolation - Created separate fixtures for each test scenario ## Example Output ``` $ waza tokens check File Tokens Limit Status -------------------------------------------------- SKILL.md 402 500 ✅ OK README.md 99 100 ✅ OK references/spec.md 790 100 ❌ EXCEEDED -------------------------------------------------- 2/3 files within limits ⚠️ 1 file(s) exceed their token limits: references/spec.md: 790 tokens (690 over limit of 100) ```

wbreza

Code Review Summary

I've performed a thorough review of PR #102 adding the waza grade command. I recommend: APPROVE

What I Reviewed

Full diff of all 14 changed files (1,570 additions, 432 deletions)
New waza grade command implementation and 572 lines of tests
Refactored grading logic (internal/graders/run.go, internal/orchestration/outcome.go)
--skip-graders flag implementation for waza run
Error handling, edge cases, and nil pointer safety
Test coverage and execution (all tests pass)

Findings

No significant issues found. The implementation is solid:

✅ Proper error handling: All error paths are handled correctly with appropriate error messages
✅ Nil pointer safety: No nil dereference risks (e.g., OutcomeDigest is a struct value, not pointer)
✅ Division by zero protection: Grader weight calculations are safe (weights default to 1.0 when <= 0)
✅ Comprehensive test coverage: 21 test cases covering edge cases like missing tasks, zero runs, and mixed pass/fail scenarios
✅ Backward compatibility: Refactored code maintains identical behavior (shared ComputeTestStats, BuildDigest ensure consistency)
✅ StatusNA handling: Properly implemented for skipped grading with correct propagation through digest/stats

Code Quality Observations

Clean separation of concerns (grading logic extracted to shared helpers)
Good test coverage including integration tests
Proper validation of required flags and input data
Clear error messages distinguish "task not in spec" vs "task not in results" vs "task has no runs"
Preservation of unmodified tasks when using --output flag

Reviewer Comments Addressed

All three comments from @richardpark-msft have been discussed and resolved:

Output format naming (--results vs --output) - discussed and kept as-is with documentation
--skip-graders requiring --output - clarified that it's optional, valid use case documented
Flag naming consistency - discussed and accepted current approach

The implementation is production-ready.

wbreza

Code Review: PR #102 — Add `grade` verb for grading run outcomes

✅ What Looks Good

Excellent refactoring — grading logic cleanly extracted from runner.go into shared graders.RunAll(), eliminating ~237 lines of duplication
Comprehensive test suite — 572 lines, 21 tests covering passing/failing/mixed/disabled/global/verbose/output scenarios
Clean separation of concerns — --skip-graders + grade command provide flexible execution-then-grading workflows
No concurrency issues — proper context propagation, immutable parameter passing, sequential processing

🟠 High (2 findings)

1. StatusNA semantic expansion — StatusNA was broadened from "missing from comparison report" to also mean "grading was skipped." Consider adding StatusSkipped as a distinct constant to avoid confusing downstream consumers (dashboards, waza compare, JUnit reports).

2. internal/graders/run.go — 9.43% test coverage — RunAll() is the core grading orchestrator extracted in this PR but lacks dedicated unit tests. It's covered indirectly through cmd_grade_test.go, but dedicated tests for error paths (grader creation/execution failures) and weight application would strengthen confidence.

🟡 Medium (4 findings)

3. No timeout for grade graders — Prompt graders make API calls but no timeout context is applied. Consider using spec.Config.TimeoutSec.

4. Workspace path not validated — --workspace accepted without checking if directory exists. An os.Stat check would prevent cryptic grader errors.

5. Inconsistent weight handling — Spec graders use EffectiveWeight() but task validators use manual if w <= 0 { w = 1.0 }. Consider a consistent pattern.

6. --output help text underspecified — Doesn't explain the merged behavior when --task is used (graded + unchanged original tasks preserved).

🟢 Low (2 findings)

7. Use streaming JSON decoder — os.ReadFile + json.Unmarshal loads entire file; json.NewDecoder would be more memory-efficient for large results.

8. Verbose logging inconsistency — One Fprintf discards error with _, _ while others check and return.

📌 Follow-up: CLI Flag Standardization

This PR highlights a broader consistency question across the waza CLI: we use --results (input file), --output (output file), and -o inconsistently across commands. We should define a consistent flag model for:

Input/output file flags — standardize naming for flags that accept file paths
Output formats — commands that support both pretty UX and JSON-based output should use a consistent pattern (e.g., --format json vs --json)

Not blocking for this PR, but worth tracking as a follow-up.

Summary

Priority	Count
Critical	0
High	2
Medium	4
Low	2

Overall Assessment: Comment — solid, well-tested feature. High findings are design suggestions, not blockers.

internal/models/outcome.go

internal/graders/run.go

cmd/waza/cmd_grade.go

internal/graders/run.go

wbreza

Code Review: PR #102 — Add `grade` verb for grading run outcomes

✅ What Looks Good

Excellent refactoring — grading logic cleanly extracted from runner.go into shared graders.RunAll(), eliminating ~237 lines of duplication
Comprehensive test suite — 572 lines, 21 tests covering passing/failing/mixed/disabled/global/verbose/output scenarios
Clean separation of concerns — --skip-graders + grade command provide flexible execution-then-grading workflows
No concurrency issues — proper context propagation, immutable parameter passing, sequential processing

🟠 High (2 findings)

1. StatusNA semantic expansion — StatusNA was broadened from "missing from comparison report" to also mean "grading was skipped." Consider adding StatusSkipped as a distinct constant to avoid confusing downstream consumers (dashboards, waza compare, JUnit reports).

2. internal/graders/run.go — 9.43% test coverage — RunAll() is the core grading orchestrator extracted in this PR but lacks dedicated unit tests. It's covered indirectly through cmd_grade_test.go, but dedicated tests for error paths (grader creation/execution failures) and weight application would strengthen confidence.

🟡 Medium (4 findings)

3. No timeout for grade graders — Prompt graders make API calls but no timeout context is applied. Consider using spec.Config.TimeoutSec.

4. Workspace path not validated — --workspace accepted without checking if directory exists. An os.Stat check would prevent cryptic grader errors.

5. Inconsistent weight handling — Spec graders use EffectiveWeight() but task validators use manual if w <= 0 { w = 1.0 }. Consider a consistent pattern.

6. --output help text underspecified — Doesn't explain the merged behavior when --task is used (graded + unchanged original tasks preserved).

🟢 Low (2 findings)

7. Use streaming JSON decoder — os.ReadFile + json.Unmarshal loads entire file; json.NewDecoder would be more memory-efficient for large results.

8. Verbose logging inconsistency — One Fprintf discards error with _, _ while others check and return.

📌 Follow-up: CLI Flag Standardization

This PR highlights a broader consistency question across the waza CLI: we use --results (input file), --output (output file), and -o inconsistently across commands. We should define a consistent flag model for:

Input/output file flags — standardize naming for flags that accept file paths
Output formats — commands that support both pretty UX and JSON-based output should use a consistent pattern (e.g., --format json vs --json)

Not blocking for this PR, but worth tracking as a follow-up.

Summary

Priority	Count
Critical	0
High	2
Medium	4
Low	2

Overall Assessment: Comment — solid, well-tested feature. High findings are design suggestions, not blockers.

internal/models/outcome.go

internal/graders/run.go

cmd/waza/cmd_grade.go

internal/graders/run.go

Copilot

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 1 comment.

cmd/waza/cmd_grade.go

Copilot

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 5 comments.

cmd/waza/cmd_run.go

cmd/waza/cmd_grade.go

internal/orchestration/outcome.go

cmd/waza/cmd_grade.go

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 4 comments.

internal/models/grade.go

cmd/waza/cmd_grade.go

internal/models/outcome.go

cmd/waza/cmd_grade.go

Copilot AI review requested due to automatic review settings March 9, 2026 19:46

Copilot started reviewing on behalf of chlowell March 9, 2026 19:47 View session

Copilot AI reviewed Mar 9, 2026

View reviewed changes

cmd/waza/cmd_grade.go Show resolved Hide resolved

cmd/waza/cmd_grade.go Outdated Show resolved Hide resolved

internal/orchestration/outcome.go Outdated Show resolved Hide resolved

cmd/waza/cmd_grade.go Outdated Show resolved Hide resolved

chlowell marked this pull request as ready for review March 9, 2026 20:30

chlowell requested review from richardpark-msft and spboyer as code owners March 9, 2026 20:30

Copilot AI review requested due to automatic review settings March 9, 2026 20:30

Copilot started reviewing on behalf of chlowell March 9, 2026 20:31 View session

Copilot AI reviewed Mar 9, 2026

View reviewed changes

cmd/waza/cmd_grade.go Show resolved Hide resolved

internal/orchestration/outcome.go Show resolved Hide resolved

cmd/waza/cmd_grade.go Show resolved Hide resolved

Copilot AI review requested due to automatic review settings March 10, 2026 14:57

chlowell force-pushed the grade branch from ef1f05f to 3dca99e Compare March 10, 2026 14:57

Copilot started reviewing on behalf of chlowell March 10, 2026 14:58 View session

Copilot AI reviewed Mar 10, 2026

View reviewed changes

cmd/waza/cmd_grade.go Show resolved Hide resolved

chlowell enabled auto-merge (squash) March 10, 2026 15:08

chlowell added 3 commits March 10, 2026 10:29

Add grade verb for grading run outcomes

fe987a7

good feedback

424e9b6

grader_averages are weighted

6a220ee

chlowell force-pushed the grade branch from 3dca99e to 6a220ee Compare March 10, 2026 17:37

richardpark-msft reviewed Mar 10, 2026

View reviewed changes

cmd/waza/cmd_grade.go Outdated Show resolved Hide resolved

cmd/waza/cmd_run.go Show resolved Hide resolved

README.md Show resolved Hide resolved

add models.GradeOutcome

f106af2

chlowell disabled auto-merge March 10, 2026 21:11

wbreza reviewed Mar 11, 2026

View reviewed changes

internal/models/outcome.go Outdated Show resolved Hide resolved

internal/graders/run.go Outdated Show resolved Hide resolved

cmd/waza/cmd_grade.go Show resolved Hide resolved

cmd/waza/cmd_grade.go Show resolved Hide resolved

internal/graders/run.go Show resolved Hide resolved

wbreza reviewed Mar 11, 2026

View reviewed changes

internal/models/outcome.go Outdated Show resolved Hide resolved

internal/graders/run.go Outdated Show resolved Hide resolved

cmd/waza/cmd_grade.go Show resolved Hide resolved

cmd/waza/cmd_grade.go Show resolved Hide resolved

internal/graders/run.go Show resolved Hide resolved

Merge remote-tracking branch 'origin/main' into grade

7d6e4f5

Copilot AI review requested due to automatic review settings March 12, 2026 15:11

Copilot started reviewing on behalf of chlowell March 12, 2026 15:12 View session

Copilot AI reviewed Mar 12, 2026

View reviewed changes

cmd/waza/cmd_grade.go Show resolved Hide resolved

chlowell added 4 commits March 12, 2026 10:25

restore StatusNA semantics; add StatusSkipped

1a6a3c5

better errors for nonexistent workspaces

06fd31b

add ValidatorInline.EffectiveWeight()

63ed2b6

error when no tasks were graded

49b3044

Copilot AI review requested due to automatic review settings March 12, 2026 17:47

Copilot started reviewing on behalf of chlowell March 12, 2026 17:48 View session

tweak help text

1a39bdb

Copilot AI reviewed Mar 12, 2026

View reviewed changes

cmd/waza/cmd_run.go Show resolved Hide resolved

cmd/waza/cmd_grade.go Outdated Show resolved Hide resolved

internal/orchestration/outcome.go Outdated Show resolved Hide resolved

cmd/waza/cmd_grade.go Show resolved Hide resolved

cmd/waza/cmd_grade.go Show resolved Hide resolved

chlowell and others added 2 commits March 12, 2026 12:34

add skill invocations to RunResult

667e875

Apply suggestions from code review

a265789

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings March 12, 2026 19:49

Copilot started reviewing on behalf of chlowell March 12, 2026 19:50 View session

unused import

7f4f039

Copilot AI reviewed Mar 12, 2026

View reviewed changes

internal/models/grade.go Show resolved Hide resolved

cmd/waza/cmd_grade.go Show resolved Hide resolved

internal/models/outcome.go Show resolved Hide resolved

cmd/waza/cmd_grade.go Outdated Show resolved Hide resolved

support tasks generated from CSV

bbf4bec

chlowell enabled auto-merge (squash) March 12, 2026 20:07

chlowell requested review from richardpark-msft and wbreza March 13, 2026 00:42

richardpark-msft approved these changes Mar 13, 2026

View reviewed changes

chlowell merged commit 3068653 into microsoft:main Mar 13, 2026
6 checks passed

Conversation

chlowell commented Mar 9, 2026

Usage

Changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

richardpark-msft left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wbreza left a comment

Choose a reason for hiding this comment

Code Review Summary

What I Reviewed

Findings

Code Quality Observations

Reviewer Comments Addressed

Uh oh!

wbreza left a comment

Choose a reason for hiding this comment

Code Review: PR #102 — Add grade verb for grading run outcomes

✅ What Looks Good

🟠 High (2 findings)

🟡 Medium (4 findings)

🟢 Low (2 findings)

📌 Follow-up: CLI Flag Standardization

Summary

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wbreza left a comment

Choose a reason for hiding this comment

Code Review: PR #102 — Add grade verb for grading run outcomes

✅ What Looks Good

🟠 High (2 findings)

🟡 Medium (4 findings)

🟢 Low (2 findings)

📌 Follow-up: CLI Flag Standardization

Summary

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

codecov-commenter commented Mar 9, 2026 •

edited

Loading

Code Review: PR #102 — Add `grade` verb for grading run outcomes

Code Review: PR #102 — Add `grade` verb for grading run outcomes