Add grade verb for grading run outcomes#102
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a waza grade command to re-run graders against previously saved waza run --output results, and introduces waza run --skip-graders to separate execution from grading.
Changes:
- Added
waza grade <eval.yaml>command that reads prior run output JSON, grades tasks, prints a summary JSON, and can optionally emit a fullEvaluationOutcome. - Added
--skip-gradersflag towaza runand updated reporting (JUnit) to mark ungraded tasks as skipped. - Refactored grading/stats/digest logic into shared helpers (
internal/graders.RunAll,ComputeTestStats,BuildDigest,RegradeOutcome).
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| internal/reporting/junit.go | Marks StatusNA test cases as JUnit <skipped>. |
| internal/reporting/junit_test.go | Adds coverage for skipped JUnit test cases. |
| internal/orchestration/runner.go | Adds skip-graders option, refactors digest/stats, and reuses shared grader runner. |
| internal/orchestration/runner_orchestration_test.go | Adds coverage for --skip-graders behavior and updates tests after refactor. |
| internal/orchestration/outcome.go | Introduces shared stats/digest helpers and RegradeOutcome. |
| internal/orchestration/outcome_test.go | Adds unit tests for the new stats/digest/regrade helpers. |
| internal/orchestration/judge_model_test.go | Updates tests to target WithModel (but introduces a package mismatch). |
| internal/models/outcome.go | Updates StatusNA semantics to include “grading skipped”. |
| internal/graders/run.go | Extracts shared grading runner and model override helper. |
| cmd/waza/root.go | Registers the new grade subcommand. |
| cmd/waza/cmd_run.go | Adds --skip-graders flag wiring into the runner. |
| cmd/waza/cmd_grade.go | Implements the waza grade command and JSON summary/outcome output. |
| cmd/waza/cmd_grade_test.go | Adds comprehensive CLI-level test coverage for waza grade. |
| README.md | Documents waza grade usage and flags. |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #102 +/- ##
=======================================
Coverage ? 73.45%
=======================================
Files ? 138
Lines ? 15771
Branches ? 0
=======================================
Hits ? 11584
Misses ? 3346
Partials ? 841
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
richardpark-msft
left a comment
There was a problem hiding this comment.
There's a lot here, it seems okay on the surface. Maybe a couple of changes and then it'll be fine.
This PR adds token limit checking functionality to the waza CLI and reorganizes testdata for proper test isolation. Closes microsoft#48, closes microsoft#49 ## Changes **New Features:** - Added `waza tokens check` command to validate markdown files against token limits defined in `.token-limits.json` - Supports both table and JSON output formats - Includes `--strict` flag to fail automation when limits are exceeded by exiting 1 - Includes `--quiet` flag for silent validation (suppresses success output) - Default token limits for common files (SKILL.md: 500, README.md: 3000, etc.) - Supports glob patterns and exact file path overrides in configuration **Implementation:** - New `cmd/waza/tokens/check.go` with full check command implementation - New `cmd/waza/tokens/internal/limits.go` parses `.token-limits.json`, implements limits, glob pattern matching, and pattern specificity scoring - Comprehensive test suite in `check_test.go` covering all scenarios: passing limits, exceeded limits, strict mode, JSON output, configuration overrides, and default limits - New `docs/TOKEN-LIMITS.md` with detailed configuration documentation **Test Infrastructure:** - Reorganized `cmd/waza/tokens/testdata/` into `check/` and `count/` subdirectories to maintain test isolation - Created separate fixtures for each test scenario ## Example Output ``` $ waza tokens check File Tokens Limit Status -------------------------------------------------- SKILL.md 402 500 ✅ OK README.md 99 100 ✅ OK references/spec.md 790 100 ❌ EXCEEDED -------------------------------------------------- 2/3 files within limits⚠️ 1 file(s) exceed their token limits: references/spec.md: 790 tokens (690 over limit of 100) ```
wbreza
left a comment
There was a problem hiding this comment.
Code Review Summary
I've performed a thorough review of PR #102 adding the waza grade command. I recommend: APPROVE
What I Reviewed
- Full diff of all 14 changed files (1,570 additions, 432 deletions)
- New
waza gradecommand implementation and 572 lines of tests - Refactored grading logic (
internal/graders/run.go,internal/orchestration/outcome.go) --skip-gradersflag implementation forwaza run- Error handling, edge cases, and nil pointer safety
- Test coverage and execution (all tests pass)
Findings
No significant issues found. The implementation is solid:
✅ Proper error handling: All error paths are handled correctly with appropriate error messages
✅ Nil pointer safety: No nil dereference risks (e.g., OutcomeDigest is a struct value, not pointer)
✅ Division by zero protection: Grader weight calculations are safe (weights default to 1.0 when <= 0)
✅ Comprehensive test coverage: 21 test cases covering edge cases like missing tasks, zero runs, and mixed pass/fail scenarios
✅ Backward compatibility: Refactored code maintains identical behavior (shared ComputeTestStats, BuildDigest ensure consistency)
✅ StatusNA handling: Properly implemented for skipped grading with correct propagation through digest/stats
Code Quality Observations
- Clean separation of concerns (grading logic extracted to shared helpers)
- Good test coverage including integration tests
- Proper validation of required flags and input data
- Clear error messages distinguish "task not in spec" vs "task not in results" vs "task has no runs"
- Preservation of unmodified tasks when using
--outputflag
Reviewer Comments Addressed
All three comments from @richardpark-msft have been discussed and resolved:
- Output format naming (
--resultsvs--output) - discussed and kept as-is with documentation --skip-gradersrequiring--output- clarified that it's optional, valid use case documented- Flag naming consistency - discussed and accepted current approach
The implementation is production-ready.
wbreza
left a comment
There was a problem hiding this comment.
Code Review: PR #102 — Add grade verb for grading run outcomes
✅ What Looks Good
- Excellent refactoring — grading logic cleanly extracted from
runner.gointo sharedgraders.RunAll(), eliminating ~237 lines of duplication - Comprehensive test suite — 572 lines, 21 tests covering passing/failing/mixed/disabled/global/verbose/output scenarios
- Clean separation of concerns —
--skip-graders+gradecommand provide flexible execution-then-grading workflows - No concurrency issues — proper context propagation, immutable parameter passing, sequential processing
🟠 High (2 findings)
1. StatusNA semantic expansion — StatusNA was broadened from "missing from comparison report" to also mean "grading was skipped." Consider adding StatusSkipped as a distinct constant to avoid confusing downstream consumers (dashboards, waza compare, JUnit reports).
2. internal/graders/run.go — 9.43% test coverage — RunAll() is the core grading orchestrator extracted in this PR but lacks dedicated unit tests. It's covered indirectly through cmd_grade_test.go, but dedicated tests for error paths (grader creation/execution failures) and weight application would strengthen confidence.
🟡 Medium (4 findings)
3. No timeout for grade graders — Prompt graders make API calls but no timeout context is applied. Consider using spec.Config.TimeoutSec.
4. Workspace path not validated — --workspace accepted without checking if directory exists. An os.Stat check would prevent cryptic grader errors.
5. Inconsistent weight handling — Spec graders use EffectiveWeight() but task validators use manual if w <= 0 { w = 1.0 }. Consider a consistent pattern.
6. --output help text underspecified — Doesn't explain the merged behavior when --task is used (graded + unchanged original tasks preserved).
🟢 Low (2 findings)
7. Use streaming JSON decoder — os.ReadFile + json.Unmarshal loads entire file; json.NewDecoder would be more memory-efficient for large results.
8. Verbose logging inconsistency — One Fprintf discards error with _, _ while others check and return.
📌 Follow-up: CLI Flag Standardization
This PR highlights a broader consistency question across the waza CLI: we use --results (input file), --output (output file), and -o inconsistently across commands. We should define a consistent flag model for:
- Input/output file flags — standardize naming for flags that accept file paths
- Output formats — commands that support both pretty UX and JSON-based output should use a consistent pattern (e.g.,
--format jsonvs--json)
Not blocking for this PR, but worth tracking as a follow-up.
Summary
| Priority | Count |
|---|---|
| Critical | 0 |
| High | 2 |
| Medium | 4 |
| Low | 2 |
Overall Assessment: Comment — solid, well-tested feature. High findings are design suggestions, not blockers.
wbreza
left a comment
There was a problem hiding this comment.
Code Review: PR #102 — Add grade verb for grading run outcomes
✅ What Looks Good
- Excellent refactoring — grading logic cleanly extracted from
runner.gointo sharedgraders.RunAll(), eliminating ~237 lines of duplication - Comprehensive test suite — 572 lines, 21 tests covering passing/failing/mixed/disabled/global/verbose/output scenarios
- Clean separation of concerns —
--skip-graders+gradecommand provide flexible execution-then-grading workflows - No concurrency issues — proper context propagation, immutable parameter passing, sequential processing
🟠 High (2 findings)
1. StatusNA semantic expansion — StatusNA was broadened from "missing from comparison report" to also mean "grading was skipped." Consider adding StatusSkipped as a distinct constant to avoid confusing downstream consumers (dashboards, waza compare, JUnit reports).
2. internal/graders/run.go — 9.43% test coverage — RunAll() is the core grading orchestrator extracted in this PR but lacks dedicated unit tests. It's covered indirectly through cmd_grade_test.go, but dedicated tests for error paths (grader creation/execution failures) and weight application would strengthen confidence.
🟡 Medium (4 findings)
3. No timeout for grade graders — Prompt graders make API calls but no timeout context is applied. Consider using spec.Config.TimeoutSec.
4. Workspace path not validated — --workspace accepted without checking if directory exists. An os.Stat check would prevent cryptic grader errors.
5. Inconsistent weight handling — Spec graders use EffectiveWeight() but task validators use manual if w <= 0 { w = 1.0 }. Consider a consistent pattern.
6. --output help text underspecified — Doesn't explain the merged behavior when --task is used (graded + unchanged original tasks preserved).
🟢 Low (2 findings)
7. Use streaming JSON decoder — os.ReadFile + json.Unmarshal loads entire file; json.NewDecoder would be more memory-efficient for large results.
8. Verbose logging inconsistency — One Fprintf discards error with _, _ while others check and return.
📌 Follow-up: CLI Flag Standardization
This PR highlights a broader consistency question across the waza CLI: we use --results (input file), --output (output file), and -o inconsistently across commands. We should define a consistent flag model for:
- Input/output file flags — standardize naming for flags that accept file paths
- Output formats — commands that support both pretty UX and JSON-based output should use a consistent pattern (e.g.,
--format jsonvs--json)
Not blocking for this PR, but worth tracking as a follow-up.
Summary
| Priority | Count |
|---|---|
| Critical | 0 |
| High | 2 |
| Medium | 4 |
| Low | 2 |
Overall Assessment: Comment — solid, well-tested feature. High findings are design suggestions, not blockers.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This separates eval execution and result grading:
waza run --skip-gradersruns evals but doesn't grade themwaza graderuns graders against savedwaza runoutputUsage
Changes
New command:
waza grade <eval.yaml>(cmd/waza/cmd_grade.go)--results(required) — path towaza run --outputJSON--task— grade a single task by ID--workspace— workspace dir for file-based graders--judge-model— override model for prompt graders--output/-o— write fullEvaluationOutcomeJSON (compatible withwaza compare)New flag:
waza run --skip-graders(cmd/waza/cmd_run.go)status: "n/a"StatusNA;Digest.Skippedis incremented<skipped>elementsRefactored grading into shared code:
internal/graders.RunAll— extracted fromTestRunner.runGraders, used by bothwaza runandwaza gradeinternal/graders.WithModel— extracted frominjectJudgeModel, copies params with model overrideinternal/orchestration/outcome.go— extractedComputeTestStats,BuildDigest, and digest helpers fromrunner.go;waza grade --outputcallsRegradeOutcomewhich uses the sameBuildDigest+ComputeTestStatsaswaza run, ensuring identicalEvaluationOutcomestructure