Skip to content

Add grade verb for grading run outcomes#102

Merged
chlowell merged 14 commits intomicrosoft:mainfrom
chlowell:grade
Mar 13, 2026
Merged

Add grade verb for grading run outcomes#102
chlowell merged 14 commits intomicrosoft:mainfrom
chlowell:grade

Conversation

@chlowell
Copy link
Member

@chlowell chlowell commented Mar 9, 2026

This separates eval execution and result grading:

  • waza run --skip-graders runs evals but doesn't grade them
  • waza grade runs graders against saved waza run output

Usage

# Run but skip grading
waza run eval.yaml --skip-graders --output results.json

# Grade later (outputs summary to stdout)
waza grade eval.yaml --results results.json

# Grade and output a full EvaluationOutcome (as waza run would have without --skip-graders)
waza grade eval.yaml --results results.json -o outcome.json

Changes

New command: waza grade <eval.yaml> (cmd/waza/cmd_grade.go)

  • --results (required) — path to waza run --output JSON
  • --task — grade a single task by ID
  • --workspace — workspace dir for file-based graders
  • --judge-model — override model for prompt graders
  • --output / -o — write full EvaluationOutcome JSON (compatible with waza compare)
  • Prints a summary JSON to stdout with per-task scores, pass/fail, and grader averages
  • Distinguishes "task not in results" from "task has zero runs" errors

New flag: waza run --skip-graders (cmd/waza/cmd_run.go)

  • Skips all grading after execution; runs get status: "n/a"
  • Overall task status becomes StatusNA; Digest.Skipped is incremented
  • JUnit output marks these as <skipped> elements

Refactored grading into shared code:

  • internal/graders.RunAll — extracted from TestRunner.runGraders, used by both waza run and waza grade
  • internal/graders.WithModel — extracted from injectJudgeModel, copies params with model override
  • internal/orchestration/outcome.go — extracted ComputeTestStats, BuildDigest, and digest helpers from runner.go; waza grade --output calls RegradeOutcome which uses the same BuildDigest + ComputeTestStats as waza run, ensuring identical EvaluationOutcome structure

Copilot AI review requested due to automatic review settings March 9, 2026 19:46
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a waza grade command to re-run graders against previously saved waza run --output results, and introduces waza run --skip-graders to separate execution from grading.

Changes:

  • Added waza grade <eval.yaml> command that reads prior run output JSON, grades tasks, prints a summary JSON, and can optionally emit a full EvaluationOutcome.
  • Added --skip-graders flag to waza run and updated reporting (JUnit) to mark ungraded tasks as skipped.
  • Refactored grading/stats/digest logic into shared helpers (internal/graders.RunAll, ComputeTestStats, BuildDigest, RegradeOutcome).

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
internal/reporting/junit.go Marks StatusNA test cases as JUnit <skipped>.
internal/reporting/junit_test.go Adds coverage for skipped JUnit test cases.
internal/orchestration/runner.go Adds skip-graders option, refactors digest/stats, and reuses shared grader runner.
internal/orchestration/runner_orchestration_test.go Adds coverage for --skip-graders behavior and updates tests after refactor.
internal/orchestration/outcome.go Introduces shared stats/digest helpers and RegradeOutcome.
internal/orchestration/outcome_test.go Adds unit tests for the new stats/digest/regrade helpers.
internal/orchestration/judge_model_test.go Updates tests to target WithModel (but introduces a package mismatch).
internal/models/outcome.go Updates StatusNA semantics to include “grading skipped”.
internal/graders/run.go Extracts shared grading runner and model override helper.
cmd/waza/root.go Registers the new grade subcommand.
cmd/waza/cmd_run.go Adds --skip-graders flag wiring into the runner.
cmd/waza/cmd_grade.go Implements the waza grade command and JSON summary/outcome output.
cmd/waza/cmd_grade_test.go Adds comprehensive CLI-level test coverage for waza grade.
README.md Documents waza grade usage and flags.

@chlowell chlowell marked this pull request as ready for review March 9, 2026 20:30
Copilot AI review requested due to automatic review settings March 9, 2026 20:30
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.

@codecov-commenter
Copy link

codecov-commenter commented Mar 9, 2026

Codecov Report

❌ Patch coverage is 84.13926% with 82 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@3f73d31). Learn more about missing BASE report.

Files with missing lines Patch % Lines
cmd/waza/cmd_grade.go 83.93% 19 Missing and 12 partials ⚠️
internal/graders/run.go 32.43% 25 Missing ⚠️
internal/orchestration/outcome.go 94.76% 9 Missing and 2 partials ⚠️
internal/orchestration/runner.go 86.56% 6 Missing and 3 partials ⚠️
internal/models/testcase.go 0.00% 4 Missing ⚠️
cmd/waza/cmd_run.go 33.33% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main     #102   +/-   ##
=======================================
  Coverage        ?   73.45%           
=======================================
  Files           ?      138           
  Lines           ?    15771           
  Branches        ?        0           
=======================================
  Hits            ?    11584           
  Misses          ?     3346           
  Partials        ?      841           
Flag Coverage Δ
go-implementation 73.45% <84.13%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot AI review requested due to automatic review settings March 10, 2026 14:57
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.

@chlowell chlowell enabled auto-merge (squash) March 10, 2026 15:08
Copy link
Member

@richardpark-msft richardpark-msft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot here, it seems okay on the surface. Maybe a couple of changes and then it'll be fine.

richardpark-msft pushed a commit to richardpark-msft/waza that referenced this pull request Mar 10, 2026
This PR adds token limit checking functionality to the waza CLI and
reorganizes testdata for proper test isolation. Closes microsoft#48, closes microsoft#49

## Changes

**New Features:**
- Added `waza tokens check` command to validate markdown files against
token limits defined in `.token-limits.json`
- Supports both table and JSON output formats
- Includes `--strict` flag to fail automation when limits are exceeded
by exiting 1
- Includes `--quiet` flag for silent validation (suppresses success
output)
- Default token limits for common files (SKILL.md: 500, README.md: 3000,
etc.)
- Supports glob patterns and exact file path overrides in configuration

**Implementation:**
- New `cmd/waza/tokens/check.go` with full check command implementation
- New `cmd/waza/tokens/internal/limits.go` parses `.token-limits.json`,
implements limits, glob pattern matching, and pattern specificity
scoring
- Comprehensive test suite in `check_test.go` covering all scenarios:
passing limits, exceeded limits, strict mode, JSON output, configuration
overrides, and default limits
- New `docs/TOKEN-LIMITS.md` with detailed configuration documentation

**Test Infrastructure:**
- Reorganized `cmd/waza/tokens/testdata/` into `check/` and `count/`
subdirectories to maintain test isolation
- Created separate fixtures for each test scenario

## Example Output

```
$ waza tokens check
File               Tokens  Limit  Status
--------------------------------------------------
SKILL.md              402    500  ✅ OK
README.md              99    100  ✅ OK
references/spec.md    790    100  ❌ EXCEEDED
--------------------------------------------------

2/3 files within limits

⚠️  1 file(s) exceed their token limits:
   references/spec.md: 790 tokens (690 over limit of 100)
```
@chlowell chlowell disabled auto-merge March 10, 2026 21:11
Copy link
Collaborator

@wbreza wbreza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

I've performed a thorough review of PR #102 adding the waza grade command. I recommend: APPROVE

What I Reviewed

  • Full diff of all 14 changed files (1,570 additions, 432 deletions)
  • New waza grade command implementation and 572 lines of tests
  • Refactored grading logic (internal/graders/run.go, internal/orchestration/outcome.go)
  • --skip-graders flag implementation for waza run
  • Error handling, edge cases, and nil pointer safety
  • Test coverage and execution (all tests pass)

Findings

No significant issues found. The implementation is solid:

Proper error handling: All error paths are handled correctly with appropriate error messages
Nil pointer safety: No nil dereference risks (e.g., OutcomeDigest is a struct value, not pointer)
Division by zero protection: Grader weight calculations are safe (weights default to 1.0 when <= 0)
Comprehensive test coverage: 21 test cases covering edge cases like missing tasks, zero runs, and mixed pass/fail scenarios
Backward compatibility: Refactored code maintains identical behavior (shared ComputeTestStats, BuildDigest ensure consistency)
StatusNA handling: Properly implemented for skipped grading with correct propagation through digest/stats

Code Quality Observations

  • Clean separation of concerns (grading logic extracted to shared helpers)
  • Good test coverage including integration tests
  • Proper validation of required flags and input data
  • Clear error messages distinguish "task not in spec" vs "task not in results" vs "task has no runs"
  • Preservation of unmodified tasks when using --output flag

Reviewer Comments Addressed

All three comments from @richardpark-msft have been discussed and resolved:

  1. Output format naming (--results vs --output) - discussed and kept as-is with documentation
  2. --skip-graders requiring --output - clarified that it's optional, valid use case documented
  3. Flag naming consistency - discussed and accepted current approach

The implementation is production-ready.

Copy link
Collaborator

@wbreza wbreza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: PR #102 — Add grade verb for grading run outcomes

✅ What Looks Good

  • Excellent refactoring — grading logic cleanly extracted from runner.go into shared graders.RunAll(), eliminating ~237 lines of duplication
  • Comprehensive test suite — 572 lines, 21 tests covering passing/failing/mixed/disabled/global/verbose/output scenarios
  • Clean separation of concerns--skip-graders + grade command provide flexible execution-then-grading workflows
  • No concurrency issues — proper context propagation, immutable parameter passing, sequential processing

🟠 High (2 findings)

1. StatusNA semantic expansionStatusNA was broadened from "missing from comparison report" to also mean "grading was skipped." Consider adding StatusSkipped as a distinct constant to avoid confusing downstream consumers (dashboards, waza compare, JUnit reports).

2. internal/graders/run.go — 9.43% test coverageRunAll() is the core grading orchestrator extracted in this PR but lacks dedicated unit tests. It's covered indirectly through cmd_grade_test.go, but dedicated tests for error paths (grader creation/execution failures) and weight application would strengthen confidence.

🟡 Medium (4 findings)

3. No timeout for grade graders — Prompt graders make API calls but no timeout context is applied. Consider using spec.Config.TimeoutSec.

4. Workspace path not validated--workspace accepted without checking if directory exists. An os.Stat check would prevent cryptic grader errors.

5. Inconsistent weight handling — Spec graders use EffectiveWeight() but task validators use manual if w <= 0 { w = 1.0 }. Consider a consistent pattern.

6. --output help text underspecified — Doesn't explain the merged behavior when --task is used (graded + unchanged original tasks preserved).

🟢 Low (2 findings)

7. Use streaming JSON decoderos.ReadFile + json.Unmarshal loads entire file; json.NewDecoder would be more memory-efficient for large results.

8. Verbose logging inconsistency — One Fprintf discards error with _, _ while others check and return.

📌 Follow-up: CLI Flag Standardization

This PR highlights a broader consistency question across the waza CLI: we use --results (input file), --output (output file), and -o inconsistently across commands. We should define a consistent flag model for:

  • Input/output file flags — standardize naming for flags that accept file paths
  • Output formats — commands that support both pretty UX and JSON-based output should use a consistent pattern (e.g., --format json vs --json)

Not blocking for this PR, but worth tracking as a follow-up.

Summary

Priority Count
Critical 0
High 2
Medium 4
Low 2

Overall Assessment: Comment — solid, well-tested feature. High findings are design suggestions, not blockers.

Copy link
Collaborator

@wbreza wbreza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: PR #102 — Add grade verb for grading run outcomes

✅ What Looks Good

  • Excellent refactoring — grading logic cleanly extracted from runner.go into shared graders.RunAll(), eliminating ~237 lines of duplication
  • Comprehensive test suite — 572 lines, 21 tests covering passing/failing/mixed/disabled/global/verbose/output scenarios
  • Clean separation of concerns--skip-graders + grade command provide flexible execution-then-grading workflows
  • No concurrency issues — proper context propagation, immutable parameter passing, sequential processing

🟠 High (2 findings)

1. StatusNA semantic expansionStatusNA was broadened from "missing from comparison report" to also mean "grading was skipped." Consider adding StatusSkipped as a distinct constant to avoid confusing downstream consumers (dashboards, waza compare, JUnit reports).

2. internal/graders/run.go — 9.43% test coverageRunAll() is the core grading orchestrator extracted in this PR but lacks dedicated unit tests. It's covered indirectly through cmd_grade_test.go, but dedicated tests for error paths (grader creation/execution failures) and weight application would strengthen confidence.

🟡 Medium (4 findings)

3. No timeout for grade graders — Prompt graders make API calls but no timeout context is applied. Consider using spec.Config.TimeoutSec.

4. Workspace path not validated--workspace accepted without checking if directory exists. An os.Stat check would prevent cryptic grader errors.

5. Inconsistent weight handling — Spec graders use EffectiveWeight() but task validators use manual if w <= 0 { w = 1.0 }. Consider a consistent pattern.

6. --output help text underspecified — Doesn't explain the merged behavior when --task is used (graded + unchanged original tasks preserved).

🟢 Low (2 findings)

7. Use streaming JSON decoderos.ReadFile + json.Unmarshal loads entire file; json.NewDecoder would be more memory-efficient for large results.

8. Verbose logging inconsistency — One Fprintf discards error with _, _ while others check and return.

📌 Follow-up: CLI Flag Standardization

This PR highlights a broader consistency question across the waza CLI: we use --results (input file), --output (output file), and -o inconsistently across commands. We should define a consistent flag model for:

  • Input/output file flags — standardize naming for flags that accept file paths
  • Output formats — commands that support both pretty UX and JSON-based output should use a consistent pattern (e.g., --format json vs --json)

Not blocking for this PR, but worth tracking as a follow-up.

Summary

Priority Count
Critical 0
High 2
Medium 4
Low 2

Overall Assessment: Comment — solid, well-tested feature. High findings are design suggestions, not blockers.

Copilot AI review requested due to automatic review settings March 12, 2026 15:11
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 1 comment.

Copilot AI review requested due to automatic review settings March 12, 2026 17:47
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 5 comments.

chlowell and others added 2 commits March 12, 2026 12:34
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 12, 2026 19:49
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 4 comments.

@chlowell chlowell enabled auto-merge (squash) March 12, 2026 20:07
@chlowell chlowell merged commit 3068653 into microsoft:main Mar 13, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants