feat: Add multi-trial flakiness detection for evals by spboyer · Pull Request #91 · microsoft/waza

spboyer · 2026-03-05T01:54:14Z

Closes #84

codecov-commenter · 2026-03-05T01:57:42Z

Codecov Report

❌ Patch coverage is 77.41935% with 7 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@a75477e). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
internal/orchestration/runner.go	76.19%	5 Missing ⚠️
cmd/waza/cmd_run.go	80.00%	2 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main      #91   +/-   ##
=======================================
  Coverage        ?   72.22%           
=======================================
  Files           ?      128           
  Lines           ?    14281           
  Branches        ?        0           
=======================================
  Hits            ?    10315           
  Misses          ?     3202           
  Partials        ?      764

Flag	Coverage Δ
go-implementation	`72.22% <77.41%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot

Pull request overview

Adds CLI and result/statistics enhancements to support running eval tasks multiple times and surfacing flakiness/pass-rate information in outputs.

Changes:

Document a new --trials flag for waza run and add CLI parsing/validation + spec override behavior.
Extend per-task TestStats with run counts and flakiness percentage, and print flakiness in the CLI summary.
Add tests around --trials handling and basic flakiness percent computation.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
site/src/content/docs/reference/cli.mdx	Documents `--trials` flag for multi-trial runs.
internal/orchestration/runner_orchestration_test.go	Adds unit test for flakiness percent/run counts in `computeTestStats`.
internal/orchestration/runner.go	Computes additional per-task stats (`PassedRuns`, `FailedRuns`, `ErrorRuns`, `TotalRuns`, `FlakinessPercent`).
internal/models/outcome.go	Extends `TestStats` JSON model with run counts + `flakiness_percent`.
cmd/waza/cmd_run_test.go	Adds tests for `--trials` parsing, validation, and spec override reflected in output JSON.
cmd/waza/cmd_run.go	Introduces `--trials` flag, validates it, applies it to `spec.Config.RunsPerTest`, and prints flakiness percent in summary.
README.md	Documents `--trials` flag in CLI options table.
.squad/log/2026-03-05T00-36-issue-assignment-pipeline.md	Adds squad session log (process/workflow).
.squad/log/2026-03-05T00-26-rusty-token-diff-design.md	Adds squad session log (token diff strategy).
.squad/decisions.md	Records decisions (includes token-diff strategy + workflow directive).

Comments suppressed due to low confidence (3)

internal/orchestration/runner_orchestration_test.go:350

This test only covers pass/fail runs where Validations is populated, so it won’t catch the common error-path cases where StatusError runs have Validations == nil (e.g., engine.Execute error / grader execution error). Adding an assertion that an error run reduces PassRate and increments ErrorRuns (and does not increment PassedRuns) would prevent regressions in the stats bucketing logic.

func TestComputeTestStats_FlakinessPercent(t *testing.T) {
	runner := NewTestRunner(config.NewBenchmarkConfig(&models.BenchmarkSpec{}), nil)
	runs := []models.RunResult{
		{
			Status:     models.StatusPassed,
			DurationMs: 10,
			Validations: map[string]models.GraderResults{
				"check": {Passed: true, Score: 1},
			},
		},
		{
			Status:     models.StatusPassed,
			DurationMs: 20,
			Validations: map[string]models.GraderResults{
				"check": {Passed: true, Score: 1},
			},
		},
		{
			Status:     models.StatusFailed,
			DurationMs: 30,
			Validations: map[string]models.GraderResults{
				"check": {Passed: false, Score: 0},
			},
		},
	}

	stats := runner.computeTestStats(runs)
	require.NotNil(t, stats)
	assert.Equal(t, 3, stats.TotalRuns)
	assert.Equal(t, 2, stats.PassedRuns)
	assert.Equal(t, 1, stats.FailedRuns)
	assert.Equal(t, 0, stats.ErrorRuns)
	assert.InDelta(t, 66.6667, stats.PassRate*100, 0.1)
	assert.True(t, stats.Flaky)
	assert.InDelta(t, 33.3333, stats.FlakinessPercent, 0.1)
}

.squad/decisions.md:258

This PR is scoped/titled as implementing multi-trial flakiness detection for evals (issue #84), but this change adds a decision entry for an unrelated topic (token diff distribution strategy for issue #81). Consider moving the #81 decision/log updates to a separate PR (or to the PR for #81) so this PR remains focused and easier to review/revert.

## 2026-03-05: Token Diff Distribution Strategy (Issue #81)

**By:** Rusty (Lead / Architect)  
**Issue:** #81  
**Status:** APPROVED

### What

For the GitHub Action token budget PR comment feature (#81), **implement `waza tokens diff` CLI command + lightweight wrapper action**, not action-only or CLI-only.

### Implementation

internal/orchestration/runner.go:1302

The current counters can double-count error runs: a StatusError run that also has failing validations increments both failed (via the else) and errored, and a StatusError run with nil validations increments passed and errored. This can make PassedRuns+FailedRuns+ErrorRuns differ from TotalRuns and makes the individual run counts misleading. Suggest counting each run into exactly one bucket (passed/failed/error) based on run.Status, then computing PassRate from those buckets.


		if run.Status == models.StatusError {
			errored++
		}

Copilot · 2026-03-05T01:59:48Z

internal/orchestration/runner.go

 		if run.AllValidationsPassed() {
 			passed++
+		} else {
+			failed++
+		}
+
+		if run.Status == models.StatusError {
+			errored++


computeTestStats counts a run as passed based on run.AllValidationsPassed(). For StatusError runs produced by executeRun on engine/grader errors, Validations is nil/empty, so AllValidationsPassed() returns true and the run is incorrectly counted as a pass (inflating PassRate, PassedRuns, and flakiness calculations). Consider deriving PassedRuns/FailedRuns/ErrorRuns (and PassRate) from run.Status so error runs always count as non-pass, consistent with computePassRate (which treats StatusError as a fail for pass-rate purposes).

This issue also appears on line 1299 of the same file.

Suggested change

if run.AllValidationsPassed() {

passed++

} else {

failed++

}

if run.Status == models.StatusError {

errored++

switch run.Status {

case models.StatusPassed:

passed++

case models.StatusFailed:

failed++

case models.StatusError:

// Treat error runs as non-passes for pass-rate purposes,

// consistent with computePassRate.

failed++

errored++

default:

// Fallback to validation-based classification for any

// unexpected statuses to preserve prior behaviour.

if run.AllValidationsPassed() {

passed++

} else {

failed++

}

spboyer

Verified by Rusty (Opus 4.6) — LGTM ✅

Solid implementation of multi-trial flakiness detection:

--trials\ flag with proper validation (>=1), clean override of \spec.Config.RunsPerTest\
\FlakinessPercent\ metric uses correct minority-outcomes formula
New stats fields (PassedRuns, FailedRuns, ErrorRuns, TotalRuns) provide full trial breakdown
Summary display updated with flakiness percentage
Tests cover flag parsing, invalid values, override behavior, and stats computation
Docs updated: README, CLI reference
CI green on ubuntu + windows + lint

Note: Can't self-approve via API (same account). Setting auto-merge.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Copilot · 2026-03-05T18:59:00Z

site/src/content/docs/reference/cli.mdx

 | `--verbose` | `-v` | bool | false | Detailed progress output |
 | `--parallel` | | bool | false | Run tasks concurrently |
 | `--workers` | `-w` | int | 4 | Number of concurrent workers |
+| `--trials` | | int | 1 | Run each task N times for flakiness detection (overrides `config.trials_per_task`) |


The CLI docs list --trials default as 1, but the flag is defined with a default of 0 in code (meaning "unset" so the spec/config value is used). Please update this row to reflect the actual behavior (e.g., default 0/unset → uses config.trials_per_task, and only overrides when explicitly provided).

Suggested change

| `--trials` | | int | 1 | Run each task N times for flakiness detection (overrides `config.trials_per_task`) |

| `--trials` | | int | 0 (unset) | Run each task N times for flakiness detection (0 = use `config.trials_per_task`; overrides only when explicitly provided) |

Copilot · 2026-03-05T18:59:01Z

cmd/waza/cmd_run.go

 	cmd.Flags().StringArrayVar(&tagFilters, "tags", nil, "Filter tasks by tags, using glob patterns (can be repeated)")
 	cmd.Flags().BoolVar(&parallel, "parallel", false, "Run tasks concurrently")
 	cmd.Flags().IntVar(&workers, "workers", 0, "Number of concurrent workers (default: 4, requires --parallel)")
+	cmd.Flags().IntVar(&trials, "trials", 0, "Number of trials per task (overrides config.trials_per_task)")


The semantics of --trials are "override only when the flag is explicitly set" (as enforced by the Changed(\"trials\") validation), but the application logic uses if trials > 0 instead of Changed(\"trials\"). To keep the code consistent and future-proof the meaning of the default value, consider applying the override based on cmd.Flags().Changed(\"trials\") and updating the flag help string to clarify that it overrides only when provided.

cmd/waza/cmd_run.go

Copilot · 2026-03-05T18:59:01Z

cmd/waza/cmd_run.go

 	if workers > 0 {
 		spec.Config.Workers = workers
 	}
+	if trials > 0 {


The semantics of --trials are "override only when the flag is explicitly set" (as enforced by the Changed(\"trials\") validation), but the application logic uses if trials > 0 instead of Changed(\"trials\"). To keep the code consistent and future-proof the meaning of the default value, consider applying the override based on cmd.Flags().Changed(\"trials\") and updating the flag help string to clarify that it overrides only when provided.

Suggested change

if trials > 0 {

if cmd.Flags().Changed("trials") {

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ft#91)

Copilot AI review requested due to automatic review settings March 5, 2026 01:54

spboyer requested review from chlowell and richardpark-msft as code owners March 5, 2026 01:54

spboyer self-assigned this Mar 5, 2026

github-actions bot enabled auto-merge (squash) March 5, 2026 01:54

Copilot started reviewing on behalf of spboyer March 5, 2026 01:55 View session

Copilot AI reviewed Mar 5, 2026

View reviewed changes

spboyer commented Mar 5, 2026

View reviewed changes

spboyer mentioned this pull request Mar 5, 2026

feat: Add trigger heuristic grader #90

Open

spboyer force-pushed the squad/84-flakiness-detection branch from fdcf8f8 to 4801230 Compare March 5, 2026 17:12

spboyer added a commit that referenced this pull request Mar 5, 2026

fix: address review feedback on PR #91

bc242b4

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings March 5, 2026 17:44

spboyer and others added 2 commits March 5, 2026 12:46

feat: add multi-trial flakiness detection microsoft#84

b7ccad8

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

fix: address review feedback on PR microsoft#91

0e90d56

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

spboyer force-pushed the squad/84-flakiness-detection branch from bc242b4 to 0e90d56 Compare March 5, 2026 17:46

Copilot AI reviewed Mar 5, 2026

View reviewed changes

fix: address PR microsoft#91 flakiness review comments

16eac6a

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

chlowell pushed a commit to chlowell/waza that referenced this pull request Mar 5, 2026

feat: implement guided requirements gathering (microsoft#53) (microso…

3059b0f

…ft#91)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add multi-trial flakiness detection for evals#91

feat: Add multi-trial flakiness detection for evals#91
spboyer wants to merge 3 commits intomicrosoft:mainfrom
spboyer:squad/84-flakiness-detection

spboyer commented Mar 5, 2026

Uh oh!

codecov-commenter commented Mar 5, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 5, 2026

Uh oh!

spboyer left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 5, 2026

Uh oh!

Copilot AI Mar 5, 2026

Uh oh!

Uh oh!

Copilot AI Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	\| `--trials` \| \| int \| 1 \| Run each task N times for flakiness detection (overrides `config.trials_per_task`) \|
	\| `--trials` \| \| int \| 0 (unset) \| Run each task N times for flakiness detection (0 = use `config.trials_per_task`; overrides only when explicitly provided) \|

Conversation

spboyer commented Mar 5, 2026

Uh oh!

codecov-commenter commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

spboyer left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov-commenter commented Mar 5, 2026 •

edited

Loading