Skip to content

feat: Add multi-trial flakiness detection for evals#91

Open
spboyer wants to merge 3 commits intomicrosoft:mainfrom
spboyer:squad/84-flakiness-detection
Open

feat: Add multi-trial flakiness detection for evals#91
spboyer wants to merge 3 commits intomicrosoft:mainfrom
spboyer:squad/84-flakiness-detection

Conversation

@spboyer
Copy link
Member

@spboyer spboyer commented Mar 5, 2026

Closes #84

Copilot AI review requested due to automatic review settings March 5, 2026 01:54
@spboyer spboyer self-assigned this Mar 5, 2026
@github-actions github-actions bot enabled auto-merge (squash) March 5, 2026 01:54
@codecov-commenter
Copy link

codecov-commenter commented Mar 5, 2026

Codecov Report

❌ Patch coverage is 77.41935% with 7 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@a75477e). Learn more about missing BASE report.

Files with missing lines Patch % Lines
internal/orchestration/runner.go 76.19% 5 Missing ⚠️
cmd/waza/cmd_run.go 80.00% 2 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main      #91   +/-   ##
=======================================
  Coverage        ?   72.22%           
=======================================
  Files           ?      128           
  Lines           ?    14281           
  Branches        ?        0           
=======================================
  Hits            ?    10315           
  Misses          ?     3202           
  Partials        ?      764           
Flag Coverage Δ
go-implementation 72.22% <77.41%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds CLI and result/statistics enhancements to support running eval tasks multiple times and surfacing flakiness/pass-rate information in outputs.

Changes:

  • Document a new --trials flag for waza run and add CLI parsing/validation + spec override behavior.
  • Extend per-task TestStats with run counts and flakiness percentage, and print flakiness in the CLI summary.
  • Add tests around --trials handling and basic flakiness percent computation.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
site/src/content/docs/reference/cli.mdx Documents --trials flag for multi-trial runs.
internal/orchestration/runner_orchestration_test.go Adds unit test for flakiness percent/run counts in computeTestStats.
internal/orchestration/runner.go Computes additional per-task stats (PassedRuns, FailedRuns, ErrorRuns, TotalRuns, FlakinessPercent).
internal/models/outcome.go Extends TestStats JSON model with run counts + flakiness_percent.
cmd/waza/cmd_run_test.go Adds tests for --trials parsing, validation, and spec override reflected in output JSON.
cmd/waza/cmd_run.go Introduces --trials flag, validates it, applies it to spec.Config.RunsPerTest, and prints flakiness percent in summary.
README.md Documents --trials flag in CLI options table.
.squad/log/2026-03-05T00-36-issue-assignment-pipeline.md Adds squad session log (process/workflow).
.squad/log/2026-03-05T00-26-rusty-token-diff-design.md Adds squad session log (token diff strategy).
.squad/decisions.md Records decisions (includes token-diff strategy + workflow directive).
Comments suppressed due to low confidence (3)

internal/orchestration/runner_orchestration_test.go:350

  • This test only covers pass/fail runs where Validations is populated, so it won’t catch the common error-path cases where StatusError runs have Validations == nil (e.g., engine.Execute error / grader execution error). Adding an assertion that an error run reduces PassRate and increments ErrorRuns (and does not increment PassedRuns) would prevent regressions in the stats bucketing logic.
func TestComputeTestStats_FlakinessPercent(t *testing.T) {
	runner := NewTestRunner(config.NewBenchmarkConfig(&models.BenchmarkSpec{}), nil)
	runs := []models.RunResult{
		{
			Status:     models.StatusPassed,
			DurationMs: 10,
			Validations: map[string]models.GraderResults{
				"check": {Passed: true, Score: 1},
			},
		},
		{
			Status:     models.StatusPassed,
			DurationMs: 20,
			Validations: map[string]models.GraderResults{
				"check": {Passed: true, Score: 1},
			},
		},
		{
			Status:     models.StatusFailed,
			DurationMs: 30,
			Validations: map[string]models.GraderResults{
				"check": {Passed: false, Score: 0},
			},
		},
	}

	stats := runner.computeTestStats(runs)
	require.NotNil(t, stats)
	assert.Equal(t, 3, stats.TotalRuns)
	assert.Equal(t, 2, stats.PassedRuns)
	assert.Equal(t, 1, stats.FailedRuns)
	assert.Equal(t, 0, stats.ErrorRuns)
	assert.InDelta(t, 66.6667, stats.PassRate*100, 0.1)
	assert.True(t, stats.Flaky)
	assert.InDelta(t, 33.3333, stats.FlakinessPercent, 0.1)
}

.squad/decisions.md:258

  • This PR is scoped/titled as implementing multi-trial flakiness detection for evals (issue #84), but this change adds a decision entry for an unrelated topic (token diff distribution strategy for issue #81). Consider moving the #81 decision/log updates to a separate PR (or to the PR for #81) so this PR remains focused and easier to review/revert.
## 2026-03-05: Token Diff Distribution Strategy (Issue #81)

**By:** Rusty (Lead / Architect)  
**Issue:** #81  
**Status:** APPROVED

### What

For the GitHub Action token budget PR comment feature (#81), **implement `waza tokens diff` CLI command + lightweight wrapper action**, not action-only or CLI-only.

### Implementation

internal/orchestration/runner.go:1302

  • The current counters can double-count error runs: a StatusError run that also has failing validations increments both failed (via the else) and errored, and a StatusError run with nil validations increments passed and errored. This can make PassedRuns+FailedRuns+ErrorRuns differ from TotalRuns and makes the individual run counts misleading. Suggest counting each run into exactly one bucket (passed/failed/error) based on run.Status, then computing PassRate from those buckets.

		if run.Status == models.StatusError {
			errored++
		}

Comment on lines +1294 to +1301
if run.AllValidationsPassed() {
passed++
} else {
failed++
}

if run.Status == models.StatusError {
errored++
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

computeTestStats counts a run as passed based on run.AllValidationsPassed(). For StatusError runs produced by executeRun on engine/grader errors, Validations is nil/empty, so AllValidationsPassed() returns true and the run is incorrectly counted as a pass (inflating PassRate, PassedRuns, and flakiness calculations). Consider deriving PassedRuns/FailedRuns/ErrorRuns (and PassRate) from run.Status so error runs always count as non-pass, consistent with computePassRate (which treats StatusError as a fail for pass-rate purposes).

This issue also appears on line 1299 of the same file.

Suggested change
if run.AllValidationsPassed() {
passed++
} else {
failed++
}
if run.Status == models.StatusError {
errored++
switch run.Status {
case models.StatusPassed:
passed++
case models.StatusFailed:
failed++
case models.StatusError:
// Treat error runs as non-passes for pass-rate purposes,
// consistent with computePassRate.
failed++
errored++
default:
// Fallback to validation-based classification for any
// unexpected statuses to preserve prior behaviour.
if run.AllValidationsPassed() {
passed++
} else {
failed++
}

Copilot uses AI. Check for mistakes.
Copy link
Member Author

@spboyer spboyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified by Rusty (Opus 4.6) — LGTM ✅

Solid implementation of multi-trial flakiness detection:

  • --trials\ flag with proper validation (>=1), clean override of \spec.Config.RunsPerTest\
  • \FlakinessPercent\ metric uses correct minority-outcomes formula
  • New stats fields (PassedRuns, FailedRuns, ErrorRuns, TotalRuns) provide full trial breakdown
  • Summary display updated with flakiness percentage
  • Tests cover flag parsing, invalid values, override behavior, and stats computation
  • Docs updated: README, CLI reference
  • CI green on ubuntu + windows + lint

Note: Can't self-approve via API (same account). Setting auto-merge.

@spboyer spboyer force-pushed the squad/84-flakiness-detection branch from fdcf8f8 to 4801230 Compare March 5, 2026 17:12
spboyer added a commit that referenced this pull request Mar 5, 2026
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 5, 2026 17:44
spboyer and others added 2 commits March 5, 2026 12:46
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@spboyer spboyer force-pushed the squad/84-flakiness-detection branch from bc242b4 to 0e90d56 Compare March 5, 2026 17:46
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

| `--verbose` | `-v` | bool | false | Detailed progress output |
| `--parallel` | | bool | false | Run tasks concurrently |
| `--workers` | `-w` | int | 4 | Number of concurrent workers |
| `--trials` | | int | 1 | Run each task N times for flakiness detection (overrides `config.trials_per_task`) |
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CLI docs list --trials default as 1, but the flag is defined with a default of 0 in code (meaning "unset" so the spec/config value is used). Please update this row to reflect the actual behavior (e.g., default 0/unset → uses config.trials_per_task, and only overrides when explicitly provided).

Suggested change
| `--trials` | | int | 1 | Run each task N times for flakiness detection (overrides `config.trials_per_task`) |
| `--trials` | | int | 0 (unset) | Run each task N times for flakiness detection (0 = use `config.trials_per_task`; overrides only when explicitly provided) |

Copilot uses AI. Check for mistakes.
cmd.Flags().StringArrayVar(&tagFilters, "tags", nil, "Filter tasks by tags, using glob patterns (can be repeated)")
cmd.Flags().BoolVar(&parallel, "parallel", false, "Run tasks concurrently")
cmd.Flags().IntVar(&workers, "workers", 0, "Number of concurrent workers (default: 4, requires --parallel)")
cmd.Flags().IntVar(&trials, "trials", 0, "Number of trials per task (overrides config.trials_per_task)")
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The semantics of --trials are "override only when the flag is explicitly set" (as enforced by the Changed(\"trials\") validation), but the application logic uses if trials > 0 instead of Changed(\"trials\"). To keep the code consistent and future-proof the meaning of the default value, consider applying the override based on cmd.Flags().Changed(\"trials\") and updating the flag help string to clarify that it overrides only when provided.

Copilot uses AI. Check for mistakes.
if workers > 0 {
spec.Config.Workers = workers
}
if trials > 0 {
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The semantics of --trials are "override only when the flag is explicitly set" (as enforced by the Changed(\"trials\") validation), but the application logic uses if trials > 0 instead of Changed(\"trials\"). To keep the code consistent and future-proof the meaning of the default value, consider applying the override based on cmd.Flags().Changed(\"trials\") and updating the flag help string to clarify that it overrides only when provided.

Suggested change
if trials > 0 {
if cmd.Flags().Changed("trials") {

Copilot uses AI. Check for mistakes.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
chlowell pushed a commit to chlowell/waza that referenced this pull request Mar 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Multi-trial flakiness detection for evals

3 participants