Skip to content

feat(analysis): add RunTrendAnalyzer — detect score regression across sequential runs#104

Merged
olearycrew merged 2 commits intopinchbench:mainfrom
nanookclaw:feat/run-trend-analyzer
Apr 6, 2026
Merged

feat(analysis): add RunTrendAnalyzer — detect score regression across sequential runs#104
olearycrew merged 2 commits intopinchbench:mainfrom
nanookclaw:feat/run-trend-analyzer

Conversation

@nanookclaw
Copy link
Copy Markdown
Contributor

Summary

Implements RunTrendAnalyzer as discussed in #101.

Detects whether a model's benchmark score is improving, stable, or degrading across sequential runs using OLS slope fitting over a configurable sliding window.

Changes

New file: scripts/lib_trend.py

  • RunPoint — dataclass (run_id, timestamp, model, score_pct, task_count)
  • RunTrendReport — analysis result with regression_detected and task_count_varies flag
  • RunTrendAnalyzer.load_points() — parses result JSONs, groups by model, sorts by timestamp. Narrowed exception handling: (json.JSONDecodeError, OSError) with silent skip.
  • RunTrendAnalyzer.analyze() — OLS slope via statistics.linear_regression, configurable window + threshold
  • RunTrendReport.summary() — CLI-friendly output with task-count-varied warning

Design decisions

  1. Pure stdlib — uses statistics.linear_regression, no new dependencies
  2. Composable — can be imported as a library or called from CLI
  3. Suite expansion awaretask_count_varies flag warns when the benchmark suite composition changed across the trending window
  4. Per-model — handles multiple models concurrently, returns sorted by slope

Tests

12 new tests in tests/test_lib_trend.py covering:

  • Edge cases (empty, single run, malformed files)
  • Regression/improving/stable detection
  • task_count_varies flag correctness
  • Multiple models + single model filtering
  • CLI summary string format

CLI integration

Can be wired into benchmark.py post-run:

from scripts.lib_trend import RunTrendAnalyzer
analyzer = RunTrendAnalyzer(Path(args.output_dir))
analyzer.run(model=args.model)

Closes #101.

cc @Soham-o — all review points addressed per our discussion. ✅ Exception narrowed ✅ task_count_varies ✅

…etection

Detects whether a model's benchmark score is improving, stable, or degrading
across sequential runs using OLS slope fitting over a sliding window.

- RunPoint dataclass with run_id, timestamp, score, task_count
- RunTrendReport with regression_detected, task_count_varies flag
- Narrowed exception handling (JSONDecodeError, OSError) in load_points
- CLI output warns when suite composition changed across trending window
- Pure stdlib (statistics.linear_regression), no new dependencies

See Issue pinchbench#101 for full proposal and maintainer review thread.
- test_no_data_returns_empty: empty directory returns []
- test_single_run_returns_empty: needs >= 2 runs for trend
- test_regression_detected: declining scores triggers flag
- test_improving_not_regression: positive slope not flagged
- test_malformed_file_skipped: bad JSON files skipped gracefully
- test_task_count_varies_flag: suite expansion sets warning
- test_task_count_varies_false_when_equal: consistent suite no warning
- test_summary_string_regression: CLI output format
- test_summary_string_task_count_warning: warning in summary
- test_stable_scores: zero slope flat detection
- test_multiple_models: concurrent analysis per model, sorted
- test_filter_by_model: single model query
Copy link
Copy Markdown

@ScuttleBot ScuttleBot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ScuttleBot review 🦀

Nice addition. Trend analysis is exactly what's needed for catching regressions when the benchmark suite itself is evolving.

What's good:

  • Pure stdlib (statistics.linear_regression) — no new deps to manage
  • task_count_varies flag is smart — calling out when the suite composition changed during the window
  • 12 tests covering edge cases (empty, single run, malformed) is solid coverage
  • Composable design (library + CLI-ready)

Suggestions:

  • Consider adding a --trend flag to benchmark.py itself for post-run auto-analysis (the README snippet shows how, but wiring it in would be nice)
  • The 5-run default window might be tight for some models with high variance — maybe expose --trend-window when you do the CLI integration?

Ready to merge. The "is this model getting worse?" question comes up a lot and this answers it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(benchmark): RunTrendAnalyzer — detect score regression across sequential benchmark runs

3 participants