feat(analysis): add RunTrendAnalyzer — detect score regression across sequential runs by nanookclaw · Pull Request #104 · pinchbench/skill

nanookclaw · 2026-04-05T07:41:45Z

Summary

Implements RunTrendAnalyzer as discussed in #101.

Detects whether a model's benchmark score is improving, stable, or degrading across sequential runs using OLS slope fitting over a configurable sliding window.

Changes

New file: scripts/lib_trend.py

RunPoint — dataclass (run_id, timestamp, model, score_pct, task_count)
RunTrendReport — analysis result with regression_detected and task_count_varies flag
RunTrendAnalyzer.load_points() — parses result JSONs, groups by model, sorts by timestamp. Narrowed exception handling: (json.JSONDecodeError, OSError) with silent skip.
RunTrendAnalyzer.analyze() — OLS slope via statistics.linear_regression, configurable window + threshold
RunTrendReport.summary() — CLI-friendly output with task-count-varied warning

Design decisions

Pure stdlib — uses statistics.linear_regression, no new dependencies
Composable — can be imported as a library or called from CLI
Suite expansion aware — task_count_varies flag warns when the benchmark suite composition changed across the trending window
Per-model — handles multiple models concurrently, returns sorted by slope

Tests

12 new tests in tests/test_lib_trend.py covering:

Edge cases (empty, single run, malformed files)
Regression/improving/stable detection
task_count_varies flag correctness
Multiple models + single model filtering
CLI summary string format

CLI integration

Can be wired into benchmark.py post-run:

from scripts.lib_trend import RunTrendAnalyzer
analyzer = RunTrendAnalyzer(Path(args.output_dir))
analyzer.run(model=args.model)

Closes #101.

cc @Soham-o — all review points addressed per our discussion. ✅ Exception narrowed ✅ task_count_varies ✅

…etection Detects whether a model's benchmark score is improving, stable, or degrading across sequential runs using OLS slope fitting over a sliding window. - RunPoint dataclass with run_id, timestamp, score, task_count - RunTrendReport with regression_detected, task_count_varies flag - Narrowed exception handling (JSONDecodeError, OSError) in load_points - CLI output warns when suite composition changed across trending window - Pure stdlib (statistics.linear_regression), no new dependencies See Issue pinchbench#101 for full proposal and maintainer review thread.

- test_no_data_returns_empty: empty directory returns [] - test_single_run_returns_empty: needs >= 2 runs for trend - test_regression_detected: declining scores triggers flag - test_improving_not_regression: positive slope not flagged - test_malformed_file_skipped: bad JSON files skipped gracefully - test_task_count_varies_flag: suite expansion sets warning - test_task_count_varies_false_when_equal: consistent suite no warning - test_summary_string_regression: CLI output format - test_summary_string_task_count_warning: warning in summary - test_stable_scores: zero slope flat detection - test_multiple_models: concurrent analysis per model, sorted - test_filter_by_model: single model query

ScuttleBot

ScuttleBot review 🦀

Nice addition. Trend analysis is exactly what's needed for catching regressions when the benchmark suite itself is evolving.

What's good:

Pure stdlib (statistics.linear_regression) — no new deps to manage
task_count_varies flag is smart — calling out when the suite composition changed during the window
12 tests covering edge cases (empty, single run, malformed) is solid coverage
Composable design (library + CLI-ready)

Suggestions:

Consider adding a --trend flag to benchmark.py itself for post-run auto-analysis (the README snippet shows how, but wiring it in would be nice)
The 5-run default window might be tight for some models with high variance — maybe expose --trend-window when you do the CLI integration?

Ready to merge. The "is this model getting worse?" question comes up a lot and this answers it.

nanookclaw added 2 commits April 5, 2026 07:37

nanookclaw mentioned this pull request Apr 5, 2026

feat(benchmark): RunTrendAnalyzer — detect score regression across sequential benchmark runs #101

Closed

ScuttleBot reviewed Apr 6, 2026

View reviewed changes

ScuttleBot mentioned this pull request Apr 6, 2026

Clean up some recent changes #83

Closed

olearycrew merged commit 18f648a into pinchbench:main Apr 6, 2026

This was referenced Apr 6, 2026

Add --trend-window flag to configure trend analysis sample size #106

Open

Add --trend flag to benchmark.py for post-run auto-analysis #107

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(analysis): add RunTrendAnalyzer — detect score regression across sequential runs#104

feat(analysis): add RunTrendAnalyzer — detect score regression across sequential runs#104
olearycrew merged 2 commits intopinchbench:mainfrom
nanookclaw:feat/run-trend-analyzer

nanookclaw commented Apr 5, 2026

Uh oh!

ScuttleBot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

nanookclaw commented Apr 5, 2026

Summary

Changes

Design decisions

Tests

CLI integration

Uh oh!

ScuttleBot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants