feat: Stability Benchmarks by nick-galluzzo · Pull Request #13 · nick-galluzzo/diffmage

nick-galluzzo · 2025-08-05T03:36:03Z

This PR introduces stability benchmarking capabilities to validate LLM evaluation consistency and reliability for commit message assessment.

Key Features

Core Functionality

EvaluationBenchmarks class for stability testing with statistical analysis
benchmark-stability CLI command for consistency validation
Variance-based assessment with configurable thresholds

Capabilities

Run multiple evaluations on same input to measure consistency
Calculate variance, std dev, and range across WHAT/WHY/Overall scores
Rich console output with progress bars and formatted results
Performance metrics and execution time tracking

Usage

# Test message stability on current HEAD
diffmage benchmark-stability "feat: add auth" --runs 5

# Test specific commit  
diffmage benchmark-stability "fix: bug" --commit abc123 --runs 3

Technical Details

Evaluates stability across WHAT (accuracy) and WHY (rationale/impact) dimensions
Stable: variance ≤ threshold (default 0.2)
Unstable: variance > threshold

Files

src/diffmage/evaluation/benchmarks.py (new)
tests/evaluation/test_benchmarks.py (new)
src/diffmage/cli/reports.py (enhanced)

Provides essential tooling for LLM evaluation validation, research, and quality assurance in production environments.

…evaluation - Add new CLI command `benchmark_stability` to evaluate LLM consistency - Implement `EvaluationBenchmarks` class with statistical analysis - Add test suite for benchmark functionality - Enable variance-based stability assessment with configurable thresholds

…k evaluation

…tructure - Introduce TypedDict classes for better type safety in benchmarking - Update stability test to return structured TypedDict results - Improve score variance calculation to handle empty inputs - Enhance display of stability results with clearer dimension names

…ts structure - Introduce TypedDict classes for better type safety in benchmarking - Update stability test to return structured TypedDict results - Improve score variance calculation to handle empty inputs - Enhance display of stability results with clearer dimension names

…ffmage into feat/benchmarks

nick-galluzzo added 6 commits August 4, 2025 19:26

refactor(cli): remove redundant model name default assignment

7bd6a67

test(evaluation): add comprehensive stability test cases for benchmar…

f1c96f8

…k evaluation

Merge branch 'feat/benchmarks' of https://github.com/nick-galluzzo/di…

889b717

…ffmage into feat/benchmarks

nick-galluzzo merged commit d425397 into main Aug 5, 2025
3 checks passed

nick-galluzzo deleted the feat/benchmarks branch August 5, 2025 04:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Stability Benchmarks#13

feat: Stability Benchmarks#13
nick-galluzzo merged 6 commits intomainfrom
feat/benchmarks

nick-galluzzo commented Aug 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nick-galluzzo commented Aug 5, 2025

Key Features

Usage

Technical Details

Files

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant