Skip to content

feat: Stability Benchmarks#13

Merged
nick-galluzzo merged 6 commits intomainfrom
feat/benchmarks
Aug 5, 2025
Merged

feat: Stability Benchmarks#13
nick-galluzzo merged 6 commits intomainfrom
feat/benchmarks

Conversation

@nick-galluzzo
Copy link
Copy Markdown
Owner

This PR introduces stability benchmarking capabilities to validate LLM evaluation consistency and reliability for commit message assessment.

Key Features

Core Functionality

  • EvaluationBenchmarks class for stability testing with statistical analysis
  • benchmark-stability CLI command for consistency validation
  • Variance-based assessment with configurable thresholds

Capabilities

  • Run multiple evaluations on same input to measure consistency
  • Calculate variance, std dev, and range across WHAT/WHY/Overall scores
  • Rich console output with progress bars and formatted results
  • Performance metrics and execution time tracking

Usage

# Test message stability on current HEAD
diffmage benchmark-stability "feat: add auth" --runs 5

# Test specific commit  
diffmage benchmark-stability "fix: bug" --commit abc123 --runs 3

Technical Details

  • Evaluates stability across WHAT (accuracy) and WHY (rationale/impact) dimensions
  • Stable: variance ≤ threshold (default 0.2)
  • Unstable: variance > threshold

Files

  • src/diffmage/evaluation/benchmarks.py (new)
  • tests/evaluation/test_benchmarks.py (new)
  • src/diffmage/cli/reports.py (enhanced)

Provides essential tooling for LLM evaluation validation, research, and quality assurance in production environments.

…evaluation

- Add new CLI command `benchmark_stability` to evaluate LLM consistency
- Implement `EvaluationBenchmarks` class with statistical analysis
- Add  test suite for benchmark functionality
- Enable variance-based stability assessment with configurable thresholds
…tructure

- Introduce TypedDict classes for better type safety in benchmarking
- Update stability test to return structured TypedDict results
- Improve score variance calculation to handle empty inputs
- Enhance display of stability results with clearer dimension names
…ts structure

- Introduce TypedDict classes for better type safety in benchmarking
- Update stability test to return structured TypedDict results
- Improve score variance calculation to handle empty inputs
- Enhance display of stability results with clearer dimension names
@nick-galluzzo nick-galluzzo merged commit d425397 into main Aug 5, 2025
3 checks passed
@nick-galluzzo nick-galluzzo deleted the feat/benchmarks branch August 5, 2025 04:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant