Skip to content

A modular LLM evaluation framework with metric registries, dataset schemas, CLI execution, and persistent run tracking.

Notifications You must be signed in to change notification settings

phiri13/llm-evaluation-framework

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM Evaluation Framework

Python 3.11+ License: MIT

A production-grade framework for evaluating Large Language Model (LLM) performance with support for multi-metric evaluation, baseline comparison, and automated regression detection.

Key Features

  • Multi-Metric Evaluation - Run multiple evaluation metrics (Exact Match, F1 Score) in a single pass
  • Baseline Management - Save and compare against golden evaluation results
  • Regression Detection - Automatically detect performance degradations with configurable thresholds
  • CI/CD Integration - Non-zero exit codes for failed quality gates
  • Config-Driven - YAML-based configuration for reproducible evaluations
  • Extensible Plugin System - Easy addition of new models and metrics
  • Comprehensive Testing - Unit, integration, and end-to-end test coverage

Quick Start

Installation

git clone https://github.com/phiri13/llm-evaluation-framework.git
cd llm-evaluation-framework
pip install -r requirements.txt

Basic Evaluation

# Run evaluation with multiple metrics
python -m src.runners.run_eval --config configs/multi_metric_eval.yaml

# Save result as baseline
python -m src.runners.run_eval \
  --config configs/multi_metric_eval.yaml \
  --save-baseline "v1.0"

# Run evaluation with regression checks
python -m src.runners.run_eval --config configs/eval_with_thresholds.yaml

Output

Evaluation completed. Result saved to results/runs/xxxxx.json

Aggregate Scores:
  exact_match: 0.500
  f1_score: 0.750

Comparing against baseline: v1.0

All regression checks PASSED

Project Structure

llm-evaluation-framework/
├── configs/                    # Evaluation configurations
│   ├── basic_eval.yaml        # Single metric config
│   ├── multi_metric_eval.yaml # Multi-metric config
│   └── eval_with_thresholds.yaml # Config with regression thresholds
├── data/
│   └── raw/                   # Evaluation datasets
│       └── qa_sample_v1.json
├── src/
│   ├── datasets/              # Dataset loading and schemas
│   ├── evaluation/            # Evaluation runner and regression detection
│   ├── metrics/               # Metric implementations and registry
│   ├── models/                # Model interface and registry
│   ├── runners/               # CLI orchestration
│   └── storage/               # Result and baseline storage
├── tests/                     # Comprehensive test suite
└── results/
    ├── runs/                  # Evaluation results
    └── baselines/             # Baseline comparisons

Configuration

Evaluation Config (configs/multi_metric_eval.yaml)

dataset_path: data/raw/qa_sample_v1.json
model_name: mock-deterministic
metrics:
  - exact_match
  - f1_score
output_dir: results/runs

Regression Thresholds (configs/eval_with_thresholds.yaml)

dataset_path: data/raw/qa_sample_v1.json
model_name: mock-deterministic
metrics:
  - exact_match
  - f1_score
baseline_name: v1.0
thresholds:
  - metric_name: exact_match
    min_score: 0.3      # Absolute minimum score
    max_drop: 0.1       # Max allowed drop vs baseline
  - metric_name: f1_score
    min_score: 0.5
    max_drop: 0.1

Usage Examples

1. Basic Evaluation

from src.evaluation import EvaluationRunner
from src.models import ModelRegistry
from src.metrics import MetricRegistry

# Load model and metrics
model = ModelRegistry.get("mock-deterministic")
metrics = [
    MetricRegistry.get("exact_match"),
    MetricRegistry.get("f1_score"),
]

# Run evaluation
runner = EvaluationRunner(model=model, metrics=metrics)
result = runner.run("data/raw/qa_sample_v1.json")

print(f"Exact Match: {result.aggregate_scores['exact_match']:.3f}")
print(f"F1 Score: {result.aggregate_scores['f1_score']:.3f}")

2. Baseline Comparison

from src.storage.baseline import BaselineStore
from src.evaluation.regression import RegressionDetector, RegressionThreshold

# Save current result as baseline
baseline_store = BaselineStore()
baseline_store.save_baseline("v1.0", result)

# Compare future runs
baseline = baseline_store.load_baseline("v1.0")
detector = RegressionDetector([
    RegressionThreshold(metric_name="exact_match", max_drop=0.1),
])
regression_result = detector.check(current_result, baseline)

if not regression_result.passed:
    print("REGRESSION DETECTED!")
    for failure in regression_result.failures:
        print(f"  - {failure}")

3. CI/CD Integration

# .github/workflows/eval.yml
name: Model Evaluation

on: [push, pull_request]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: actions/setup-python@v4
        with:
          python-version: '3.11'
      
      - name: Install dependencies
        run: pip install -r requirements.txt
      
      - name: Run evaluation with thresholds
        run: |
          python -m src.runners.run_eval \
            --config configs/eval_with_thresholds.yaml

Extending the Framework

Adding a New Metric

# src/metrics/custom_metric.py
from src.metrics.base import BaseMetric

class CustomMetric(BaseMetric):
    name = "custom_metric"
    description = "My custom evaluation metric"
    
    def score(self, prediction, reference, context=None) -> float:
        # Implement your scoring logic
        return score
# src/metrics/__init__.py
from src.metrics.custom_metric import CustomMetric

register_metric("custom_metric", CustomMetric)

Adding a New Model

# src/models/my_model.py
from src.models.base import BaseModel

class MyModel(BaseModel):
    name = "my-model"
    
    def generate(self, prompt: str) -> str:
        # Implement model inference
        return response
    
    def metadata(self) -> dict:
        return {
            "model_type": "custom",
            "version": "1.0"
        }

Testing

# Run all tests
python -m tests.test_metrics_exact_match
python -m tests.test_evaluation_runner
python -m tests.test_multi_metric
python -m tests.test_regression

# Run specific test
python -m tests.test_regression

Results Format

Evaluation results are stored as JSON:

{
  "model_name": "mock-deterministic",
  "dataset_name": "basic_qa_eval",
  "num_samples": 100,
  "aggregate_scores": {
    "exact_match": 0.85,
    "f1_score": 0.92
  },
  "sample_results": [
    {
      "id": "1",
      "input": "What is the capital of France?",
      "prediction": "Paris",
      "expected_output": "Paris",
      "scores": {
        "exact_match": 1.0,
        "f1_score": 1.0
      }
    }
  ],
  "metadata": {
    "model_metadata": {...},
    "metrics": ["exact_match", "f1_score"]
  }
}

Architecture

Core Components

  1. Model Registry - Centralized model management with plugin architecture
  2. Metric Registry - Extensible metric system with instance-based scoring
  3. Evaluation Runner - Orchestrates model inference and metric computation
  4. Baseline Store - Manages golden evaluation results
  5. Regression Detector - Compares results against thresholds and baselines
  6. CLI Orchestrator - Config-driven evaluation execution

Design Principles

  • Separation of Concerns - Clean boundaries between models, metrics, and evaluation
  • Plugin Architecture - Easy extension without modifying core code
  • Type Safety - Pydantic schemas for configuration and results
  • Testability - Comprehensive test coverage with mocked dependencies
  • CI/CD Ready - Exit codes and reports for automated quality gates

Available Metrics

Metric Description Use Case
exact_match Case-insensitive exact string match QA, classification
f1_score Token-level F1 score Text generation, QA

Available Models

Model Type Description
mock-deterministic Mock Returns input as output (testing)

Dataset Format

Datasets use a standardized JSON schema:

{
  "name": "dataset_name",
  "version": "1.0",
  "task": "qa",
  "description": "Dataset description",
  "samples": [
    {
      "id": "1",
      "task": "qa",
      "input": "Question or prompt",
      "expected_output": "Ground truth answer",
      "metadata": {}
    }
  ]
}

Regression Detection

The framework supports two types of regression checks:

  1. Absolute Thresholds - Enforce minimum scores
  2. Relative Thresholds - Limit performance drops vs baselines

Example regression check:

thresholds = [
    RegressionThreshold(
        metric_name="exact_match",
        min_score=0.3,      # Must score at least 30%
        max_drop=0.1        # Can't drop >10% from baseline
    )
]

Development

Setup Development Environment

git clone https://github.com/phiri13/llm-evaluation-framework.git
cd llm-evaluation-framework
pip install -r requirements.txt

Running Tests

# Run all tests
python -m tests.test_metrics_exact_match
python -m tests.test_evaluation_runner
python -m tests.test_multi_metric
python -m tests.test_regression
python -m tests.test_result_storage
python -m tests.test_eval_cli

Git Workflow

# Create feature branch
git checkout -b feature/my-feature

# Make changes and commit
git add .
git commit -m "feat: add new feature"

# Merge to develop
git checkout develop
git merge feature/my-feature

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Built with:

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new functionality
  4. Ensure all tests pass
  5. Submit a pull request

Contact

Joshua Phiri - @phiri13

Project Link: https://github.com/phiri13/llm-evaluation-framework


Note: This framework is designed for production use but currently includes mock models for demonstration. Integration with real LLM APIs (OpenAI, Anthropic, etc.) can be added by extending the BaseModel interface.

About

A modular LLM evaluation framework with metric registries, dataset schemas, CLI execution, and persistent run tracking.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages