LLM Evaluation Framework

A production-grade framework for evaluating Large Language Model (LLM) performance with support for multi-metric evaluation, baseline comparison, and automated regression detection.

Key Features

Multi-Metric Evaluation - Run multiple evaluation metrics (Exact Match, F1 Score) in a single pass
Baseline Management - Save and compare against golden evaluation results
Regression Detection - Automatically detect performance degradations with configurable thresholds
CI/CD Integration - Non-zero exit codes for failed quality gates
Config-Driven - YAML-based configuration for reproducible evaluations
Extensible Plugin System - Easy addition of new models and metrics
Comprehensive Testing - Unit, integration, and end-to-end test coverage

Quick Start

Installation

git clone https://github.com/phiri13/llm-evaluation-framework.git
cd llm-evaluation-framework
pip install -r requirements.txt

Basic Evaluation

# Run evaluation with multiple metrics
python -m src.runners.run_eval --config configs/multi_metric_eval.yaml

# Save result as baseline
python -m src.runners.run_eval \
  --config configs/multi_metric_eval.yaml \
  --save-baseline "v1.0"

# Run evaluation with regression checks
python -m src.runners.run_eval --config configs/eval_with_thresholds.yaml

Output

Evaluation completed. Result saved to results/runs/xxxxx.json

Aggregate Scores:
  exact_match: 0.500
  f1_score: 0.750

Comparing against baseline: v1.0

All regression checks PASSED

Project Structure

llm-evaluation-framework/
├── configs/                    # Evaluation configurations
│   ├── basic_eval.yaml        # Single metric config
│   ├── multi_metric_eval.yaml # Multi-metric config
│   └── eval_with_thresholds.yaml # Config with regression thresholds
├── data/
│   └── raw/                   # Evaluation datasets
│       └── qa_sample_v1.json
├── src/
│   ├── datasets/              # Dataset loading and schemas
│   ├── evaluation/            # Evaluation runner and regression detection
│   ├── metrics/               # Metric implementations and registry
│   ├── models/                # Model interface and registry
│   ├── runners/               # CLI orchestration
│   └── storage/               # Result and baseline storage
├── tests/                     # Comprehensive test suite
└── results/
    ├── runs/                  # Evaluation results
    └── baselines/             # Baseline comparisons

Configuration

Evaluation Config (`configs/multi_metric_eval.yaml`)

dataset_path: data/raw/qa_sample_v1.json
model_name: mock-deterministic
metrics:
  - exact_match
  - f1_score
output_dir: results/runs

Regression Thresholds (`configs/eval_with_thresholds.yaml`)

dataset_path: data/raw/qa_sample_v1.json
model_name: mock-deterministic
metrics:
  - exact_match
  - f1_score
baseline_name: v1.0
thresholds:
  - metric_name: exact_match
    min_score: 0.3      # Absolute minimum score
    max_drop: 0.1       # Max allowed drop vs baseline
  - metric_name: f1_score
    min_score: 0.5
    max_drop: 0.1

Usage Examples

1. Basic Evaluation

from src.evaluation import EvaluationRunner
from src.models import ModelRegistry
from src.metrics import MetricRegistry

# Load model and metrics
model = ModelRegistry.get("mock-deterministic")
metrics = [
    MetricRegistry.get("exact_match"),
    MetricRegistry.get("f1_score"),
]

# Run evaluation
runner = EvaluationRunner(model=model, metrics=metrics)
result = runner.run("data/raw/qa_sample_v1.json")

print(f"Exact Match: {result.aggregate_scores['exact_match']:.3f}")
print(f"F1 Score: {result.aggregate_scores['f1_score']:.3f}")

2. Baseline Comparison

from src.storage.baseline import BaselineStore
from src.evaluation.regression import RegressionDetector, RegressionThreshold

# Save current result as baseline
baseline_store = BaselineStore()
baseline_store.save_baseline("v1.0", result)

# Compare future runs
baseline = baseline_store.load_baseline("v1.0")
detector = RegressionDetector([
    RegressionThreshold(metric_name="exact_match", max_drop=0.1),
])
regression_result = detector.check(current_result, baseline)

if not regression_result.passed:
    print("REGRESSION DETECTED!")
    for failure in regression_result.failures:
        print(f"  - {failure}")

3. CI/CD Integration

# .github/workflows/eval.yml
name: Model Evaluation

on: [push, pull_request]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: actions/setup-python@v4
        with:
          python-version: '3.11'
      
      - name: Install dependencies
        run: pip install -r requirements.txt
      
      - name: Run evaluation with thresholds
        run: |
          python -m src.runners.run_eval \
            --config configs/eval_with_thresholds.yaml

Extending the Framework

Adding a New Metric

# src/metrics/custom_metric.py
from src.metrics.base import BaseMetric

class CustomMetric(BaseMetric):
    name = "custom_metric"
    description = "My custom evaluation metric"
    
    def score(self, prediction, reference, context=None) -> float:
        # Implement your scoring logic
        return score

# src/metrics/__init__.py
from src.metrics.custom_metric import CustomMetric

register_metric("custom_metric", CustomMetric)

Adding a New Model

# src/models/my_model.py
from src.models.base import BaseModel

class MyModel(BaseModel):
    name = "my-model"
    
    def generate(self, prompt: str) -> str:
        # Implement model inference
        return response
    
    def metadata(self) -> dict:
        return {
            "model_type": "custom",
            "version": "1.0"
        }

Testing

# Run all tests
python -m tests.test_metrics_exact_match
python -m tests.test_evaluation_runner
python -m tests.test_multi_metric
python -m tests.test_regression

# Run specific test
python -m tests.test_regression

Results Format

Evaluation results are stored as JSON:

{
  "model_name": "mock-deterministic",
  "dataset_name": "basic_qa_eval",
  "num_samples": 100,
  "aggregate_scores": {
    "exact_match": 0.85,
    "f1_score": 0.92
  },
  "sample_results": [
    {
      "id": "1",
      "input": "What is the capital of France?",
      "prediction": "Paris",
      "expected_output": "Paris",
      "scores": {
        "exact_match": 1.0,
        "f1_score": 1.0
      }
    }
  ],
  "metadata": {
    "model_metadata": {...},
    "metrics": ["exact_match", "f1_score"]
  }
}

Architecture

Core Components

Model Registry - Centralized model management with plugin architecture
Metric Registry - Extensible metric system with instance-based scoring
Evaluation Runner - Orchestrates model inference and metric computation
Baseline Store - Manages golden evaluation results
Regression Detector - Compares results against thresholds and baselines
CLI Orchestrator - Config-driven evaluation execution

Design Principles

Separation of Concerns - Clean boundaries between models, metrics, and evaluation
Plugin Architecture - Easy extension without modifying core code
Type Safety - Pydantic schemas for configuration and results
Testability - Comprehensive test coverage with mocked dependencies
CI/CD Ready - Exit codes and reports for automated quality gates

Available Metrics

Metric	Description	Use Case
`exact_match`	Case-insensitive exact string match	QA, classification
`f1_score`	Token-level F1 score	Text generation, QA

Available Models

Model	Type	Description
`mock-deterministic`	Mock	Returns input as output (testing)

Dataset Format

Datasets use a standardized JSON schema:

{
  "name": "dataset_name",
  "version": "1.0",
  "task": "qa",
  "description": "Dataset description",
  "samples": [
    {
      "id": "1",
      "task": "qa",
      "input": "Question or prompt",
      "expected_output": "Ground truth answer",
      "metadata": {}
    }
  ]
}

Regression Detection

The framework supports two types of regression checks:

Absolute Thresholds - Enforce minimum scores
Relative Thresholds - Limit performance drops vs baselines

Example regression check:

thresholds = [
    RegressionThreshold(
        metric_name="exact_match",
        min_score=0.3,      # Must score at least 30%
        max_drop=0.1        # Can't drop >10% from baseline
    )
]

Development

Setup Development Environment

git clone https://github.com/phiri13/llm-evaluation-framework.git
cd llm-evaluation-framework
pip install -r requirements.txt

Running Tests

# Run all tests
python -m tests.test_metrics_exact_match
python -m tests.test_evaluation_runner
python -m tests.test_multi_metric
python -m tests.test_regression
python -m tests.test_result_storage
python -m tests.test_eval_cli

Git Workflow

# Create feature branch
git checkout -b feature/my-feature

# Make changes and commit
git add .
git commit -m "feat: add new feature"

# Merge to develop
git checkout develop
git merge feature/my-feature

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Built with:

Pydantic - Data validation
PyYAML - Configuration management

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch
Add tests for new functionality
Ensure all tests pass
Submit a pull request

Contact

Joshua Phiri - @phiri13

Project Link: https://github.com/phiri13/llm-evaluation-framework

Note: This framework is designed for production use but currently includes mock models for demonstration. Integration with real LLM APIs (OpenAI, Anthropic, etc.) can be added by extending the BaseModel interface.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
configs		configs
data/raw		data/raw
results		results
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
f1_score.py		f1_score.py
nano		nano
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

phiri13/llm-evaluation-framework

Folders and files

Latest commit

History

Repository files navigation