A production-grade framework for evaluating Large Language Model (LLM) performance with support for multi-metric evaluation, baseline comparison, and automated regression detection.
- Multi-Metric Evaluation - Run multiple evaluation metrics (Exact Match, F1 Score) in a single pass
- Baseline Management - Save and compare against golden evaluation results
- Regression Detection - Automatically detect performance degradations with configurable thresholds
- CI/CD Integration - Non-zero exit codes for failed quality gates
- Config-Driven - YAML-based configuration for reproducible evaluations
- Extensible Plugin System - Easy addition of new models and metrics
- Comprehensive Testing - Unit, integration, and end-to-end test coverage
git clone https://github.com/phiri13/llm-evaluation-framework.git
cd llm-evaluation-framework
pip install -r requirements.txt# Run evaluation with multiple metrics
python -m src.runners.run_eval --config configs/multi_metric_eval.yaml
# Save result as baseline
python -m src.runners.run_eval \
--config configs/multi_metric_eval.yaml \
--save-baseline "v1.0"
# Run evaluation with regression checks
python -m src.runners.run_eval --config configs/eval_with_thresholds.yamlEvaluation completed. Result saved to results/runs/xxxxx.json
Aggregate Scores:
exact_match: 0.500
f1_score: 0.750
Comparing against baseline: v1.0
All regression checks PASSED
llm-evaluation-framework/
├── configs/ # Evaluation configurations
│ ├── basic_eval.yaml # Single metric config
│ ├── multi_metric_eval.yaml # Multi-metric config
│ └── eval_with_thresholds.yaml # Config with regression thresholds
├── data/
│ └── raw/ # Evaluation datasets
│ └── qa_sample_v1.json
├── src/
│ ├── datasets/ # Dataset loading and schemas
│ ├── evaluation/ # Evaluation runner and regression detection
│ ├── metrics/ # Metric implementations and registry
│ ├── models/ # Model interface and registry
│ ├── runners/ # CLI orchestration
│ └── storage/ # Result and baseline storage
├── tests/ # Comprehensive test suite
└── results/
├── runs/ # Evaluation results
└── baselines/ # Baseline comparisons
dataset_path: data/raw/qa_sample_v1.json
model_name: mock-deterministic
metrics:
- exact_match
- f1_score
output_dir: results/runsdataset_path: data/raw/qa_sample_v1.json
model_name: mock-deterministic
metrics:
- exact_match
- f1_score
baseline_name: v1.0
thresholds:
- metric_name: exact_match
min_score: 0.3 # Absolute minimum score
max_drop: 0.1 # Max allowed drop vs baseline
- metric_name: f1_score
min_score: 0.5
max_drop: 0.1from src.evaluation import EvaluationRunner
from src.models import ModelRegistry
from src.metrics import MetricRegistry
# Load model and metrics
model = ModelRegistry.get("mock-deterministic")
metrics = [
MetricRegistry.get("exact_match"),
MetricRegistry.get("f1_score"),
]
# Run evaluation
runner = EvaluationRunner(model=model, metrics=metrics)
result = runner.run("data/raw/qa_sample_v1.json")
print(f"Exact Match: {result.aggregate_scores['exact_match']:.3f}")
print(f"F1 Score: {result.aggregate_scores['f1_score']:.3f}")from src.storage.baseline import BaselineStore
from src.evaluation.regression import RegressionDetector, RegressionThreshold
# Save current result as baseline
baseline_store = BaselineStore()
baseline_store.save_baseline("v1.0", result)
# Compare future runs
baseline = baseline_store.load_baseline("v1.0")
detector = RegressionDetector([
RegressionThreshold(metric_name="exact_match", max_drop=0.1),
])
regression_result = detector.check(current_result, baseline)
if not regression_result.passed:
print("REGRESSION DETECTED!")
for failure in regression_result.failures:
print(f" - {failure}")# .github/workflows/eval.yml
name: Model Evaluation
on: [push, pull_request]
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install dependencies
run: pip install -r requirements.txt
- name: Run evaluation with thresholds
run: |
python -m src.runners.run_eval \
--config configs/eval_with_thresholds.yaml# src/metrics/custom_metric.py
from src.metrics.base import BaseMetric
class CustomMetric(BaseMetric):
name = "custom_metric"
description = "My custom evaluation metric"
def score(self, prediction, reference, context=None) -> float:
# Implement your scoring logic
return score# src/metrics/__init__.py
from src.metrics.custom_metric import CustomMetric
register_metric("custom_metric", CustomMetric)# src/models/my_model.py
from src.models.base import BaseModel
class MyModel(BaseModel):
name = "my-model"
def generate(self, prompt: str) -> str:
# Implement model inference
return response
def metadata(self) -> dict:
return {
"model_type": "custom",
"version": "1.0"
}# Run all tests
python -m tests.test_metrics_exact_match
python -m tests.test_evaluation_runner
python -m tests.test_multi_metric
python -m tests.test_regression
# Run specific test
python -m tests.test_regressionEvaluation results are stored as JSON:
{
"model_name": "mock-deterministic",
"dataset_name": "basic_qa_eval",
"num_samples": 100,
"aggregate_scores": {
"exact_match": 0.85,
"f1_score": 0.92
},
"sample_results": [
{
"id": "1",
"input": "What is the capital of France?",
"prediction": "Paris",
"expected_output": "Paris",
"scores": {
"exact_match": 1.0,
"f1_score": 1.0
}
}
],
"metadata": {
"model_metadata": {...},
"metrics": ["exact_match", "f1_score"]
}
}- Model Registry - Centralized model management with plugin architecture
- Metric Registry - Extensible metric system with instance-based scoring
- Evaluation Runner - Orchestrates model inference and metric computation
- Baseline Store - Manages golden evaluation results
- Regression Detector - Compares results against thresholds and baselines
- CLI Orchestrator - Config-driven evaluation execution
- Separation of Concerns - Clean boundaries between models, metrics, and evaluation
- Plugin Architecture - Easy extension without modifying core code
- Type Safety - Pydantic schemas for configuration and results
- Testability - Comprehensive test coverage with mocked dependencies
- CI/CD Ready - Exit codes and reports for automated quality gates
| Metric | Description | Use Case |
|---|---|---|
exact_match |
Case-insensitive exact string match | QA, classification |
f1_score |
Token-level F1 score | Text generation, QA |
| Model | Type | Description |
|---|---|---|
mock-deterministic |
Mock | Returns input as output (testing) |
Datasets use a standardized JSON schema:
{
"name": "dataset_name",
"version": "1.0",
"task": "qa",
"description": "Dataset description",
"samples": [
{
"id": "1",
"task": "qa",
"input": "Question or prompt",
"expected_output": "Ground truth answer",
"metadata": {}
}
]
}The framework supports two types of regression checks:
- Absolute Thresholds - Enforce minimum scores
- Relative Thresholds - Limit performance drops vs baselines
Example regression check:
thresholds = [
RegressionThreshold(
metric_name="exact_match",
min_score=0.3, # Must score at least 30%
max_drop=0.1 # Can't drop >10% from baseline
)
]git clone https://github.com/phiri13/llm-evaluation-framework.git
cd llm-evaluation-framework
pip install -r requirements.txt# Run all tests
python -m tests.test_metrics_exact_match
python -m tests.test_evaluation_runner
python -m tests.test_multi_metric
python -m tests.test_regression
python -m tests.test_result_storage
python -m tests.test_eval_cli# Create feature branch
git checkout -b feature/my-feature
# Make changes and commit
git add .
git commit -m "feat: add new feature"
# Merge to develop
git checkout develop
git merge feature/my-featureThis project is licensed under the MIT License - see the LICENSE file for details.
Built with:
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Ensure all tests pass
- Submit a pull request
Joshua Phiri - @phiri13
Project Link: https://github.com/phiri13/llm-evaluation-framework
Note: This framework is designed for production use but currently includes mock models for demonstration. Integration with real LLM APIs (OpenAI, Anthropic, etc.) can be added by extending the BaseModel interface.