🌟 Enterprise-Grade Python Framework for Large Language Model Evaluation & Testing 🌟
Built with production-ready standards • Type-safe • Comprehensive testing • Full CLI support
📚 Documentation • � Quick Start • 💡 Examples • 🐛 Report Issues
|
|
# Install from PyPI (Recommended)
pip install LLMEvaluationFramework
# Or install from source for latest features
git clone https://github.com/isathish/LLMEvaluationFramework.git
cd LLMEvaluationFramework
pip install -e .Requirements: Python 3.8+ • No external dependencies for core functionality
from llm_evaluation_framework import (
ModelRegistry,
ModelInferenceEngine,
TestDatasetGenerator
)
# 1️⃣ Setup the registry and register your model
registry = ModelRegistry()
registry.register_model("gpt-3.5-turbo", {
"provider": "openai",
"api_cost_input": 0.0015,
"api_cost_output": 0.002,
"capabilities": ["reasoning", "creativity", "coding"]
})
# 2️⃣ Generate test cases
generator = TestDatasetGenerator()
test_cases = generator.generate_test_cases(
use_case={"domain": "general", "required_capabilities": ["reasoning"]},
count=10
)
# 3️⃣ Run evaluation
engine = ModelInferenceEngine(registry)
results = engine.evaluate_model("gpt-3.5-turbo", test_cases)
# 4️⃣ Analyze results
print(f"✅ Accuracy: {results['aggregate_metrics']['accuracy']:.1%}")
print(f"💰 Total Cost: ${results['aggregate_metrics']['total_cost']:.4f}")
print(f"⏱️ Total Time: {results['aggregate_metrics']['total_time']:.2f}s")# Evaluate a model with specific capabilities
llm-eval evaluate --model gpt-3.5-turbo --test-cases 10 --capability reasoning
# Generate a custom test dataset
llm-eval generate --capability coding --count 20 --output my_dataset.json
# Score predictions against references
llm-eval score --predictions "Hello world" "Good morning" \
--references "Hello world" "Good evening" \
--metric accuracy
# List available capabilities and models
llm-eval listgraph TB
CLI[🖥️ CLI Interface<br/>llm-eval] --> Engine[⚙️ Inference Engine<br/>ModelInferenceEngine]
Engine --> Registry[🗄️ Model Registry<br/>ModelRegistry]
Engine --> Generator[🧪 Dataset Generator<br/>TestDatasetGenerator]
Engine --> Scoring[📊 Scoring Strategies<br/>AccuracyScoringStrategy]
Registry --> Models[(🤖 Models<br/>gpt-3.5-turbo, gpt-4, etc.)]
Engine --> Storage[💾 Persistence Layer]
Storage --> JSON[📄 JSON Store]
Storage --> SQLite[🗃️ SQLite Store]
Engine --> Utils[🛠️ Utilities]
Utils --> Logger[📝 Advanced Logging]
Utils --> ErrorHandler[🛡️ Error Handling]
Utils --> AutoSuggest[💡 Auto Suggestions]
| Component | Description | Key Features |
|---|---|---|
| 🔥 Inference Engine | Execute and evaluate LLM inferences | Async processing, cost tracking, batch operations |
| 🗄️ Model Registry | Centralized model management | Multi-provider support, configuration management |
| 🧪 Dataset Generator | Create synthetic test cases | Capability-based generation, domain-specific tests |
| 📊 Scoring Strategies | Multiple evaluation metrics | Accuracy, F1-score, custom metrics |
| 💾 Persistence Layer | Dual storage backends | JSON files, SQLite database with querying |
| 🛡️ Error Handling | Robust error management | Custom exceptions, retry mechanisms |
| 📝 Logging System | Advanced logging capabilities | File rotation, structured logging |
|
|
|
# Available evaluation capabilities
CAPABILITIES = [
"reasoning", # Logical reasoning and problem-solving
"creativity", # Creative writing and ideation
"factual", # Factual accuracy and knowledge
"instruction", # Instruction following
"coding" # Code generation and debugging
]🔍 Click to see Advanced Usage Examples
from llm_evaluation_framework import ModelRegistry, ModelInferenceEngine
from llm_evaluation_framework.persistence import JSONStore
# Setup multiple models
registry = ModelRegistry()
models = {
"gpt-3.5-turbo": {"provider": "openai", "cost_input": 0.0015},
"gpt-4": {"provider": "openai", "cost_input": 0.03},
"claude-3": {"provider": "anthropic", "cost_input": 0.015}
}
for name, config in models.items():
registry.register_model(name, config)
# Run comparative evaluation
engine = ModelInferenceEngine(registry)
results = {}
for model_name in models.keys():
print(f"🚀 Evaluating {model_name}...")
result = engine.evaluate_model(model_name, test_cases)
results[model_name] = result
# Save results
store = JSONStore(f"results_{model_name}.json")
store.save_evaluation_result(result)
# Compare results
for model, result in results.items():
accuracy = result['aggregate_metrics']['accuracy']
cost = result['aggregate_metrics']['total_cost']
print(f"📊 {model}: {accuracy:.1%} accuracy, ${cost:.4f} cost")from llm_evaluation_framework.evaluation.scoring_strategies import ScoringContext
class CustomCosineSimilarityStrategy:
"""Custom scoring using cosine similarity."""
def calculate_score(self, predictions, references):
# Your custom scoring logic here
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(predictions + references)
pred_vectors = vectors[:len(predictions)]
ref_vectors = vectors[len(predictions):]
similarities = cosine_similarity(pred_vectors, ref_vectors)
return similarities.diagonal().mean()
# Use custom strategy
custom_strategy = CustomCosineSimilarityStrategy()
context = ScoringContext(custom_strategy)
score = context.evaluate(predictions, references)
print(f"🎯 Custom similarity score: {score:.3f}")import asyncio
from llm_evaluation_framework.engines.async_inference_engine import AsyncInferenceEngine
async def run_async_evaluation():
"""Run multiple evaluations concurrently."""
async_engine = AsyncInferenceEngine(registry)
# Define multiple evaluation tasks
tasks = []
for capability in ["reasoning", "creativity", "coding"]:
task = async_engine.evaluate_async(
model_name="gpt-3.5-turbo",
test_cases=test_cases,
capability=capability
)
tasks.append(task)
# Run all evaluations concurrently
results = await asyncio.gather(*tasks)
# Process results
for i, result in enumerate(results):
capability = ["reasoning", "creativity", "coding"][i]
accuracy = result['aggregate_metrics']['accuracy']
print(f"✅ {capability}: {accuracy:.1%}")
# Run async evaluation
asyncio.run(run_async_evaluation())| Section | Description | Link |
|---|---|---|
| 🚀 Getting Started | Installation, quick start, and basic concepts | View Guide |
| 🧠 Core Concepts | Understanding the framework architecture | Learn More |
| 🖥️ CLI Usage | Complete command-line interface documentation | CLI Guide |
| 📊 API Reference | Detailed API documentation with examples | API Docs |
| 💡 Examples | Practical examples and tutorials | View Examples |
| 🛠️ Developer Guide | Contributing guidelines and development setup | Dev Guide |
|
📈 Test Coverage
|
✅ Total Tests
|
🔧 Test Files
|
⚡ Test Types
|
# Run all tests
pytest
# Run with detailed coverage report
pytest --cov=llm_evaluation_framework --cov-report=html
# Run specific test categories
pytest tests/test_model_inference_engine_comprehensive.py # Core engine tests
pytest tests/test_cli_comprehensive.py # CLI tests
pytest tests/test_persistence_comprehensive.py # Storage tests
# View coverage report
open htmlcov/index.html| Test Type | Count | Description |
|---|---|---|
| 🔧 Unit Tests | 150+ | Individual component testing |
| 🔗 Integration Tests | 40+ | Component interaction testing |
| 🎯 Edge Case Tests | 20+ | Error conditions and boundaries |
| ⚡ Performance Tests | 10+ | Speed and memory optimization |
# 1️⃣ Fork and clone the repository
git clone https://github.com/YOUR_USERNAME/LLMEvaluationFramework.git
cd LLMEvaluationFramework
# 2️⃣ Create a virtual environment (recommended)
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# 3️⃣ Install in development mode
pip install -e ".[dev]"
# 4️⃣ Run tests to ensure everything works
pytest
# 5️⃣ Install pre-commit hooks (optional but recommended)
pre-commit install- 🍴 Fork the repository
- 🌿 Create a feature branch (
git checkout -b feature/amazing-feature) - ✅ Write tests for your changes
- 🧪 Run the test suite (
pytest) - 📝 Commit your changes (
git commit -m 'Add amazing feature') - 🚀 Push to the branch (
git push origin feature/amazing-feature) - 🔀 Open a Pull Request
- 🐛 Bug fixes and improvements
- 📚 Documentation enhancements
- ✨ New features and capabilities
- 🧪 Additional test cases
- 🎨 UI/UX improvements for CLI
- 🔧 Performance optimizations
| Python Version | Status | Notes |
|---|---|---|
| Python 3.8 | ✅ Supported | Minimum required version |
| Python 3.9 | ✅ Supported | Fully tested |
| Python 3.10 | ✅ Supported | Recommended |
| Python 3.11 | ✅ Supported | Latest features |
| Python 3.12+ | ✅ Supported | Future-ready |
# Core dependencies (automatically installed)
REQUIRED = [
# No external dependencies for core functionality!
# Framework uses only Python standard library
]
# Optional development dependencies
DEVELOPMENT = [
"pytest>=7.0.0", # Testing framework
"pytest-cov>=4.0.0", # Coverage reporting
"black>=22.0.0", # Code formatting
"flake8>=5.0.0", # Code linting
"mypy>=1.0.0", # Type checking
"pre-commit>=2.20.0", # Git hooks
]- ✅ Linux (Ubuntu, CentOS, RHEL)
- ✅ macOS (Intel & Apple Silicon)
- ✅ Windows (10, 11)
- ✅ Docker containers
- ✅ CI/CD environments (GitHub Actions, Jenkins, etc.)
This project is licensed under the MIT License
You are free to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the software.
- 🚀 Inspiration: Born from the need for standardized, reliable LLM evaluation tools
- 🏗️ Architecture: Built with modern Python best practices and enterprise standards
- 🧪 Testing: Comprehensive test coverage ensuring production reliability
- 👥 Community: Driven by developers, researchers, and AI practitioners
- 📚 Documentation: Extensive documentation for developers at all levels
| Technology | Purpose | Why We Chose It |
|---|---|---|
| 🐍 Python 3.8+ | Core Language | Wide adoption, excellent ecosystem |
| 📋 Type Hints | Code Safety | Better IDE support, fewer runtime errors |
| 🧪 Pytest | Testing Framework | Industry standard, excellent plugin ecosystem |
| 📊 SQLite | Database Storage | Lightweight, serverless, reliable |
| 📝 MkDocs | Documentation | Beautiful docs, Markdown-based |
| 🎨 Rich CLI | User Interface | Modern, intuitive command-line experience |
| Type | Where to Go | Response Time |
|---|---|---|
| 🐛 Bug Reports | GitHub Issues | 24-48 hours |
| ❓ Questions | GitHub Discussions | Community-driven |
| 📚 Documentation | Online Docs | Always available |
| 💡 Feature Requests | GitHub Issues | Weekly review |
| Resource | Link | Description |
|---|---|---|
| 📦 PyPI Package | pypi.org/project/llm-evaluation-framework | Install via pip |
| 📚 Documentation | isathish.github.io/LLMEvaluationFramework | Complete documentation |
| 💻 Source Code | github.com/isathish/LLMEvaluationFramework | View source & contribute |
| 🐛 Issue Tracker | github.com/.../issues | Report bugs & request features |
| 💬 Discussions | github.com/.../discussions | Community discussion |
Made with ❤️ by Sathish Kumar N
If you find this project useful, please consider giving it a ⭐️
pip install LLMEvaluationFramework📚 Read the Documentation • 🚀 View Examples • 💬 Join Discussions
Built for developers, researchers, and AI practitioners who demand reliable, production-ready LLM evaluation tools.