🚀 LLM Evaluation Framework

🌟 Enterprise-Grade Python Framework for Large Language Model Evaluation & Testing 🌟

Built with production-ready standards • Type-safe • Comprehensive testing • Full CLI support

📚 Documentation • � Quick Start • 💡 Examples • 🐛 Report Issues

🌟 What Makes This Special?

🎯 Production Ready

212 comprehensive tests with 89% coverage
Complete type hints throughout codebase
Robust error handling with custom exceptions
Enterprise-grade logging and monitoring

⚡ High Performance

Async inference engine for concurrent evaluations
Batch processing capabilities
Cost optimization and tracking
Memory-efficient data handling

�️ Developer Friendly

Intuitive CLI interface for all operations
Comprehensive documentation with examples
Modular architecture for easy extension
Multiple storage backends (JSON, SQLite)

📊 Rich Analytics

Multiple scoring strategies (Accuracy, F1, Custom)
Detailed performance metrics
Cost analysis and optimization
Exportable evaluation reports

� Quick Installation

# Install from PyPI (Recommended)
pip install LLMEvaluationFramework

# Or install from source for latest features
git clone https://github.com/isathish/LLMEvaluationFramework.git
cd LLMEvaluationFramework
pip install -e .

Requirements: Python 3.8+ • No external dependencies for core functionality

⚡ Quick Start

🐍 Python API (Recommended)

from llm_evaluation_framework import (
    ModelRegistry, 
    ModelInferenceEngine, 
    TestDatasetGenerator
)

# 1️⃣ Setup the registry and register your model
registry = ModelRegistry()
registry.register_model("gpt-3.5-turbo", {
    "provider": "openai",
    "api_cost_input": 0.0015,
    "api_cost_output": 0.002,
    "capabilities": ["reasoning", "creativity", "coding"]
})

# 2️⃣ Generate test cases
generator = TestDatasetGenerator()
test_cases = generator.generate_test_cases(
    use_case={"domain": "general", "required_capabilities": ["reasoning"]},
    count=10
)

# 3️⃣ Run evaluation
engine = ModelInferenceEngine(registry)
results = engine.evaluate_model("gpt-3.5-turbo", test_cases)

# 4️⃣ Analyze results
print(f"✅ Accuracy: {results['aggregate_metrics']['accuracy']:.1%}")
print(f"💰 Total Cost: ${results['aggregate_metrics']['total_cost']:.4f}")
print(f"⏱️  Total Time: {results['aggregate_metrics']['total_time']:.2f}s")

🖥️ Command Line Interface

# Evaluate a model with specific capabilities
llm-eval evaluate --model gpt-3.5-turbo --test-cases 10 --capability reasoning

# Generate a custom test dataset
llm-eval generate --capability coding --count 20 --output my_dataset.json

# Score predictions against references
llm-eval score --predictions "Hello world" "Good morning" \
               --references "Hello world" "Good evening" \
               --metric accuracy

# List available capabilities and models
llm-eval list

�️ Core Architecture

graph TB
    CLI[🖥️ CLI Interface<br/>llm-eval] --> Engine[⚙️ Inference Engine<br/>ModelInferenceEngine]
    
    Engine --> Registry[🗄️ Model Registry<br/>ModelRegistry]
    Engine --> Generator[🧪 Dataset Generator<br/>TestDatasetGenerator]
    Engine --> Scoring[📊 Scoring Strategies<br/>AccuracyScoringStrategy]
    
    Registry --> Models[(🤖 Models<br/>gpt-3.5-turbo, gpt-4, etc.)]
    
    Engine --> Storage[💾 Persistence Layer]
    Storage --> JSON[📄 JSON Store]
    Storage --> SQLite[🗃️ SQLite Store]
    
    Engine --> Utils[🛠️ Utilities]
    Utils --> Logger[📝 Advanced Logging]
    Utils --> ErrorHandler[🛡️ Error Handling]
    Utils --> AutoSuggest[💡 Auto Suggestions]

🎯 Core Components

Component	Description	Key Features
🔥 Inference Engine	Execute and evaluate LLM inferences	Async processing, cost tracking, batch operations
🗄️ Model Registry	Centralized model management	Multi-provider support, configuration management
🧪 Dataset Generator	Create synthetic test cases	Capability-based generation, domain-specific tests
📊 Scoring Strategies	Multiple evaluation metrics	Accuracy, F1-score, custom metrics
💾 Persistence Layer	Dual storage backends	JSON files, SQLite database with querying
🛡️ Error Handling	Robust error management	Custom exceptions, retry mechanisms
📝 Logging System	Advanced logging capabilities	File rotation, structured logging

🎯 Feature Highlights

🚀 What You Can Do

🔬 Research & Benchmarking

Compare multiple LLM providers
Standardized evaluation metrics
Reproducible experiments
Performance benchmarking

🏢 Enterprise Integration

CI/CD pipeline integration
Automated regression testing
Cost optimization analysis
Quality assurance workflows

💰 Cost Management

Real-time cost tracking
Provider cost comparison
Budget optimization
ROI analysis

📊 Supported Capabilities

# Available evaluation capabilities
CAPABILITIES = [
    "reasoning",      # Logical reasoning and problem-solving
    "creativity",     # Creative writing and ideation
    "factual",        # Factual accuracy and knowledge
    "instruction",    # Instruction following
    "coding"          # Code generation and debugging
]

🎮 Interactive Examples

🔍 Click to see Advanced Usage Examples

📈 Batch Evaluation with Multiple Models

from llm_evaluation_framework import ModelRegistry, ModelInferenceEngine
from llm_evaluation_framework.persistence import JSONStore

# Setup multiple models
registry = ModelRegistry()
models = {
    "gpt-3.5-turbo": {"provider": "openai", "cost_input": 0.0015},
    "gpt-4": {"provider": "openai", "cost_input": 0.03},
    "claude-3": {"provider": "anthropic", "cost_input": 0.015}
}

for name, config in models.items():
    registry.register_model(name, config)

# Run comparative evaluation
engine = ModelInferenceEngine(registry)
results = {}

for model_name in models.keys():
    print(f"🚀 Evaluating {model_name}...")
    result = engine.evaluate_model(model_name, test_cases)
    results[model_name] = result
    
    # Save results
    store = JSONStore(f"results_{model_name}.json")
    store.save_evaluation_result(result)

# Compare results
for model, result in results.items():
    accuracy = result['aggregate_metrics']['accuracy']
    cost = result['aggregate_metrics']['total_cost']
    print(f"📊 {model}: {accuracy:.1%} accuracy, ${cost:.4f} cost")

🎯 Custom Scoring Strategy

from llm_evaluation_framework.evaluation.scoring_strategies import ScoringContext

class CustomCosineSimilarityStrategy:
    """Custom scoring using cosine similarity."""
    
    def calculate_score(self, predictions, references):
        # Your custom scoring logic here
        from sklearn.metrics.pairwise import cosine_similarity
        from sklearn.feature_extraction.text import TfidfVectorizer
        
        vectorizer = TfidfVectorizer()
        vectors = vectorizer.fit_transform(predictions + references)
        
        pred_vectors = vectors[:len(predictions)]
        ref_vectors = vectors[len(predictions):]
        
        similarities = cosine_similarity(pred_vectors, ref_vectors)
        return similarities.diagonal().mean()

# Use custom strategy
custom_strategy = CustomCosineSimilarityStrategy()
context = ScoringContext(custom_strategy)
score = context.evaluate(predictions, references)
print(f"🎯 Custom similarity score: {score:.3f}")

🔄 Async Evaluation Pipeline

import asyncio
from llm_evaluation_framework.engines.async_inference_engine import AsyncInferenceEngine

async def run_async_evaluation():
    """Run multiple evaluations concurrently."""
    
    async_engine = AsyncInferenceEngine(registry)
    
    # Define multiple evaluation tasks
    tasks = []
    for capability in ["reasoning", "creativity", "coding"]:
        task = async_engine.evaluate_async(
            model_name="gpt-3.5-turbo",
            test_cases=test_cases,
            capability=capability
        )
        tasks.append(task)
    
    # Run all evaluations concurrently
    results = await asyncio.gather(*tasks)
    
    # Process results
    for i, result in enumerate(results):
        capability = ["reasoning", "creativity", "coding"][i]
        accuracy = result['aggregate_metrics']['accuracy']
        print(f"✅ {capability}: {accuracy:.1%}")

# Run async evaluation
asyncio.run(run_async_evaluation())

📚 Documentation & Resources

📖 Comprehensive Documentation Available

Section	Description	Link
🚀 Getting Started	Installation, quick start, and basic concepts	View Guide
🧠 Core Concepts	Understanding the framework architecture	Learn More
🖥️ CLI Usage	Complete command-line interface documentation	CLI Guide
📊 API Reference	Detailed API documentation with examples	API Docs
💡 Examples	Practical examples and tutorials	View Examples
🛠️ Developer Guide	Contributing guidelines and development setup	Dev Guide

🧪 Testing & Quality

🏆 High-Quality Codebase with Comprehensive Testing

📈 Test Coverage
89%
Comprehensive test coverage

✅ Total Tests
212
All tests passing

🔧 Test Files
10+
Modular test structure

⚡ Test Types
4+
Unit, Integration, Edge Cases

🚀 Run Tests Locally

# Run all tests
pytest

# Run with detailed coverage report
pytest --cov=llm_evaluation_framework --cov-report=html

# Run specific test categories
pytest tests/test_model_inference_engine_comprehensive.py  # Core engine tests
pytest tests/test_cli_comprehensive.py                     # CLI tests
pytest tests/test_persistence_comprehensive.py            # Storage tests

# View coverage report
open htmlcov/index.html

📊 Test Categories

Test Type	Count	Description
🔧 Unit Tests	150+	Individual component testing
🔗 Integration Tests	40+	Component interaction testing
🎯 Edge Case Tests	20+	Error conditions and boundaries
⚡ Performance Tests	10+	Speed and memory optimization

🤝 Contributing

🌟 We Welcome Contributors!

🛠️ Development Setup

# 1️⃣ Fork and clone the repository
git clone https://github.com/YOUR_USERNAME/LLMEvaluationFramework.git
cd LLMEvaluationFramework

# 2️⃣ Create a virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# 3️⃣ Install in development mode
pip install -e ".[dev]"

# 4️⃣ Run tests to ensure everything works
pytest

# 5️⃣ Install pre-commit hooks (optional but recommended)
pre-commit install

📝 Contribution Guidelines

🍴 Fork the repository
🌿 Create a feature branch (git checkout -b feature/amazing-feature)
✅ Write tests for your changes
🧪 Run the test suite (pytest)
📝 Commit your changes (git commit -m 'Add amazing feature')
🚀 Push to the branch (git push origin feature/amazing-feature)
🔀 Open a Pull Request

🎯 What We're Looking For

🐛 Bug fixes and improvements
📚 Documentation enhancements
✨ New features and capabilities
🧪 Additional test cases
🎨 UI/UX improvements for CLI
🔧 Performance optimizations

📋 Requirements & Compatibility

🐍 Python Version Support

Python Version	Status	Notes
Python 3.8	✅ Supported	Minimum required version
Python 3.9	✅ Supported	Fully tested
Python 3.10	✅ Supported	Recommended
Python 3.11	✅ Supported	Latest features
Python 3.12+	✅ Supported	Future-ready

📦 Dependencies

# Core dependencies (automatically installed)
REQUIRED = [
    # No external dependencies for core functionality!
    # Framework uses only Python standard library
]

# Optional development dependencies
DEVELOPMENT = [
    "pytest>=7.0.0",           # Testing framework
    "pytest-cov>=4.0.0",      # Coverage reporting
    "black>=22.0.0",           # Code formatting
    "flake8>=5.0.0",           # Code linting
    "mypy>=1.0.0",             # Type checking
    "pre-commit>=2.20.0",      # Git hooks
]

🌐 Platform Support

✅ Linux (Ubuntu, CentOS, RHEL)
✅ macOS (Intel & Apple Silicon)
✅ Windows (10, 11)
✅ Docker containers
✅ CI/CD environments (GitHub Actions, Jenkins, etc.)

📄 License

This project is licensed under the MIT License

You are free to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the software.

📜 Read the full license

🙏 Acknowledgments & Credits

🌟 Built with Love and Open Source

🚀 Inspiration: Born from the need for standardized, reliable LLM evaluation tools
🏗️ Architecture: Built with modern Python best practices and enterprise standards
🧪 Testing: Comprehensive test coverage ensuring production reliability
👥 Community: Driven by developers, researchers, and AI practitioners
📚 Documentation: Extensive documentation for developers at all levels

🔧 Technology Stack

Technology	Purpose	Why We Chose It
🐍 Python 3.8+	Core Language	Wide adoption, excellent ecosystem
📋 Type Hints	Code Safety	Better IDE support, fewer runtime errors
🧪 Pytest	Testing Framework	Industry standard, excellent plugin ecosystem
📊 SQLite	Database Storage	Lightweight, serverless, reliable
📝 MkDocs	Documentation	Beautiful docs, Markdown-based
🎨 Rich CLI	User Interface	Modern, intuitive command-line experience

📞 Support & Community

💬 Get Help & Connect

🆘 Getting Support

Type	Where to Go	Response Time
🐛 Bug Reports	GitHub Issues	24-48 hours
❓ Questions	GitHub Discussions	Community-driven
📚 Documentation	Online Docs	Always available
💡 Feature Requests	GitHub Issues	Weekly review

📈 Project Statistics

🔗 Important Links

🌐 Quick Access

Resource	Link	Description
📦 PyPI Package	pypi.org/project/llm-evaluation-framework	Install via pip
📚 Documentation	isathish.github.io/LLMEvaluationFramework	Complete documentation
💻 Source Code	github.com/isathish/LLMEvaluationFramework	View source & contribute
🐛 Issue Tracker	github.com/.../issues	Report bugs & request features
💬 Discussions	github.com/.../discussions	Community discussion

🎉 Thank You for Using LLM Evaluation Framework!

Made with ❤️ by Sathish Kumar N

If you find this project useful, please consider giving it a ⭐️

🚀 Ready to Get Started?

pip install LLMEvaluationFramework

📚 Read the Documentation • 🚀 View Examples • 💬 Join Discussions

Built for developers, researchers, and AI practitioners who demand reliable, production-ready LLM evaluation tools.

Name		Name	Last commit message	Last commit date
Latest commit History 164 Commits
.github/workflows		.github/workflows
.vscode		.vscode
data		data
docs		docs
examples		examples
llm_evaluation_framework.egg-info		llm_evaluation_framework.egg-info
llm_evaluation_framework		llm_evaluation_framework
site		site
tests		tests
.bumpversion.cfg		.bumpversion.cfg
.coverage		.coverage
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
FINAL_COMPLETION_REPORT.md		FINAL_COMPLETION_REPORT.md
LICENSE		LICENSE
PROJECT_CLEANUP_COMPLETE.md		PROJECT_CLEANUP_COMPLETE.md
PROJECT_COMPLETE.md		PROJECT_COMPLETE.md
README.md		README.md
README.md.backup		README.md.backup
mkdocs.yml		mkdocs.yml
requirements.txt		requirements.txt
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

🚀 LLM Evaluation Framework

🌟 What Makes This Special?

🎯 Production Ready

⚡ High Performance

�️ Developer Friendly

📊 Rich Analytics

� Quick Installation

⚡ Quick Start

🐍 Python API (Recommended)

🖥️ Command Line Interface

�️ Core Architecture

🎯 Core Components

🎯 Feature Highlights

🚀 What You Can Do

🔬 Research & Benchmarking

🏢 Enterprise Integration

💰 Cost Management

📊 Supported Capabilities

🎮 Interactive Examples

📈 Batch Evaluation with Multiple Models

🎯 Custom Scoring Strategy

🔄 Async Evaluation Pipeline

📚 Documentation & Resources

📖 Comprehensive Documentation Available

🧪 Testing & Quality

🏆 High-Quality Codebase with Comprehensive Testing

🚀 Run Tests Locally

📊 Test Categories

🤝 Contributing

🌟 We Welcome Contributors!

🛠️ Development Setup

📝 Contribution Guidelines

🎯 What We're Looking For

📋 Requirements & Compatibility

🐍 Python Version Support

📦 Dependencies

🌐 Platform Support

📄 License

🙏 Acknowledgments & Credits

🌟 Built with Love and Open Source

🔧 Technology Stack

📞 Support & Community

💬 Get Help & Connect

🆘 Getting Support

📈 Project Statistics

🔗 Important Links

🌐 Quick Access

🎉 Thank You for Using LLM Evaluation Framework!

🚀 Ready to Get Started?

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 21

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages