CodeGauge is a prompt-driven, flexible, automated evaluation framework for assessing the performance of Large Language Models (LLMs) on coding tasks.
- 🔧 Registry Pattern: Dynamic component loading for easy extensibility
- 🌐 Multi-Language Support: Evaluate models across C, Java, Python, Solidity, and more
- 🔌 Provider Agnostic: Works with OpenAI, Anthropic, Google, HuggingFace, and local models
- 📊 Rich Evaluation: Configurable metrics and detailed reporting
- ⚡ Parallel Processing: Efficient concurrent evaluation
- 💾 Persistent Storage: File, SQLite database for result tracking
# Clone the repository
git clone https://github.com/kingki663/LLM-Audit-Chain.git
cd LLM-Audit-Chain
# Install dependencies
pip install -e .
# Or using uv (recommended)
uv sync# Copy configuration templates
cp config.yaml.example config.yaml
cp .env.example .env
# Edit .env with your API keys
# Edit config.yaml with model settingsUses Registry Pattern for maximum flexibility:
# Launch interactive guided setup (default when no args are provided)
python -m src.cliThe guided flow now only presents two choices: start the step-by-step setup or quit. All configuration is collected interactively and every prompt is auto-generated from the benchmark metadata declared in benchmarks/benchmarks.yaml plus the unified field specs in src/cli/utils/subtask.py.
# All benchmark parameters are specified using --subtask key=value format
python -m src.cli --benchmark fault_localization --subtask language=python --max-samples 10
# Multiple subtask parameters can be comma-separated
python -m src.cli --benchmark code_edit --subtask language=python,task_type=adaptive --max-samples 10
# Or use multiple --subtask flags
python -m src.cli --benchmark code_edit --subtask language=python --subtask task_type=adaptive --max-samples 10
# You can also use the direct run_benchmark entry
python -m src.cli.run_benchmark --benchmark fault_localization --subtask language=python --max-samples 10All benchmark-specific parameters are specified using --subtask key=value format. This provides a unified interface for all benchmark types.
Behind the scenes, supported subtask dimensions (language, category, task type, repository, tasks, etc.) are declared once in SUBTASK_FIELD_SPECS. The CLI, config validator, and benchmark runner all consume that spec to (a) build prompts, (b) list supported values, and (c) validate the provided inputs – so adding a new supported_* axis only requires editing the spec and the YAML.
PromptLoader -> PromptAssembler -> LLMClient -> Evaluator -> Sink
Core Components:
- PromptLoader: Loads test cases from datasets
- PromptAssembler: Formats prompts for LLM
- LLMClient: Interacts with model providers
- Evaluator: Scores model outputs
- Sink: Persists results
| Document | Description |
|---|---|
| 🛠️ Development Guide | Create custom benchmarks |
| 🔌 API Reference | Complete API documentation |
| 🏛️ Framework Design | Architecture details |
# Run all tests
python -m unittest discover tests
# Run specific test
python -m unittest tests.test_litellm_client
# With verbose output
python -m unittest discover -v testsSee README-CN.md for Chinese documentation.
Contributions welcome! Please read CONTRIBUTING.md for guidelines.
- Built with LiteLLM