CodeGauge - LLM Evaluation Framework

CodeGauge is a prompt-driven, flexible, automated evaluation framework for assessing the performance of Large Language Models (LLMs) on coding tasks.

✨ Key Features

🔧 Registry Pattern: Dynamic component loading for easy extensibility
🌐 Multi-Language Support: Evaluate models across C, Java, Python, Solidity, and more
🔌 Provider Agnostic: Works with OpenAI, Anthropic, Google, HuggingFace, and local models
📊 Rich Evaluation: Configurable metrics and detailed reporting
⚡ Parallel Processing: Efficient concurrent evaluation
💾 Persistent Storage: File, SQLite database for result tracking

🚀 Quick Start

Installation

# Clone the repository
git clone https://github.com/kingki663/LLM-Audit-Chain.git
cd LLM-Audit-Chain

# Install dependencies
pip install -e .

# Or using uv (recommended)
uv sync

Configuration

# Copy configuration templates
cp config.yaml.example config.yaml
cp .env.example .env

# Edit .env with your API keys
# Edit config.yaml with model settings

📚 Benchmark Versions

Version 0.2 (Recommended)

Uses Registry Pattern for maximum flexibility:

Guided Setup (Recommended)

# Launch interactive guided setup (default when no args are provided)
python -m src.cli

The guided flow now only presents two choices: start the step-by-step setup or quit. All configuration is collected interactively and every prompt is auto-generated from the benchmark metadata declared in benchmarks/benchmarks.yaml plus the unified field specs in src/cli/utils/subtask.py.

Command Line Mode

# All benchmark parameters are specified using --subtask key=value format
python -m src.cli --benchmark fault_localization --subtask language=python --max-samples 10

# Multiple subtask parameters can be comma-separated
python -m src.cli --benchmark code_edit --subtask language=python,task_type=adaptive --max-samples 10

# Or use multiple --subtask flags
python -m src.cli --benchmark code_edit --subtask language=python --subtask task_type=adaptive --max-samples 10

# You can also use the direct run_benchmark entry
python -m src.cli.run_benchmark --benchmark fault_localization --subtask language=python --max-samples 10

All benchmark-specific parameters are specified using --subtask key=value format. This provides a unified interface for all benchmark types.

Behind the scenes, supported subtask dimensions (language, category, task type, repository, tasks, etc.) are declared once in SUBTASK_FIELD_SPECS. The CLI, config validator, and benchmark runner all consume that spec to (a) build prompts, (b) list supported values, and (c) validate the provided inputs – so adding a new supported_* axis only requires editing the spec and the YAML.

🏗️ Architecture

PromptLoader -> PromptAssembler -> LLMClient -> Evaluator -> Sink

Core Components:

PromptLoader: Loads test cases from datasets
PromptAssembler: Formats prompts for LLM
LLMClient: Interacts with model providers
Evaluator: Scores model outputs
Sink: Persists results

📖 Documentation

Document	Description
🛠️ Development Guide	Create custom benchmarks
🔌 API Reference	Complete API documentation
🏛️ Framework Design	Architecture details

🧪 Testing

# Run all tests
python -m unittest discover tests

# Run specific test
python -m unittest tests.test_litellm_client

# With verbose output
python -m unittest discover -v tests

See README-CN.md for Chinese documentation.

🤝 Contributing

Contributions welcome! Please read CONTRIBUTING.md for guidelines.

🙏 Acknowledgments

Built with LiteLLM

Name		Name	Last commit message	Last commit date
Latest commit History 290 Commits
assets		assets
benchmarks		benchmarks
docs		docs
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md
config.yaml.example		config.yaml.example
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CodeGauge - LLM Evaluation Framework

✨ Key Features

🚀 Quick Start

Installation

Configuration

📚 Benchmark Versions

Version 0.2 (Recommended)

Guided Setup (Recommended)

Command Line Mode

🏗️ Architecture

📖 Documentation

🧪 Testing

🤝 Contributing

🙏 Acknowledgments

About

Uh oh!

Releases 4

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CodeGauge - LLM Evaluation Framework

✨ Key Features

🚀 Quick Start

Installation

Configuration

📚 Benchmark Versions

Version 0.2 (Recommended)

Guided Setup (Recommended)

Command Line Mode

🏗️ Architecture

📖 Documentation

🧪 Testing

🤝 Contributing

🙏 Acknowledgments

About

Topics

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 4

Contributors

Uh oh!

Languages