Skip to content

kingki663/CodeGauge

Repository files navigation

CodeGauge - LLM Evaluation Framework

Python 3.12+

CodeGauge is a prompt-driven, flexible, automated evaluation framework for assessing the performance of Large Language Models (LLMs) on coding tasks.

✨ Key Features

  • 🔧 Registry Pattern: Dynamic component loading for easy extensibility
  • 🌐 Multi-Language Support: Evaluate models across C, Java, Python, Solidity, and more
  • 🔌 Provider Agnostic: Works with OpenAI, Anthropic, Google, HuggingFace, and local models
  • 📊 Rich Evaluation: Configurable metrics and detailed reporting
  • ⚡ Parallel Processing: Efficient concurrent evaluation
  • 💾 Persistent Storage: File, SQLite database for result tracking

🚀 Quick Start

Installation

# Clone the repository
git clone https://github.com/kingki663/LLM-Audit-Chain.git
cd LLM-Audit-Chain

# Install dependencies
pip install -e .

# Or using uv (recommended)
uv sync

Configuration

# Copy configuration templates
cp config.yaml.example config.yaml
cp .env.example .env

# Edit .env with your API keys
# Edit config.yaml with model settings

📚 Benchmark Versions

Version 0.2 (Recommended)

Uses Registry Pattern for maximum flexibility:

Guided Setup (Recommended)

# Launch interactive guided setup (default when no args are provided)
python -m src.cli

The guided flow now only presents two choices: start the step-by-step setup or quit. All configuration is collected interactively and every prompt is auto-generated from the benchmark metadata declared in benchmarks/benchmarks.yaml plus the unified field specs in src/cli/utils/subtask.py.

Command Line Mode

# All benchmark parameters are specified using --subtask key=value format
python -m src.cli --benchmark fault_localization --subtask language=python --max-samples 10

# Multiple subtask parameters can be comma-separated
python -m src.cli --benchmark code_edit --subtask language=python,task_type=adaptive --max-samples 10

# Or use multiple --subtask flags
python -m src.cli --benchmark code_edit --subtask language=python --subtask task_type=adaptive --max-samples 10

# You can also use the direct run_benchmark entry
python -m src.cli.run_benchmark --benchmark fault_localization --subtask language=python --max-samples 10

All benchmark-specific parameters are specified using --subtask key=value format. This provides a unified interface for all benchmark types.

Behind the scenes, supported subtask dimensions (language, category, task type, repository, tasks, etc.) are declared once in SUBTASK_FIELD_SPECS. The CLI, config validator, and benchmark runner all consume that spec to (a) build prompts, (b) list supported values, and (c) validate the provided inputs – so adding a new supported_* axis only requires editing the spec and the YAML.

🏗️ Architecture

PromptLoader -> PromptAssembler -> LLMClient -> Evaluator -> Sink

Core Components:

  • PromptLoader: Loads test cases from datasets
  • PromptAssembler: Formats prompts for LLM
  • LLMClient: Interacts with model providers
  • Evaluator: Scores model outputs
  • Sink: Persists results

📖 Documentation

Document Description
🛠️ Development Guide Create custom benchmarks
🔌 API Reference Complete API documentation
🏛️ Framework Design Architecture details

🧪 Testing

# Run all tests
python -m unittest discover tests

# Run specific test
python -m unittest tests.test_litellm_client

# With verbose output
python -m unittest discover -v tests

See README-CN.md for Chinese documentation.

🤝 Contributing

Contributions welcome! Please read CONTRIBUTING.md for guidelines.

🙏 Acknowledgments

About

A Prompt-Driven Coding-Agent Evaluation Framework

Topics

Resources

Contributing

Stars

Watchers

Forks

Contributors