Automated evaluation pipeline for AI-generated code. Supports two evaluation modes — full-project eval and lightweight snippet — covering Python and Java (Maven).
code-eval eval |
code-eval snippet |
|
|---|---|---|
| Purpose | Full-project evaluation with tests, lint, security, and complexity | Quick static analysis of a single code snippet |
| Input | Directory / file paths / git diff | Inline code (-c) or single file (--file) |
| Scanners | All 9 scanners (incl. test runners & dependency auditors) | Static-analysis only (no pytest / maven-test / pip-audit) |
| Scoring | 4 dimensions: correctness, quality, security, maintainability | 3 dimensions: quality, security, maintainability (no correctness) |
| Output | evaluation.json — full report with metrics, issues, scores |
Compact SnippetResult JSON with score (0-100) and issues |
| Use Case | CI/CD pipelines, batch project evaluation | Code review, quick checks, editor integration |
- Two evaluation modes:
eval(project) andsnippet(single file / inline code) - Three input modes (eval): directory, file path, git-diff
- Two language adapters: Python + Java (Maven)
- Nine scanners:
- Python: pytest, ruff, bandit, radon, pip-audit
- Java: maven-test, java-lint, java-security, java-complexity
- Multi-dimensional scoring: correctness (0.40), quality (0.25), security (0.20), maintainability (0.15)
- Two-layer diff awareness: file-level + line-level tracking (
in_difftagging) - Configurable Docker sandbox: optional container isolation with resource limits
- Batch evaluation: concurrent target processing with progress reporting
- Structured output:
evaluation.jsonwith metrics, issues, scores, and summary
pip install code-evalOr install from source:
pip install -e .Full-project evaluation — runs all scanners (tests, lint, security, complexity) and produces a comprehensive structured report.
Evaluate a project directory (language auto-detected by markers such as pyproject.toml or pom.xml):
code-eval eval --targets ./my_projectEvaluate specific files:
code-eval eval --targets ./src/auth.py ./src/api.pyFor Java, file mode also works (project root resolved via pom.xml):
code-eval eval --targets ./my-java-project/src/main/java/com/example/App.javaEvaluate only files changed since main:
code-eval eval --git-diff --base maincode-eval eval --targets ./project_a ./project_bcode-eval eval --targets ./my_project --output evaluation.jsoncode-eval eval --targets ./my_project --output evaluation.json --summary summary.mdcode-eval eval --targets ./my_project --config .env.productionThe evaluation.json output contains:
{
"meta": {
"timestamp": "2025-01-01T00:00:00Z",
"pipeline_version": "0.1.0",
"total_targets": 1,
"total_duration_seconds": 5.2
},
"results": [
{
"target": "/path/to/project",
"language": "python",
"duration_seconds": 5.2,
"scores": {
"correctness": { "value": 0.85, "weight": 0.40, "detail": "17/20 tests passed" },
"quality": { "value": 0.96, "weight": 0.25, "detail": "2 lint issues in diff" },
"security": { "value": 1.0, "weight": 0.20, "detail": "No security issues" },
"maintainability": { "value": 0.9, "weight": 0.15, "detail": "Average complexity: 6.2" },
"overall": 0.91
},
"metrics": {
"tests_total": 20,
"tests_passed": 17,
"tests_failed": 3,
"lint_issues": 2,
"security_issues": 0,
"avg_complexity": 6.2,
"files_evaluated": 8
},
"issues": [ "..." ]
}
],
"summary": {
"avg_overall_score": 0.91,
"total_issues": 5,
"critical_issues": 0,
"targets_passed": 1,
"targets_failed": 0
}
}| Dimension | Weight | Source | Scoring Logic |
|---|---|---|---|
| Correctness | 0.40 | pytest / maven-test | tests_passed / tests_total; no tests → 0.5; compilation failed → 0.0 |
| Quality | 0.25 | ruff / java-lint | -0.02 per in-diff lint issue; -0.002 per out-of-diff |
| Security | 0.20 | bandit / java-security | Deductions: critical -0.30, high -0.15, medium -0.05, low -0.02 |
| Maintainability | 0.15 | radon / java-complexity | CC≤5 → 1.0; CC 5-15 → 1.0-0.5; CC 15-25 → 0.5-0.0 |
Lightweight snippet evaluation — runs static-analysis scanners only (no test runners or dependency auditors) and produces a compact result with a 0-100 score.
Evaluate a code string directly:
code-eval snippet -c "import os; os.system('rm -rf /')" --lang pythonEvaluate a single code file:
code-eval snippet --file ./utils.pyLanguage is auto-detected from the file extension. You can override it:
code-eval snippet --file ./script.txt --lang pythoncode-eval snippet -c "print('hello')" --lang python --output result.jsonThe snippet result JSON is a compact schema:
{
"language": "python",
"file": "snippet.py",
"duration_seconds": 0.45,
"score": 85.0,
"issues_count": 3,
"issues": [
{
"id": "SNIPPET-001",
"severity": "high",
"type": "security",
"message": "Possible shell injection via os.system()",
"file": "snippet.py",
"line": 1
}
],
"severity_summary": {
"critical": 0,
"high": 1,
"medium": 1,
"low": 1,
"info": 0
}
}Snippet mode uses 3 dimensions (no correctness, since there are no tests):
| Dimension | Weight | Source | Scoring Logic |
|---|---|---|---|
| Quality | 0.40 | ruff / java-lint | -0.02 per lint issue |
| Security | 0.35 | bandit / java-security | Deductions: critical -0.30, high -0.15, medium -0.05, low -0.02 |
| Maintainability | 0.25 | radon / java-complexity | CC≤5 → 1.0; CC 5-15 → 1.0-0.5; CC 15-25 → 0.5-0.0 |
| Language | Scanners |
|---|---|
| Python | ruff, bandit, radon |
| Java | java-lint, java-security, java-complexity |
Note: Test runners (pytest, maven-test) and dependency auditors (pip-audit) are excluded from snippet mode since snippets have no project structure.
| Code | Meaning |
|---|---|
0 |
No critical or high severity issues |
1 |
At least one critical or high severity issue found |
Create a .env file (see .env.example) to customize behavior:
# Sandbox
SANDBOX_ENABLED=false # Global toggle (default: false)
SANDBOX_PYTHON_ENABLED=true # Per-language override
SANDBOX_JAVA_ENABLED= # Per-language override for Java
SANDBOX_MEMORY_LIMIT=512m # Docker memory limit
SANDBOX_CPU_LIMIT=1 # Docker CPU limit
SANDBOX_TIMEOUT=300 # Total timeout in seconds
SANDBOX_NETWORK=none # Docker network mode
# Concurrency
MAX_CONCURRENT=4 # Max parallel evaluations
# Issue limits
MAX_ISSUES_PER_TARGET=50 # Max issues per target in report
# Scoring weights (auto-normalized if they don't sum to 1.0)
SCORE_WEIGHT_CORRECTNESS=0.40
SCORE_WEIGHT_QUALITY=0.25
SCORE_WEIGHT_SECURITY=0.20
SCORE_WEIGHT_MAINTAINABILITY=0.15
# Java / Maven
JAVA_MVN_PATH= # Optional mvn path (fallback: PATH lookup)
JAVA_MVN_SETTINGS= # Optional settings.xml
JAVA_MVN_TIMEOUT=300 # Maven timeout in seconds
JAVA_MVN_SKIP_TESTS=false # Run compile instead of test
JAVA_MVN_THREADS= # Optional -T value (e.g. 2C)For each language: per-language override → global toggle → default (false)
Example: SANDBOX_ENABLED=false + SANDBOX_PYTHON_ENABLED=true → Python runs in sandbox, others run directly.
To build the evaluation Docker image:
docker build -f docker/Dockerfile.python -t code-eval-python .Enable sandbox in .env:
SANDBOX_ENABLED=truecode_eval/
├── __init__.py
├── cli.py # Click CLI entry point (eval + snippet sub-commands)
├── config.py # Configuration from .env
├── adapters/ # Language adapter interface + Python/Java implementations
├── core/ # Runner, scheduler, sandbox, models
├── extractors/ # Issue extractors (Python + Java)
├── reporting/ # JSON & markdown report generation
├── resolvers/ # Target resolution & language detection
├── scanners/ # Scanner interface + Python/Java scanner implementations
├── schemas/ # Pydantic data models (Issue, Metrics, EvaluationReport, SnippetResult)
├── scoring/ # Score computation
└── snippet/ # Snippet-mode runner & scanner selection
# Install with dev dependencies
pip install -e ".[dev]"
# Run tests
python -m pytest tests/ -v