Code Eval

Automated evaluation pipeline for AI-generated code. Supports two evaluation modes — full-project eval and lightweight snippet — covering Python and Java (Maven).

Two Modes

	`code-eval eval`	`code-eval snippet`
Purpose	Full-project evaluation with tests, lint, security, and complexity	Quick static analysis of a single code snippet
Input	Directory / file paths / git diff	Inline code (`-c`) or single file (`--file`)
Scanners	All 9 scanners (incl. test runners & dependency auditors)	Static-analysis only (no pytest / maven-test / pip-audit)
Scoring	4 dimensions: correctness, quality, security, maintainability	3 dimensions: quality, security, maintainability (no correctness)
Output	`evaluation.json` — full report with metrics, issues, scores	Compact `SnippetResult` JSON with score (0-100) and issues
Use Case	CI/CD pipelines, batch project evaluation	Code review, quick checks, editor integration

Features

Two evaluation modes: eval (project) and snippet (single file / inline code)
Three input modes (eval): directory, file path, git-diff
Two language adapters: Python + Java (Maven)
Nine scanners:
- Python: pytest, ruff, bandit, radon, pip-audit
- Java: maven-test, java-lint, java-security, java-complexity
Multi-dimensional scoring: correctness (0.40), quality (0.25), security (0.20), maintainability (0.15)
Two-layer diff awareness: file-level + line-level tracking (in_diff tagging)
Configurable Docker sandbox: optional container isolation with resource limits
Batch evaluation: concurrent target processing with progress reporting
Structured output: evaluation.json with metrics, issues, scores, and summary

Installation

pip install code-eval

Or install from source:

pip install -e .

Mode 1: `code-eval eval`

Full-project evaluation — runs all scanners (tests, lint, security, complexity) and produces a comprehensive structured report.

Directory mode

Evaluate a project directory (language auto-detected by markers such as pyproject.toml or pom.xml):

code-eval eval --targets ./my_project

File mode

Evaluate specific files:

code-eval eval --targets ./src/auth.py ./src/api.py

For Java, file mode also works (project root resolved via pom.xml):

code-eval eval --targets ./my-java-project/src/main/java/com/example/App.java

Git diff mode

Evaluate only files changed since main:

code-eval eval --git-diff --base main

Multiple targets

code-eval eval --targets ./project_a ./project_b

Save output to file

code-eval eval --targets ./my_project --output evaluation.json

Generate markdown summary

code-eval eval --targets ./my_project --output evaluation.json --summary summary.md

Custom configuration

code-eval eval --targets ./my_project --config .env.production

Eval Output Format

The evaluation.json output contains:

{
  "meta": {
    "timestamp": "2025-01-01T00:00:00Z",
    "pipeline_version": "0.1.0",
    "total_targets": 1,
    "total_duration_seconds": 5.2
  },
  "results": [
    {
      "target": "/path/to/project",
      "language": "python",
      "duration_seconds": 5.2,
      "scores": {
        "correctness": { "value": 0.85, "weight": 0.40, "detail": "17/20 tests passed" },
        "quality": { "value": 0.96, "weight": 0.25, "detail": "2 lint issues in diff" },
        "security": { "value": 1.0, "weight": 0.20, "detail": "No security issues" },
        "maintainability": { "value": 0.9, "weight": 0.15, "detail": "Average complexity: 6.2" },
        "overall": 0.91
      },
      "metrics": {
        "tests_total": 20,
        "tests_passed": 17,
        "tests_failed": 3,
        "lint_issues": 2,
        "security_issues": 0,
        "avg_complexity": 6.2,
        "files_evaluated": 8
      },
      "issues": [ "..." ]
    }
  ],
  "summary": {
    "avg_overall_score": 0.91,
    "total_issues": 5,
    "critical_issues": 0,
    "targets_passed": 1,
    "targets_failed": 0
  }
}

Eval Scoring Dimensions

Dimension	Weight	Source	Scoring Logic
Correctness	0.40	pytest / maven-test	`tests_passed / tests_total`; no tests → 0.5; compilation failed → 0.0
Quality	0.25	ruff / java-lint	`-0.02` per in-diff lint issue; `-0.002` per out-of-diff
Security	0.20	bandit / java-security	Deductions: critical `-0.30`, high `-0.15`, medium `-0.05`, low `-0.02`
Maintainability	0.15	radon / java-complexity	CC≤5 → 1.0; CC 5-15 → 1.0-0.5; CC 15-25 → 0.5-0.0

Mode 2: `code-eval snippet`

Lightweight snippet evaluation — runs static-analysis scanners only (no test runners or dependency auditors) and produces a compact result with a 0-100 score.

Inline code

Evaluate a code string directly:

code-eval snippet -c "import os; os.system('rm -rf /')" --lang python

File input

Evaluate a single code file:

code-eval snippet --file ./utils.py

Language is auto-detected from the file extension. You can override it:

code-eval snippet --file ./script.txt --lang python

Save snippet result

code-eval snippet -c "print('hello')" --lang python --output result.json

Snippet Output Format

The snippet result JSON is a compact schema:

{
  "language": "python",
  "file": "snippet.py",
  "duration_seconds": 0.45,
  "score": 85.0,
  "issues_count": 3,
  "issues": [
    {
      "id": "SNIPPET-001",
      "severity": "high",
      "type": "security",
      "message": "Possible shell injection via os.system()",
      "file": "snippet.py",
      "line": 1
    }
  ],
  "severity_summary": {
    "critical": 0,
    "high": 1,
    "medium": 1,
    "low": 1,
    "info": 0
  }
}

Snippet Scoring Dimensions

Snippet mode uses 3 dimensions (no correctness, since there are no tests):

Dimension	Weight	Source	Scoring Logic
Quality	0.40	ruff / java-lint	`-0.02` per lint issue
Security	0.35	bandit / java-security	Deductions: critical `-0.30`, high `-0.15`, medium `-0.05`, low `-0.02`
Maintainability	0.25	radon / java-complexity	CC≤5 → 1.0; CC 5-15 → 1.0-0.5; CC 15-25 → 0.5-0.0

Snippet Scanners by Language

Language	Scanners
Python	ruff, bandit, radon
Java	java-lint, java-security, java-complexity

Note: Test runners (pytest, maven-test) and dependency auditors (pip-audit) are excluded from snippet mode since snippets have no project structure.

Exit Codes (snippet)

Code	Meaning
`0`	No critical or high severity issues
`1`	At least one critical or high severity issue found

Configuration

Create a .env file (see .env.example) to customize behavior:

# Sandbox
SANDBOX_ENABLED=false              # Global toggle (default: false)
SANDBOX_PYTHON_ENABLED=true        # Per-language override
SANDBOX_JAVA_ENABLED=              # Per-language override for Java
SANDBOX_MEMORY_LIMIT=512m          # Docker memory limit
SANDBOX_CPU_LIMIT=1                # Docker CPU limit
SANDBOX_TIMEOUT=300                # Total timeout in seconds
SANDBOX_NETWORK=none               # Docker network mode

# Concurrency
MAX_CONCURRENT=4                   # Max parallel evaluations

# Issue limits
MAX_ISSUES_PER_TARGET=50           # Max issues per target in report

# Scoring weights (auto-normalized if they don't sum to 1.0)
SCORE_WEIGHT_CORRECTNESS=0.40
SCORE_WEIGHT_QUALITY=0.25
SCORE_WEIGHT_SECURITY=0.20
SCORE_WEIGHT_MAINTAINABILITY=0.15

# Java / Maven
JAVA_MVN_PATH=                     # Optional mvn path (fallback: PATH lookup)
JAVA_MVN_SETTINGS=                 # Optional settings.xml
JAVA_MVN_TIMEOUT=300               # Maven timeout in seconds
JAVA_MVN_SKIP_TESTS=false          # Run compile instead of test
JAVA_MVN_THREADS=                  # Optional -T value (e.g. 2C)

Sandbox resolution order

For each language: per-language override → global toggle → default (false)

Example: SANDBOX_ENABLED=false + SANDBOX_PYTHON_ENABLED=true → Python runs in sandbox, others run directly.

Docker Sandbox

To build the evaluation Docker image:

docker build -f docker/Dockerfile.python -t code-eval-python .

Enable sandbox in .env:

SANDBOX_ENABLED=true

Project Structure

code_eval/
├── __init__.py
├── cli.py              # Click CLI entry point (eval + snippet sub-commands)
├── config.py           # Configuration from .env
├── adapters/           # Language adapter interface + Python/Java implementations
├── core/               # Runner, scheduler, sandbox, models
├── extractors/         # Issue extractors (Python + Java)
├── reporting/          # JSON & markdown report generation
├── resolvers/          # Target resolution & language detection
├── scanners/           # Scanner interface + Python/Java scanner implementations
├── schemas/            # Pydantic data models (Issue, Metrics, EvaluationReport, SnippetResult)
├── scoring/            # Score computation
└── snippet/            # Snippet-mode runner & scanner selection

Development

# Install with dev dependencies
pip install -e ".[dev]"

# Run tests
python -m pytest tests/ -v

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Skills/code_eval		Skills/code_eval
code_eval		code_eval
docker		docker
sample		sample
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
main.py		main.py
publish.ps1		publish.ps1
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Code Eval

Two Modes

Features

Installation

Mode 1: `code-eval eval`

Directory mode

File mode

Git diff mode

Multiple targets

Save output to file

Generate markdown summary

Custom configuration

Eval Output Format

Eval Scoring Dimensions

Mode 2: `code-eval snippet`

Inline code

File input

Save snippet result

Snippet Output Format

Snippet Scoring Dimensions

Snippet Scanners by Language

Exit Codes (snippet)

Configuration

Sandbox resolution order

Docker Sandbox

Project Structure

Development

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Code Eval

Two Modes

Features

Installation

Mode 1: code-eval eval

Directory mode

File mode

Git diff mode

Multiple targets

Save output to file

Generate markdown summary

Custom configuration

Eval Output Format

Eval Scoring Dimensions

Mode 2: code-eval snippet

Inline code

File input

Save snippet result

Snippet Output Format

Snippet Scoring Dimensions

Snippet Scanners by Language

Exit Codes (snippet)

Configuration

Sandbox resolution order

Docker Sandbox

Project Structure

Development

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Mode 1: `code-eval eval`

Mode 2: `code-eval snippet`

Packages