Skip to content

patrickleehua/CodeEval

Repository files navigation

Code Eval

Automated evaluation pipeline for AI-generated code. Supports two evaluation modes — full-project eval and lightweight snippet — covering Python and Java (Maven).

Two Modes

code-eval eval code-eval snippet
Purpose Full-project evaluation with tests, lint, security, and complexity Quick static analysis of a single code snippet
Input Directory / file paths / git diff Inline code (-c) or single file (--file)
Scanners All 9 scanners (incl. test runners & dependency auditors) Static-analysis only (no pytest / maven-test / pip-audit)
Scoring 4 dimensions: correctness, quality, security, maintainability 3 dimensions: quality, security, maintainability (no correctness)
Output evaluation.json — full report with metrics, issues, scores Compact SnippetResult JSON with score (0-100) and issues
Use Case CI/CD pipelines, batch project evaluation Code review, quick checks, editor integration

Features

  • Two evaluation modes: eval (project) and snippet (single file / inline code)
  • Three input modes (eval): directory, file path, git-diff
  • Two language adapters: Python + Java (Maven)
  • Nine scanners:
    • Python: pytest, ruff, bandit, radon, pip-audit
    • Java: maven-test, java-lint, java-security, java-complexity
  • Multi-dimensional scoring: correctness (0.40), quality (0.25), security (0.20), maintainability (0.15)
  • Two-layer diff awareness: file-level + line-level tracking (in_diff tagging)
  • Configurable Docker sandbox: optional container isolation with resource limits
  • Batch evaluation: concurrent target processing with progress reporting
  • Structured output: evaluation.json with metrics, issues, scores, and summary

Installation

pip install code-eval

Or install from source:

pip install -e .

Mode 1: code-eval eval

Full-project evaluation — runs all scanners (tests, lint, security, complexity) and produces a comprehensive structured report.

Directory mode

Evaluate a project directory (language auto-detected by markers such as pyproject.toml or pom.xml):

code-eval eval --targets ./my_project

File mode

Evaluate specific files:

code-eval eval --targets ./src/auth.py ./src/api.py

For Java, file mode also works (project root resolved via pom.xml):

code-eval eval --targets ./my-java-project/src/main/java/com/example/App.java

Git diff mode

Evaluate only files changed since main:

code-eval eval --git-diff --base main

Multiple targets

code-eval eval --targets ./project_a ./project_b

Save output to file

code-eval eval --targets ./my_project --output evaluation.json

Generate markdown summary

code-eval eval --targets ./my_project --output evaluation.json --summary summary.md

Custom configuration

code-eval eval --targets ./my_project --config .env.production

Eval Output Format

The evaluation.json output contains:

{
  "meta": {
    "timestamp": "2025-01-01T00:00:00Z",
    "pipeline_version": "0.1.0",
    "total_targets": 1,
    "total_duration_seconds": 5.2
  },
  "results": [
    {
      "target": "/path/to/project",
      "language": "python",
      "duration_seconds": 5.2,
      "scores": {
        "correctness": { "value": 0.85, "weight": 0.40, "detail": "17/20 tests passed" },
        "quality": { "value": 0.96, "weight": 0.25, "detail": "2 lint issues in diff" },
        "security": { "value": 1.0, "weight": 0.20, "detail": "No security issues" },
        "maintainability": { "value": 0.9, "weight": 0.15, "detail": "Average complexity: 6.2" },
        "overall": 0.91
      },
      "metrics": {
        "tests_total": 20,
        "tests_passed": 17,
        "tests_failed": 3,
        "lint_issues": 2,
        "security_issues": 0,
        "avg_complexity": 6.2,
        "files_evaluated": 8
      },
      "issues": [ "..." ]
    }
  ],
  "summary": {
    "avg_overall_score": 0.91,
    "total_issues": 5,
    "critical_issues": 0,
    "targets_passed": 1,
    "targets_failed": 0
  }
}

Eval Scoring Dimensions

Dimension Weight Source Scoring Logic
Correctness 0.40 pytest / maven-test tests_passed / tests_total; no tests → 0.5; compilation failed → 0.0
Quality 0.25 ruff / java-lint -0.02 per in-diff lint issue; -0.002 per out-of-diff
Security 0.20 bandit / java-security Deductions: critical -0.30, high -0.15, medium -0.05, low -0.02
Maintainability 0.15 radon / java-complexity CC≤5 → 1.0; CC 5-15 → 1.0-0.5; CC 15-25 → 0.5-0.0

Mode 2: code-eval snippet

Lightweight snippet evaluation — runs static-analysis scanners only (no test runners or dependency auditors) and produces a compact result with a 0-100 score.

Inline code

Evaluate a code string directly:

code-eval snippet -c "import os; os.system('rm -rf /')" --lang python

File input

Evaluate a single code file:

code-eval snippet --file ./utils.py

Language is auto-detected from the file extension. You can override it:

code-eval snippet --file ./script.txt --lang python

Save snippet result

code-eval snippet -c "print('hello')" --lang python --output result.json

Snippet Output Format

The snippet result JSON is a compact schema:

{
  "language": "python",
  "file": "snippet.py",
  "duration_seconds": 0.45,
  "score": 85.0,
  "issues_count": 3,
  "issues": [
    {
      "id": "SNIPPET-001",
      "severity": "high",
      "type": "security",
      "message": "Possible shell injection via os.system()",
      "file": "snippet.py",
      "line": 1
    }
  ],
  "severity_summary": {
    "critical": 0,
    "high": 1,
    "medium": 1,
    "low": 1,
    "info": 0
  }
}

Snippet Scoring Dimensions

Snippet mode uses 3 dimensions (no correctness, since there are no tests):

Dimension Weight Source Scoring Logic
Quality 0.40 ruff / java-lint -0.02 per lint issue
Security 0.35 bandit / java-security Deductions: critical -0.30, high -0.15, medium -0.05, low -0.02
Maintainability 0.25 radon / java-complexity CC≤5 → 1.0; CC 5-15 → 1.0-0.5; CC 15-25 → 0.5-0.0

Snippet Scanners by Language

Language Scanners
Python ruff, bandit, radon
Java java-lint, java-security, java-complexity

Note: Test runners (pytest, maven-test) and dependency auditors (pip-audit) are excluded from snippet mode since snippets have no project structure.

Exit Codes (snippet)

Code Meaning
0 No critical or high severity issues
1 At least one critical or high severity issue found

Configuration

Create a .env file (see .env.example) to customize behavior:

# Sandbox
SANDBOX_ENABLED=false              # Global toggle (default: false)
SANDBOX_PYTHON_ENABLED=true        # Per-language override
SANDBOX_JAVA_ENABLED=              # Per-language override for Java
SANDBOX_MEMORY_LIMIT=512m          # Docker memory limit
SANDBOX_CPU_LIMIT=1                # Docker CPU limit
SANDBOX_TIMEOUT=300                # Total timeout in seconds
SANDBOX_NETWORK=none               # Docker network mode

# Concurrency
MAX_CONCURRENT=4                   # Max parallel evaluations

# Issue limits
MAX_ISSUES_PER_TARGET=50           # Max issues per target in report

# Scoring weights (auto-normalized if they don't sum to 1.0)
SCORE_WEIGHT_CORRECTNESS=0.40
SCORE_WEIGHT_QUALITY=0.25
SCORE_WEIGHT_SECURITY=0.20
SCORE_WEIGHT_MAINTAINABILITY=0.15

# Java / Maven
JAVA_MVN_PATH=                     # Optional mvn path (fallback: PATH lookup)
JAVA_MVN_SETTINGS=                 # Optional settings.xml
JAVA_MVN_TIMEOUT=300               # Maven timeout in seconds
JAVA_MVN_SKIP_TESTS=false          # Run compile instead of test
JAVA_MVN_THREADS=                  # Optional -T value (e.g. 2C)

Sandbox resolution order

For each language: per-language overrideglobal toggledefault (false)

Example: SANDBOX_ENABLED=false + SANDBOX_PYTHON_ENABLED=true → Python runs in sandbox, others run directly.

Docker Sandbox

To build the evaluation Docker image:

docker build -f docker/Dockerfile.python -t code-eval-python .

Enable sandbox in .env:

SANDBOX_ENABLED=true

Project Structure

code_eval/
├── __init__.py
├── cli.py              # Click CLI entry point (eval + snippet sub-commands)
├── config.py           # Configuration from .env
├── adapters/           # Language adapter interface + Python/Java implementations
├── core/               # Runner, scheduler, sandbox, models
├── extractors/         # Issue extractors (Python + Java)
├── reporting/          # JSON & markdown report generation
├── resolvers/          # Target resolution & language detection
├── scanners/           # Scanner interface + Python/Java scanner implementations
├── schemas/            # Pydantic data models (Issue, Metrics, EvaluationReport, SnippetResult)
├── scoring/            # Score computation
└── snippet/            # Snippet-mode runner & scanner selection

Development

# Install with dev dependencies
pip install -e ".[dev]"

# Run tests
python -m pytest tests/ -v

License

MIT

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors