Skip to content

mathieu0905/llm_code_analysis

Repository files navigation

LLMCodeAnalysisBench

A comprehensive benchmark suite for evaluating Large Language Models on code analysis tasks.

Python 3.9+ License: Proprietary Status: active Website

End-to-end pipelines for AST/CFG/CG analysis, pointer analysis, expression matching, DP/Taint detection, mutant & flaky test identification — with unified evaluation, visualization, and human review.

Language: English | 简体中文


🚀 Quick Links

📑 Table of Contents

✨ Features

  • 🔍 Minimal pipelines for AST/CFG/CG, Pointer Analysis, Expression Match, DP/Taint, Mutant, and Flaky Tests
  • 🚀 Single‑command generation scripts and unified evaluation aggregator
  • 🔌 Multi‑provider support via .env and config/providers.yaml (OpenAI / DeepSeek / Close / Ollama)
  • 📊 PDF rendering for structural outputs to aid manual inspection
  • 👥 Human review dashboard backed by SQLite for quality control
  • 📈 Comprehensive metrics including precision, recall, F1, and structural similarity scores
  • 🛠️ Extensible architecture for adding new tasks and evaluation metrics

📋 Prerequisites

Before you begin, ensure you have the following installed:

  • Python 3.9+ (Python 3.10 or 3.11 recommended)
  • Git for cloning the repository
  • pip and virtualenv (or Python's built-in venv)

Optional dependencies for specific features:

  • Graphviz (for graph rendering): brew install graphviz (macOS) or apt-get install graphviz (Ubuntu)
  • PyTorch (for expression matching): Installed automatically with extras, but you may need platform-specific builds

📦 Installation

Basic Setup

Create and activate a virtual environment:

# Clone the repository (if you haven't already)
git clone https://github.com/mathieu0905/llm_code_analysis
cd llm_code_analysis

# Option 1: Core installation only
bash scripts/setup_venv.sh

# Option 2: With evaluation and expression matching extras
bash scripts/setup_venv.sh -e -x

# Activate the virtual environment
source .venv/bin/activate

pip install -r requirements.txt

Installation Options

Option Command Includes
Core bash scripts/setup_venv.sh Base dependencies for generation
+ Evaluation bash scripts/setup_venv.sh -e Core + evaluation metrics
+ Expression bash scripts/setup_venv.sh -x Core + expression matching (PyTorch)
Full bash scripts/setup_venv.sh -e -x All features

Note: If PyTorch installation fails on your platform, you may need to install a platform-specific wheel manually. Visit pytorch.org for instructions.

⚙️ Configuration

Environment Variables

Copy the example environment file and configure your API keys:

cp .env.example .env
# Edit .env with your API keys and endpoints

Supported Providers

Configure one or more providers in your .env file:

OpenAI (GPT-4, GPT-5, etc.)
export OPENAI_API_KEY=sk-...
export OPENAI_API_BASE=https://api.openai.com/v1
DeepSeek
export DEEPSEEK_API_KEY=ds-...
export DEEPSEEK_API_BASE=https://api.deepseek.com/v1
Close Proxy / Custom OpenAI-compatible
export CLOSE_API_KEY=...
export CLOSE_API_BASE=https://your.close.proxy/v1
Ollama (Local)
export OLLAMA_API_BASE=http://127.0.0.1:11434/v1

Output Configuration

# Set the root directory for all results (recommended)
export RESULTS_ROOT=results

🚀 Quick Start

Generate Results with OpenAI

# Run generation for all tasks with default settings
bash scripts/run_openai_gpt5.sh

This will:

  • Generate AST, CFG, and Call Graph structures
  • Perform pointer analysis
  • Evaluate expression matching
  • Analyze DP/Taint patterns
  • Run mutant and flaky test detection

Use Alternative Providers

DeepSeek:

bash scripts/run_deepseek_chat.sh

Custom OpenAI-compatible Provider:

  1. Set credentials in .env:
export CLOSE_API_KEY=sk-...
export CLOSE_API_BASE=https://api.closeai.com/v1
  1. Update the launcher script:
# Change env var checks and default model
sed -i 's/OPENAI_API_KEY/CLOSE_API_KEY/g; s/OPENAI_API_BASE/CLOSE_API_BASE/g' scripts/run_openai_gpt5.sh
sed -i 's/MODEL_NAME=${MODEL_NAME:-gpt-5-mini}/MODEL_NAME=${MODEL_NAME:-claude-sonnet-4}/' scripts/run_openai_gpt5.sh

# Note: On macOS, use sed -i '' instead of sed -i
  1. (Optional) Configure provider defaults:
    • Code fallback: src/common/llm.py_default_provider_config() includes a close provider
    • Config override: Edit config/providers.yaml to set default_provider: close

Evaluate Results

After generation completes, evaluate across all tasks and models:

python evaluation/evaluate_multi_models.py

Output: results/aggregated_summary.json and results/results.md

Behavior and tips:

  • Omitting --model enumerates every directory under --results-root and evaluates each.
  • By default, if results/aggregated_summary.json already contains a model entry, that model is skipped and the cached summary is reused.
  • Skipped models still appear in the Markdown output, using cached values.
  • Force recompute for all/selected models with --overwrite.
  • Target specific models with --model Display=folder (repeatable), e.g. --model gpt-4o-mini=gpt-4o-mini.
  • Reserved folder multi_model/ under results/ is ignored automatically and not treated as a model.

Render Visual Graphs

Generate PDF visualizations for AST/CFG/CG:

python evaluation/render_graphs.py <model_name> \
  --results-root results \
  --tasks ast cfg cg \
  --render-gold \
  --languages C java python solidity

Tip: Omit <model_name> to iterate all models under --results-root; existing PDFs are skipped by default. Add --overwrite to force re-render. See the Rendering section for full options. You can also run as a module after installing the package: python -m llm_code_analysis.evaluation.render_graphs ....

Human Review Interface

Launch the interactive dashboard for manual quality control:

bash scripts/run_human_eval_dashboard.sh

# Open in your browser:
# macOS:   open http://localhost:3000
# Linux:   xdg-open http://localhost:3000
# Windows: start http://localhost:3000

📁 Project Structure

llm_code_analysis/
├── config/                    # Provider configurations
│   └── providers.yaml
├── datasets/                  # Test cases and benchmarks
│   ├── ast_cfg/              # AST & CFG test cases
│   ├── call_graph/           # Call graph test cases
│   ├── pointer/              # Pointer analysis cases
│   ├── expression_match/     # Expression matching cases
│   ├── dp_taint/             # DP & Taint analysis cases
│   ├── flakytest/            # Flaky test detection cases
│   └── mutant/               # Mutant detection cases
├── src/                       # Source code
│   ├── pipelines/            # Task-specific pipelines
│   ├── common/               # Shared utilities
│   └── tasks_cli.py          # CLI entry point
├── evaluation/                # Evaluation scripts
│   ├── metrics/              # Metric implementations
│   ├── evaluate_multi_models.py
│   └── render_graphs.py
├── scripts/                   # Convenience scripts
│   ├── setup_venv.sh
│   ├── run_openai_gpt5.sh
│   └── run_deepseek_chat.sh
├── static_baseline/           # Gold standard outputs
│   ├── AST/
│   ├── CFG/
│   └── CG/
├── tools/                     # Additional tools
│   └── human_eval_next/      # Human review dashboard
├── results/                   # Generated outputs (gitignored)
└── pyproject.toml            # Project metadata

📚 Datasets

The datasets/ directory contains test cases organized by analysis task:

Task Location Structure
AST & CFG datasets/ast_cfg/<language>/code/* Source files per language
Call Graph datasets/call_graph/<language>/code/* Source files per language
Pointer Analysis datasets/pointer/cases/<case>/ code.c, prompt.md per case
Expression Match datasets/expression_match/cases/<case>/ prompt.md, reference/* code
DP/Taint datasets/dp_taint/contracts/ <project>.json files
Flaky Tests datasets/flakytest/{summary,concept}_cases/ question.md, metadata.json
MutantBench datasets/mutant/{fewshot,zeroshot}_cases/ question.md, metadata.json

Gold Standards:

  • Static baselines: static_baseline/{AST,CFG,CG}/<language>/<case>/gold.json
  • Pointer ground truth: datasets/pointer/ground_truth/

💡 For detailed dataset documentation, see datasets/README.md

➕ Add New Cases

AST/CFG Cases

mkdir -p datasets/ast_cfg/C/code
cp your_case.c datasets/ast_cfg/C/code/

Call Graph Cases

mkdir -p datasets/call_graph/C/code
cp your_case.c datasets/call_graph/C/code/

Pointer Analysis Cases

# Copy from example template
cp -r datasets/pointer/cases/example_case datasets/pointer/cases/<new_case>

# Edit the source code
$EDITOR datasets/pointer/cases/<new_case>/code.c

# Edit the prompt
$EDITOR datasets/pointer/cases/<new_case>/prompt.md

Expression Match Cases

mkdir -p datasets/expression_match/cases/<case_name>/reference
echo "Your analysis prompt here" > datasets/expression_match/cases/<case_name>/prompt.md
cp -r /path/to/reference/code/* datasets/expression_match/cases/<case_name>/reference/

DP/Taint Cases

mkdir -p datasets/dp_taint/contracts

# Format: [DP_ITEMS, TAINT_ITEMS]
# Each item: [id, text_or_parts, label]
echo '[[], []]' > datasets/dp_taint/contracts/<project_name>.json

Flaky Test & MutantBench Cases

# Flaky test case
mkdir -p datasets/flakytest/summary_cases/<case_id>
echo "Question text here" > datasets/flakytest/summary_cases/<case_id>/question.md
echo '{"label": true, "reason": "explanation"}' > datasets/flakytest/summary_cases/<case_id>/metadata.json

# Mutant case
mkdir -p datasets/mutant/fewshot_cases/<case_id>
echo "Question text here" > datasets/mutant/fewshot_cases/<case_id>/question.md
echo '{"label": true, "reason": "explanation"}' > datasets/mutant/fewshot_cases/<case_id>/metadata.json

🏆 Static Gold (AST/CFG/CG)

Gold standard files are stored in static_baseline/{AST,CFG,CG}/<language>/<case>/gold.json.

File Organization

  • Case naming: Folder name equals the source filename
    • Example: static_baseline/AST/C/for_loop.c/gold.json

JSON Schema

AST Format:

{
  "type": "FunctionDecl",
  "value": "main",
  "children": [
    {"type": "ReturnStmt", "children": [...]}
  ]
}

CFG/CG Format:

{
  "nodes": [
    {"id": "node1", "type": "EntryPoint", "label": "main"}
  ],
  "edges": [
    {"source": "node1", "target": "node2"}
  ]
}

Generation Guidelines

  1. Generate using your static analysis tools
  2. Convert to the schema format above
  3. The evaluator normalizes common naming variations
  4. Minor field variations are tolerated

Note for Call Graphs: Evaluation uses 75 canonical cases from datasets/call_graph/*/code/ to align totals across languages.

🔍 DP/Taint Format & Extraction

Input Format

Files are stored in datasets/dp_taint/contracts/<project>.json:

[
  [
    ["dp_1", ["text part 1", "text part 2"], true],
    ["dp_2", "single text string", false]
  ],
  [
    ["taint_1", "taint description", true]
  ]
]

Structure: [DP_ITEMS, TAINT_ITEMS]

  • Each item: [id, text_or_parts, label]
  • Text can be a single string or array of parts

Output Format

Predictions are written to:

results/<model>/dp_taint/<project>/contract/{dp,taint}/<id>.txt

First line must be exactly one of:

  • yes
  • no
  • unknown

Evaluation

  • Aggregates precision, recall, and F1 score per project
  • Generates plots in results/<model>/dp_taint/plots/

📤 Outputs

Set RESULTS_ROOT=results in your environment. All pipelines write to $RESULTS_ROOT/<model>/<task>/....

Directory Structure

results/
├── <model_name>/
│   ├── ast_cfg/              # AST and CFG outputs
│   ├── call_graph/           # Call graph outputs
│   ├── pointer/              # Pointer analysis results
│   ├── expression_match/     # Expression matching results
│   ├── mutant/               # Mutant detection results
│   ├── flakytest/            # Flaky test detection results
│   └── dp_taint/             # DP and Taint analysis
│       ├── <project>/
│       │   └── contract/{dp,taint}/<id>.txt
│       └── plots/            # Visualization plots
├── aggregated_summary.json   # Cross-model evaluation
└── results.md                # Human-readable summary

Configuration

Some legacy scripts default to results. Override with:

  • --results-root results
  • --output-root results
  • --output-dir results

📊 Evaluation Details

Multi-Model Aggregator

Script: evaluation/evaluate_multi_models.py

python evaluation/evaluate_multi_models.py

Outputs:

  • results/aggregated_summary.json - Machine-readable metrics
  • results/results.md - Human-readable report (optional)

Metrics Included:

  • 🌳 Structural: AST/CFG/CG similarity scores
  • 🎯 DP/Taint: Precision, Recall, F1 scores
  • 🔗 Pointer: Jaccard similarity
  • 🐛 Mutant & Flaky: Classification accuracy
  • 📝 Expression Match: Semantic similarity (requires [expr] extras)

Focused Evaluators

For task-specific evaluation:

  • evaluation/report.py - Individual task reports
  • evaluation/metrics/* - Metric implementations

Metrics Reference

Task Primary Metric Range Notes
AST/CFG Tree Edit Distance 0.0-1.0 Normalized similarity
Call Graph Graph Isomorphism 0.0-1.0 Node/edge matching
Pointer Jaccard Index 0.0-1.0 Set overlap
DP/Taint F1 Score 0.0-1.0 Harmonic mean of P/R
Mutant/Flaky Accuracy 0.0-1.0 Correct classifications

🎨 Rendering

Convert structural JSON outputs to visual PDFs for manual inspection.

Script: evaluation/render_graphs.py

# Run directly (no install required)
python evaluation/render_graphs.py <model_name> \
  --results-root results \
  --tasks ast cfg cg \
  --render-gold \
  --languages C java python solidity

# Or as a module (after `pip install -e .[eval]`)
python -m llm_code_analysis.evaluation.render_graphs <model_name> \
  --results-root results \
  --tasks ast cfg cg \
  --render-gold \
  --languages C java python solidity

If you see ModuleNotFoundError: No module named 'llm_code_analysis':

  • Install the project as a package (recommended): pip install -e .[eval]
  • Or run without the package prefix: python -m evaluation.render_graphs ... or python evaluation/render_graphs.py ...

Incremental rendering (all models):

  • Omit the <model_name> argument to iterate over every model under --results-root.
  • By default (without --overwrite), existing PDFs are skipped, so only missing graphs are rendered.

Example (render only missing PDFs across all models):

python evaluation/render_graphs.py \
  --results-root results \
  --tasks ast cfg cg \
  --languages C java python solidity
python -m llm_code_analysis.evaluation.render_graphs \
  --results-root results \
  --tasks ast cfg cg \
  --languages C java python solidity

Optional:

  • Also render static gold into results/<gold-as-model>/...:
python -m llm_code_analysis.evaluation.render_graphs \
  --results-root results \
  --tasks ast cfg cg \
  --languages C java python solidity \
  --render-gold
  • Force re-render (overwrite existing PDFs): add --overwrite.
  • Render only the static gold (skip model outputs): add --only-gold.
    • Configure gold sources and target name with --gold-root and --gold-as-model.

Options:

  • --render-gold - Also render gold standard references
  • --only-gold - Render only gold references into --gold-as-model
  • --gold-root - Directory containing AST/CFG/CG/<language>/<case>/gold.json (default static_baseline)
  • --gold-as-model - Target folder name under --results-root for gold renders (default gold-static)
  • --tasks - Select specific tasks (ast, cfg, cg)
  • --languages - Filter by programming language

Output locations:

  • AST/CFG: results/<model>/ast_cfg/<language>/<case>/{AST,CFG}.pdf
  • CG: results/<model>/call_graph/<language>/<case>/CG.pdf

👥 Human Review

Interactive dashboard for manual quality control and judgment collection.

Running the Dashboard

bash scripts/run_human_eval_dashboard.sh

# Access at http://localhost:3000

Database Integration

Location: tools/human_eval_next/

The Next.js app uses SQLite to store human judgments.

Default DB: tools/human_eval_next/human_eval.sqlite3

  • Override with -d <path> or HUMAN_EVAL_DB environment variable

Database Schema

Tables are automatically created on first run:

Table Purpose Key Fields
judgments Human labels from UI model, task, language, case, reviewer, label, note
auto_judgments Automatic PASS/FAIL labels model, task, case, label, score, threshold
structure_results Per-case metrics model, task, case, score, metrics, label

Integration with Evaluation

Read human overrides:

python evaluation/evaluate_multi_models.py --human-db path/to/human_eval.sqlite3

Write automatic judgments:

python evaluation/evaluate_multi_models.py --write-auto-db

🔧 Configuration

Environment Variables Reference

Variable Provider Example Required
OPENAI_API_KEY OpenAI sk-...
OPENAI_API_BASE OpenAI https://api.openai.com/v1
DEEPSEEK_API_KEY DeepSeek ds-... Optional
DEEPSEEK_API_BASE DeepSeek https://api.deepseek.com/v1 Optional
CLOSE_API_KEY Close/Custom sk-... Optional
CLOSE_API_BASE Close/Custom https://proxy.example.com/v1 Optional
OLLAMA_API_BASE Ollama http://127.0.0.1:11434/v1 Optional
RESULTS_ROOT Output results Recommended

Provider Routing

Configure model-to-provider mapping in config/providers.yaml:

default_provider: openai

providers:
  openai:
    models: [gpt-4, gpt-5-mini]
  deepseek:
    models: [deepseek-chat, deepseek-coder]
  close:
    models: [claude-sonnet-4]

❓ FAQ

Q: "Results root" path mismatch errors?

A: Set the RESULTS_ROOT environment variable consistently:

export RESULTS_ROOT=results

Pass --results-root results to scripts that support it.

Q: How do I switch providers or models?

A: Two options:

  1. Environment variables (quickest):
export MODEL_NAME=gpt-4
bash scripts/run_openai_gpt5.sh
  1. Edit launcher scripts:
# Edit MODEL_NAME, API_KEY, API_BASE in the script
vim scripts/run_openai_gpt5.sh
Q: Installation errors with extras ([eval], [expr])?

A:

  • Evaluation extras: pip install -e .[eval]
  • Expression matching: pip install -e .[expr]
    • Requires PyTorch + Transformers
    • CPU or GPU works
    • For PyTorch issues, see pytorch.org
Q: Where are the generated outputs?

A: All outputs go to results/<model_name>/<task>/

Check $RESULTS_ROOT environment variable if paths don't match.

Q: How do I add my own test cases?

A: See the Add New Cases section for task-specific instructions.

Q: Can I run this on a GPU?

A: Yes! Expression matching can leverage GPU acceleration if PyTorch detects CUDA. Other tasks are primarily API-based and don't require GPU.

🚧 Extending

Adding a New Generation Task

  1. Create pipeline: src/pipelines/<task_name>.py

    def run(**kwargs):
        # Your generation logic
        pass
    
    def build_arg_parser():
        # Argument parser for CLI
        pass
  2. Follow output convention: Write to $RESULTS_ROOT/<model>/<task>/...

  3. Register task: Add to src/tasks_cli.py

  4. Add evaluation (optional): Create metric in evaluation/metrics/<task>_metrics.py

Adding New Metrics or Visualizations

  1. Implement metric: Add to evaluation/metrics/
  2. Add plotting: Extend evaluation/task_plots.py
  3. Integrate: Update evaluation/evaluate_multi_models.py to aggregate

Example: Custom Metric

# evaluation/metrics/custom_metric.py

def evaluate_custom(gold_path, pred_path):
    """
    Your custom evaluation logic.
    
    Returns:
        dict: {"accuracy": 0.95, "custom_score": 0.87}
    """
    # Load and compare
    pass

📄 License

Proprietary License

This project is proprietary software. See pyproject.toml for license metadata.

For licensing inquiries, contact the repository maintainer.

📚 Citation

If you use this benchmark in your research, please cite:

@article{ma2023exploring,
  title={Exploring Code Analysis: Zero-Shot Insights on Syntax and Semantics with LLMs},
  author={Ma, Wei and Lin, Zhihao and Liu, Shangqing and Hu, Qiang and Liu, Ye and Wang, Wenhan and Zhang, Cen and Nie, Liming and Li, Li and Liu, Yang and Jiang, Lingxiao},
  journal={arXiv preprint arXiv:2305.12138},
  year={2023}
}

Paper: arXiv:2305.12138


Questions or Issues? Open an issue or check the FAQ

Want to Contribute? See Extending for guidelines

Made with ❤️ for LLM code analysis research

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •