LLMCodeAnalysisBench

A comprehensive benchmark suite for evaluating Large Language Models on code analysis tasks.

End-to-end pipelines for AST/CFG/CG analysis, pointer analysis, expression matching, DP/Taint detection, mutant & flaky test identification — with unified evaluation, visualization, and human review.

Language: English | 简体中文

🚀 Quick Links

Installation: scripts/setup_venv.sh
Configuration: .env and provider keys
Run Generation: scripts/run_openai_gpt5.sh
Evaluate: evaluation/evaluate_multi_models.py
Render Graphs: evaluation/render_graphs.py
Human Review UI: scripts/run_human_eval_dashboard.sh
Website (Results): https://mathieu0905.github.io/llm_analysis/

📑 Table of Contents

Features
Prerequisites
Installation
Configuration
Quick Start
Project Structure
Datasets
Add New Cases
Static Gold
DP/Taint Format
Outputs
Evaluation Details
Rendering
Human Review
FAQ
Extending
License
Citation

✨ Features

🔍 Minimal pipelines for AST/CFG/CG, Pointer Analysis, Expression Match, DP/Taint, Mutant, and Flaky Tests
🚀 Single‑command generation scripts and unified evaluation aggregator
🔌 Multi‑provider support via .env and config/providers.yaml (OpenAI / DeepSeek / Close / Ollama)
📊 PDF rendering for structural outputs to aid manual inspection
👥 Human review dashboard backed by SQLite for quality control
📈 Comprehensive metrics including precision, recall, F1, and structural similarity scores
🛠️ Extensible architecture for adding new tasks and evaluation metrics

📋 Prerequisites

Before you begin, ensure you have the following installed:

Python 3.9+ (Python 3.10 or 3.11 recommended)
Git for cloning the repository
pip and virtualenv (or Python's built-in venv)

Optional dependencies for specific features:

Graphviz (for graph rendering): brew install graphviz (macOS) or apt-get install graphviz (Ubuntu)
PyTorch (for expression matching): Installed automatically with extras, but you may need platform-specific builds

📦 Installation

Basic Setup

Create and activate a virtual environment:

# Clone the repository (if you haven't already)
git clone https://github.com/mathieu0905/llm_code_analysis
cd llm_code_analysis

# Option 1: Core installation only
bash scripts/setup_venv.sh

# Option 2: With evaluation and expression matching extras
bash scripts/setup_venv.sh -e -x

# Activate the virtual environment
source .venv/bin/activate

pip install -r requirements.txt

Installation Options

Option	Command	Includes
Core	`bash scripts/setup_venv.sh`	Base dependencies for generation
+ Evaluation	`bash scripts/setup_venv.sh -e`	Core + evaluation metrics
+ Expression	`bash scripts/setup_venv.sh -x`	Core + expression matching (PyTorch)
Full	`bash scripts/setup_venv.sh -e -x`	All features

Note: If PyTorch installation fails on your platform, you may need to install a platform-specific wheel manually. Visit pytorch.org for instructions.

⚙️ Configuration

Environment Variables

Copy the example environment file and configure your API keys:

cp .env.example .env
# Edit .env with your API keys and endpoints

Supported Providers

Configure one or more providers in your .env file:

OpenAI (GPT-4, GPT-5, etc.)

export OPENAI_API_KEY=sk-...
export OPENAI_API_BASE=https://api.openai.com/v1

DeepSeek

export DEEPSEEK_API_KEY=ds-...
export DEEPSEEK_API_BASE=https://api.deepseek.com/v1

Close Proxy / Custom OpenAI-compatible

export CLOSE_API_KEY=...
export CLOSE_API_BASE=https://your.close.proxy/v1

Ollama (Local)

export OLLAMA_API_BASE=http://127.0.0.1:11434/v1

Output Configuration

# Set the root directory for all results (recommended)
export RESULTS_ROOT=results

🚀 Quick Start

Generate Results with OpenAI

# Run generation for all tasks with default settings
bash scripts/run_openai_gpt5.sh

This will:

Generate AST, CFG, and Call Graph structures
Perform pointer analysis
Evaluate expression matching
Analyze DP/Taint patterns
Run mutant and flaky test detection

Use Alternative Providers

DeepSeek:

bash scripts/run_deepseek_chat.sh

Custom OpenAI-compatible Provider:

Set credentials in .env:

export CLOSE_API_KEY=sk-...
export CLOSE_API_BASE=https://api.closeai.com/v1

Update the launcher script:

# Change env var checks and default model
sed -i 's/OPENAI_API_KEY/CLOSE_API_KEY/g; s/OPENAI_API_BASE/CLOSE_API_BASE/g' scripts/run_openai_gpt5.sh
sed -i 's/MODEL_NAME=${MODEL_NAME:-gpt-5-mini}/MODEL_NAME=${MODEL_NAME:-claude-sonnet-4}/' scripts/run_openai_gpt5.sh

# Note: On macOS, use sed -i '' instead of sed -i

(Optional) Configure provider defaults:
- Code fallback: src/common/llm.py → _default_provider_config() includes a close provider
- Config override: Edit config/providers.yaml to set default_provider: close

Evaluate Results

After generation completes, evaluate across all tasks and models:

python evaluation/evaluate_multi_models.py

Output: results/aggregated_summary.json and results/results.md

Behavior and tips:

Omitting --model enumerates every directory under --results-root and evaluates each.
By default, if results/aggregated_summary.json already contains a model entry, that model is skipped and the cached summary is reused.
Skipped models still appear in the Markdown output, using cached values.
Force recompute for all/selected models with --overwrite.
Target specific models with --model Display=folder (repeatable), e.g. --model gpt-4o-mini=gpt-4o-mini.
Reserved folder multi_model/ under results/ is ignored automatically and not treated as a model.

Render Visual Graphs

Generate PDF visualizations for AST/CFG/CG:

python evaluation/render_graphs.py <model_name> \
  --results-root results \
  --tasks ast cfg cg \
  --render-gold \
  --languages C java python solidity

Tip: Omit <model_name> to iterate all models under --results-root; existing PDFs are skipped by default. Add --overwrite to force re-render. See the Rendering section for full options. You can also run as a module after installing the package: python -m llm_code_analysis.evaluation.render_graphs ....

Human Review Interface

Launch the interactive dashboard for manual quality control:

bash scripts/run_human_eval_dashboard.sh

# Open in your browser:
# macOS:   open http://localhost:3000
# Linux:   xdg-open http://localhost:3000
# Windows: start http://localhost:3000

📁 Project Structure

llm_code_analysis/
├── config/                    # Provider configurations
│   └── providers.yaml
├── datasets/                  # Test cases and benchmarks
│   ├── ast_cfg/              # AST & CFG test cases
│   ├── call_graph/           # Call graph test cases
│   ├── pointer/              # Pointer analysis cases
│   ├── expression_match/     # Expression matching cases
│   ├── dp_taint/             # DP & Taint analysis cases
│   ├── flakytest/            # Flaky test detection cases
│   └── mutant/               # Mutant detection cases
├── src/                       # Source code
│   ├── pipelines/            # Task-specific pipelines
│   ├── common/               # Shared utilities
│   └── tasks_cli.py          # CLI entry point
├── evaluation/                # Evaluation scripts
│   ├── metrics/              # Metric implementations
│   ├── evaluate_multi_models.py
│   └── render_graphs.py
├── scripts/                   # Convenience scripts
│   ├── setup_venv.sh
│   ├── run_openai_gpt5.sh
│   └── run_deepseek_chat.sh
├── static_baseline/           # Gold standard outputs
│   ├── AST/
│   ├── CFG/
│   └── CG/
├── tools/                     # Additional tools
│   └── human_eval_next/      # Human review dashboard
├── results/                   # Generated outputs (gitignored)
└── pyproject.toml            # Project metadata

📚 Datasets

The datasets/ directory contains test cases organized by analysis task:

Task	Location	Structure
AST & CFG	`datasets/ast_cfg/<language>/code/*`	Source files per language
Call Graph	`datasets/call_graph/<language>/code/*`	Source files per language
Pointer Analysis	`datasets/pointer/cases/<case>/`	`code.c`, `prompt.md` per case
Expression Match	`datasets/expression_match/cases/<case>/`	`prompt.md`, `reference/*` code
DP/Taint	`datasets/dp_taint/contracts/`	`<project>.json` files
Flaky Tests	`datasets/flakytest/{summary,concept}_cases/`	`question.md`, `metadata.json`
MutantBench	`datasets/mutant/{fewshot,zeroshot}_cases/`	`question.md`, `metadata.json`

Gold Standards:

Static baselines: static_baseline/{AST,CFG,CG}/<language>/<case>/gold.json
Pointer ground truth: datasets/pointer/ground_truth/

💡 For detailed dataset documentation, see datasets/README.md

➕ Add New Cases

AST/CFG Cases

mkdir -p datasets/ast_cfg/C/code
cp your_case.c datasets/ast_cfg/C/code/

Call Graph Cases

mkdir -p datasets/call_graph/C/code
cp your_case.c datasets/call_graph/C/code/

Pointer Analysis Cases

# Copy from example template
cp -r datasets/pointer/cases/example_case datasets/pointer/cases/<new_case>

# Edit the source code
$EDITOR datasets/pointer/cases/<new_case>/code.c

# Edit the prompt
$EDITOR datasets/pointer/cases/<new_case>/prompt.md

Expression Match Cases

mkdir -p datasets/expression_match/cases/<case_name>/reference
echo "Your analysis prompt here" > datasets/expression_match/cases/<case_name>/prompt.md
cp -r /path/to/reference/code/* datasets/expression_match/cases/<case_name>/reference/

DP/Taint Cases

mkdir -p datasets/dp_taint/contracts

# Format: [DP_ITEMS, TAINT_ITEMS]
# Each item: [id, text_or_parts, label]
echo '[[], []]' > datasets/dp_taint/contracts/<project_name>.json

Flaky Test & MutantBench Cases

# Flaky test case
mkdir -p datasets/flakytest/summary_cases/<case_id>
echo "Question text here" > datasets/flakytest/summary_cases/<case_id>/question.md
echo '{"label": true, "reason": "explanation"}' > datasets/flakytest/summary_cases/<case_id>/metadata.json

# Mutant case
mkdir -p datasets/mutant/fewshot_cases/<case_id>
echo "Question text here" > datasets/mutant/fewshot_cases/<case_id>/question.md
echo '{"label": true, "reason": "explanation"}' > datasets/mutant/fewshot_cases/<case_id>/metadata.json

🏆 Static Gold (AST/CFG/CG)

Gold standard files are stored in static_baseline/{AST,CFG,CG}/<language>/<case>/gold.json.

File Organization

Case naming: Folder name equals the source filename
- Example: static_baseline/AST/C/for_loop.c/gold.json

JSON Schema

AST Format:

{
  "type": "FunctionDecl",
  "value": "main",
  "children": [
    {"type": "ReturnStmt", "children": [...]}
  ]
}

CFG/CG Format:

{
  "nodes": [
    {"id": "node1", "type": "EntryPoint", "label": "main"}
  ],
  "edges": [
    {"source": "node1", "target": "node2"}
  ]
}

Generation Guidelines

Generate using your static analysis tools
Convert to the schema format above
The evaluator normalizes common naming variations
Minor field variations are tolerated

Note for Call Graphs: Evaluation uses 75 canonical cases from datasets/call_graph/*/code/ to align totals across languages.

🔍 DP/Taint Format & Extraction

Input Format

Files are stored in datasets/dp_taint/contracts/<project>.json:

[
  [
    ["dp_1", ["text part 1", "text part 2"], true],
    ["dp_2", "single text string", false]
  ],
  [
    ["taint_1", "taint description", true]
  ]
]

Structure: [DP_ITEMS, TAINT_ITEMS]

Each item: [id, text_or_parts, label]
Text can be a single string or array of parts

Output Format

Predictions are written to:

results/<model>/dp_taint/<project>/contract/{dp,taint}/<id>.txt

First line must be exactly one of:

yes
no
unknown

Evaluation

Aggregates precision, recall, and F1 score per project
Generates plots in results/<model>/dp_taint/plots/

📤 Outputs

Set RESULTS_ROOT=results in your environment. All pipelines write to $RESULTS_ROOT/<model>/<task>/....

Directory Structure

results/
├── <model_name>/
│   ├── ast_cfg/              # AST and CFG outputs
│   ├── call_graph/           # Call graph outputs
│   ├── pointer/              # Pointer analysis results
│   ├── expression_match/     # Expression matching results
│   ├── mutant/               # Mutant detection results
│   ├── flakytest/            # Flaky test detection results
│   └── dp_taint/             # DP and Taint analysis
│       ├── <project>/
│       │   └── contract/{dp,taint}/<id>.txt
│       └── plots/            # Visualization plots
├── aggregated_summary.json   # Cross-model evaluation
└── results.md                # Human-readable summary

Configuration

Some legacy scripts default to results. Override with:

--results-root results
--output-root results
--output-dir results

📊 Evaluation Details

Multi-Model Aggregator

Script: evaluation/evaluate_multi_models.py

python evaluation/evaluate_multi_models.py

Outputs:

results/aggregated_summary.json - Machine-readable metrics
results/results.md - Human-readable report (optional)

Metrics Included:

🌳 Structural: AST/CFG/CG similarity scores
🎯 DP/Taint: Precision, Recall, F1 scores
🔗 Pointer: Jaccard similarity
🐛 Mutant & Flaky: Classification accuracy
📝 Expression Match: Semantic similarity (requires [expr] extras)

Focused Evaluators

For task-specific evaluation:

evaluation/report.py - Individual task reports
evaluation/metrics/* - Metric implementations

Metrics Reference

Task	Primary Metric	Range	Notes
AST/CFG	Tree Edit Distance	0.0-1.0	Normalized similarity
Call Graph	Graph Isomorphism	0.0-1.0	Node/edge matching
Pointer	Jaccard Index	0.0-1.0	Set overlap
DP/Taint	F1 Score	0.0-1.0	Harmonic mean of P/R
Mutant/Flaky	Accuracy	0.0-1.0	Correct classifications

🎨 Rendering

Convert structural JSON outputs to visual PDFs for manual inspection.

Script: evaluation/render_graphs.py

# Run directly (no install required)
python evaluation/render_graphs.py <model_name> \
  --results-root results \
  --tasks ast cfg cg \
  --render-gold \
  --languages C java python solidity

# Or as a module (after `pip install -e .[eval]`)
python -m llm_code_analysis.evaluation.render_graphs <model_name> \
  --results-root results \
  --tasks ast cfg cg \
  --render-gold \
  --languages C java python solidity

If you see ModuleNotFoundError: No module named 'llm_code_analysis':

Install the project as a package (recommended): pip install -e .[eval]
Or run without the package prefix: python -m evaluation.render_graphs ... or python evaluation/render_graphs.py ...

Incremental rendering (all models):

Omit the <model_name> argument to iterate over every model under --results-root.
By default (without --overwrite), existing PDFs are skipped, so only missing graphs are rendered.

Example (render only missing PDFs across all models):

python evaluation/render_graphs.py \
  --results-root results \
  --tasks ast cfg cg \
  --languages C java python solidity

python -m llm_code_analysis.evaluation.render_graphs \
  --results-root results \
  --tasks ast cfg cg \
  --languages C java python solidity

Optional:

Also render static gold into results/<gold-as-model>/...:

python -m llm_code_analysis.evaluation.render_graphs \
  --results-root results \
  --tasks ast cfg cg \
  --languages C java python solidity \
  --render-gold

Force re-render (overwrite existing PDFs): add --overwrite.
Render only the static gold (skip model outputs): add --only-gold.
- Configure gold sources and target name with --gold-root and --gold-as-model.

Options:

--render-gold - Also render gold standard references
--only-gold - Render only gold references into --gold-as-model
--gold-root - Directory containing AST/CFG/CG/<language>/<case>/gold.json (default static_baseline)
--gold-as-model - Target folder name under --results-root for gold renders (default gold-static)
--tasks - Select specific tasks (ast, cfg, cg)
--languages - Filter by programming language

Output locations:

AST/CFG: results/<model>/ast_cfg/<language>/<case>/{AST,CFG}.pdf
CG: results/<model>/call_graph/<language>/<case>/CG.pdf

👥 Human Review

Interactive dashboard for manual quality control and judgment collection.

Running the Dashboard

bash scripts/run_human_eval_dashboard.sh

# Access at http://localhost:3000

Database Integration

Location: tools/human_eval_next/

The Next.js app uses SQLite to store human judgments.

Default DB: tools/human_eval_next/human_eval.sqlite3

Override with -d <path> or HUMAN_EVAL_DB environment variable

Database Schema

Tables are automatically created on first run:

Table	Purpose	Key Fields
`judgments`	Human labels from UI	model, task, language, case, reviewer, label, note
`auto_judgments`	Automatic PASS/FAIL labels	model, task, case, label, score, threshold
`structure_results`	Per-case metrics	model, task, case, score, metrics, label

Integration with Evaluation

Read human overrides:

python evaluation/evaluate_multi_models.py --human-db path/to/human_eval.sqlite3

Write automatic judgments:

python evaluation/evaluate_multi_models.py --write-auto-db

🔧 Configuration

Environment Variables Reference

Variable	Provider	Example	Required
`OPENAI_API_KEY`	OpenAI	`sk-...`	✅
`OPENAI_API_BASE`	OpenAI	`https://api.openai.com/v1`	✅
`DEEPSEEK_API_KEY`	DeepSeek	`ds-...`	Optional
`DEEPSEEK_API_BASE`	DeepSeek	`https://api.deepseek.com/v1`	Optional
`CLOSE_API_KEY`	Close/Custom	`sk-...`	Optional
`CLOSE_API_BASE`	Close/Custom	`https://proxy.example.com/v1`	Optional
`OLLAMA_API_BASE`	Ollama	`http://127.0.0.1:11434/v1`	Optional
`RESULTS_ROOT`	Output	`results`	Recommended

Provider Routing

Configure model-to-provider mapping in config/providers.yaml:

default_provider: openai

providers:
  openai:
    models: [gpt-4, gpt-5-mini]
  deepseek:
    models: [deepseek-chat, deepseek-coder]
  close:
    models: [claude-sonnet-4]

❓ FAQ

Q: "Results root" path mismatch errors?

A: Set the RESULTS_ROOT environment variable consistently:

export RESULTS_ROOT=results

Pass --results-root results to scripts that support it.

Q: How do I switch providers or models?

A: Two options:

Environment variables (quickest):

export MODEL_NAME=gpt-4
bash scripts/run_openai_gpt5.sh

Edit launcher scripts:

# Edit MODEL_NAME, API_KEY, API_BASE in the script
vim scripts/run_openai_gpt5.sh

Q: Installation errors with extras ([eval], [expr])?

A:

Evaluation extras: pip install -e .[eval]
Expression matching: pip install -e .[expr]
- Requires PyTorch + Transformers
- CPU or GPU works
- For PyTorch issues, see pytorch.org

Q: Where are the generated outputs?

A: All outputs go to results/<model_name>/<task>/

Check $RESULTS_ROOT environment variable if paths don't match.

Q: How do I add my own test cases?

A: See the Add New Cases section for task-specific instructions.

Q: Can I run this on a GPU?

A: Yes! Expression matching can leverage GPU acceleration if PyTorch detects CUDA. Other tasks are primarily API-based and don't require GPU.

🚧 Extending

Adding a New Generation Task

Create pipeline: src/pipelines/<task_name>.py

def run(**kwargs):
    # Your generation logic
    pass

def build_arg_parser():
    # Argument parser for CLI
    pass

Follow output convention: Write to $RESULTS_ROOT/<model>/<task>/...
Register task: Add to src/tasks_cli.py
Add evaluation (optional): Create metric in evaluation/metrics/<task>_metrics.py

Adding New Metrics or Visualizations

Implement metric: Add to evaluation/metrics/
Add plotting: Extend evaluation/task_plots.py
Integrate: Update evaluation/evaluate_multi_models.py to aggregate

Example: Custom Metric

# evaluation/metrics/custom_metric.py

def evaluate_custom(gold_path, pred_path):
    """
    Your custom evaluation logic.
    
    Returns:
        dict: {"accuracy": 0.95, "custom_score": 0.87}
    """
    # Load and compare
    pass

📄 License

Proprietary License

This project is proprietary software. See pyproject.toml for license metadata.

For licensing inquiries, contact the repository maintainer.

📚 Citation

If you use this benchmark in your research, please cite:

@article{ma2023exploring,
  title={Exploring Code Analysis: Zero-Shot Insights on Syntax and Semantics with LLMs},
  author={Ma, Wei and Lin, Zhihao and Liu, Shangqing and Hu, Qiang and Liu, Ye and Wang, Wenhan and Zhang, Cen and Nie, Liming and Li, Li and Liu, Yang and Jiang, Lingxiao},
  journal={arXiv preprint arXiv:2305.12138},
  year={2023}
}

Paper: arXiv:2305.12138

Questions or Issues? Open an issue or check the FAQ

Want to Contribute? See Extending for guidelines

Made with ❤️ for LLM code analysis research

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
config		config
datasets		datasets
evaluation		evaluation
scripts		scripts
src		src
static_baseline		static_baseline
tools/human_eval_next		tools/human_eval_next
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
README_zh.md		README_zh.md
__init__.py		__init__.py
index.html		index.html
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
results.md		results.md
tasks.py		tasks.py

mathieu0905/llm_code_analysis

Folders and files

Latest commit

History

Repository files navigation