A comprehensive benchmark suite for evaluating Large Language Models on code analysis tasks.
End-to-end pipelines for AST/CFG/CG analysis, pointer analysis, expression matching, DP/Taint detection, mutant & flaky test identification — with unified evaluation, visualization, and human review.
- Installation:
scripts/setup_venv.sh - Configuration:
.envand provider keys - Run Generation:
scripts/run_openai_gpt5.sh - Evaluate:
evaluation/evaluate_multi_models.py - Render Graphs:
evaluation/render_graphs.py - Human Review UI:
scripts/run_human_eval_dashboard.sh - Website (Results): https://mathieu0905.github.io/llm_analysis/
- Features
- Prerequisites
- Installation
- Configuration
- Quick Start
- Project Structure
- Datasets
- Add New Cases
- Static Gold
- DP/Taint Format
- Outputs
- Evaluation Details
- Rendering
- Human Review
- FAQ
- Extending
- License
- Citation
- 🔍 Minimal pipelines for AST/CFG/CG, Pointer Analysis, Expression Match, DP/Taint, Mutant, and Flaky Tests
- 🚀 Single‑command generation scripts and unified evaluation aggregator
- 🔌 Multi‑provider support via
.envandconfig/providers.yaml(OpenAI / DeepSeek / Close / Ollama) - 📊 PDF rendering for structural outputs to aid manual inspection
- 👥 Human review dashboard backed by SQLite for quality control
- 📈 Comprehensive metrics including precision, recall, F1, and structural similarity scores
- 🛠️ Extensible architecture for adding new tasks and evaluation metrics
Before you begin, ensure you have the following installed:
- Python 3.9+ (Python 3.10 or 3.11 recommended)
- Git for cloning the repository
- pip and virtualenv (or Python's built-in
venv)
Optional dependencies for specific features:
- Graphviz (for graph rendering):
brew install graphviz(macOS) orapt-get install graphviz(Ubuntu) - PyTorch (for expression matching): Installed automatically with extras, but you may need platform-specific builds
Create and activate a virtual environment:
# Clone the repository (if you haven't already)
git clone https://github.com/mathieu0905/llm_code_analysis
cd llm_code_analysis
# Option 1: Core installation only
bash scripts/setup_venv.sh
# Option 2: With evaluation and expression matching extras
bash scripts/setup_venv.sh -e -x
# Activate the virtual environment
source .venv/bin/activate
pip install -r requirements.txt| Option | Command | Includes |
|---|---|---|
| Core | bash scripts/setup_venv.sh |
Base dependencies for generation |
| + Evaluation | bash scripts/setup_venv.sh -e |
Core + evaluation metrics |
| + Expression | bash scripts/setup_venv.sh -x |
Core + expression matching (PyTorch) |
| Full | bash scripts/setup_venv.sh -e -x |
All features |
Note: If PyTorch installation fails on your platform, you may need to install a platform-specific wheel manually. Visit pytorch.org for instructions.
Copy the example environment file and configure your API keys:
cp .env.example .env
# Edit .env with your API keys and endpointsConfigure one or more providers in your .env file:
OpenAI (GPT-4, GPT-5, etc.)
export OPENAI_API_KEY=sk-...
export OPENAI_API_BASE=https://api.openai.com/v1DeepSeek
export DEEPSEEK_API_KEY=ds-...
export DEEPSEEK_API_BASE=https://api.deepseek.com/v1Close Proxy / Custom OpenAI-compatible
export CLOSE_API_KEY=...
export CLOSE_API_BASE=https://your.close.proxy/v1Ollama (Local)
export OLLAMA_API_BASE=http://127.0.0.1:11434/v1# Set the root directory for all results (recommended)
export RESULTS_ROOT=results# Run generation for all tasks with default settings
bash scripts/run_openai_gpt5.shThis will:
- Generate AST, CFG, and Call Graph structures
- Perform pointer analysis
- Evaluate expression matching
- Analyze DP/Taint patterns
- Run mutant and flaky test detection
DeepSeek:
bash scripts/run_deepseek_chat.shCustom OpenAI-compatible Provider:
- Set credentials in
.env:
export CLOSE_API_KEY=sk-...
export CLOSE_API_BASE=https://api.closeai.com/v1- Update the launcher script:
# Change env var checks and default model
sed -i 's/OPENAI_API_KEY/CLOSE_API_KEY/g; s/OPENAI_API_BASE/CLOSE_API_BASE/g' scripts/run_openai_gpt5.sh
sed -i 's/MODEL_NAME=${MODEL_NAME:-gpt-5-mini}/MODEL_NAME=${MODEL_NAME:-claude-sonnet-4}/' scripts/run_openai_gpt5.sh
# Note: On macOS, use sed -i '' instead of sed -i- (Optional) Configure provider defaults:
- Code fallback:
src/common/llm.py→_default_provider_config()includes acloseprovider - Config override: Edit
config/providers.yamlto setdefault_provider: close
- Code fallback:
After generation completes, evaluate across all tasks and models:
python evaluation/evaluate_multi_models.pyOutput: results/aggregated_summary.json and results/results.md
Behavior and tips:
- Omitting
--modelenumerates every directory under--results-rootand evaluates each. - By default, if
results/aggregated_summary.jsonalready contains a model entry, that model is skipped and the cached summary is reused. - Skipped models still appear in the Markdown output, using cached values.
- Force recompute for all/selected models with
--overwrite. - Target specific models with
--model Display=folder(repeatable), e.g.--model gpt-4o-mini=gpt-4o-mini. - Reserved folder
multi_model/underresults/is ignored automatically and not treated as a model.
Generate PDF visualizations for AST/CFG/CG:
python evaluation/render_graphs.py <model_name> \
--results-root results \
--tasks ast cfg cg \
--render-gold \
--languages C java python solidityTip: Omit <model_name> to iterate all models under --results-root; existing PDFs are skipped by default. Add --overwrite to force re-render. See the Rendering section for full options. You can also run as a module after installing the package: python -m llm_code_analysis.evaluation.render_graphs ....
Launch the interactive dashboard for manual quality control:
bash scripts/run_human_eval_dashboard.sh
# Open in your browser:
# macOS: open http://localhost:3000
# Linux: xdg-open http://localhost:3000
# Windows: start http://localhost:3000llm_code_analysis/
├── config/ # Provider configurations
│ └── providers.yaml
├── datasets/ # Test cases and benchmarks
│ ├── ast_cfg/ # AST & CFG test cases
│ ├── call_graph/ # Call graph test cases
│ ├── pointer/ # Pointer analysis cases
│ ├── expression_match/ # Expression matching cases
│ ├── dp_taint/ # DP & Taint analysis cases
│ ├── flakytest/ # Flaky test detection cases
│ └── mutant/ # Mutant detection cases
├── src/ # Source code
│ ├── pipelines/ # Task-specific pipelines
│ ├── common/ # Shared utilities
│ └── tasks_cli.py # CLI entry point
├── evaluation/ # Evaluation scripts
│ ├── metrics/ # Metric implementations
│ ├── evaluate_multi_models.py
│ └── render_graphs.py
├── scripts/ # Convenience scripts
│ ├── setup_venv.sh
│ ├── run_openai_gpt5.sh
│ └── run_deepseek_chat.sh
├── static_baseline/ # Gold standard outputs
│ ├── AST/
│ ├── CFG/
│ └── CG/
├── tools/ # Additional tools
│ └── human_eval_next/ # Human review dashboard
├── results/ # Generated outputs (gitignored)
└── pyproject.toml # Project metadata
The datasets/ directory contains test cases organized by analysis task:
| Task | Location | Structure |
|---|---|---|
| AST & CFG | datasets/ast_cfg/<language>/code/* |
Source files per language |
| Call Graph | datasets/call_graph/<language>/code/* |
Source files per language |
| Pointer Analysis | datasets/pointer/cases/<case>/ |
code.c, prompt.md per case |
| Expression Match | datasets/expression_match/cases/<case>/ |
prompt.md, reference/* code |
| DP/Taint | datasets/dp_taint/contracts/ |
<project>.json files |
| Flaky Tests | datasets/flakytest/{summary,concept}_cases/ |
question.md, metadata.json |
| MutantBench | datasets/mutant/{fewshot,zeroshot}_cases/ |
question.md, metadata.json |
Gold Standards:
- Static baselines:
static_baseline/{AST,CFG,CG}/<language>/<case>/gold.json - Pointer ground truth:
datasets/pointer/ground_truth/
💡 For detailed dataset documentation, see
datasets/README.md
mkdir -p datasets/ast_cfg/C/code
cp your_case.c datasets/ast_cfg/C/code/mkdir -p datasets/call_graph/C/code
cp your_case.c datasets/call_graph/C/code/# Copy from example template
cp -r datasets/pointer/cases/example_case datasets/pointer/cases/<new_case>
# Edit the source code
$EDITOR datasets/pointer/cases/<new_case>/code.c
# Edit the prompt
$EDITOR datasets/pointer/cases/<new_case>/prompt.mdmkdir -p datasets/expression_match/cases/<case_name>/reference
echo "Your analysis prompt here" > datasets/expression_match/cases/<case_name>/prompt.md
cp -r /path/to/reference/code/* datasets/expression_match/cases/<case_name>/reference/mkdir -p datasets/dp_taint/contracts
# Format: [DP_ITEMS, TAINT_ITEMS]
# Each item: [id, text_or_parts, label]
echo '[[], []]' > datasets/dp_taint/contracts/<project_name>.json# Flaky test case
mkdir -p datasets/flakytest/summary_cases/<case_id>
echo "Question text here" > datasets/flakytest/summary_cases/<case_id>/question.md
echo '{"label": true, "reason": "explanation"}' > datasets/flakytest/summary_cases/<case_id>/metadata.json
# Mutant case
mkdir -p datasets/mutant/fewshot_cases/<case_id>
echo "Question text here" > datasets/mutant/fewshot_cases/<case_id>/question.md
echo '{"label": true, "reason": "explanation"}' > datasets/mutant/fewshot_cases/<case_id>/metadata.jsonGold standard files are stored in static_baseline/{AST,CFG,CG}/<language>/<case>/gold.json.
- Case naming: Folder name equals the source filename
- Example:
static_baseline/AST/C/for_loop.c/gold.json
- Example:
AST Format:
{
"type": "FunctionDecl",
"value": "main",
"children": [
{"type": "ReturnStmt", "children": [...]}
]
}CFG/CG Format:
{
"nodes": [
{"id": "node1", "type": "EntryPoint", "label": "main"}
],
"edges": [
{"source": "node1", "target": "node2"}
]
}- Generate using your static analysis tools
- Convert to the schema format above
- The evaluator normalizes common naming variations
- Minor field variations are tolerated
Note for Call Graphs: Evaluation uses 75 canonical cases from
datasets/call_graph/*/code/to align totals across languages.
Files are stored in datasets/dp_taint/contracts/<project>.json:
[
[
["dp_1", ["text part 1", "text part 2"], true],
["dp_2", "single text string", false]
],
[
["taint_1", "taint description", true]
]
]Structure: [DP_ITEMS, TAINT_ITEMS]
- Each item:
[id, text_or_parts, label] - Text can be a single string or array of parts
Predictions are written to:
results/<model>/dp_taint/<project>/contract/{dp,taint}/<id>.txt
First line must be exactly one of:
yesnounknown
- Aggregates precision, recall, and F1 score per project
- Generates plots in
results/<model>/dp_taint/plots/
Set RESULTS_ROOT=results in your environment. All pipelines write to $RESULTS_ROOT/<model>/<task>/....
results/
├── <model_name>/
│ ├── ast_cfg/ # AST and CFG outputs
│ ├── call_graph/ # Call graph outputs
│ ├── pointer/ # Pointer analysis results
│ ├── expression_match/ # Expression matching results
│ ├── mutant/ # Mutant detection results
│ ├── flakytest/ # Flaky test detection results
│ └── dp_taint/ # DP and Taint analysis
│ ├── <project>/
│ │ └── contract/{dp,taint}/<id>.txt
│ └── plots/ # Visualization plots
├── aggregated_summary.json # Cross-model evaluation
└── results.md # Human-readable summary
Some legacy scripts default to results. Override with:
--results-root results--output-root results--output-dir results
Script: evaluation/evaluate_multi_models.py
python evaluation/evaluate_multi_models.pyOutputs:
results/aggregated_summary.json- Machine-readable metricsresults/results.md- Human-readable report (optional)
Metrics Included:
- 🌳 Structural: AST/CFG/CG similarity scores
- 🎯 DP/Taint: Precision, Recall, F1 scores
- 🔗 Pointer: Jaccard similarity
- 🐛 Mutant & Flaky: Classification accuracy
- 📝 Expression Match: Semantic similarity (requires
[expr]extras)
For task-specific evaluation:
evaluation/report.py- Individual task reportsevaluation/metrics/*- Metric implementations
| Task | Primary Metric | Range | Notes |
|---|---|---|---|
| AST/CFG | Tree Edit Distance | 0.0-1.0 | Normalized similarity |
| Call Graph | Graph Isomorphism | 0.0-1.0 | Node/edge matching |
| Pointer | Jaccard Index | 0.0-1.0 | Set overlap |
| DP/Taint | F1 Score | 0.0-1.0 | Harmonic mean of P/R |
| Mutant/Flaky | Accuracy | 0.0-1.0 | Correct classifications |
Convert structural JSON outputs to visual PDFs for manual inspection.
Script: evaluation/render_graphs.py
# Run directly (no install required)
python evaluation/render_graphs.py <model_name> \
--results-root results \
--tasks ast cfg cg \
--render-gold \
--languages C java python solidity
# Or as a module (after `pip install -e .[eval]`)
python -m llm_code_analysis.evaluation.render_graphs <model_name> \
--results-root results \
--tasks ast cfg cg \
--render-gold \
--languages C java python solidityIf you see ModuleNotFoundError: No module named 'llm_code_analysis':
- Install the project as a package (recommended):
pip install -e .[eval] - Or run without the package prefix:
python -m evaluation.render_graphs ...orpython evaluation/render_graphs.py ...
Incremental rendering (all models):
- Omit the
<model_name>argument to iterate over every model under--results-root. - By default (without
--overwrite), existing PDFs are skipped, so only missing graphs are rendered.
Example (render only missing PDFs across all models):
python evaluation/render_graphs.py \
--results-root results \
--tasks ast cfg cg \
--languages C java python soliditypython -m llm_code_analysis.evaluation.render_graphs \
--results-root results \
--tasks ast cfg cg \
--languages C java python solidityOptional:
- Also render static gold into
results/<gold-as-model>/...:
python -m llm_code_analysis.evaluation.render_graphs \
--results-root results \
--tasks ast cfg cg \
--languages C java python solidity \
--render-gold- Force re-render (overwrite existing PDFs): add
--overwrite. - Render only the static gold (skip model outputs): add
--only-gold.- Configure gold sources and target name with
--gold-rootand--gold-as-model.
- Configure gold sources and target name with
Options:
--render-gold- Also render gold standard references--only-gold- Render only gold references into--gold-as-model--gold-root- Directory containingAST/CFG/CG/<language>/<case>/gold.json(defaultstatic_baseline)--gold-as-model- Target folder name under--results-rootfor gold renders (defaultgold-static)--tasks- Select specific tasks (ast, cfg, cg)--languages- Filter by programming language
Output locations:
- AST/CFG:
results/<model>/ast_cfg/<language>/<case>/{AST,CFG}.pdf - CG:
results/<model>/call_graph/<language>/<case>/CG.pdf
Interactive dashboard for manual quality control and judgment collection.
bash scripts/run_human_eval_dashboard.sh
# Access at http://localhost:3000Location: tools/human_eval_next/
The Next.js app uses SQLite to store human judgments.
Default DB: tools/human_eval_next/human_eval.sqlite3
- Override with
-d <path>orHUMAN_EVAL_DBenvironment variable
Tables are automatically created on first run:
| Table | Purpose | Key Fields |
|---|---|---|
judgments |
Human labels from UI | model, task, language, case, reviewer, label, note |
auto_judgments |
Automatic PASS/FAIL labels | model, task, case, label, score, threshold |
structure_results |
Per-case metrics | model, task, case, score, metrics, label |
Read human overrides:
python evaluation/evaluate_multi_models.py --human-db path/to/human_eval.sqlite3Write automatic judgments:
python evaluation/evaluate_multi_models.py --write-auto-db| Variable | Provider | Example | Required |
|---|---|---|---|
OPENAI_API_KEY |
OpenAI | sk-... |
✅ |
OPENAI_API_BASE |
OpenAI | https://api.openai.com/v1 |
✅ |
DEEPSEEK_API_KEY |
DeepSeek | ds-... |
Optional |
DEEPSEEK_API_BASE |
DeepSeek | https://api.deepseek.com/v1 |
Optional |
CLOSE_API_KEY |
Close/Custom | sk-... |
Optional |
CLOSE_API_BASE |
Close/Custom | https://proxy.example.com/v1 |
Optional |
OLLAMA_API_BASE |
Ollama | http://127.0.0.1:11434/v1 |
Optional |
RESULTS_ROOT |
Output | results |
Recommended |
Configure model-to-provider mapping in config/providers.yaml:
default_provider: openai
providers:
openai:
models: [gpt-4, gpt-5-mini]
deepseek:
models: [deepseek-chat, deepseek-coder]
close:
models: [claude-sonnet-4]Q: "Results root" path mismatch errors?
A: Set the RESULTS_ROOT environment variable consistently:
export RESULTS_ROOT=resultsPass --results-root results to scripts that support it.
Q: How do I switch providers or models?
A: Two options:
- Environment variables (quickest):
export MODEL_NAME=gpt-4
bash scripts/run_openai_gpt5.sh- Edit launcher scripts:
# Edit MODEL_NAME, API_KEY, API_BASE in the script
vim scripts/run_openai_gpt5.shQ: Installation errors with extras ([eval], [expr])?
A:
- Evaluation extras:
pip install -e .[eval] - Expression matching:
pip install -e .[expr]- Requires PyTorch + Transformers
- CPU or GPU works
- For PyTorch issues, see pytorch.org
Q: Where are the generated outputs?
A: All outputs go to results/<model_name>/<task>/
Check $RESULTS_ROOT environment variable if paths don't match.
Q: How do I add my own test cases?
A: See the Add New Cases section for task-specific instructions.
Q: Can I run this on a GPU?
A: Yes! Expression matching can leverage GPU acceleration if PyTorch detects CUDA. Other tasks are primarily API-based and don't require GPU.
-
Create pipeline:
src/pipelines/<task_name>.pydef run(**kwargs): # Your generation logic pass def build_arg_parser(): # Argument parser for CLI pass
-
Follow output convention: Write to
$RESULTS_ROOT/<model>/<task>/... -
Register task: Add to
src/tasks_cli.py -
Add evaluation (optional): Create metric in
evaluation/metrics/<task>_metrics.py
- Implement metric: Add to
evaluation/metrics/ - Add plotting: Extend
evaluation/task_plots.py - Integrate: Update
evaluation/evaluate_multi_models.pyto aggregate
# evaluation/metrics/custom_metric.py
def evaluate_custom(gold_path, pred_path):
"""
Your custom evaluation logic.
Returns:
dict: {"accuracy": 0.95, "custom_score": 0.87}
"""
# Load and compare
passProprietary License
This project is proprietary software. See pyproject.toml for license metadata.
For licensing inquiries, contact the repository maintainer.
If you use this benchmark in your research, please cite:
@article{ma2023exploring,
title={Exploring Code Analysis: Zero-Shot Insights on Syntax and Semantics with LLMs},
author={Ma, Wei and Lin, Zhihao and Liu, Shangqing and Hu, Qiang and Liu, Ye and Wang, Wenhan and Zhang, Cen and Nie, Liming and Li, Li and Liu, Yang and Jiang, Lingxiao},
journal={arXiv preprint arXiv:2305.12138},
year={2023}
}Paper: arXiv:2305.12138