MultiLogBench is a multilingual benchmark for automated logging statement generation. Each sample provides a callable with one developer-written logging statement removed (input_initial), the original callable (function_content), and a structured target logging statement (target_log). The benchmark covers C/C++, C#, Go, Java, JavaScript, and Python.
MultiLogBench/
|-- data/
| |-- repository_snapshot/
| | |-- samples.jsonl
| | |-- manifest.json
| | |-- stats.json
| | |-- postprocess_report.json
| | `-- splits/
| | |-- train.jsonl
| | |-- valid.jsonl
| | |-- test.jsonl
| | |-- manifest.json
| | `-- stats.json
| |-- revision_history/
| | |-- accepted_pool.jsonl
| | |-- manifest.json
| | |-- stats.json
| | `-- postprocess_report.json
| `-- revision_history_transformed/
| |-- accepted_pool.jsonl
| |-- manifest.json
| |-- stats.json
| `-- transform_report.json
|-- multilogbench/
| `-- logging_dataset/
| |-- config.py
| |-- core.py
| `-- runtime.py
|-- scripts/
| |-- run_llms.py
| `-- evaluate_logging_metrics.py
|-- requirements.txt
|-- LICENSE
`-- README.md
| Dataset | Path | Rows | Use |
|---|---|---|---|
| Repository-snapshot full set | data/repository_snapshot/samples.jsonl |
75,292 | Full cleaned snapshot dataset |
| Repository-snapshot train | data/repository_snapshot/splits/train.jsonl |
51,120 | Training-oriented split |
| Repository-snapshot valid | data/repository_snapshot/splits/valid.jsonl |
6,426 | Retrieval examples and validation |
| Repository-snapshot test | data/repository_snapshot/splits/test.jsonl |
6,419 | In-distribution test split |
| Revision-history test | data/revision_history/accepted_pool.jsonl |
744 | Main historical benchmark |
| Transformed revision-history test | data/revision_history_transformed/accepted_pool.jsonl |
744 | Leakage-resistant transformed benchmark |
The core JSONL fields are:
| Field | Meaning |
|---|---|
index |
Stable sample identifier |
language |
Programming language |
repo_name |
Source repository |
function_name |
Callable name |
function_content |
Original callable with the target logging statement |
input_initial |
Callable after removing the target logging statement |
target_log |
Gold logging statement fields: line, level, message, vars, statement |
target_log_normalized |
Normalized logging level, message template, and variable fields |
Revision-history samples additionally include parent/child commit metadata and parent_function_content.
Large JSONL files are tracked with Git LFS through .gitattributes. After cloning, run:
git lfs install
git lfs pullUse Python 3.10 or newer.
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtThe evaluator uses tree-sitter-languages to parse generated code snippets and extract logging statements.
The public runner writes a run directory containing:
| File | Meaning |
|---|---|
selected_gold.jsonl |
Gold subset used in the run |
generations.jsonl |
Raw generated revised code keyed by sample index |
run_config.json |
Run configuration |
Dry-run smoke test:
python scripts/run_llms.py \
--dry-run \
--max-samples 2 \
--run-name smokeOpenAI-compatible API:
export OPENAI_API_KEY=your_api_key
python scripts/run_llms.py \
--provider openai-compatible \
--model gpt-5.4 \
--base-url https://api.openai.com/v1 \
--dataset data/revision_history/accepted_pool.jsonl \
--strategy rag \
--max-workers 8Gemini API:
export GEMINI_API_KEY=your_api_key
python scripts/run_llms.py \
--provider google-genai \
--model gemini-2.5-flash \
--dataset data/revision_history/accepted_pool.jsonl \
--strategy rag \
--max-workers 8Useful filters:
python scripts/run_llms.py --dry-run --languages java,python --max-samples 10
python scripts/run_llms.py --dry-run --sample-indexes 1,2,3Evaluate a run directory:
python scripts/evaluate_logging_metrics.py \
--run-dir runs/dry-run/smokeThis produces, by default:
| File | Meaning |
|---|---|
metrics.json |
Aggregate metrics |
metrics_details.jsonl |
Per-sample metric details |
predictions.jsonl |
Normalized predicted logging statement records |
Evaluate raw generations explicitly:
python scripts/evaluate_logging_metrics.py \
--gold data/revision_history/accepted_pool.jsonl \
--pred runs/MODEL_NAME/historical_test/generations.jsonl \
--pred-format generations \
--pa-mode target-line-change \
--output metrics.json \
--details-out metrics_details.jsonl \
--predictions-out predictions.jsonlEvaluate structured predictions:
python scripts/evaluate_logging_metrics.py \
--gold data/repository_snapshot/splits/test.jsonl \
--pred predictions.jsonl \
--pred-format structured \
--pa-mode exact-line \
--output metrics.json--pa-mode auto uses target-line-change for raw generations.jsonl and exact-line for structured predictions.jsonl.
Structured prediction files should use one JSON object per line:
{"index":123,"pred_log":{"line":7,"level":"warn","message":"\"retry failed {}\", e.getMessage()","vars":["e.getMessage()"],"statement":"logger.warn(\"retry failed {}\", e.getMessage());"}}Required fields:
| Field | Type | Meaning |
|---|---|---|
index |
integer | Must match a gold sample index |
pred_log.line |
integer or null | 1-based callable-relative logging statement line |
pred_log.level |
string or null | Predicted logging level |
pred_log.message |
string | Predicted logging payload text |
pred_log.vars |
string array | Predicted runtime variables or expressions |
pred_log.statement |
string | Full predicted logging statement |
If prediction fails, still emit a record with null/empty pred_log fields and an optional status or error field.
The evaluator reports:
| Metric | Meaning |
|---|---|
| PA | Position Accuracy |
| FA | Framework Accuracy |
| LA | Logging Level Accuracy |
| AOD | Average Ordinal Distance for logging level |
| BLEU-4 | N-gram overlap for logging text |
| ROUGE-L | Longest-common-subsequence similarity for logging text |
| PMR | Precisely Matched Rate for logging variables |
| Precision / Recall / F1 | Set-based logging variable metrics |
| CCS | Weakly cascaded composite score: PA * (0.5 + 0.25 * FA + 0.25 * ((LA + ROUGE-L + F1) / 3)) |
By default, PA is averaged over all samples, while FA, LA, AOD, BLEU-4, ROUGE-L, PMR, Precision, Recall, and F1 are conditioned on samples with PA = 1. CCS is computed from the aggregate PA, FA, LA, ROUGE-L, and F1 values.
@misc{zhong2026singlelanguageevidenceinsufficientautomated, title={Single-Language Evidence Is Insufficient for Automated Logging: A Multilingual Benchmark and Empirical Study with LLMs}, author={Renyi Zhong and Yichen Li and Yulun Wu and Jinxi Kuang and Yintong Huo and Michael R. Lyu}, year={2026}, eprint={2604.17529}, archivePrefix={arXiv}, primaryClass={cs.SE}, url={https://arxiv.org/abs/2604.17529}, }
The code and documentation in this repository are released under the MIT License. Dataset samples are derived from open-source repositories.