Official implementation of "Not All Invariants Are Equal: Curating Training Data to Accelerate Program Verification with SLMs" (arXiv preprint; citation to be updated upon publication).
The synthesis of inductive loop invariants is a critical bottleneck in automated program verification. While Large Language Models (LLMs) show promise in mitigating this issue, they often fail on hard instances, generating invariants that are invalid or computationally ineffective. While fine-tuning is a natural route to mitigate this limitation, obtaining high-quality training data for invariant generation remains an open challenge. We present a rigorous data curation pipeline designed to extract high-quality training signals from raw verifier-generated invariants. First, we formalize the properties required for a high-quality training invariant. Second, we propose WONDA, a pipeline that refines noisy data via AST-based normalization, followed by LLM-driven semantic rewriting and augmentation with provable quality guarantees. We demonstrate that fine-tuning Small Language Models (SLMs) on this curated dataset result in consistent and significant performance gain. In particular, a fine-tuned 4B parameter model matches the utility of a GPT-OSS-120B baseline and approaches the state-of-the-art GPT-5.2, without incurring reasoning-time overhead. On challenging instances from the recent InvBench evaluation suite, our approach doubles the invariant correctness rate of base models; and improves their Virtual Best Performance (VBP) rates on the verification task by up to 14.2%.
This repository contains the code for WONDA (data curation pipeline) and the full training and evaluation framework for fine-tuning Small Language Models (SLMs) on loop invariant generation for C programs.
Released models and datasets: All fine-tuned models and datasets are available in the Wonda HuggingFace collection.
- Python 3.13+
- CUDA-compatible GPU (we used L40S or H200)
- uv package manager
- UAutomizer (for verification)
- runlim (for time/memory limits during verification)
- Weights & Biases (W&B) account (optional but recommended): we use wandb for experiment tracking when training fine-tuned models, and W&B Weave for evaluation observability; a W&B account gives you the full experience.
# Clone the repository
git clone https://github.com/idopinto/wonda.git
cd wonda
# Install dependencies
uv sync
# Install dev dependencies (optional)
uv sync --extra devCopy .env.example to .env and fill in any credentials you need:
cp .env.example .env
# Edit .env with your API keys (OpenAI, Together, OpenRouter, HuggingFace, W&B, etc.)Variables are optional and depend on what you run: API keys for external model backends, HF_TOKEN for private models or pushing to the Hub, WANDB_API_KEY for experiment tracking. See .env.example for the full list.
Download and extract UAutomizer (0.3.0-dev-d790fec) to tools/UAutomizer25/. The framework expects the following structure:
tools/
└── UAutomizer25/
└── Ultimate.py
Other releases can be found here: https://fm-tools.sosy-lab.org/#tool-uautomizer
Build or install runlim and place the runlim binary at tools/runlim/runlim. It is used to enforce time and memory limits when running the verifier (e.g. for evaluation and dataset building).
To verify the repo works without a GPU, UAutomizer, or runlim:
# Run the full test suite (no network, no external tools)
uv run pytest tests/ -v
# Run a quick demo (AST pipeline only, no verifier)
uv run python -m wonda.demos.ast_program_demoSee wonda/core/README.md for more demos (AST tree, normalization, and optional direct verification with UAutomizer).
Run a quick multi-run evaluation on a small sample (e.g. 5 instances). For a fast local smoke test, use multi_run.num_runs=1 together with weave.skip_weave=true and weave.test_mode=true:
# GPT-5.2 (OpenRouter API; set OPENROUTER_API_KEY in .env)
uv run -m wonda.eval.run_multi_eval multi_run.num_runs=1 dataset.limit=5 weave.skip_weave=true weave.test_mode=true models=gpt_5.2_config
# Qwen3-0.6B base (no fine-tuning)
uv run -m wonda.eval.run_multi_eval multi_run.num_runs=1 dataset.limit=5 weave.skip_weave=true weave.test_mode=true models=qwen3_0.6b_nt_config
# Qwen3-0.6B fine-tuned v2.2 (WONDA SFT)
uv run -m wonda.eval.run_multi_eval multi_run.num_runs=1 dataset.limit=5 weave.skip_weave=true weave.test_mode=true models=qwen3_0.6b_nt_config models.eval_ft_model=true models.ft_model.sft_version="v2.2"Evaluation requires UAutomizer and runlim (see Installation). We recommend using W&B Weave for observability (traces, prompts, and model outputs); for quick local tests, disable it with weave.skip_weave=true and keep outputs isolated with weave.test_mode=true. Evaluation runs model inference with Weave parallelism (WEAVE_PARALLELISM; single-run default 8, multi-run default 1); UAutomizer verification is throttled via a semaphore (VERIFIER_MAX_CONCURRENT, default 1) to avoid resource contention. More models and full benchmark runs are described in wonda/eval/README.md.
Runnable demos (no external tools for the first three) are in wonda/demos/:
| Demo | Command | Notes |
|---|---|---|
| AST program pipeline | uv run python -m wonda.demos.ast_program_demo |
In-memory C program through full pipeline |
| AST tree visualization | uv run python -m wonda.demos.ast_tree_demo |
Unicode tree diagrams |
| Invariant normalization | uv run python -m wonda.demos.normalization_demo |
AST-based normalization |
| Direct verification | uv run python -m wonda.demos.direct_verification_demo |
Requires UAutomizer + runlim |
Details: wonda/core/README.md.
- Evaluation (multi-run eval, scoring): UAutomizer, runlim, and optionally GPU/API keys. See wonda/eval/README.md.
- Training (fine-tuning): GPU, optional W&B; datasets from HuggingFace. See wonda/train/README.md.
- Preprocessing (dataset building): Raw benchmarks, UAutomizer, runlim, optional cloud LLM. See wonda/preprocess/README.md.
The test suite and the AST/normalization demos above need only Python and the installed dependencies.
All datasets are on HuggingFace (see the Wonda collection for the full list). See wonda/preprocess/README.md for dataset details and how to rebuild them.
We release fine-tuned Qwen3 (non-think) models on HuggingFace; training uses LoRA and the WONDA SFT datasets. See wonda/train/README.md for released models and how to train your own.
We run multi-trial evaluation and report the metrics described in the paper. The metrics (validation, correctness, usefulness, VBS, speedup, etc.) are implemented and explained in wonda/eval/inv_gen_scorer.py. See wonda/eval/README.md for the evaluation process, models used in the paper, and how to run evaluations.
wonda/
├── configs/
│ ├── eval/ # Evaluation configs (Hydra), incl. model configs in models/
│ ├── train/ # Training configs (qwen3_0.6b … qwen3_14b)
│ └── preprocess/ # Preprocessing configs (build_eval_dataset, build_sft_dataset, wonda_data_pipeline)
├── scripts/
│ ├── eval/ # SLURM scripts: run_multi_eval.sbatch
│ ├── train/ # SLURM scripts: train_qwen.sbatch, train_and_eval_qwen.sbatch
│ └── preprocess/ # SLURM/shell: build_eval_dataset.sbatch, build_raw_train_dataset.sbatch, wonda_pipeline.sbatch, build_sft_dataset.sh
├── wonda/ # Main package
│ ├── core/ # AST and property utilities (ast_program.py, property.py)
│ ├── eval/ # Evaluation pipeline
│ │ ├── evaluate.py, run_multi_eval.py, inv_gen_scorer.py
│ │ ├── decision_procedure.py, eval_data.py, validate.py, aggregate_results.py
│ │ └── models/ # Model backends (qwen_model, oss_model, open_router_model, vllm_model, etc.)
│ ├── preprocess/ # WONDA data curation pipeline
│ │ ├── build_eval_dataset.py, build_raw_train_dataset.py, build_sft_dataset.py
│ │ ├── wonda_pipeline.py, baseline_dataset_common.py
│ │ └── gt_invariant_normalization.py, gt_invariant_simplification.py
│ ├── train/ # Training: train_qwen.py, data_utils.py, train_utils.py
│ └── verifiers/ # UAutomizer wrapper (uautomizer.py)
├── tests/ # Test suite (see [tests/README.md](tests/README.md))
└── tools/ # UAutomizer and runlim (see Installation)
See tests/README.md for how to run the test suite, what to run, and a summary of test modules. Tests use no external tools (gcc, UAutomizer, runlim) or network and are suitable for CI.
@article{pinto2026wonda,
title={Not All Invariants Are Equal: Curating Training Data to Accelerate Program Verification with SLMs},
author={Ido Pinto and Yizhak Elboher and Haoze Wu and Nina Narodytska and Guy Katz},
journal={arXiv preprint},
year={2026},
note={Citation to be updated upon publication}
}This project is licensed under the MIT License - see the LICENSE file for details.

