Not All Invariants Are Equal

Official implementation of "Not All Invariants Are Equal: Curating Training Data to Accelerate Program Verification with SLMs" (arXiv preprint; citation to be updated upon publication).

Abstract

The synthesis of inductive loop invariants is a critical bottleneck in automated program verification. While Large Language Models (LLMs) show promise in mitigating this issue, they often fail on hard instances, generating invariants that are invalid or computationally ineffective. While fine-tuning is a natural route to mitigate this limitation, obtaining high-quality training data for invariant generation remains an open challenge. We present a rigorous data curation pipeline designed to extract high-quality training signals from raw verifier-generated invariants. First, we formalize the properties required for a high-quality training invariant. Second, we propose WONDA, a pipeline that refines noisy data via AST-based normalization, followed by LLM-driven semantic rewriting and augmentation with provable quality guarantees. We demonstrate that fine-tuning Small Language Models (SLMs) on this curated dataset result in consistent and significant performance gain. In particular, a fine-tuned 4B parameter model matches the utility of a GPT-OSS-120B baseline and approaches the state-of-the-art GPT-5.2, without incurring reasoning-time overhead. On challenging instances from the recent InvBench evaluation suite, our approach doubles the invariant correctness rate of base models; and improves their Virtual Best Performance (VBP) rates on the verification task by up to 14.2%.

Overview

This repository contains the code for WONDA (data curation pipeline) and the full training and evaluation framework for fine-tuning Small Language Models (SLMs) on loop invariant generation for C programs.

Released models and datasets: All fine-tuned models and datasets are available in the Wonda HuggingFace collection.

Installation

Prerequisites

Python 3.13+
CUDA-compatible GPU (we used L40S or H200)
uv package manager
UAutomizer (for verification)
runlim (for time/memory limits during verification)
Weights & Biases (W&B) account (optional but recommended): we use wandb for experiment tracking when training fine-tuned models, and W&B Weave for evaluation observability; a W&B account gives you the full experience.

Setup

# Clone the repository
git clone https://github.com/idopinto/wonda.git
cd wonda

# Install dependencies
uv sync

# Install dev dependencies (optional)
uv sync --extra dev

Environment variables

Copy .env.example to .env and fill in any credentials you need:

cp .env.example .env
# Edit .env with your API keys (OpenAI, Together, OpenRouter, HuggingFace, W&B, etc.)

Variables are optional and depend on what you run: API keys for external model backends, HF_TOKEN for private models or pushing to the Hub, WANDB_API_KEY for experiment tracking. See .env.example for the full list.

UAutomizer Setup

Download and extract UAutomizer (0.3.0-dev-d790fec) to tools/UAutomizer25/. The framework expects the following structure:

tools/
└── UAutomizer25/
    └── Ultimate.py

Other releases can be found here: https://fm-tools.sosy-lab.org/#tool-uautomizer

Runlim Setup

Build or install runlim and place the runlim binary at tools/runlim/runlim. It is used to enforce time and memory limits when running the verifier (e.g. for evaluation and dataset building).

Minimum smoke test (no GPU or verifier)

To verify the repo works without a GPU, UAutomizer, or runlim:

# Run the full test suite (no network, no external tools)
uv run pytest tests/ -v

# Run a quick demo (AST pipeline only, no verifier)
uv run python -m wonda.demos.ast_program_demo

See wonda/core/README.md for more demos (AST tree, normalization, and optional direct verification with UAutomizer).

Quick Start

Run a quick multi-run evaluation on a small sample (e.g. 5 instances). For a fast local smoke test, use multi_run.num_runs=1 together with weave.skip_weave=true and weave.test_mode=true:

# GPT-5.2 (OpenRouter API; set OPENROUTER_API_KEY in .env)
uv run -m wonda.eval.run_multi_eval multi_run.num_runs=1 dataset.limit=5 weave.skip_weave=true weave.test_mode=true models=gpt_5.2_config

# Qwen3-0.6B base (no fine-tuning)
uv run -m wonda.eval.run_multi_eval multi_run.num_runs=1 dataset.limit=5 weave.skip_weave=true weave.test_mode=true models=qwen3_0.6b_nt_config

# Qwen3-0.6B fine-tuned v2.2 (WONDA SFT)
uv run -m wonda.eval.run_multi_eval multi_run.num_runs=1 dataset.limit=5 weave.skip_weave=true weave.test_mode=true models=qwen3_0.6b_nt_config models.eval_ft_model=true models.ft_model.sft_version="v2.2"

Evaluation requires UAutomizer and runlim (see Installation). We recommend using W&B Weave for observability (traces, prompts, and model outputs); for quick local tests, disable it with weave.skip_weave=true and keep outputs isolated with weave.test_mode=true. Evaluation runs model inference with Weave parallelism (WEAVE_PARALLELISM; single-run default 8, multi-run default 1); UAutomizer verification is throttled via a semaphore (VERIFIER_MAX_CONCURRENT, default 1) to avoid resource contention. More models and full benchmark runs are described in wonda/eval/README.md.

Demos

Runnable demos (no external tools for the first three) are in wonda/demos/:

Demo	Command	Notes
AST program pipeline	`uv run python -m wonda.demos.ast_program_demo`	In-memory C program through full pipeline
AST tree visualization	`uv run python -m wonda.demos.ast_tree_demo`	Unicode tree diagrams
Invariant normalization	`uv run python -m wonda.demos.normalization_demo`	AST-based normalization
Direct verification	`uv run python -m wonda.demos.direct_verification_demo`	Requires UAutomizer + runlim

Details: wonda/core/README.md.

What requires external services or tools?

Evaluation (multi-run eval, scoring): UAutomizer, runlim, and optionally GPU/API keys. See wonda/eval/README.md.
Training (fine-tuning): GPU, optional W&B; datasets from HuggingFace. See wonda/train/README.md.
Preprocessing (dataset building): Raw benchmarks, UAutomizer, runlim, optional cloud LLM. See wonda/preprocess/README.md.

The test suite and the AST/normalization demos above need only Python and the installed dependencies.

Reproducing Paper Results

1. Data Preparation

All datasets are on HuggingFace (see the Wonda collection for the full list). See wonda/preprocess/README.md for dataset details and how to rebuild them.

2. Training

We release fine-tuned Qwen3 (non-think) models on HuggingFace; training uses LoRA and the WONDA SFT datasets. See wonda/train/README.md for released models and how to train your own.

3. Evaluation

We run multi-trial evaluation and report the metrics described in the paper. The metrics (validation, correctness, usefulness, VBS, speedup, etc.) are implemented and explained in wonda/eval/inv_gen_scorer.py. See wonda/eval/README.md for the evaluation process, models used in the paper, and how to run evaluations.

Project Structure

wonda/
├── configs/
│   ├── eval/           # Evaluation configs (Hydra), incl. model configs in models/
│   ├── train/          # Training configs (qwen3_0.6b … qwen3_14b)
│   └── preprocess/     # Preprocessing configs (build_eval_dataset, build_sft_dataset, wonda_data_pipeline)
├── scripts/
│   ├── eval/           # SLURM scripts: run_multi_eval.sbatch
│   ├── train/          # SLURM scripts: train_qwen.sbatch, train_and_eval_qwen.sbatch
│   └── preprocess/     # SLURM/shell: build_eval_dataset.sbatch, build_raw_train_dataset.sbatch, wonda_pipeline.sbatch, build_sft_dataset.sh
├── wonda/              # Main package
│   ├── core/           # AST and property utilities (ast_program.py, property.py)
│   ├── eval/           # Evaluation pipeline
│   │   ├── evaluate.py, run_multi_eval.py, inv_gen_scorer.py
│   │   ├── decision_procedure.py, eval_data.py, validate.py, aggregate_results.py
│   │   └── models/     # Model backends (qwen_model, oss_model, open_router_model, vllm_model, etc.)
│   ├── preprocess/     # WONDA data curation pipeline
│   │   ├── build_eval_dataset.py, build_raw_train_dataset.py, build_sft_dataset.py
│   │   ├── wonda_pipeline.py, baseline_dataset_common.py
│   │   └── gt_invariant_normalization.py, gt_invariant_simplification.py
│   ├── train/          # Training: train_qwen.py, data_utils.py, train_utils.py
│   └── verifiers/      # UAutomizer wrapper (uautomizer.py)
├── tests/              # Test suite (see [tests/README.md](tests/README.md))
└── tools/              # UAutomizer and runlim (see Installation)

Testing

See tests/README.md for how to run the test suite, what to run, and a summary of test modules. Tests use no external tools (gcc, UAutomizer, runlim) or network and are suitable for CI.

Citation

@article{pinto2026wonda,
  title={Not All Invariants Are Equal: Curating Training Data to Accelerate Program Verification with SLMs},
  author={Ido Pinto and Yizhak Elboher and Haoze Wu and Nina Narodytska and Guy Katz},
  journal={arXiv preprint},
  year={2026},
  note={Citation to be updated upon publication}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
.github		.github
configs		configs
data		data
figures		figures
scripts		scripts
tests		tests
wonda		wonda
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Not All Invariants Are Equal

Abstract

Overview

Installation

Prerequisites

Setup

Environment variables

UAutomizer Setup

Runlim Setup

Minimum smoke test (no GPU or verifier)

Quick Start

Demos

What requires external services or tools?

Reproducing Paper Results

1. Data Preparation

2. Training

3. Evaluation

Project Structure

Testing

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Not All Invariants Are Equal

Abstract

Overview

Installation

Prerequisites

Setup

Environment variables

UAutomizer Setup

Runlim Setup

Minimum smoke test (no GPU or verifier)

Quick Start

Demos

What requires external services or tools?

Reproducing Paper Results

1. Data Preparation

2. Training

3. Evaluation

Project Structure

Testing

Citation

License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages