Synthesis-in-the-loop evaluation of LLMs for RTL generation. SynthAgentic scores generated Verilog with the Hardware Quality Index (HQI): a 0–100 metric that integrates post-synthesis area, delay, and warning count against expert golden references under a Nangate45 45 nm flow.
This repository accompanies our GLSVLSI '26 paper:
Weimin Fu, Zeng Wang, Minghao Shao, Ramesh Karri, Muhammad Shafique, Johann Knechtel, Ozgur Sinanoglu, Xiaolong Guo. Synthesis-in-the-Loop Evaluation of LLMs for RTL Generation: Quality, Reliability, and Failure Modes. GLSVLSI '26.
doi:10.1145/3787109.3815245
The framework evaluates 32 language models on 202 Verilog tasks (VerilogEval + RTLLM) across five attempts each. Beyond pass/fail simulation, it stages every generation through Icarus Verilog → Yosys (Nangate45) → testbench simulation, then anchors HQI to expert golden synthesis numbers.
SynthAgentic/
├── scripts/ Python pipeline + analysis
│ ├── config.py Paths, model styles, category mapping
│ ├── load_data.py Load + merge per-iter eval CSVs
│ ├── compute_metrics.py Per-attempt HQI / best-of-5 / model stats
│ ├── detect_synth_subtypes.py Yosys log → 9-subtype failure taxonomy
│ ├── hqi_sensitivity.py HQI weight robustness (ρ ≥ 0.997)
│ ├── complexity_weight_check.py Complexity-weighting robustness (ρ = 0.985)
│ ├── tech_ablation.py Resynth under SG13G2 / OSU 0.35
│ ├── plot_*.py Paper figure generators
│ ├── run_all.py End-to-end driver
│ ├── task_manifest.csv 202-task ID → category mapping
│ └── setup_external.sh Fetches RTLLM + tech libs + eval data
├── tests/smoke_test.py Synthetic-fixture pipeline test (no LLM)
├── docs/
│ ├── progress.md Engineering log
│ └── failure_analysis.md 9-subtype synthesis-failure taxonomy
├── requirements.txt
├── LICENSE MIT
└── README.md
# 1. Install dependencies
pip install -r requirements.txt
# 2. Smoke-test the analysis pipeline (no external data, runs in <1 s)
python tests/smoke_test.py
# 3. Fetch external dependencies (RTLLM, tech libraries, evaluation data)
bash scripts/setup_external.sh
# 4. Reproduce paper figures from the released eval data
python scripts/run_all.pyExternal tooling: SynthAgentic shells out to yosys (synthesis) and iverilog (parse + simulate). Set YOSYS=/path/to/yosys and IVERILOG=/path/to/iverilog if they are not on $PATH.
The paper's raw evaluation outputs are not vendored in this repo. They live as a HuggingFace dataset:
scripts/setup_external.sh downloads them into eval_results/, full_baseline_stats/, and scripts/data/. Contents:
| Bundle | Content |
|---|---|
eval_results/ |
Per-attempt CSVs for each (model, iter), 32 × 5 = 160 files plus per-iter summary JSONs |
full_baseline_stats/ |
788 golden-reference Verilog modules (the references HQI is anchored to) |
scripts/data/golden_reference.csv |
Per-task golden synthesis stats (area, delay, warnings, AST complexity) |
scripts/data/inference_log.csv |
OpenRouter generation log (cost, tokens, TTFT, throughput) |
Third-party code that this framework depends on but does not redistribute:
| Resource | Source | Why we don't vendor |
|---|---|---|
| RTLLM benchmark | https://github.com/hkust-zhiyao/RTLLM | Upstream repo with its own license |
| Nangate45 cell library | OpenROAD-flow-scripts | Upstream PDK, separate license |
| IHP SG13G2 PDK | https://github.com/IHP-GmbH/IHP-Open-PDK | Upstream PDK |
| OSU 0.35 µm library | Oklahoma State VLSI Arch | Manual download required |
For a passing design on task t with post-synthesis area â, delay d̂, and warning count ŵ:
cost = 0.5 · (â / a*ₜ) + 0.5 · (d̂ / d*ₜ) + 0.1 · max(0, ŵ − w*ₜ)
HQI = min(100 / cost, 100)
Designs that fail any pipeline gate (syntax / synthesis / simulation) score 0. Tasks without a valid golden reference contribute to coverage only and are excluded from HQI aggregates.
Per-model headline aggregates:
- Coverage — complexity-weighted task-solve rate (any of 5 attempts passes)
- Global HQI — best-of-five HQI ceiling, complexity-weighted
- Expected HQI — single-attempt mean HQI, complexity-weighted
Rankings are robust to weight choices: Spearman ρ ≥ 0.997 across ten coefficient configurations, ρ ≥ 0.985 with equal task weighting (max rank displacement 3 positions across 32 models).
@inproceedings{fu2026synthagentic,
title = {Synthesis-in-the-Loop Evaluation of {LLMs} for {RTL} Generation:
Quality, Reliability, and Failure Modes},
author = {Fu, Weimin and Wang, Zeng and Shao, Minghao and Karri, Ramesh and
Shafique, Muhammad and Knechtel, Johann and Sinanoglu, Ozgur and
Guo, Xiaolong},
booktitle = {Proceedings of the Great Lakes Symposium on VLSI 2026 (GLSVLSI '26)},
year = {2026},
doi = {10.1145/3787109.3815245}
}MIT (see LICENSE). Third-party benchmarks and PDKs retain their own licenses; SynthAgentic does not redistribute them.