SynthAgentic

Synthesis-in-the-loop evaluation of LLMs for RTL generation. SynthAgentic scores generated Verilog with the Hardware Quality Index (HQI): a 0–100 metric that integrates post-synthesis area, delay, and warning count against expert golden references under a Nangate45 45 nm flow.

This repository accompanies our GLSVLSI '26 paper:

Weimin Fu, Zeng Wang, Minghao Shao, Ramesh Karri, Muhammad Shafique, Johann Knechtel, Ozgur Sinanoglu, Xiaolong Guo. Synthesis-in-the-Loop Evaluation of LLMs for RTL Generation: Quality, Reliability, and Failure Modes. GLSVLSI '26. doi:10.1145/3787109.3815245

The framework evaluates 32 language models on 202 Verilog tasks (VerilogEval + RTLLM) across five attempts each. Beyond pass/fail simulation, it stages every generation through Icarus Verilog → Yosys (Nangate45) → testbench simulation, then anchors HQI to expert golden synthesis numbers.

Repository layout

SynthAgentic/
├── scripts/                    Python pipeline + analysis
│   ├── config.py               Paths, model styles, category mapping
│   ├── load_data.py            Load + merge per-iter eval CSVs
│   ├── compute_metrics.py      Per-attempt HQI / best-of-5 / model stats
│   ├── detect_synth_subtypes.py  Yosys log → 9-subtype failure taxonomy
│   ├── hqi_sensitivity.py      HQI weight robustness (ρ ≥ 0.997)
│   ├── complexity_weight_check.py  Complexity-weighting robustness (ρ = 0.985)
│   ├── tech_ablation.py        Resynth under SG13G2 / OSU 0.35
│   ├── plot_*.py               Paper figure generators
│   ├── run_all.py              End-to-end driver
│   ├── task_manifest.csv       202-task ID → category mapping
│   └── setup_external.sh       Fetches RTLLM + tech libs + eval data
├── tests/smoke_test.py         Synthetic-fixture pipeline test (no LLM)
├── docs/
│   ├── progress.md             Engineering log
│   └── failure_analysis.md     9-subtype synthesis-failure taxonomy
├── requirements.txt
├── LICENSE                     MIT
└── README.md

Quick start

# 1. Install dependencies
pip install -r requirements.txt

# 2. Smoke-test the analysis pipeline (no external data, runs in <1 s)
python tests/smoke_test.py

# 3. Fetch external dependencies (RTLLM, tech libraries, evaluation data)
bash scripts/setup_external.sh

# 4. Reproduce paper figures from the released eval data
python scripts/run_all.py

External tooling: SynthAgentic shells out to yosys (synthesis) and iverilog (parse + simulate). Set YOSYS=/path/to/yosys and IVERILOG=/path/to/iverilog if they are not on $PATH.

Data

The paper's raw evaluation outputs are not vendored in this repo. They live as a HuggingFace dataset:

KSU-HW-SEC/SynthAgentic-Eval

scripts/setup_external.sh downloads them into eval_results/, full_baseline_stats/, and scripts/data/. Contents:

Bundle	Content
`eval_results/`	Per-attempt CSVs for each (model, iter), 32 × 5 = 160 files plus per-iter summary JSONs
`full_baseline_stats/`	788 golden-reference Verilog modules (the references HQI is anchored to)
`scripts/data/golden_reference.csv`	Per-task golden synthesis stats (area, delay, warnings, AST complexity)
`scripts/data/inference_log.csv`	OpenRouter generation log (cost, tokens, TTFT, throughput)

Third-party code that this framework depends on but does not redistribute:

Resource	Source	Why we don't vendor
RTLLM benchmark	https://github.com/hkust-zhiyao/RTLLM	Upstream repo with its own license
Nangate45 cell library	OpenROAD-flow-scripts	Upstream PDK, separate license
IHP SG13G2 PDK	https://github.com/IHP-GmbH/IHP-Open-PDK	Upstream PDK
OSU 0.35 µm library	Oklahoma State VLSI Arch	Manual download required

Hardware Quality Index (HQI)

For a passing design on task t with post-synthesis area â, delay d̂, and warning count ŵ:

cost = 0.5 · (â / a*ₜ) + 0.5 · (d̂ / d*ₜ) + 0.1 · max(0, ŵ − w*ₜ)
HQI  = min(100 / cost, 100)

Designs that fail any pipeline gate (syntax / synthesis / simulation) score 0. Tasks without a valid golden reference contribute to coverage only and are excluded from HQI aggregates.

Per-model headline aggregates:

Coverage — complexity-weighted task-solve rate (any of 5 attempts passes)
Global HQI — best-of-five HQI ceiling, complexity-weighted
Expected HQI — single-attempt mean HQI, complexity-weighted

Rankings are robust to weight choices: Spearman ρ ≥ 0.997 across ten coefficient configurations, ρ ≥ 0.985 with equal task weighting (max rank displacement 3 positions across 32 models).

Citation

@inproceedings{fu2026synthagentic,
  title     = {Synthesis-in-the-Loop Evaluation of {LLMs} for {RTL} Generation:
               Quality, Reliability, and Failure Modes},
  author    = {Fu, Weimin and Wang, Zeng and Shao, Minghao and Karri, Ramesh and
               Shafique, Muhammad and Knechtel, Johann and Sinanoglu, Ozgur and
               Guo, Xiaolong},
  booktitle = {Proceedings of the Great Lakes Symposium on VLSI 2026 (GLSVLSI '26)},
  year      = {2026},
  doi       = {10.1145/3787109.3815245}
}

License

MIT (see LICENSE). Third-party benchmarks and PDKs retain their own licenses; SynthAgentic does not redistribute them.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SynthAgentic

Repository layout

Quick start

Data

Hardware Quality Index (HQI)

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
docs		docs
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

SynthAgentic

Repository layout

Quick start

Data

Hardware Quality Index (HQI)

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages