Skip to content

owenfucell/SynthAgentic

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SynthAgentic

Synthesis-in-the-loop evaluation of LLMs for RTL generation. SynthAgentic scores generated Verilog with the Hardware Quality Index (HQI): a 0–100 metric that integrates post-synthesis area, delay, and warning count against expert golden references under a Nangate45 45 nm flow.

This repository accompanies our GLSVLSI '26 paper:

Weimin Fu, Zeng Wang, Minghao Shao, Ramesh Karri, Muhammad Shafique, Johann Knechtel, Ozgur Sinanoglu, Xiaolong Guo. Synthesis-in-the-Loop Evaluation of LLMs for RTL Generation: Quality, Reliability, and Failure Modes. GLSVLSI '26. doi:10.1145/3787109.3815245

The framework evaluates 32 language models on 202 Verilog tasks (VerilogEval + RTLLM) across five attempts each. Beyond pass/fail simulation, it stages every generation through Icarus Verilog → Yosys (Nangate45) → testbench simulation, then anchors HQI to expert golden synthesis numbers.

Repository layout

SynthAgentic/
├── scripts/                    Python pipeline + analysis
│   ├── config.py               Paths, model styles, category mapping
│   ├── load_data.py            Load + merge per-iter eval CSVs
│   ├── compute_metrics.py      Per-attempt HQI / best-of-5 / model stats
│   ├── detect_synth_subtypes.py  Yosys log → 9-subtype failure taxonomy
│   ├── hqi_sensitivity.py      HQI weight robustness (ρ ≥ 0.997)
│   ├── complexity_weight_check.py  Complexity-weighting robustness (ρ = 0.985)
│   ├── tech_ablation.py        Resynth under SG13G2 / OSU 0.35
│   ├── plot_*.py               Paper figure generators
│   ├── run_all.py              End-to-end driver
│   ├── task_manifest.csv       202-task ID → category mapping
│   └── setup_external.sh       Fetches RTLLM + tech libs + eval data
├── tests/smoke_test.py         Synthetic-fixture pipeline test (no LLM)
├── docs/
│   ├── progress.md             Engineering log
│   └── failure_analysis.md     9-subtype synthesis-failure taxonomy
├── requirements.txt
├── LICENSE                     MIT
└── README.md

Quick start

# 1. Install dependencies
pip install -r requirements.txt

# 2. Smoke-test the analysis pipeline (no external data, runs in <1 s)
python tests/smoke_test.py

# 3. Fetch external dependencies (RTLLM, tech libraries, evaluation data)
bash scripts/setup_external.sh

# 4. Reproduce paper figures from the released eval data
python scripts/run_all.py

External tooling: SynthAgentic shells out to yosys (synthesis) and iverilog (parse + simulate). Set YOSYS=/path/to/yosys and IVERILOG=/path/to/iverilog if they are not on $PATH.

Data

The paper's raw evaluation outputs are not vendored in this repo. They live as a HuggingFace dataset:

KSU-HW-SEC/SynthAgentic-Eval

scripts/setup_external.sh downloads them into eval_results/, full_baseline_stats/, and scripts/data/. Contents:

Bundle Content
eval_results/ Per-attempt CSVs for each (model, iter), 32 × 5 = 160 files plus per-iter summary JSONs
full_baseline_stats/ 788 golden-reference Verilog modules (the references HQI is anchored to)
scripts/data/golden_reference.csv Per-task golden synthesis stats (area, delay, warnings, AST complexity)
scripts/data/inference_log.csv OpenRouter generation log (cost, tokens, TTFT, throughput)

Third-party code that this framework depends on but does not redistribute:

Resource Source Why we don't vendor
RTLLM benchmark https://github.com/hkust-zhiyao/RTLLM Upstream repo with its own license
Nangate45 cell library OpenROAD-flow-scripts Upstream PDK, separate license
IHP SG13G2 PDK https://github.com/IHP-GmbH/IHP-Open-PDK Upstream PDK
OSU 0.35 µm library Oklahoma State VLSI Arch Manual download required

Hardware Quality Index (HQI)

For a passing design on task t with post-synthesis area â, delay , and warning count ŵ:

cost = 0.5 · (â / a*ₜ) + 0.5 · (d̂ / d*ₜ) + 0.1 · max(0, ŵ − w*ₜ)
HQI  = min(100 / cost, 100)

Designs that fail any pipeline gate (syntax / synthesis / simulation) score 0. Tasks without a valid golden reference contribute to coverage only and are excluded from HQI aggregates.

Per-model headline aggregates:

  • Coverage — complexity-weighted task-solve rate (any of 5 attempts passes)
  • Global HQI — best-of-five HQI ceiling, complexity-weighted
  • Expected HQI — single-attempt mean HQI, complexity-weighted

Rankings are robust to weight choices: Spearman ρ ≥ 0.997 across ten coefficient configurations, ρ ≥ 0.985 with equal task weighting (max rank displacement 3 positions across 32 models).

Citation

@inproceedings{fu2026synthagentic,
  title     = {Synthesis-in-the-Loop Evaluation of {LLMs} for {RTL} Generation:
               Quality, Reliability, and Failure Modes},
  author    = {Fu, Weimin and Wang, Zeng and Shao, Minghao and Karri, Ramesh and
               Shafique, Muhammad and Knechtel, Johann and Sinanoglu, Ozgur and
               Guo, Xiaolong},
  booktitle = {Proceedings of the Great Lakes Symposium on VLSI 2026 (GLSVLSI '26)},
  year      = {2026},
  doi       = {10.1145/3787109.3815245}
}

License

MIT (see LICENSE). Third-party benchmarks and PDKs retain their own licenses; SynthAgentic does not redistribute them.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors