CADGenBench measures how well AI systems produce correct 3D mechanical parts. It covers two tasks:
- Generation: from an engineering drawing of a part, produce a valid, geometrically correct 3D model.
- Editing: given an existing STEP file and a requested change, apply that change.
The benchmark is tool-agnostic. It makes no assumption about how you
build the model (build123d, Autodesk Fusion, Onshape): a submission
is one STEP file per sample. Each sample declares its task type
(generation or editing) in description.yaml, and the same metrics
and output.step contract apply to both.
Submit and view the leaderboard:
HuggingAI4Engineering/CADGenBench.
This GitHub repo is the source code behind the benchmark. You do not need to install it to participate. It contains three parts:
- Scoring engine (
src/cadgenbench/eval/): the CAD Score pipeline the leaderboard Space runs server-side against your submitted STEP files. - Docs (
docs/): metric definitions and the submission contract. - Reference baseline (
src/cadgenbench/baseline/): an optional example generator that turns a sample's description into a submission (iteratively writesbuild123dPython, validates the STEP, and repeats until valid).
Evaluation runs on the Space, scoring submissions against the privately
held ground truth in
cadgenbench-data-gt.
Full contract (zip layout, meta.json fields, validity gate, optional
canonical pose) is at
docs/benchmark/submission.md. In
short:
- For each sample in
cadgenbench-data, produce anoutput.step. Any tool works. - Zip them as
submission.zipwith one folder per sample plus a smallmeta.jsonat the root. - Upload via the Submit tab on the leaderboard Space.
The Space validates the zip, runs the eval, publishes a leaderboard row, and writes a per-submission HTML report you can share or download.
Rows publish as unvalidated; promotion to a validated tier is a
separate methodology review by the maintainer team. See
docs/benchmark/validation.md for the
review process and accepted evidence types.
A sanity_check_submission.py script shipped alongside the samples in
cadgenbench-data lets you exercise the same validity gate locally
before uploading; see
docs/benchmark/submission.md#self-check-before-submitting.
The Space scores each candidate STEP against ground truth on four axes:
| Metric | What it captures |
|---|---|
| Validity | Is the BREP well-formed, watertight, tessellable? Gate: failure zeroes the rest. |
| Shape similarity | Geometry distance (surface distance F1, volume IoU). |
| Interface match | Mating-feature correctness via authored keep-in / keep-out sub-volumes. |
| Topology match | Betti numbers (b0, b1, b2) of the tessellated boundary. |
The CAD Score is a weighted combination of the applicable component
scores, gated by validity. See docs/metrics.md for
the full specification and docs/metrics/ for the
per-axis details.
The reference baseline is an iterative agent that writes build123d
Python, renders the resulting STEP, and reviews those renders to refine
its code in a loop until the part is valid. Use it to see what an
end-to-end run looks like, or as a starting point for your own
generator. It targets Python 3.12 and installs entirely via pip.
# 1. Python 3.12 env (venv, uv, conda, etc.)
python -m venv .venv && source .venv/bin/activate
# 2. Editable install with the baseline + dev extras
pip install -e ".[baseline,dev]"
# 3. Provider API keys for whichever model(s) you plan to run
cp .env.example .env # then fill in ANTHROPIC_API_KEY, OPENAI_API_KEY, etc.
# 4. Point at the public sample-inputs dataset on the Hub. cadgenbench
# snapshot-downloads it on first use and caches under
# ~/.cache/huggingface/hub/.
export CADGENBENCH_DATA_REPO=HuggingAI4Engineering/cadgenbench-dataRendering (per-turn visual feedback to the agent) is in-process via
PyVista/VTK; no Chromium or browser install is needed. On a bare
headless Linux box VTK needs system OpenGL libs (e.g. libgl1 /
Mesa); macOS works out of the box.
Verify:
cadgenbench --help
pytest tests/ -qcgb is a shorter alias.
Run on one sample, or in parallel on all of them:
# Single sample (sample names are the dataset's folder names, e.g. 101)
cadgenbench baseline run 101 --model openai/gpt-5.5
# All samples, in parallel
cadgenbench baseline run --all --parallel 4 --model openai/gpt-5.5Using a different LLM. --model takes any
LiteLLM provider/model
string; just set the matching key in .env. For example:
cadgenbench baseline run 101 --model anthropic/claude-opus-4-7 # ANTHROPIC_API_KEY
cadgenbench baseline run 101 --model gemini/gemini-3.1-pro-preview # GEMINI_API_KEY
cadgenbench baseline run 101 --model openai/gpt-5.5 # OPENAI_API_KEYSee cadgenbench baseline run --help for the full flag set.
Output lands at results/<timestamp>_<model_slug>/<sample>/output.step.
The baseline only generates candidates; scoring against ground truth
happens on the leaderboard Space after you submit.
Bundle a run directory into a submission zip (top-level meta.json +
one output.step per sample, per the submission contract):
cadgenbench baseline package results/20260602_120000_gpt-5.5 \
--submitter "Your Name" --name "My agent v1" --agreeWrites <run_dir>.zip, ready to upload on the Space's Submit tab.
agree_to_publish stays false unless you pass --agree.
Samples live in two HF dataset repos:
HuggingAI4Engineering/cadgenbench-data: public; inputs (descriptions, optional input STEPs and renders) for every sample, plus thesanity_check_submission.pyhelper.HuggingAI4Engineering/cadgenbench-data-gt: private; ground truth (ground_truth.step, optional jig sub-volumes, renders) and the labeller-facingAUTHORING.md/ sanity-check scripts. Only the leaderboard Space reads from it.
Apache-2.0. See LICENSE.