CADGenBench

CADGenBench measures how well AI systems produce correct 3D mechanical parts. It covers two tasks:

Generation: from an engineering drawing of a part, produce a valid, geometrically correct 3D model.
Editing: given an existing STEP file and a requested change, apply that change.

The benchmark is tool-agnostic. It makes no assumption about how you build the model (build123d, Autodesk Fusion, Onshape): a submission is one STEP file per sample. Each sample declares its task type (generation or editing) in description.yaml, and the same metrics and output.step contract apply to both.

Submit and view the leaderboard: HuggingAI4Engineering/CADGenBench.

What this repo contains

This GitHub repo is the source code behind the benchmark. You do not need to install it to participate. It contains three parts:

Scoring engine (src/cadgenbench/eval/): the CAD Score pipeline the leaderboard Space runs server-side against your submitted STEP files.
Docs (docs/): metric definitions and the submission contract.
Reference baseline (src/cadgenbench/baseline/): an optional example generator that turns a sample's description into a submission (iteratively writes build123d Python, validates the STEP, and repeats until valid).

Evaluation runs on the Space, scoring submissions against the privately held ground truth in cadgenbench-data-gt.

How to submit

Full contract (zip layout, meta.json fields, validity gate, optional canonical pose) is at docs/benchmark/submission.md. In short:

For each sample in cadgenbench-data, produce an output.step. Any tool works.
Zip them as submission.zip with one folder per sample plus a small meta.json at the root.
Upload via the Submit tab on the leaderboard Space.

The Space validates the zip, runs the eval, publishes a leaderboard row, and writes a per-submission HTML report you can share or download.

Rows publish as unvalidated; promotion to a validated tier is a separate methodology review by the maintainer team. See docs/benchmark/validation.md for the review process and accepted evidence types.

A sanity_check_submission.py script shipped alongside the samples in cadgenbench-data lets you exercise the same validity gate locally before uploading; see docs/benchmark/submission.md#self-check-before-submitting.

Metrics

The Space scores each candidate STEP against ground truth on four axes:

Metric	What it captures
Validity	Is the BREP well-formed, watertight, tessellable? Gate: failure zeroes the rest.
Shape similarity	Geometry distance (surface distance F1, volume IoU).
Interface match	Mating-feature correctness via authored keep-in / keep-out sub-volumes.
Topology match	Betti numbers (b0, b1, b2) of the tessellated boundary.

The CAD Score is a weighted combination of the applicable component scores, gated by validity. See docs/metrics.md for the full specification and docs/metrics/ for the per-axis details.

Reference baseline (optional)

The reference baseline is an iterative agent that writes build123d Python, renders the resulting STEP, and reviews those renders to refine its code in a loop until the part is valid. Use it to see what an end-to-end run looks like, or as a starting point for your own generator. It targets Python 3.12 and installs entirely via pip.

# 1. Python 3.12 env (venv, uv, conda, etc.)
python -m venv .venv && source .venv/bin/activate

# 2. Editable install with the baseline + dev extras
pip install -e ".[baseline,dev]"

# 3. Provider API keys for whichever model(s) you plan to run
cp .env.example .env   # then fill in ANTHROPIC_API_KEY, OPENAI_API_KEY, etc.

# 4. Point at the public sample-inputs dataset on the Hub. cadgenbench
# snapshot-downloads it on first use and caches under
# ~/.cache/huggingface/hub/.
export CADGENBENCH_DATA_REPO=HuggingAI4Engineering/cadgenbench-data

Rendering (per-turn visual feedback to the agent) is in-process via PyVista/VTK; no Chromium or browser install is needed. On a bare headless Linux box VTK needs system OpenGL libs (e.g. libgl1 / Mesa); macOS works out of the box.

Verify:

cadgenbench --help
pytest tests/ -q

cgb is a shorter alias.

Run on one sample, or in parallel on all of them:

# Single sample (sample names are the dataset's folder names, e.g. 101)
cadgenbench baseline run 101 --model openai/gpt-5.5

# All samples, in parallel
cadgenbench baseline run --all --parallel 4 --model openai/gpt-5.5

Using a different LLM. --model takes any LiteLLM provider/model string; just set the matching key in .env. For example:

cadgenbench baseline run 101 --model anthropic/claude-opus-4-7   # ANTHROPIC_API_KEY
cadgenbench baseline run 101 --model gemini/gemini-3.1-pro-preview  # GEMINI_API_KEY
cadgenbench baseline run 101 --model openai/gpt-5.5             # OPENAI_API_KEY

See cadgenbench baseline run --help for the full flag set.

Output lands at results/<timestamp>_<model_slug>/<sample>/output.step. The baseline only generates candidates; scoring against ground truth happens on the leaderboard Space after you submit.

Bundle a run directory into a submission zip (top-level meta.json + one output.step per sample, per the submission contract):

cadgenbench baseline package results/20260602_120000_gpt-5.5 \
    --submitter "Your Name" --name "My agent v1" --agree

Writes <run_dir>.zip, ready to upload on the Space's Submit tab. agree_to_publish stays false unless you pass --agree.

Dataset

Samples live in two HF dataset repos:

HuggingAI4Engineering/cadgenbench-data: public; inputs (descriptions, optional input STEPs and renders) for every sample, plus the sanity_check_submission.py helper.
HuggingAI4Engineering/cadgenbench-data-gt: private; ground truth (ground_truth.step, optional jig sub-volumes, renders) and the labeller-facing AUTHORING.md / sanity-check scripts. Only the leaderboard Space reads from it.

License

Apache-2.0. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 129 Commits
.cursor/rules		.cursor/rules
docs		docs
src/cadgenbench		src/cadgenbench
tests		tests
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CADGenBench

What this repo contains

How to submit

Metrics

Reference baseline (optional)

Dataset

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CADGenBench

What this repo contains

How to submit

Metrics

Reference baseline (optional)

Dataset

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages