Skip to content

huggingface/cadgenbench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

129 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CADGenBench

HF Space HF Dataset License Python

CADGenBench measures how well AI systems produce correct 3D mechanical parts. It covers two tasks:

  • Generation: from an engineering drawing of a part, produce a valid, geometrically correct 3D model.
  • Editing: given an existing STEP file and a requested change, apply that change.

The benchmark is tool-agnostic. It makes no assumption about how you build the model (build123d, Autodesk Fusion, Onshape): a submission is one STEP file per sample. Each sample declares its task type (generation or editing) in description.yaml, and the same metrics and output.step contract apply to both.

Submit and view the leaderboard: HuggingAI4Engineering/CADGenBench.

What this repo contains

This GitHub repo is the source code behind the benchmark. You do not need to install it to participate. It contains three parts:

  • Scoring engine (src/cadgenbench/eval/): the CAD Score pipeline the leaderboard Space runs server-side against your submitted STEP files.
  • Docs (docs/): metric definitions and the submission contract.
  • Reference baseline (src/cadgenbench/baseline/): an optional example generator that turns a sample's description into a submission (iteratively writes build123d Python, validates the STEP, and repeats until valid).

Evaluation runs on the Space, scoring submissions against the privately held ground truth in cadgenbench-data-gt.

How to submit

Full contract (zip layout, meta.json fields, validity gate, optional canonical pose) is at docs/benchmark/submission.md. In short:

  1. For each sample in cadgenbench-data, produce an output.step. Any tool works.
  2. Zip them as submission.zip with one folder per sample plus a small meta.json at the root.
  3. Upload via the Submit tab on the leaderboard Space.

The Space validates the zip, runs the eval, publishes a leaderboard row, and writes a per-submission HTML report you can share or download.

Rows publish as unvalidated; promotion to a validated tier is a separate methodology review by the maintainer team. See docs/benchmark/validation.md for the review process and accepted evidence types.

A sanity_check_submission.py script shipped alongside the samples in cadgenbench-data lets you exercise the same validity gate locally before uploading; see docs/benchmark/submission.md#self-check-before-submitting.

Metrics

The Space scores each candidate STEP against ground truth on four axes:

Metric What it captures
Validity Is the BREP well-formed, watertight, tessellable? Gate: failure zeroes the rest.
Shape similarity Geometry distance (surface distance F1, volume IoU).
Interface match Mating-feature correctness via authored keep-in / keep-out sub-volumes.
Topology match Betti numbers (b0, b1, b2) of the tessellated boundary.

The CAD Score is a weighted combination of the applicable component scores, gated by validity. See docs/metrics.md for the full specification and docs/metrics/ for the per-axis details.

Reference baseline (optional)

The reference baseline is an iterative agent that writes build123d Python, renders the resulting STEP, and reviews those renders to refine its code in a loop until the part is valid. Use it to see what an end-to-end run looks like, or as a starting point for your own generator. It targets Python 3.12 and installs entirely via pip.

# 1. Python 3.12 env (venv, uv, conda, etc.)
python -m venv .venv && source .venv/bin/activate

# 2. Editable install with the baseline + dev extras
pip install -e ".[baseline,dev]"

# 3. Provider API keys for whichever model(s) you plan to run
cp .env.example .env   # then fill in ANTHROPIC_API_KEY, OPENAI_API_KEY, etc.

# 4. Point at the public sample-inputs dataset on the Hub. cadgenbench
# snapshot-downloads it on first use and caches under
# ~/.cache/huggingface/hub/.
export CADGENBENCH_DATA_REPO=HuggingAI4Engineering/cadgenbench-data

Rendering (per-turn visual feedback to the agent) is in-process via PyVista/VTK; no Chromium or browser install is needed. On a bare headless Linux box VTK needs system OpenGL libs (e.g. libgl1 / Mesa); macOS works out of the box.

Verify:

cadgenbench --help
pytest tests/ -q

cgb is a shorter alias.

Run on one sample, or in parallel on all of them:

# Single sample (sample names are the dataset's folder names, e.g. 101)
cadgenbench baseline run 101 --model openai/gpt-5.5

# All samples, in parallel
cadgenbench baseline run --all --parallel 4 --model openai/gpt-5.5

Using a different LLM. --model takes any LiteLLM provider/model string; just set the matching key in .env. For example:

cadgenbench baseline run 101 --model anthropic/claude-opus-4-7   # ANTHROPIC_API_KEY
cadgenbench baseline run 101 --model gemini/gemini-3.1-pro-preview  # GEMINI_API_KEY
cadgenbench baseline run 101 --model openai/gpt-5.5             # OPENAI_API_KEY

See cadgenbench baseline run --help for the full flag set.

Output lands at results/<timestamp>_<model_slug>/<sample>/output.step. The baseline only generates candidates; scoring against ground truth happens on the leaderboard Space after you submit.

Bundle a run directory into a submission zip (top-level meta.json + one output.step per sample, per the submission contract):

cadgenbench baseline package results/20260602_120000_gpt-5.5 \
    --submitter "Your Name" --name "My agent v1" --agree

Writes <run_dir>.zip, ready to upload on the Space's Submit tab. agree_to_publish stays false unless you pass --agree.

Dataset

Samples live in two HF dataset repos:

License

Apache-2.0. See LICENSE.

Releases

No releases published

Packages

 
 
 

Contributors

Languages