Procedural generator for diverse, valid, parametric CadQuery programs and the geometry they produce.
Every generated part is a pure Python function of its parameters — tweak any slider and the geometry updates in milliseconds, no regeneration, no ML. CadQuarry exists to be a clean, freely-usable source of CAD program data: permissively licensed code, CC0 data.
Most families emit self-contained programs that run on
cadqueryalone. The two mechanical families —gearandthreaded— emit programs built on the build123d stack (py_gearworks, bd_warehouse) and bridged back into CadQuery, so they require the optionalmechextra to execute (see Setup).
CadQuarry is split across two homes so that the code/data license split maps onto the platforms instead of needing prose:
| Home | License | Contents | |
|---|---|---|---|
| Generator | this GitHub repo | Apache-2.0 | code, seed lists, docs, a committed 1,000-part sample in sample/demo-1k/ |
| Full corpus | Hugging Face dataset | CC0-1.0 | the size ladder (1k → 200k), browsable in the dataset viewer, load_dataset-able |
The published corpus is a convenience artifact — the generator plus the
seed list (seeds/v1.toml) is the canonical source. Everything
is reproducible bit-for-bit from a seed.
The committed sample renders its real geometry directly in your browser:
- ▶ Open the demo-1k preview (GitHub Pages)
- Generates large batches of unique, executable CadQuery programs across a broad operation vocabulary (plates, shafts, blocks, enclosures, flanged hubs, ribbed structures, profiled extrusions, L/C/Z angle brackets, and multi-section compound assemblies — manifolds, stepped shafts, standoffs — with round, polygonal, slot, and rectangular cross-sections), plus two real mechanical families — involute/cycloid gears and standards-based threaded parts (ISO/ACME/trapezoidal) — built on build123d and bridged into CadQuery (opt-in
mechextra). Render images and STL/STEP files are included using the default parameters for each object. - Guarantees validity by execution — every accepted part actually builds to a non-empty solid.
- Every program is parametric by construction: it declares a machine-readable
PARAMSschema (typed, range-bounded, UI-labeled) and is a pure functionbuild(p)of those parameters. - Ships a live customizer: open any part, move sliders, the 3D view updates.
- Ships a gallery: browse a generated corpus, filter by family/tier, click to open in the customizer.
- Fully reproducible: seed + generator version → identical corpus, bit-for-bit.
- No restrictions on use: code is Apache-2.0, generated data is CC0-1.0.
CadQuarry is developed against uv and a local
.venv. uv resolves the whole stack (including CadQuery and its OCP kernel)
from PyPI, so no conda step is required.
# 1. Install uv if you don't have it (see https://docs.astral.sh/uv/)
curl -LsSf https://astral.sh/uv/install.sh | sh
# 2. Create a Python 3.11+ virtualenv in ./.venv
uv venv --python 3.11
# 3. Install CadQuarry + extras into it (dev tooling + geometry export deps)
uv pip install -e ".[dev,export]"
# 4. (Optional) add the `mech` extra to generate/execute the gear + threaded
# families. This pulls in the build123d stack (build123d, bd_warehouse,
# py_gearworks). py_gearworks has no PyPI release, so git is required.
uv pip install -e ".[mech]"This pins everything into ./.venv. The examples below call the venv's
executables directly as .venv/bin/<cmd> so they work without activating the
environment — if you prefer, run source .venv/bin/activate once and drop the
.venv/bin/ prefix.
If CadQuery's wheels don't resolve for your platform, see the CadQuery install docs; everything else in CadQuarry installs cleanly from PyPI.
# Generate 100 parts
.venv/bin/cadquarry generate --count 100 --seed 42 --out dataset/
# Build EVERY corpus in the seed ladder (generate + export geometry) in one pass
.venv/bin/cadquarry build # all sizes -> datasets/{tag}/ with STEP+STL+renders
.venv/bin/cadquarry build --sizes 1k 2k 5k # just a subset
.venv/bin/cadquarry build --no-export # generate only, skip geometry
# Build everything, then pack + upload it all to Hugging Face (needs HF_TOKEN)
.venv/bin/cadquarry build && .venv/bin/cadquarry publish
# Open the live customizer for a single part
.venv/bin/cadquarry serve --part examples/plate_with_holes.py
# Browse a generated corpus
.venv/bin/cadquarry serve --dataset dataset/
# Execute a part with parameter overrides
.venv/bin/cadquarry run examples/plate_with_holes.py --set plate_w=80 --set thickness=8
# Export to STL
.venv/bin/cadquarry run examples/plate_with_holes.py --export stl --out plate.stl
# Print the parameter schema for a part
.venv/bin/cadquarry info examples/plate_with_holes.py
# Verify all parts in a corpus re-execute correctly
.venv/bin/cadquarry verify dataset/| Family | Description | Weight |
|---|---|---|
plate |
Flat rectangular plates with holes, fillets, pockets | 18% |
bracket |
Angle brackets — L (single leg), C/channel (two legs), Z (cranked offset) — with per-leg holes and optional gussets | 14% |
revolved |
Shafts, bushings, washers — solid of revolution | 14% |
block |
Prismatic blocks/housings with pockets and bosses | 11% |
compound |
Multi-section assemblies — manifolds (bored side ports), stepped shafts (+ polygon drive heads), polygon standoffs, pedestals, side spigots, mounting tabs. Sections use round, polygonal (hex/oct), slot, and rect cross-sections | 11% |
flanged |
Revolved stub + polar bolt-circle pattern | 6% |
ribbed |
Base plate + patterned thin ribs | 5% |
enclosure |
Shelled box (hollow, open-top) | 5% |
profiled |
Long constant cross-section — rod, tube, slot, or 3/4/5/6/8-sided polygon bar | 4% |
gear ⚙ |
Real involute/cycloid gears — spur, helical (incl. herringbone), bevel, cycloid, inside-ring; ISO 54 modules, optional profile shift, root fillets, bore, hub | 6% |
threaded ⚙ |
Real thread geometry — ISO metric external/internal, plus ACME and metric-trapezoidal lead screws | 6% |
Family weights, tier distributions, and per-family dimension ranges are all tunable via configs/default.toml.
⚙ The
gearandthreadedfamilies require the optionalmechextra (uv pip install -e ".[mech]") and a compatible CadQuery/build123d pairing to execute — their emitted programs are not self-contained oncadqueryalone. Consumers of the published dataset who only read thesourcetext need nothing extra; only re-executing these parts needs themechstack.
| Tier | Description | Default weight |
|---|---|---|
| 0 | Single feature / bare primitive | 25% |
| 1 | 2–3 ops: holes, a base + one attached section (spigot, column, tab) | 40% |
| 2 | Fillets, chamfers, counterbore/countersink, drive heads, extra ports | 25% |
| 3 | Pockets, bosses, ribs, additional bored ports, deep multi-section trees | 10% |
Tiers are clamped per family (e.g. bracket/block/compound start at tier 1),
and the compound family scales section count with tier — a manifold sprouts
2 → 3 → 4 bored ports as the tier rises.
The gear and threaded families build real, standards-based geometry through
the build123d stack and bridge it back
into CadQuery via each object's shared OCP .wrapped handle — so a generated
gear or thread is, after one line, an ordinary single-solid cq.Workplane that
flows through the same execution, dedup, export, and customizer paths as every
other family.
Gears (py_gearworks):
- Types: spur, helical (including herringbone), bevel, cycloid, and inside-ring.
- ISO 54 preferred module series, 20° pressure angle (py_gearworks default).
- Optional profile shift, root fillets, an axial bore, and a raised hub — the bore/hub are added with plain CadQuery ops after the bridge.
Threads (bd_warehouse):
- ISO metric, both external (thread fused onto a shank sized to the root) and internal (thread fused into a bored body), drawn from the ISO 261/262 coarse/fine pitch tables.
- Lead screws: ACME and metric-trapezoidal external threads, by standard size designation.
Roadmap: tapered pipe threads (NPT/BSPT) are not yet supported because bd_warehouse does not currently model them; they are a documented future addition rather than a present capability.
Reproducibility nuance. Geometry signatures (quantized volume, area, and
sorted principal moments of inertia) for these solver-built families are stable
to the dedup tolerance (3 significant figures), so dedup and seed-reproducible
selection hold. Exact STEP/STL bytes, however, come out of a numeric
NURBS/thread solver and are not guaranteed bit-identical across platforms or
library versions the way the primitive families are — pin the mech versions
(recorded in seeds/v2.toml) if byte-level reproduction
matters.
Generation validates every part by execution. Importing CadQuery/OCP (~1–1.5s)
is a fixed startup cost, amortized by a pool of persistent workers that
import once and then build many parts each, fanning the work across cores. With
the full default config the throughput floor is set by the solver-built
mechanical families (gear + threaded, ~12% of the mix), which solve real
involute/thread geometry and cost a meaningful fraction of a second each — so
end-to-end generation runs at roughly ~9–10 validated parts/sec rather than
the hundreds/sec the millisecond-scale primitive families alone would sustain.
Sampling stays seeded per-part and the accept/dedup decision stays in strict
attempt order, so output is bit-for-bit identical regardless of worker count
— the same seed always yields the same corpus.
Measured on an AMD Ryzen Threadripper 9960X (24C/48T) with the default
24-worker pool, full default config (mech families included), generate only
(execution-validated, no geometry export):
| Dataset size | Wall-clock time | Throughput |
|---|---|---|
| 1,000 | ~2 m 10 s | 7.5 part/s |
| 2,000 | ~3 m 30 s | 9.5 part/s |
| 5,000 | ~9 m 30 s | 8.7 part/s |
| 10,000 | ~18 m 45 s | 8.9 part/s |
| 20,000 | ~36 m 30 s | 9.1 part/s |
| 50,000 | ~1 h 30 m | 9.2 part/s |
| 100,000 | ~2 h 50 m | 9.8 part/s |
Throughput climbs slightly with corpus size as the fixed worker warm-up
amortizes. Dropping the mech families (or lowering their weight in
configs/default.toml) pushes throughput up by more than an order of magnitude,
since the remaining primitive families build in milliseconds.
Tune the pool with --workers N (default: min(cores, 24)). The best worker
count is system-dependent: it scales with core count, but the solver-built
families are memory-hungry, so on machines with less RAM you may need fewer
workers than you have cores to avoid swapping (very large corpora at high worker
counts can exhaust memory). Times exclude STEP/STL/render export
(cadquarry export, which uses the same worker pool).
Every generated .py file is self-contained and customizer-compatible:
import cadquery as cq
# Generated by CadQuarry v0.1.0 — seed 42 — CC0-1.0
PARAMS = {
"plate_w": {"type": "float", "default": 80.0, "min": 30.0, "max": 150.0, "step": 1.0,
"group": "Body", "label": "Plate width (mm)"},
"filleted": {"type": "bool", "default": True, "group": "Body", "label": "Fillet corners"},
...
}
def build(p):
result = cq.Workplane("XY").box(p["plate_w"], p["plate_d"], p["thickness"])
if p["filleted"]:
result = result.edges("|Z").fillet(min(p["plate_w"], p["plate_d"]) * 0.08)
...
return result
result = build({k: v["default"] for k, v in PARAMS.items()})Full format spec: docs/parametric-format.md
dataset/
├── manifest.jsonl one JSON record per part (id, family, tier, signature, descriptor, dimensions, paths)
├── DATASET_CARD.md scale, distribution, generator version, license
├── parts/{id}.py parametric CadQuery source
├── params/{id}.params.json parameter schema sidecar
├── meta/{id}.meta.json provenance: seed, family, tier, geometry signature, descriptor, dimensions
├── dims/{id}.txt human-readable, prompt-friendly dimension summary (one line)
├── geometry/{id}.step/.stl (optional, generated by cadquarry export)
├── renders/{id}/{view}.png (optional) multi-angle PNG renders, one per view
└── pointclouds/{id}.ply (optional) sampled point cloud
Each meta.json record (mirrored in manifest.jsonl) carries a
geometry_signature and — for families with categorical design intent — a
machine-readable descriptor that resolves the part's defining attributes
so you can query them without parsing source. For example a threaded part:
"descriptor": {
"family": "threaded", "standard": "iso", "external": true,
"length": 60.0, "major_diameter": 16.0, "pitch": 2.0, "designation": "M16x2"
}and a gear:
"descriptor": {
"family": "gear", "kind": "helical", "module": 2.0, "teeth": 24,
"pitch_diameter": 48.0, "pressure_angle_deg": 20.0,
"helix_angle_deg": 18.0, "herringbone": true, "bore_d": 8.0
}Lead screws record a size designation (e.g. "ACME 1/2", "Tr 20x4"). The
descriptor is omitted for families whose intent is fully captured by their
parameters alone.
Every part also carries a procedurally-generated, plain-English dimension summary — a short, prompt-friendly description of the part's overall size and each of its features (holes, fillets, bores, pockets, brackets, gears, threads, …), resolved against the part's default parameters. There is no AI in this step: it is a pure, deterministic walk of the part's IR, so it is reproducible bit-for-bit alongside everything else. For example:
Rectangular body, 55 × 23.5 × 75.4 mm (W×D×H). Four Ø3.2 mm holes at the corners, 38.5 × 16.45 mm spacing. Fillet radius 1.078 mm on vertical edges.
The summary appears in three places: a dimensions block in meta.json (with
both a lines list — one phrase per feature — and a joined text blob), a
dims/{id}.txt sidecar holding the text line, and the dimensions column of
the manifest and the published JSONL/Parquet (so it flows straight to the
dataset). It pairs naturally with the renders as a text-image signal for
program-recovery and captioning work.
cadquarry generate writes only the text artifacts (code, params, meta).
Geometry is produced on demand by cadquarry export, whose default formats
are STEP + STL + renders:
# STEP, STL, and 8-angle renders for every part (the default)
.venv/bin/cadquarry export dataset/
# Pick formats explicitly
.venv/bin/cadquarry export dataset/ --formats step,stl,render,pointcloudEach part is rendered from eight standard viewpoints — front, top,
right, a canonical iso, and the four isometric corners (iso_fr, iso_fl,
iso_br, iso_bl) — written to renders/{id}/{view}.png. Rendering is done on
the GPU through a headless EGL OpenGL context (a small deferred pipeline with a
real depth buffer, screen-space ambient occlusion and soft hemispherical +
key lighting), and needs the optional trimesh + moderngl + pillow deps
(covered by the export extra, i.e. uv pip install -e ".[export]") plus an
EGL-capable GL driver; if they're missing, export prints one warning and
skips renders while still writing STEP/STL.
cadquarry annotate supplements an already-generated corpus in place with
the procedural dimension metadata (the dimensions block, the dims/{id}.txt
sidecars, and the manifest column) — without re-running any geometry export:
.venv/bin/cadquarry annotate dataset/ # uses configs/default.toml
.venv/bin/cadquarry annotate dataset/ --config configs/default.tomlEach part is recomposed deterministically from its stored seed + index
(composition is a pure function of seed + config), so no CAD execution and no
geometry re-export are needed — it only re-emits the cheap text artifacts
(parts/*.py, params/*.params.json, meta/*.meta.json, dims/*.txt) and
rewrites the manifest, leaving STEP/STL/renders/point clouds untouched. This is
the cheap path to add dimensions to a corpus built before the feature existed,
and it also upgrades the corpus to the current generator version in passing.
Annotate with the same --config the corpus was generated with: each
recomposition is checked against the stored ir_hash (ignoring the
generator-version label, which a version bump is expected to change), so a part
whose geometry-relevant IR diverges is left as-is and reported rather than
overwritten with mismatched dimensions — a wrong --config can never corrupt
good source.
cadquarry build is the single unified command that generates and exports
every corpus in the seed ladder ([[publish.corpus]] in
seeds/v1.toml), so you don't have to script a
generate-then-export loop yourself.
# Generate + export STEP/STL/renders for every size -> datasets/{tag}/
.venv/bin/cadquarry build
# Build a subset, choose formats, change the base output dir
.venv/bin/cadquarry build --sizes 1k 2k 5k --formats step,stl --out datasets/
# Generate only (no geometry)
.venv/bin/cadquarry build --no-exportFor each ladder tag it writes datasets/{tag}/ with the usual corpus layout
plus exported geometry. Builds are resumable: a corpus whose
manifest.jsonl already has enough parts is reused as-is (it's bit-identical to
a fresh run anyway); pass --force to regenerate. --workers is shared by both
the generation and export phases.
Heads up: the ladder goes up to 200k parts. Run
--sizeswith the specific tags you want unless you really intend to build the whole ladder.
# Re-generate an exact corpus from its seed
.venv/bin/cadquarry generate --seed 42 --count 5000 --config configs/default.toml --out dataset-repro/
# Verify signatures match
.venv/bin/cadquarry verify dataset-repro/Same seed + same generator version → identical manifest.jsonl and identical geometry signatures.
A pinned 1,000-part corpus is checked into the repo so anyone can inspect the format, run the customizer, and preview the data without downloading anything:
sample/demo-1k/
├── manifest.jsonl one record per part
├── parts/{id}.py parametric CadQuery source
├── params/{id}.params.json
├── meta/{id}.meta.json
├── dims/{id}.txt human-readable dimension summary
├── geometry/{id}.step STEP B-rep solids
├── geometry/{id}.stl compact binary meshes (for the in-browser preview)
├── renders/{id}/{view}.png 8 perspective PNGs (front/top/right/iso + 4 iso corners)
├── DATASET_CARD.md
└── preview.html self-contained gallery (three.js, lazy-loaded geometry)
It is corpus demo-1k (seed 1234) from the active seed list. Regenerate it —
and everything else the GitHub Pages preview needs — with one command:
.venv/bin/cadquarry build_samplebuild_sample regenerates the part sources + params + meta deterministically
from the [[corpus]] entry, then exports STEP B-reps, the compact binary STL
meshes the in-browser preview loads, and the 8 perspective renders the preview's
"Renders" tab shows — straight into sample/demo-1k/. GitHub Pages serves the
repo as-is (index.html → sample/demo-1k/preview.html), so just commit the
result to publish. Pass --name <corpus>, --out <dir>, or --formats … to
rebuild a different committed sample.
The full corpus is published at
jacobjennings/cadquarry.
You don't need to install CadQuarry or CadQuery to consume it — just the
datasets library.
pip install datasetsfrom datasets import load_dataset
# Code + metadata only (fastest; JSONL-backed):
ds = load_dataset("jacobjennings/cadquarry", "1k", split="train")
print(ds[0]["source"]) # full parametric CadQuery program (the canonical artifact)
print(ds[0]["family"]) # e.g. "plate", "revolved", "compound"
print(ds[0]["params"]) # typed parameter schema (JSON)
print(ds[0]["dimensions"]) # plain-English dimension summary (prompt-friendly)
# With 8-view shaded renders (PIL images):
ds = load_dataset("jacobjennings/cadquarry", "1k-renders", split="train")
ds[0]["render_iso"].show()
# With binary STL meshes:
import io, trimesh
ds = load_dataset("jacobjennings/cadquarry", "1k-stl", split="train")
mesh = trimesh.load(io.BytesIO(ds[0]["stl_bytes"]), file_type="stl")
# Everything (renders + STL + STEP B-rep):
ds = load_dataset("jacobjennings/cadquarry", "1k-full", split="train")
with open("part.step", "wb") as f:
f.write(ds[0]["step_bytes"])Each corpus size is published as six content configs, so you fetch only what
you need. Swap the 1k prefix for any size in the ladder:
| Config | Contents | Format |
|---|---|---|
<tag> |
CadQuery source + metadata | JSONL |
<tag>-renders |
+ 8-view render images | Parquet (render_* = image) |
<tag>-stl |
+ binary STL mesh | Parquet (stl_bytes = binary) |
<tag>-step |
+ binary STEP B-rep | Parquet (step_bytes = binary) |
<tag>-geo |
+ renders + STL | Parquet |
<tag>-full |
+ renders + STL + STEP | Parquet |
Each content config is also published limited to complexity tiers 0–2
(inclusive) under a -t0-2 suffix — e.g. <tag>-t0-2, <tag>-t0-2-renders,
<tag>-t0-2-full — for consumers who want to exclude the most complex (tier-3)
parts. The unlabeled configs include all tiers (0–3).
Available <tag> sizes (from seeds/v1.toml): 1k, 2k,
5k, 10k, 20k, 50k, 100k, 200k. For example,
load_dataset("jacobjennings/cadquarry", "50k-stl") or
load_dataset("jacobjennings/cadquarry", "50k-t0-2").
Every part is reproducible bit-for-bit from its seed, so the published data is a
convenience artifact — the generator plus seeds/v1.toml is the canonical
source.
The whole thing is two commands — build the corpora (generate + export), then
publish (pack + upload). With HF_TOKEN set, the defaults are right:
uv pip install -e ".[mech,export,publish]" # mech families, renders, uploader
cadquarry build && cadquarry publishCommands can also be chained in a single invocation (each runs with its defaults), so a full refresh of both the GitHub Pages preview and the dataset is:
cadquarry build_sample build publishcadquarry build— generates and exports every size in the ladder intodatasets/<tag>/with STEP + STL + renders. It uses the seed list matching the current generator version (seeds/v2.toml) and automatically regenerates any corpus that was built by an older generator, so a barebuildalways produces an up-to-date dataset (this is the slow part; 200k + rendering takes a while). Use--sizes 1k 2k 5kfor a subset or--forceto rebuild unconditionally.cadquarry publish— reads the pre-builtdatasets/<tag>/corpora and uploads each as a set ofload_datasetconfigs of one HF dataset. It generates and renders nothing. Defaults to all built sizes; the repo id comes from the seed list (resolving to<your-hf-user>/cadquarry) orCADQUARRY_HF_REPO. The code-only variant is packed into a singlecorpus.jsonlwith the parametricsourceandparamsinlined; the geometry variants (renders/STL/STEP) are packed into Snappy-compressed Parquet with typed binary columns. If a size hasn't been built yet it stops and tells you what to build.
# subset to your own repo
cadquarry build --sizes 1k 2k 5k && cadquarry publish --sizes 1k 2k 5k --repo-id <user>/cadquarry
# pack locally without uploading (inspect .hf_build/)
cadquarry publish --sizes 1k --dry-run
# read pre-built corpora from a custom location
cadquarry publish --sizes 1k --out /data/cadquarrySecrets. Publishing reads your token from the HF_TOKEN environment
variable (or a prior huggingface-cli login) and never prints, logs, or
commits it. Nothing secret is stored in the repo, so it works as-is for anyone
with their own HF account. The size ladder's seeds live in the active
seeds/v*.toml under [[publish.corpus]].
All distribution knobs live in configs/default.toml. Key sections:
[distribution.families]
plate = 0.18
revolved = 0.14
bracket = 0.14
compound = 0.11
gear = 0.06 # requires the `mech` extra
threaded = 0.06 # requires the `mech` extra
...
[distribution.tiers]
tier0 = 0.25
tier1 = 0.40
tier2 = 0.25
tier3 = 0.10
[families.plate]
width_min = 20.0
width_max = 120.0
...
[fasteners]
clearance_diameters = [3.2, 4.3, 5.3, 6.4, 8.4, 10.5, 13.0]uv pip install -e ".[dev]"
.venv/bin/pytest # runs tests that don't require cadquery
.venv/bin/pytest -k "not exec" # same (explicit filter)Tests in tests/ that don't touch the executor run without CadQuery. Execution tests require CadQuery.
| What | License |
|---|---|
Generator source code (cadquarry/) |
Apache-2.0 |
Generated data (.py, .step, .stl, .params.json, .meta.json) |
CC0-1.0 |
CC0 makes the intent unambiguous: do whatever you want with the data, including commercial use, no attribution required.