Releases · imazen/zenforks-cubecl

28 May 00:08

lilith

zenforks-v0.10.1

9370f8a

zenforks-v0.10.1 — PTX cache widening + Metal atomic capability honesty

Patch release on top of 0.10.0.

What's new in 0.10.1

1. Persistent PTX cache widening (`zenforks-cubecl-cuda`)

The existing disk-persistent PTX cache key was too narrow for our
usage. We add three axes:

CUBECL_GIT_SHA — captured at build time. Invalidates on any
zenforks-cubecl-cuda source change, not just upstream
cubecl-common's Cargo.toml version field bumps.
sm_arch — NVRTC compiles arch-specific PTX. Serving sm_70 PTX
to an sm_80 device is a correctness bug; appending the arch makes
safety structural.
driver_version — different driver versions JIT the same PTX
into different SASS. Per-driver safety.

Resulting on-disk layout:

<root>/cuda/<cubecl-common-ver>/<git-sha>/<sm_arch>/<driver_ver>/ptx.json.log

Eliminates the "fresh-process cold start = ~18s NVRTC re-compile
because the cache key was too narrow" failure mode that hit
zenmetrics' fleet workers under cubecl rev bumps.

2. Metal `Atomic<f32>` capability honesty (`zenforks-cubecl-wgpu`)

cubecl-wgpu's Metal backend was declaring Atomic<f32> + Add
capable, but naga's MSL backend doesn't emit
atomic_fetch_add_explicit for f32 — so the WGSL atomicAdd<f32>
got silently dropped during translation, leaving every reduction
returning its default 0.0 value. Symptom: every *-gpu metric's
score collapsed to a fall-through constant on Metal.

This patch drops AtomicUsage::Add from Metal's f32 atomic
registration. Callers requesting Atomic<f32>::fetch_add now fail
at construct time with an actionable error instead of returning
wrong numbers at runtime.

Not yet: Part B (CAS-loop WGSL codegen lowering for f32-atomic-add)
which would let Metal users actually get correct Atomic<f32>::fetch_add.
That requires a wider change to cubecl-wgpu's WGSL Type system and
binding layer; deferred to a follow-on release. The downstream
zenmetrics workarounds —
flipping fast-reduction default off on butteraugli-gpu and
dssim-gpu, Metal-reject on cvvdp-gpu — remain the production
correctness fix on Metal.

What's unchanged

0.10.0's pinned-upload patch on zenforks-cubecl-runtime is still here.
All 11 renamed crates have the same package -> [lib] shim
([lib] name is cubecl_*, package name is zenforks-cubecl-*),
so consumer source code keeps reading use cubecl_runtime::*;
unchanged.

Consumer pin convention

[dependencies]
cubecl-runtime = { package = "zenforks-cubecl-runtime", version = "0.10.1" }
cubecl-cuda    = { package = "zenforks-cubecl-cuda",    version = "0.10.1" }
cubecl-wgpu    = { package = "zenforks-cubecl-wgpu",    version = "0.10.1" }
# Non-renamed crates stay on upstream:
cubecl-common  = "0.10.0"
cubecl-ir      = "0.10.0"

Assets 2

28 May 00:05

lilith

zenforks-v0.10.0

d45a386

zenforks-v0.10.0 — vanilla rename + pinned-upload

First release of the zenforks-cubecl-* family on crates.io.

What this is

A maintained fork of tracel-ai/cubecl
v0.10.0, with 11 of its 16 crates renamed and published to crates.io
under the zenforks-cubecl-* namespace. The renamed crates carry a
small number of internal-use patches that downstream imazen
projects (zenmetrics, six *-gpu perceptual-metric crates) depend on
while waiting for upstream PRs to merge.

What's renamed vs not

Renamed in 0.10.0 (published from this tag):
zenforks-cubecl, zenforks-cubecl-runtime, zenforks-cubecl-core,
zenforks-cubecl-cuda, zenforks-cubecl-wgpu, zenforks-cubecl-cpu,
zenforks-cubecl-cpp, zenforks-cubecl-hip, zenforks-cubecl-spirv,
zenforks-cubecl-std, zenforks-cubecl-opt.

Stays upstream (consume directly from tracel-ai/cubecl's
crates.io publication at 0.10.0): cubecl-common, cubecl-ir,
cubecl-macros, cubecl-macros-internal, cubecl-zspace.

These are leaves of the dep graph; no transitive dep on a patched
crate, so no need to rename.

What patches ship in 0.10.0

Only the pinned-host-buffer fast path for
ComputeClient::create_from_slice and friends, in cubecl-runtime.
~4x HtoD speedup on CUDA workloads via direct DMA from pinned host
memory at 12-25 GB/s on PCIe 4.0 (vs ~5-6 GB/s pageable bounce).
Drafted as upstream PR
tracel-ai/cubecl#1334.

The PTX cache widening and Metal Atomic capability honesty
patches are coming in zenforks-v0.10.1.

Consumer pin convention

In your Cargo.toml, alias via the package field:

[dependencies]
cubecl-runtime = { package = "zenforks-cubecl-runtime", version = "0.10.0" }
cubecl-cuda    = { package = "zenforks-cubecl-cuda",    version = "0.10.0" }
# ...etc
# Non-renamed crates stay on upstream:
cubecl-common  = "0.10.0"

Then in source code, use cubecl_runtime::*; resolves to our
package because we keep [lib] name = "cubecl_runtime" unchanged.
No source rewrites needed.

Acknowledgement

Built on the great work of the upstream
tracel-ai/cubecl maintainers.
This fork exists to ship downstream patches without waiting on
upstream review cycles, not to replace upstream.

Assets 2

28 May 06:23

lilith

zenforks-cubecl-cpu-v0.10.2

d73c5b3

zenforks-cubecl-cpu v0.10.2 — multi-cube sync_cube/SharedMemory fix Latest

Latest

zenforks-cubecl-cpu v0.10.2 (cpu-only patch)

This release patches the MLIR visitor in zenforks-cubecl-cpu only. All other zenforks-cubecl-* crates stay at workspace version 0.10.1.

Fixed

Multi-cube SharedMemory + sync_cube isolation. The MLIR visitor generated 3 nested scf::for loops over CubeCount* inside the per-unit kernel body, but the global sync_cube barrier in compute_task.rs (counted in cube_dim_size arrivals) lost shared-memory isolation between cubes — different units could advance to different cube iterations between syncs, so cube k's units could read shared memory written by cube k+1's unit 0.
- Surfaced on cvvdp-gpu's downscale_tiled_kernel (LDS-tiled 5x5 gauss reduce, 16x16 workgroup + 36x36 SharedMemory tile): worked at 32×32 (1 workgroup) but diverged by 1.3 cells on 73×91 inputs (3x3 workgroups).
- End-to-end downstream impact for the cvvdp JOD metric: ~1.73 JOD divergence vs pycvvdp v0.5.4 at 73×91 odd-dim, dropping to f32-precision parity (~1e-6 JOD) after the fix.
- Fix: emit an implicit sync_cube call at the end of every cube-iteration body in the visitor's innermost scf::for. (93dd86d)
Pre-existing test compilation error: FastMath::all().difference(...) expected EnumSet<FastMath> but received a bare enum variant. Apply .into() coercion. (04e4ffa)

Tests

New regression test test_sync_cube_multi_cube_writes_pos_cpu: 3 cubes × 4 units; cube k's unit 0 writes CUBE_POS_X = k to shared memory; all 4 units in cube k must read k. (93dd86d)

Workspace

Other zenforks-cubecl-* crates remain at 0.10.1. Downstream consumers using zenforks-cubecl = "0.10.1" will pull zenforks-cubecl-cpu 0.10.2 via cargo's semver-compatible resolution.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's new in 0.10.1

1. Persistent PTX cache widening (`zenforks-cubecl-cuda`)

2. Metal `Atomic<f32>` capability honesty (`zenforks-cubecl-wgpu`)

What's unchanged

Consumer pin convention

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What this is

What's renamed vs not

What patches ship in 0.10.0

Consumer pin convention

Acknowledgement

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

zenforks-cubecl-cpu v0.10.2 (cpu-only patch)

Fixed

Tests

Workspace

Uh oh!

Releases: imazen/zenforks-cubecl

zenforks-v0.10.1 — PTX cache widening + Metal atomic capability honesty

What's new in 0.10.1

1. Persistent PTX cache widening (zenforks-cubecl-cuda)

2. Metal Atomic<f32> capability honesty (zenforks-cubecl-wgpu)

What's unchanged

Consumer pin convention

Uh oh!

zenforks-v0.10.0 — vanilla rename + pinned-upload

What this is

What's renamed vs not

What patches ship in 0.10.0

Consumer pin convention

Acknowledgement

Uh oh!

zenforks-cubecl-cpu v0.10.2 — multi-cube sync_cube/SharedMemory fix

zenforks-cubecl-cpu v0.10.2 (cpu-only patch)

Fixed

Tests

Workspace

Uh oh!

1. Persistent PTX cache widening (`zenforks-cubecl-cuda`)

2. Metal `Atomic<f32>` capability honesty (`zenforks-cubecl-wgpu`)