Skip to content

Releases: imazen/zenforks-cubecl

zenforks-v0.10.1 — PTX cache widening + Metal atomic capability honesty

28 May 00:08

Choose a tag to compare

Patch release on top of 0.10.0.

What's new in 0.10.1

1. Persistent PTX cache widening (zenforks-cubecl-cuda)

The existing disk-persistent PTX cache key was too narrow for our
usage. We add three axes:

  • CUBECL_GIT_SHA — captured at build time. Invalidates on any
    zenforks-cubecl-cuda source change, not just upstream
    cubecl-common's Cargo.toml version field bumps.
  • sm_arch — NVRTC compiles arch-specific PTX. Serving sm_70 PTX
    to an sm_80 device is a correctness bug; appending the arch makes
    safety structural.
  • driver_version — different driver versions JIT the same PTX
    into different SASS. Per-driver safety.

Resulting on-disk layout:

<root>/cuda/<cubecl-common-ver>/<git-sha>/<sm_arch>/<driver_ver>/ptx.json.log

Eliminates the "fresh-process cold start = ~18s NVRTC re-compile
because the cache key was too narrow" failure mode that hit
zenmetrics' fleet workers under cubecl rev bumps.

2. Metal Atomic<f32> capability honesty (zenforks-cubecl-wgpu)

cubecl-wgpu's Metal backend was declaring Atomic<f32> + Add
capable, but naga's MSL backend doesn't emit
atomic_fetch_add_explicit for f32 — so the WGSL atomicAdd<f32>
got silently dropped during translation, leaving every reduction
returning its default 0.0 value. Symptom: every *-gpu metric's
score collapsed to a fall-through constant on Metal.

This patch drops AtomicUsage::Add from Metal's f32 atomic
registration. Callers requesting Atomic<f32>::fetch_add now fail
at construct time with an actionable error instead of returning
wrong numbers at runtime.

Not yet: Part B (CAS-loop WGSL codegen lowering for f32-atomic-add)
which would let Metal users actually get correct Atomic<f32>::fetch_add.
That requires a wider change to cubecl-wgpu's WGSL Type system and
binding layer; deferred to a follow-on release. The downstream
zenmetrics workarounds —
flipping fast-reduction default off on butteraugli-gpu and
dssim-gpu, Metal-reject on cvvdp-gpu — remain the production
correctness fix on Metal.

What's unchanged

  • 0.10.0's pinned-upload patch on zenforks-cubecl-runtime is still here.
  • All 11 renamed crates have the same package -> [lib] shim
    ([lib] name is cubecl_*, package name is zenforks-cubecl-*),
    so consumer source code keeps reading use cubecl_runtime::*;
    unchanged.

Consumer pin convention

[dependencies]
cubecl-runtime = { package = "zenforks-cubecl-runtime", version = "0.10.1" }
cubecl-cuda    = { package = "zenforks-cubecl-cuda",    version = "0.10.1" }
cubecl-wgpu    = { package = "zenforks-cubecl-wgpu",    version = "0.10.1" }
# Non-renamed crates stay on upstream:
cubecl-common  = "0.10.0"
cubecl-ir      = "0.10.0"

zenforks-v0.10.0 — vanilla rename + pinned-upload

28 May 00:05

Choose a tag to compare

First release of the zenforks-cubecl-* family on crates.io.

What this is

A maintained fork of tracel-ai/cubecl
v0.10.0, with 11 of its 16 crates renamed and published to crates.io
under the zenforks-cubecl-* namespace. The renamed crates carry a
small number of internal-use patches that downstream imazen
projects (zenmetrics, six *-gpu perceptual-metric crates) depend on
while waiting for upstream PRs to merge.

What's renamed vs not

Renamed in 0.10.0 (published from this tag):
zenforks-cubecl, zenforks-cubecl-runtime, zenforks-cubecl-core,
zenforks-cubecl-cuda, zenforks-cubecl-wgpu, zenforks-cubecl-cpu,
zenforks-cubecl-cpp, zenforks-cubecl-hip, zenforks-cubecl-spirv,
zenforks-cubecl-std, zenforks-cubecl-opt.

Stays upstream (consume directly from tracel-ai/cubecl's
crates.io publication at 0.10.0): cubecl-common, cubecl-ir,
cubecl-macros, cubecl-macros-internal, cubecl-zspace.

These are leaves of the dep graph; no transitive dep on a patched
crate, so no need to rename.

What patches ship in 0.10.0

Only the pinned-host-buffer fast path for
ComputeClient::create_from_slice and friends, in cubecl-runtime.
~4x HtoD speedup on CUDA workloads via direct DMA from pinned host
memory at 12-25 GB/s on PCIe 4.0 (vs ~5-6 GB/s pageable bounce).
Drafted as upstream PR
tracel-ai/cubecl#1334.

The PTX cache widening and Metal Atomic capability honesty
patches are coming in zenforks-v0.10.1.

Consumer pin convention

In your Cargo.toml, alias via the package field:

[dependencies]
cubecl-runtime = { package = "zenforks-cubecl-runtime", version = "0.10.0" }
cubecl-cuda    = { package = "zenforks-cubecl-cuda",    version = "0.10.0" }
# ...etc
# Non-renamed crates stay on upstream:
cubecl-common  = "0.10.0"

Then in source code, use cubecl_runtime::*; resolves to our
package because we keep [lib] name = "cubecl_runtime" unchanged.
No source rewrites needed.

Acknowledgement

Built on the great work of the upstream
tracel-ai/cubecl maintainers.
This fork exists to ship downstream patches without waiting on
upstream review cycles, not to replace upstream.

zenforks-cubecl-cpu v0.10.2 — multi-cube sync_cube/SharedMemory fix

28 May 06:23

Choose a tag to compare

zenforks-cubecl-cpu v0.10.2 (cpu-only patch)

This release patches the MLIR visitor in zenforks-cubecl-cpu only. All other zenforks-cubecl-* crates stay at workspace version 0.10.1.

Fixed

  • Multi-cube SharedMemory + sync_cube isolation. The MLIR visitor generated 3 nested scf::for loops over CubeCount* inside the per-unit kernel body, but the global sync_cube barrier in compute_task.rs (counted in cube_dim_size arrivals) lost shared-memory isolation between cubes — different units could advance to different cube iterations between syncs, so cube k's units could read shared memory written by cube k+1's unit 0.
    • Surfaced on cvvdp-gpu's downscale_tiled_kernel (LDS-tiled 5x5 gauss reduce, 16x16 workgroup + 36x36 SharedMemory tile): worked at 32×32 (1 workgroup) but diverged by 1.3 cells on 73×91 inputs (3x3 workgroups).
    • End-to-end downstream impact for the cvvdp JOD metric: ~1.73 JOD divergence vs pycvvdp v0.5.4 at 73×91 odd-dim, dropping to f32-precision parity (~1e-6 JOD) after the fix.
    • Fix: emit an implicit sync_cube call at the end of every cube-iteration body in the visitor's innermost scf::for. (93dd86d)
  • Pre-existing test compilation error: FastMath::all().difference(...) expected EnumSet<FastMath> but received a bare enum variant. Apply .into() coercion. (04e4ffa)

Tests

  • New regression test test_sync_cube_multi_cube_writes_pos_cpu: 3 cubes × 4 units; cube k's unit 0 writes CUBE_POS_X = k to shared memory; all 4 units in cube k must read k. (93dd86d)

Workspace

Other zenforks-cubecl-* crates remain at 0.10.1. Downstream consumers using zenforks-cubecl = "0.10.1" will pull zenforks-cubecl-cpu 0.10.2 via cargo's semver-compatible resolution.