Releases: imazen/zenforks-cubecl
zenforks-v0.10.1 — PTX cache widening + Metal atomic capability honesty
Patch release on top of 0.10.0.
What's new in 0.10.1
1. Persistent PTX cache widening (zenforks-cubecl-cuda)
The existing disk-persistent PTX cache key was too narrow for our
usage. We add three axes:
CUBECL_GIT_SHA— captured at build time. Invalidates on any
zenforks-cubecl-cuda source change, not just upstream
cubecl-common'sCargo.tomlversion field bumps.sm_arch— NVRTC compiles arch-specific PTX. Serving sm_70 PTX
to an sm_80 device is a correctness bug; appending the arch makes
safety structural.driver_version— different driver versions JIT the same PTX
into different SASS. Per-driver safety.
Resulting on-disk layout:
<root>/cuda/<cubecl-common-ver>/<git-sha>/<sm_arch>/<driver_ver>/ptx.json.log
Eliminates the "fresh-process cold start = ~18s NVRTC re-compile
because the cache key was too narrow" failure mode that hit
zenmetrics' fleet workers under cubecl rev bumps.
2. Metal Atomic<f32> capability honesty (zenforks-cubecl-wgpu)
cubecl-wgpu's Metal backend was declaring Atomic<f32> + Add
capable, but naga's MSL backend doesn't emit
atomic_fetch_add_explicit for f32 — so the WGSL atomicAdd<f32>
got silently dropped during translation, leaving every reduction
returning its default 0.0 value. Symptom: every *-gpu metric's
score collapsed to a fall-through constant on Metal.
This patch drops AtomicUsage::Add from Metal's f32 atomic
registration. Callers requesting Atomic<f32>::fetch_add now fail
at construct time with an actionable error instead of returning
wrong numbers at runtime.
Not yet: Part B (CAS-loop WGSL codegen lowering for f32-atomic-add)
which would let Metal users actually get correct Atomic<f32>::fetch_add.
That requires a wider change to cubecl-wgpu's WGSL Type system and
binding layer; deferred to a follow-on release. The downstream
zenmetrics workarounds —
flipping fast-reduction default off on butteraugli-gpu and
dssim-gpu, Metal-reject on cvvdp-gpu — remain the production
correctness fix on Metal.
What's unchanged
- 0.10.0's pinned-upload patch on
zenforks-cubecl-runtimeis still here. - All 11 renamed crates have the same
package->[lib]shim
([lib] name iscubecl_*, package name iszenforks-cubecl-*),
so consumer source code keeps readinguse cubecl_runtime::*;
unchanged.
Consumer pin convention
[dependencies]
cubecl-runtime = { package = "zenforks-cubecl-runtime", version = "0.10.1" }
cubecl-cuda = { package = "zenforks-cubecl-cuda", version = "0.10.1" }
cubecl-wgpu = { package = "zenforks-cubecl-wgpu", version = "0.10.1" }
# Non-renamed crates stay on upstream:
cubecl-common = "0.10.0"
cubecl-ir = "0.10.0"zenforks-v0.10.0 — vanilla rename + pinned-upload
First release of the zenforks-cubecl-* family on crates.io.
What this is
A maintained fork of tracel-ai/cubecl
v0.10.0, with 11 of its 16 crates renamed and published to crates.io
under the zenforks-cubecl-* namespace. The renamed crates carry a
small number of internal-use patches that downstream imazen
projects (zenmetrics, six *-gpu perceptual-metric crates) depend on
while waiting for upstream PRs to merge.
What's renamed vs not
Renamed in 0.10.0 (published from this tag):
zenforks-cubecl, zenforks-cubecl-runtime, zenforks-cubecl-core,
zenforks-cubecl-cuda, zenforks-cubecl-wgpu, zenforks-cubecl-cpu,
zenforks-cubecl-cpp, zenforks-cubecl-hip, zenforks-cubecl-spirv,
zenforks-cubecl-std, zenforks-cubecl-opt.
Stays upstream (consume directly from tracel-ai/cubecl's
crates.io publication at 0.10.0): cubecl-common, cubecl-ir,
cubecl-macros, cubecl-macros-internal, cubecl-zspace.
These are leaves of the dep graph; no transitive dep on a patched
crate, so no need to rename.
What patches ship in 0.10.0
Only the pinned-host-buffer fast path for
ComputeClient::create_from_slice and friends, in cubecl-runtime.
~4x HtoD speedup on CUDA workloads via direct DMA from pinned host
memory at 12-25 GB/s on PCIe 4.0 (vs ~5-6 GB/s pageable bounce).
Drafted as upstream PR
tracel-ai/cubecl#1334.
The PTX cache widening and Metal Atomic capability honesty
patches are coming in zenforks-v0.10.1.
Consumer pin convention
In your Cargo.toml, alias via the package field:
[dependencies]
cubecl-runtime = { package = "zenforks-cubecl-runtime", version = "0.10.0" }
cubecl-cuda = { package = "zenforks-cubecl-cuda", version = "0.10.0" }
# ...etc
# Non-renamed crates stay on upstream:
cubecl-common = "0.10.0"Then in source code, use cubecl_runtime::*; resolves to our
package because we keep [lib] name = "cubecl_runtime" unchanged.
No source rewrites needed.
Acknowledgement
Built on the great work of the upstream
tracel-ai/cubecl maintainers.
This fork exists to ship downstream patches without waiting on
upstream review cycles, not to replace upstream.
zenforks-cubecl-cpu v0.10.2 — multi-cube sync_cube/SharedMemory fix
zenforks-cubecl-cpu v0.10.2 (cpu-only patch)
This release patches the MLIR visitor in zenforks-cubecl-cpu only. All other zenforks-cubecl-* crates stay at workspace version 0.10.1.
Fixed
- Multi-cube
SharedMemory+sync_cubeisolation. The MLIR visitor generated 3 nestedscf::forloops overCubeCount*inside the per-unit kernel body, but the globalsync_cubebarrier incompute_task.rs(counted incube_dim_sizearrivals) lost shared-memory isolation between cubes — different units could advance to different cube iterations between syncs, so cube k's units could read shared memory written by cube k+1's unit 0.- Surfaced on cvvdp-gpu's
downscale_tiled_kernel(LDS-tiled 5x5 gauss reduce, 16x16 workgroup + 36x36SharedMemorytile): worked at 32×32 (1 workgroup) but diverged by 1.3 cells on 73×91 inputs (3x3 workgroups). - End-to-end downstream impact for the cvvdp JOD metric: ~1.73 JOD divergence vs pycvvdp v0.5.4 at 73×91 odd-dim, dropping to f32-precision parity (~1e-6 JOD) after the fix.
- Fix: emit an implicit
sync_cubecall at the end of every cube-iteration body in the visitor's innermostscf::for. (93dd86d)
- Surfaced on cvvdp-gpu's
- Pre-existing test compilation error:
FastMath::all().difference(...)expectedEnumSet<FastMath>but received a bare enum variant. Apply.into()coercion. (04e4ffa)
Tests
- New regression test
test_sync_cube_multi_cube_writes_pos_cpu: 3 cubes × 4 units; cube k's unit 0 writesCUBE_POS_X = kto shared memory; all 4 units in cube k must readk. (93dd86d)
Workspace
Other zenforks-cubecl-* crates remain at 0.10.1. Downstream consumers using zenforks-cubecl = "0.10.1" will pull zenforks-cubecl-cpu 0.10.2 via cargo's semver-compatible resolution.