Skip to content

zenforks-cubecl-cpu v0.10.2 — multi-cube sync_cube/SharedMemory fix

Latest

Choose a tag to compare

@lilith lilith released this 28 May 06:23
· 6 commits to main since this release

zenforks-cubecl-cpu v0.10.2 (cpu-only patch)

This release patches the MLIR visitor in zenforks-cubecl-cpu only. All other zenforks-cubecl-* crates stay at workspace version 0.10.1.

Fixed

  • Multi-cube SharedMemory + sync_cube isolation. The MLIR visitor generated 3 nested scf::for loops over CubeCount* inside the per-unit kernel body, but the global sync_cube barrier in compute_task.rs (counted in cube_dim_size arrivals) lost shared-memory isolation between cubes — different units could advance to different cube iterations between syncs, so cube k's units could read shared memory written by cube k+1's unit 0.
    • Surfaced on cvvdp-gpu's downscale_tiled_kernel (LDS-tiled 5x5 gauss reduce, 16x16 workgroup + 36x36 SharedMemory tile): worked at 32×32 (1 workgroup) but diverged by 1.3 cells on 73×91 inputs (3x3 workgroups).
    • End-to-end downstream impact for the cvvdp JOD metric: ~1.73 JOD divergence vs pycvvdp v0.5.4 at 73×91 odd-dim, dropping to f32-precision parity (~1e-6 JOD) after the fix.
    • Fix: emit an implicit sync_cube call at the end of every cube-iteration body in the visitor's innermost scf::for. (93dd86d)
  • Pre-existing test compilation error: FastMath::all().difference(...) expected EnumSet<FastMath> but received a bare enum variant. Apply .into() coercion. (04e4ffa)

Tests

  • New regression test test_sync_cube_multi_cube_writes_pos_cpu: 3 cubes × 4 units; cube k's unit 0 writes CUBE_POS_X = k to shared memory; all 4 units in cube k must read k. (93dd86d)

Workspace

Other zenforks-cubecl-* crates remain at 0.10.1. Downstream consumers using zenforks-cubecl = "0.10.1" will pull zenforks-cubecl-cpu 0.10.2 via cargo's semver-compatible resolution.