·
6 commits
to main
since this release
zenforks-cubecl-cpu v0.10.2 (cpu-only patch)
This release patches the MLIR visitor in zenforks-cubecl-cpu only. All other zenforks-cubecl-* crates stay at workspace version 0.10.1.
Fixed
- Multi-cube
SharedMemory+sync_cubeisolation. The MLIR visitor generated 3 nestedscf::forloops overCubeCount*inside the per-unit kernel body, but the globalsync_cubebarrier incompute_task.rs(counted incube_dim_sizearrivals) lost shared-memory isolation between cubes — different units could advance to different cube iterations between syncs, so cube k's units could read shared memory written by cube k+1's unit 0.- Surfaced on cvvdp-gpu's
downscale_tiled_kernel(LDS-tiled 5x5 gauss reduce, 16x16 workgroup + 36x36SharedMemorytile): worked at 32×32 (1 workgroup) but diverged by 1.3 cells on 73×91 inputs (3x3 workgroups). - End-to-end downstream impact for the cvvdp JOD metric: ~1.73 JOD divergence vs pycvvdp v0.5.4 at 73×91 odd-dim, dropping to f32-precision parity (~1e-6 JOD) after the fix.
- Fix: emit an implicit
sync_cubecall at the end of every cube-iteration body in the visitor's innermostscf::for. (93dd86d)
- Surfaced on cvvdp-gpu's
- Pre-existing test compilation error:
FastMath::all().difference(...)expectedEnumSet<FastMath>but received a bare enum variant. Apply.into()coercion. (04e4ffa)
Tests
- New regression test
test_sync_cube_multi_cube_writes_pos_cpu: 3 cubes × 4 units; cube k's unit 0 writesCUBE_POS_X = kto shared memory; all 4 units in cube k must readk. (93dd86d)
Workspace
Other zenforks-cubecl-* crates remain at 0.10.1. Downstream consumers using zenforks-cubecl = "0.10.1" will pull zenforks-cubecl-cpu 0.10.2 via cargo's semver-compatible resolution.