What's new
- GPU-resident Merkle tensor_hash in job setup (per-grid setup overhead 42.6% → 9.1%, bit-exact) plus a tiled shared-memory transpose.
- Coopmat layout query — the backend requires the native 16×16 sint8 shape or cleanly falls back to the DP4A path (so RDNA2 and other no-matrix-core GPUs still mine).
- Backend C ported off Win32 (timing → clock_gettime, loader → dlopen) so the same source builds as p40vk.dll (Windows) and libp40vk.so (Linux/ARM).
Solo and pool mining carry the same transparent 1% dev fee.