Skip to content

Releases: NVlabs/cutile-rs

v0.2.0

16 Jun 19:41
v0.2.0
de6c5cd

Choose a tag to compare

cuTile Rust 0.2.0 adds low-precision inference support and accompanies our paper, Fearless Concurrency on the GPU.

Paper: https://arxiv.org/abs/2606.15991
Artifacts: cutile-benchmarks/paper/
Announcement: #164

Highlights

  • CUDA 13.3-oriented low-precision support, including NVFP4 packing and unpacking and block-scaled matrix multiply.
  • Runnable NVFP4 and MXFP8 examples.
  • New cutile-kernels crate with reusable inference-oriented kernels written in cuTile Rust.
  • Paper reproducibility artifacts under cutile-benchmarks/paper/, including benchmark harnesses, CSV results, hardware details, and clock settings.
  • Expanded compile-only regression coverage and example/test cleanup for release validation.

Performance context

On NVIDIA B200, cuTile Rust reaches 7 TB/s for element-wise operations and 2 PFlop/s for GEMM, about 91% of peak memory bandwidth and 92% of dense f16 peak, respectively. Safe Rust persistent GEMM reaches 2.07 PFlop/s at M=N=K=8192, within 0.3% of the corresponding low-level Tile IR variant.

We also evaluated Grout, a Qwen3 inference engine built with cuTile Rust in collaboration with Hugging Face. In batch-1 Qwen3 decode, Grout reaches 171 tokens/s for Qwen3-4B on NVIDIA GeForce RTX 5090 and 82 tokens/s for Qwen3-32B on B200, showing strong performance on memory-bound inference tasks, consistent with our HBM roofline analysis.

Notes

cuTile Rust remains early-stage research software. CUDA 13.3 is required for the new low-precision features, and hardware-specific features such as native FP4 require architectures that support them.

v0.1.1

15 Jun 22:06
v0.1.1
c239867

Choose a tag to compare

This is a small 0.1.x release for NVFP4 Tile IR support and documentation updates.

  • Added NVFP4 type and op coverage across Tile IR, compiler lowering, examples, tests, and book material.
  • Added versioned book build and publish support.
  • Bumped the workspace crates to 0.1.1 and refreshed CI/docs.

v0.1.0

16 May 05:49
v0.1.0
f79740b

Choose a tag to compare

cuTile Rust v0.1.0 release

cuTile Rust v0.1.0 is now available on crates.io.

We've finalized our host-side and device-side APIs. We
are not planning any further breaking changes to the kernel authoring model, tensor
launch API, DeviceOp execution model, or core device operation surface.

It is also much easier to try cuTile Rust in a normal Rust project:

cargo add cutile@0.1.0

or add it directly to Cargo.toml:

[dependencies]
cutile = "0.1.0"

That should be enough for normal kernel authoring; you should not need to
depend on the internal workspace crates directly.

Highlights in v0.1.0:

  • Finalized host and device APIs for tensor kernels, DeviceOps, async
    execution, CUDA graphs, and CUDA interop.
  • Device-side operations now closely track Tile IR, including tensor views,
    partition views, atomics, memory ordering, shape operations, and tile math.
  • The JIT compiler now has stronger type inference, static dispatch lowering,
    type aliases, global constants, Global, else if, and better diagnostics.
  • Mapped partitions support safe persistent scheduling patterns, including a
    persistent GEMM example.
  • Dynamic-shape and read-only tile-like loads generate faster code in important
    cases.
  • Runtime ergonomics improved with dynamic CUDA bindings, tileiras override,
    custom memory pools, and memory accounting.

The README and book have also been updated with a shorter quick-start example
and current API docs:

Please keep sending feedback, especially on the v0.1 API surface as you start
using it from crates.io.