Summary
Extend mlxcel's release build matrix to cover Linux x86_64 and Windows in addition to the current macOS aarch64 + Linux aarch64 (CUDA) targets. This aligns the distribution surface with the Backend.AI:GO desktop application and makes mlxcel deployable to customer sites that standardize on Linux x86_64 (the dominant CUDA host architecture in production).
Background
Today's releases ship two artifact families:
mlxcel-macos-aarch64.zip — Apple Silicon
mlxcel-linux-aarch64-cuda13-{gb10,gh200} — Linux aarch64 + CUDA (Grace Blackwell / Grace Hopper)
The README prerequisites already claim "Linux (aarch64 or x86_64)" support, but no published binary exists for x86_64 today. Windows is not built at all, although the runtime distribution layout (MLX upstream's mlx/backend/cuda/delayload.cpp delay-load mechanism + NVIDIA DLL bundling, ~1.0–1.3 GB total) has been worked out.
Two motivations:
- Customer fit — Most enterprise CUDA hosts (RTX, A100, L40, H100/H200 outside of GH200 Grace systems) are Linux x86_64. Without a published x86_64 binary, those sites cannot adopt mlxcel without building from source.
- Backend.AI:GO parity — The Backend.AI:GO desktop application supports macOS + Windows + Linux x86_64. mlxcel should match so the same runtime can serve any Backend.AI:GO target.
Proposed Solution
Split the work into three deliverables.
A. Linux x86_64 + CUDA build
- Add a release artifact
mlxcel-linux-x86_64-cuda13.{tar.gz|zip} produced from an x86_64 + NVIDIA host (build-only is acceptable for the first iteration; smoke test on a real GPU is preferred).
- Decide on the target CUDA architectures — SM 80/86/89/90 covers Ampere through Hopper; the existing CUDA arch matrix in
README.md already documents non-Hopper quantization limitations.
- Verify
src/lib/mlxcel-core/build.rs and the bundled MLX C++ build produce equivalent output on x86_64 — expected to be mechanical because Linux paths already differentiate from macOS via #[cfg(target_os = "linux")].
B. Windows + CUDA build
- Extend
src/lib/mlxcel-core/build.rs with #[cfg(target_os = "windows")] branches:
- MSVC toolchain detection (vs. GCC/Clang on Linux)
CUDA_PATH env var (Windows convention) in addition to CUDA_HOME
- Link against
cudart.lib and friends; static-vs-dynamic linkage decision mirrors MLX's Windows build
- Add Windows handling in the bundled MLX C++ build for OpenBLAS FetchContent and the delay-load DLL compile definitions (
MLX_CUDA_BIN_DIR, MLX_CUDNN_BIN_DIR).
- Package the Windows artifact (
mlxcel.exe + nvidia/cublas/bin/... etc.) per the runtime distribution layout. Bundle size is expected to be ~1.0–1.3 GB because of cuBLAS + NVRTC + cuDNN — fits within the 2 GB-per-asset GitHub Release limit.
- Code signing — investigate
signtool.exe / Azure Code Signing, or ship unsigned for the first release with a documented SmartScreen warning.
- Confirm all crates compile on
x86_64-pc-windows-msvc: tokenizers, safetensors, axum, cxx, sentencepiece-sys, and the multimodal stack (ffmpeg / video frame extraction may need a Windows-friendly variant).
C. Documentation + README updates
- Update
README.md prerequisites to enumerate Linux aarch64, Linux x86_64, and Windows distinctly rather than the current "Linux (aarch64 or x86_64)" shorthand.
- Add a
docs/windows-build-guide.md for developers building from source on Windows.
- Update the CUDA architecture compatibility table to reflect the broader x86_64 GPU coverage (RTX 30/40/50 series, A100, L40, H100).
Implementation Notes
- Code surface that already touches Windows
src/distributed/rdma_capabilities.rs already has cfg!(target_os = "windows") branches. The rest of src/distributed/ should be audited for Windows-specific socket / transport quirks before claiming pipeline parallelism works on Windows.
- No other
target_os = "windows" gates exist today.
- MLX upstream Windows status — MLX has the delay-load CUDA mechanism (
mlx/backend/cuda/delayload.cpp), but full Windows-only correctness may require additional upstream patches.
- Self-hosted runner needs — x86_64 + NVIDIA Linux runner (RTX class is enough for build, but actual smoke testing benefits from Hopper or Blackwell). Windows + NVIDIA runner ideally has the same.
- Build-only vs. test-on-platform — For the first iteration, building plus a smoke test (model load + 10-token generate) on each platform is sufficient. Full benchmark + parity runs can land in a follow-up.
- Rust target triples
- Linux x86_64:
x86_64-unknown-linux-gnu
- Windows:
x86_64-pc-windows-msvc
The release pipeline lives in the development repository and not this mirror; this issue tracks the user-visible deliverable. Implementation discussion is welcome here.
Acceptance Criteria
Original Suggestion
Let's match the compatibility matrix with Backend.AI:GO desktop application, and also make it available to our customer sites which often use Linux x86-64 environments.
Summary
Extend mlxcel's release build matrix to cover Linux x86_64 and Windows in addition to the current macOS aarch64 + Linux aarch64 (CUDA) targets. This aligns the distribution surface with the Backend.AI:GO desktop application and makes mlxcel deployable to customer sites that standardize on Linux x86_64 (the dominant CUDA host architecture in production).
Background
Today's releases ship two artifact families:
mlxcel-macos-aarch64.zip— Apple Siliconmlxcel-linux-aarch64-cuda13-{gb10,gh200}— Linux aarch64 + CUDA (Grace Blackwell / Grace Hopper)The README prerequisites already claim "Linux (aarch64 or x86_64)" support, but no published binary exists for x86_64 today. Windows is not built at all, although the runtime distribution layout (MLX upstream's
mlx/backend/cuda/delayload.cppdelay-load mechanism + NVIDIA DLL bundling, ~1.0–1.3 GB total) has been worked out.Two motivations:
Proposed Solution
Split the work into three deliverables.
A. Linux x86_64 + CUDA build
mlxcel-linux-x86_64-cuda13.{tar.gz|zip}produced from an x86_64 + NVIDIA host (build-only is acceptable for the first iteration; smoke test on a real GPU is preferred).README.mdalready documents non-Hopper quantization limitations.src/lib/mlxcel-core/build.rsand the bundled MLX C++ build produce equivalent output on x86_64 — expected to be mechanical because Linux paths already differentiate from macOS via#[cfg(target_os = "linux")].B. Windows + CUDA build
src/lib/mlxcel-core/build.rswith#[cfg(target_os = "windows")]branches:CUDA_PATHenv var (Windows convention) in addition toCUDA_HOMEcudart.liband friends; static-vs-dynamic linkage decision mirrors MLX's Windows buildMLX_CUDA_BIN_DIR,MLX_CUDNN_BIN_DIR).mlxcel.exe+nvidia/cublas/bin/...etc.) per the runtime distribution layout. Bundle size is expected to be ~1.0–1.3 GB because of cuBLAS + NVRTC + cuDNN — fits within the 2 GB-per-asset GitHub Release limit.signtool.exe/ Azure Code Signing, or ship unsigned for the first release with a documented SmartScreen warning.x86_64-pc-windows-msvc:tokenizers,safetensors,axum,cxx,sentencepiece-sys, and the multimodal stack (ffmpeg / video frame extraction may need a Windows-friendly variant).C. Documentation + README updates
README.mdprerequisites to enumerate Linux aarch64, Linux x86_64, and Windows distinctly rather than the current "Linux (aarch64 or x86_64)" shorthand.docs/windows-build-guide.mdfor developers building from source on Windows.Implementation Notes
src/distributed/rdma_capabilities.rsalready hascfg!(target_os = "windows")branches. The rest ofsrc/distributed/should be audited for Windows-specific socket / transport quirks before claiming pipeline parallelism works on Windows.target_os = "windows"gates exist today.mlx/backend/cuda/delayload.cpp), but full Windows-only correctness may require additional upstream patches.x86_64-unknown-linux-gnux86_64-pc-windows-msvcThe release pipeline lives in the development repository and not this mirror; this issue tracks the user-visible deliverable. Implementation discussion is welcome here.
Acceptance Criteria
mlxcel-linux-x86_64-cuda13.{tar.gz|zip}mlxcel-windows-x86_64-cuda13.zipwith bundled CUDA runtime DLLsmlxcel generatewith a 1B 4-bit model, 10 tokens) succeeds on Linux x86_64 + NVIDIA GPUREADME.mdprerequisites section enumerates Linux aarch64, Linux x86_64, and Windows distinctlydocs/windows-build-guide.mdwalks through a developer-side Windows buildOriginal Suggestion