-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
System Info
- transformers.js version: 3.7.3 (latest)
- Node.js: v24.3.0
- Vite: 7.1.4
- Environment: Browser (WebGPU enabled)
Environment/Platform
- Website/web-app
- Browser extension
- Server-side (e.g., Node.js, Deno, Bun)
- Desktop app (e.g., Electron)
- Other (e.g., VSCode extension)
Description
MobileLLM models (onnx-community/MobileLLM-R1-140M-ONNX, 360M, 950M variants) fail to run with WebGPU device, throwing "Rotary interleaved attention is not supported"
error. These models are advertised as ONNX-optimized for edge deployment but are incompatible with WebGPU acceleration.
When falling back to WASM, the models work but are extremely slow (2-3 minutes for 10-20 token generation), making them unsuitable for production use. Additionally, the
models ignore maxTokens
parameter and generate thousands of tokens with internal <think>
reasoning chains.
Expected behavior: MobileLLM models should work with WebGPU for reasonable performance, or documentation should clearly state they are WASM-only with performance
warnings.
Reproduction
starting from https://github.com/huggingface/transformers.js-examples/tree/main/smollm-webgpu
- { device: 'webgpu', dtype: 'q4' } - Rotary interleaved error
- { device: 'webgpu', dtype: 'q4f16' } - Rotary interleaved error
- { device: 'webgpu', dtype: 'fp16' } - Rotary interleaved error