Skip to content

"Rotary interleaved attention is not supported" error in WebGPU implementation for MobileLLM #1416

@jpabbuehl

Description

@jpabbuehl

System Info

  • transformers.js version: 3.7.3 (latest)
  • Node.js: v24.3.0
  • Vite: 7.1.4
  • Environment: Browser (WebGPU enabled)

Environment/Platform

  • Website/web-app
  • Browser extension
  • Server-side (e.g., Node.js, Deno, Bun)
  • Desktop app (e.g., Electron)
  • Other (e.g., VSCode extension)

Description

MobileLLM models (onnx-community/MobileLLM-R1-140M-ONNX, 360M, 950M variants) fail to run with WebGPU device, throwing "Rotary interleaved attention is not supported"
error. These models are advertised as ONNX-optimized for edge deployment but are incompatible with WebGPU acceleration.

When falling back to WASM, the models work but are extremely slow (2-3 minutes for 10-20 token generation), making them unsuitable for production use. Additionally, the
models ignore maxTokens parameter and generate thousands of tokens with internal <think> reasoning chains.

Expected behavior: MobileLLM models should work with WebGPU for reasonable performance, or documentation should clearly state they are WASM-only with performance
warnings.

Reproduction

starting from https://github.com/huggingface/transformers.js-examples/tree/main/smollm-webgpu

  • { device: 'webgpu', dtype: 'q4' } - Rotary interleaved error
  • { device: 'webgpu', dtype: 'q4f16' } - Rotary interleaved error
  • { device: 'webgpu', dtype: 'fp16' } - Rotary interleaved error

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingv4

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions