Describe the bug
When starting the vLLM engine, the process crashes with the following error:
RuntimeError: Worker failed with error 'This flash attention build does not support headdim not being a multiple of 32.'
Full traceback excerpt:
(EngineCore_DP0 pid=669) ERROR 11-17 00:53:02 [multiproc_executor.py:230] Worker proc VllmWorker-0 died unexpectedly, shutting down executor.
...
File "/data/vllm39/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 228, in _initialize_kv_caches
available_gpu_memory = self.model_executor.determine_available_memory()
...
RuntimeError: Worker failed with error 'This flash attention build does not support headdim not being a multiple of 32.'
This happens immediately when launching the engine via vllm.start() (or equivalent).
It seems to be caused by FlashAttention 2.8.3 rejecting the model’s head dimension, even though vLLM supposedly handles arbitrary head sizes.
To Reproduce
Steps to reproduce:
Install latest vLLM from GitHub (master).
Install FlashAttention 2.8.3.
Run a model whose attention head dimension is not a multiple of 32 (e.g., some custom / fine-tuned architectures).
Launch the engine.
Engine crashes before initialization completes.
Expected behavior
vLLM should either:
fall back to a compatible attention kernel,
or
provide a clear error message explaining which models are incompatible with the installed FlashAttention version.
Hardware and system info
Torch: 2.9
CUDA: 12.8
GPU: H100
vLLM: latest GitHub master
FlashAttention: 2.8.3
Python: 3.12
Additional context
Using flash-attn 2.8.3 + vLLM latest master + Swift 3.10 environment.
The error happens consistently and prevents any model from loading.