Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: FlashAttention only supports Ampere GPUs or newer. #5985

Open
1 task done
linzm1007 opened this issue May 6, 2024 · 1 comment
Open
1 task done

RuntimeError: FlashAttention only supports Ampere GPUs or newer. #5985

linzm1007 opened this issue May 6, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@linzm1007
Copy link

Describe the bug

image
image

python server.py --auto-devices --gpu-memory 80 80 --listen --api --model-dir /data/mlops/ --model modelDir --trust-remote-code

model : Qwen-7B-Chat

question
Traceback (most recent call last):
File "/app/modules/callbacks.py", line 61, in gentask
ret = self.mfunc(callback=_callback, *args, **self.kwargs)
File "/app/modules/text_generation.py", line 390, in generate_with_callback
shared.model.generate(**kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/modelDir/modeling_qwen.py", line 1259, in generate
return super().generate(
File "/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 1592, in generate
return self.sample(
File "/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 2696, in sample
outputs = self(
File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/modelDir/modeling_qwen.py", line 1043, in forward
transformer_outputs = self.transformer(
File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/modelDir/modeling_qwen.py", line 891, in forward
outputs = block(
File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/modelDir/modeling_qwen.py", line 610, in forward
attn_outputs = self.attn(
File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/modelDir/modeling_qwen.py", line 499, in forward
attn_output = self.core_attention_flash(q, k, v, attention_mask=attention_mask)
File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/modelDir/modeling_qwen.py", line 191, in forward
output = flash_attn_func(q, k, v, dropout_p, softmax_scale=self.softmax_scale, causal=self.causal)
File "/venv/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 825, in flash_attn_func
return FlashAttnFunc.apply(
File "/venv/lib/python3.10/site-packages/torch/autograd/function.py", line 553, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/venv/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 507, in forward
out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = _flash_attn_forward(
File "/venv/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 51, in _flash_attn_forward
out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.fwd(
RuntimeError: FlashAttention only supports Ampere GPUs or newer.
Output generated in 2.02 seconds (0.00 tokens/s, 0 tokens, context 60, seed 1536745076)
06:02:32-584051 INFO Deleted "logs/chat/Assistant/20240506-04-07-12.json".

Is there an existing issue for this?

  • I have searched the existing issues

Reproduction

gpu: V100-PCIE-32GB
python: 3.10
model:Qwen-7B-Chat
docker docker run -it --rm --gpus='"device=0,3"' -v /root/wangbing/model/Qwen-7B-Chat/V1/:/data/mlops/modelDir -v /root/wangbing/sftmodel/qwen/V1:/data/mlops/adapterDir/ -p30901:5000 -p7901:7860 dggecr01.huawei.com:80/tbox/text-generation-webui:at-0.0.1 bash
app python server.py --auto-devices --gpu-memory 80 80 --listen --api --model-dir /data/mlops/ --model modelDir --trust-remote-code

Screenshot

No response

Logs

Traceback (most recent call last):
  File "/app/modules/callbacks.py", line 61, in gentask
    ret = self.mfunc(callback=_callback, *args, **self.kwargs)
  File "/app/modules/text_generation.py", line 390, in generate_with_callback
    shared.model.generate(**kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/modelDir/modeling_qwen.py", line 1259, in generate
    return super().generate(
  File "/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 1592, in generate
    return self.sample(
  File "/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 2696, in sample
    outputs = self(
  File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/modelDir/modeling_qwen.py", line 1043, in forward
    transformer_outputs = self.transformer(
  File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/modelDir/modeling_qwen.py", line 891, in forward
    outputs = block(
  File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/modelDir/modeling_qwen.py", line 610, in forward
    attn_outputs = self.attn(
  File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/modelDir/modeling_qwen.py", line 499, in forward
    attn_output = self.core_attention_flash(q, k, v, attention_mask=attention_mask)
  File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/modelDir/modeling_qwen.py", line 191, in forward
    output = flash_attn_func(q, k, v, dropout_p, softmax_scale=self.softmax_scale, causal=self.causal)
  File "/venv/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 825, in flash_attn_func
    return FlashAttnFunc.apply(
  File "/venv/lib/python3.10/site-packages/torch/autograd/function.py", line 553, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/venv/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 507, in forward
    out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = _flash_attn_forward(
  File "/venv/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 51, in _flash_attn_forward
    out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.fwd(
RuntimeError: FlashAttention only supports Ampere GPUs or newer.
Output generated in 2.02 seconds (0.00 tokens/s, 0 tokens, context 60, seed 1536745076)
06:02:32-584051 INFO     Deleted "logs/chat/Assistant/20240506-04-07-12.json".

System Info

![image](https://github.com/oobabooga/text-generation-webui/assets/96732179/3b7f46e8-d59b-4d56-bfb3-95c7ecd73887)
![image](https://github.com/oobabooga/text-generation-webui/assets/96732179/4232979d-e104-4117-bca5-15edd7787617)
@linzm1007 linzm1007 added the bug Something isn't working label May 6, 2024
@nickpotafiy
Copy link

Tesla V100 uses the Volta architecture. It goes Volta < Turing < Ampere < Hopper < Ada. As of now, flash attention currently supports Ampere, Hopper, or Ada, (also Turing with flash attention 1.x). Turing is being worked on for flash attention 2, maybe Volta after that. 🤞

In the meantime, load the model without flash attention.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants