-
Notifications
You must be signed in to change notification settings - Fork 959
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NotImplementedError: Mistral model requires flash attention v2 #1253
Comments
What's your GPU? FlashAttentionV2 support only Ampere and later architecture for now. |
I don’t think that error is related to architecture issue. My GPU is NVIDIA A10G which is ampere arch. |
I also have this problem with Samantha Mistra 7B. the point is that I am able to serve this model using pytorch as well using the text generation webui on a 2070 rtx ... it would be great having these models working even on TGI |
I am facing the same issue for deploying the Zephyr-7B model on T4 GPU using latest TGI image. Is there any update or workaround for this? |
You need to have a device with Ampere or Hopper architecture. |
Any way to disable flash attention?
text-generation-inference/server/text_generation_server/models/__init__.py Lines 250 to 259 in 3c02262
text-generation-inference/server/text_generation_server/models/__init__.py Lines 200 to 217 in 3c02262
|
Hi, any news on this? |
@bilelomrani1 you can use Ampere architecture (A100, A10) |
For some instances I don't have access to Ampere GPUs, only V100 |
@bilelomrani1 There is a pull request for this, #1279 - have you tried this patch, does it work? |
I tried the patch, and it is not working for Mistral 7B as is. Giving a "key_error: mistral" message when running the I tried some of the previous commits in @xihajun 's repo, but they all lead to this error. here is the full output:
|
It's worth noting, that I only have this issue when running on A10G (Ampere) as native bfloat16, using any model based on Mistral. On the NVIDIA RTX 3060 (with bitsandbytes, or awq enabled), I can simply use the official Docker image. |
I built a container with my fix a few weeks ago, it worked on v100 without flash attention. https://hub.docker.com/r/xihajun/text-generation-inference Haven’t tested it yet, but it should work I hope. In terms of my pull request, I will double check if they have been out of date already |
@suparious you need to update transformers. |
#1351 will relax the requirements for mistral. It will now default to CausalLM for V100 and allow Flash Attention v1 to be used if sliding window is None or -1. |
@OlivierDehaene - Is there any indication as to when 1.3.4 will be released so that we can make use of this fix? We're trying desperately to use Mistral in GovCloud where G5 compute instances arent available. |
Will the version of transformers be updated in the Docker image? The current version doesn't support this. |
System Info
Used the latest version of, "ghcr.io/huggingface/text-generation-inference:latest".
Information
Tasks
Reproduction
The :latest TGI image throws the same error, I tried to install it manually but that also thrown an error,
I don't think the current latest image is configured to handle the installation either. Any suggestion/fix on this issue? A similar issue was closed.
#1208
The text was updated successfully, but these errors were encountered: