NotImplementedError: Mistral model requires flash attention v2 #1253

abhijithnair1 · 2023-11-09T23:11:42Z

System Info

Used the latest version of, "ghcr.io/huggingface/text-generation-inference:latest".

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

Use the official Docker image (ghcr.io/huggingface/text-generation-inference:latest) or install flash attention v2 with cd server && make install install-flash-attention-v2

The :latest TGI image throws the same error, I tried to install it manually but that also thrown an error,

cd flash-attention-v2 && git fetch && git checkout 02ac572f3ffc4f402e4183aaa6824b45859d3ed3
HEAD is now at 02ac572 Clarify inference README is a placeholder
cd flash-attention-v2 && python setup.py build
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'


torch.__version__  = 2.0.1


Traceback (most recent call last):
  File "/usr/src/server/flash-attention-v2/setup.py", line 115, in <module>
    _, bare_metal_version = get_cuda_bare_metal_version(CUDA_HOME)
  File "/usr/src/server/flash-attention-v2/setup.py", line 66, in get_cuda_bare_metal_version
    raw_output = subprocess.check_output([cuda_dir + "/bin/nvcc", "-V"], universal_newlines=True)
  File "/opt/conda/lib/python3.9/subprocess.py", line 424, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/opt/conda/lib/python3.9/subprocess.py", line 505, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/opt/conda/lib/python3.9/subprocess.py", line 951, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/opt/conda/lib/python3.9/subprocess.py", line 1837, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/cuda/bin/nvcc'
make: *** [Makefile-flash-att-v2:10: build-flash-attention-v2] Error 1

I don't think the current latest image is configured to handle the installation either. Any suggestion/fix on this issue? A similar issue was closed.

#1208

The text was updated successfully, but these errors were encountered:

RisaKirisu · 2023-11-09T23:52:47Z

What's your GPU? FlashAttentionV2 support only Ampere and later architecture for now.

abhijithnair1 · 2023-11-10T04:37:57Z

What's your GPU? FlashAttentionV2 support only Ampere and later architecture for now.

I don’t think that error is related to architecture issue. My GPU is NVIDIA A10G which is ampere arch.

WasamiKirua · 2023-11-12T12:52:20Z

I also have this problem with Samantha Mistra 7B. the point is that I am able to serve this model using pytorch as well using the text generation webui on a 2070 rtx ... it would be great having these models working even on TGI

abhinavztb · 2023-11-14T05:03:20Z

I am facing the same issue for deploying the Zephyr-7B model on T4 GPU using latest TGI image. Is there any update or workaround for this?

mrseeker · 2023-11-20T10:59:15Z

I am facing the same issue for deploying the Zephyr-7B model on T4 GPU using latest TGI image. Is there any update or workaround for this?

You need to have a device with Ampere or Hopper architecture.

xihajun · 2023-11-22T12:45:29Z

Any way to disable flash attention?

text-generation-inference/server/text_generation_server/models/__init__.py

Line 47 in 3c02262

FLASH_ATTENTION = True

text-generation-inference/server/text_generation_server/models/__init__.py

Lines 250 to 259 in 3c02262

    
           if model_type == "mistral": 
        
               if MISTRAL: 
        
                   return FlashMistral( 
        
                       model_id, 
        
                       revision, 
        
                       quantize=quantize, 
        
                       dtype=dtype, 
        
                       trust_remote_code=trust_remote_code, 
        
                   ) 
        
               raise NotImplementedError("Mistral model requires flash attention v2")

text-generation-inference/server/text_generation_server/models/__init__.py

Lines 200 to 217 in 3c02262

    
           if FLASH_ATTENTION: 
        
               return FlashLlama( 
        
                   model_id, 
        
                   revision, 
        
                   quantize=quantize, 
        
                   dtype=dtype, 
        
                   trust_remote_code=trust_remote_code, 
        
               ) 
        
           elif sharded: 
        
               raise NotImplementedError(FLASH_ATT_ERROR_MESSAGE.format("Sharded Llama")) 
        
           else: 
        
               return CausalLM( 
        
                   model_id, 
        
                   revision, 
        
                   quantize=quantize, 
        
                   dtype=dtype, 
        
                   trust_remote_code=trust_remote_code, 
        
               )

bilelomrani1 · 2023-12-12T14:33:32Z

Hi, any news on this?

muhammad-asn · 2023-12-12T14:57:01Z

@bilelomrani1 you can use Ampere architecture (A100, A10)

bilelomrani1 · 2023-12-12T15:24:28Z

For some instances I don't have access to Ampere GPUs, only V100

kno10 · 2023-12-12T16:02:10Z

@bilelomrani1 There is a pull request for this, #1279 - have you tried this patch, does it work?

suparious · 2023-12-15T00:34:20Z

I tried the patch, and it is not working for Mistral 7B as is. Giving a "key_error: mistral" message when running the text-generation-launcher binary.

I tried some of the previous commits in @xihajun 's repo, but they all lead to this error.

here is the full output:

2023-12-15T00:32:47.209675Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

Traceback (most recent call last):

  File "/home/ubuntu/miniconda3/envs/text-generation-inference-xihajun/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/home/ubuntu/text-generation-inference-xihajun/server/text_generation_server/cli.py", line 83, in serve
    server.serve(

  File "/home/ubuntu/text-generation-inference-xihajun/server/text_generation_server/server.py", line 207, in serve
    asyncio.run(

  File "/home/ubuntu/miniconda3/envs/text-generation-inference-xihajun/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)

  File "/home/ubuntu/miniconda3/envs/text-generation-inference-xihajun/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()

  File "/home/ubuntu/text-generation-inference-xihajun/server/text_generation_server/server.py", line 159, in serve_inner
    model = get_model(

  File "/home/ubuntu/text-generation-inference-xihajun/server/text_generation_server/models/__init__.py", line 260, in get_model
    return CausalLM(

  File "/home/ubuntu/text-generation-inference-xihajun/server/text_generation_server/models/causal_lm.py", line 504, in __init__
    model = AutoModelForCausalLM.from_pretrained(

  File "/home/ubuntu/miniconda3/envs/text-generation-inference-xihajun/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 527, in from_pretrained
    config, kwargs = AutoConfig.from_pretrained(

  File "/home/ubuntu/miniconda3/envs/text-generation-inference-xihajun/lib/python3.9/site-packages/transformers/models/auto/configuration_auto.py", line 1039, in from_pretrained
    config_class = CONFIG_MAPPING[config_dict["model_type"]]

  File "/home/ubuntu/miniconda3/envs/text-generation-inference-xihajun/lib/python3.9/site-packages/transformers/models/auto/configuration_auto.py", line 734, in __getitem__
    raise KeyError(key)

KeyError: 'mistral'
 rank=0
2023-12-15T00:32:47.308425Z ERROR text_generation_launcher: Shard 0 failed to start
2023-12-15T00:32:47.308447Z  INFO text_generation_launcher: Shutting down shards
Error: ShardCannotStart

suparious · 2023-12-15T01:06:13Z

It's worth noting, that I only have this issue when running on A10G (Ampere) as native bfloat16, using any model based on Mistral. On the NVIDIA RTX 3060 (with bitsandbytes, or awq enabled), I can simply use the official Docker image.

xihajun · 2023-12-15T09:11:11Z

It's worth noting, that I only have this issue when running on A10G (Ampere) as native bfloat16, using any model based on Mistral. On the NVIDIA RTX 3060 (with bitsandbytes, or awq enabled), I can simply use the official Docker image.

I built a container with my fix a few weeks ago, it worked on v100 without flash attention. https://hub.docker.com/r/xihajun/text-generation-inference

Haven’t tested it yet, but it should work I hope.

In terms of my pull request, I will double check if they have been out of date already

OlivierDehaene · 2023-12-15T09:47:29Z

@suparious you need to update transformers.
pip install --upgrade transformers

OlivierDehaene · 2023-12-15T10:18:08Z

#1351 will relax the requirements for mistral. It will now default to CausalLM for V100 and allow Flash Attention v1 to be used if sliding window is None or -1.

Close #1253 Close #1279

shieldsjared · 2024-01-11T22:45:08Z

@OlivierDehaene - Is there any indication as to when 1.3.4 will be released so that we can make use of this fix? We're trying desperately to use Mistral in GovCloud where G5 compute instances arent available.

suparious · 2024-01-12T09:18:06Z

Will the version of transformers be updated in the Docker image? The current version doesn't support this.

Close huggingface#1253 Close huggingface#1279

xihajun mentioned this issue Nov 23, 2023

fix: Add V100 (older) GPU Support for Mistral 7b Models #1279

Closed

5 tasks

OlivierDehaene mentioned this issue Dec 15, 2023

feat: relax mistral requirements #1351

Merged

OlivierDehaene closed this as completed in #1351 Dec 15, 2023

OlivierDehaene added a commit that referenced this issue Dec 15, 2023

feat: relax mistral requirements (#1351)

9b56d3f

Close #1253 Close #1279

kdamaszk pushed a commit to kdamaszk/tgi-gaudi that referenced this issue Apr 29, 2024

feat: relax mistral requirements (huggingface#1351)

a95e6d6

Close huggingface#1253 Close huggingface#1279

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NotImplementedError: Mistral model requires flash attention v2 #1253

NotImplementedError: Mistral model requires flash attention v2 #1253

abhijithnair1 commented Nov 9, 2023 •

edited

Loading

RisaKirisu commented Nov 9, 2023

abhijithnair1 commented Nov 10, 2023

WasamiKirua commented Nov 12, 2023 •

edited

Loading

abhinavztb commented Nov 14, 2023

mrseeker commented Nov 20, 2023

xihajun commented Nov 22, 2023 •

edited

Loading

bilelomrani1 commented Dec 12, 2023

muhammad-asn commented Dec 12, 2023

bilelomrani1 commented Dec 12, 2023

kno10 commented Dec 12, 2023

suparious commented Dec 15, 2023

suparious commented Dec 15, 2023 •

edited

Loading

xihajun commented Dec 15, 2023

OlivierDehaene commented Dec 15, 2023

OlivierDehaene commented Dec 15, 2023

shieldsjared commented Jan 11, 2024 •

edited

Loading

suparious commented Jan 12, 2024

NotImplementedError: Mistral model requires flash attention v2 #1253

NotImplementedError: Mistral model requires flash attention v2 #1253

Comments

abhijithnair1 commented Nov 9, 2023 • edited Loading

System Info

Information

Tasks

Reproduction

RisaKirisu commented Nov 9, 2023

abhijithnair1 commented Nov 10, 2023

WasamiKirua commented Nov 12, 2023 • edited Loading

abhinavztb commented Nov 14, 2023

mrseeker commented Nov 20, 2023

xihajun commented Nov 22, 2023 • edited Loading

bilelomrani1 commented Dec 12, 2023

muhammad-asn commented Dec 12, 2023

bilelomrani1 commented Dec 12, 2023

kno10 commented Dec 12, 2023

suparious commented Dec 15, 2023

suparious commented Dec 15, 2023 • edited Loading

xihajun commented Dec 15, 2023

OlivierDehaene commented Dec 15, 2023

OlivierDehaene commented Dec 15, 2023

shieldsjared commented Jan 11, 2024 • edited Loading

suparious commented Jan 12, 2024

abhijithnair1 commented Nov 9, 2023 •

edited

Loading

WasamiKirua commented Nov 12, 2023 •

edited

Loading

xihajun commented Nov 22, 2023 •

edited

Loading

suparious commented Dec 15, 2023 •

edited

Loading

shieldsjared commented Jan 11, 2024 •

edited

Loading