Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NotImplementedError: Mistral model requires flash attention v2 #1253

Closed
2 of 4 tasks
abhijithnair1 opened this issue Nov 9, 2023 · 17 comments · Fixed by #1351
Closed
2 of 4 tasks

NotImplementedError: Mistral model requires flash attention v2 #1253

abhijithnair1 opened this issue Nov 9, 2023 · 17 comments · Fixed by #1351

Comments

@abhijithnair1
Copy link

abhijithnair1 commented Nov 9, 2023

System Info

Used the latest version of, "ghcr.io/huggingface/text-generation-inference:latest".

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

Use the official Docker image (ghcr.io/huggingface/text-generation-inference:latest) or install flash attention v2 with cd server && make install install-flash-attention-v2

The :latest TGI image throws the same error, I tried to install it manually but that also thrown an error,

cd flash-attention-v2 && git fetch && git checkout 02ac572f3ffc4f402e4183aaa6824b45859d3ed3
HEAD is now at 02ac572 Clarify inference README is a placeholder
cd flash-attention-v2 && python setup.py build
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'


torch.__version__  = 2.0.1


Traceback (most recent call last):
  File "/usr/src/server/flash-attention-v2/setup.py", line 115, in <module>
    _, bare_metal_version = get_cuda_bare_metal_version(CUDA_HOME)
  File "/usr/src/server/flash-attention-v2/setup.py", line 66, in get_cuda_bare_metal_version
    raw_output = subprocess.check_output([cuda_dir + "/bin/nvcc", "-V"], universal_newlines=True)
  File "/opt/conda/lib/python3.9/subprocess.py", line 424, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/opt/conda/lib/python3.9/subprocess.py", line 505, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/opt/conda/lib/python3.9/subprocess.py", line 951, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/opt/conda/lib/python3.9/subprocess.py", line 1837, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/cuda/bin/nvcc'
make: *** [Makefile-flash-att-v2:10: build-flash-attention-v2] Error 1

I don't think the current latest image is configured to handle the installation either. Any suggestion/fix on this issue? A similar issue was closed.

#1208

@RisaKirisu
Copy link

What's your GPU? FlashAttentionV2 support only Ampere and later architecture for now.

@abhijithnair1
Copy link
Author

What's your GPU? FlashAttentionV2 support only Ampere and later architecture for now.

I don’t think that error is related to architecture issue. My GPU is NVIDIA A10G which is ampere arch.

@WasamiKirua
Copy link

WasamiKirua commented Nov 12, 2023

I also have this problem with Samantha Mistra 7B. the point is that I am able to serve this model using pytorch as well using the text generation webui on a 2070 rtx ... it would be great having these models working even on TGI

@abhinavztb
Copy link

I am facing the same issue for deploying the Zephyr-7B model on T4 GPU using latest TGI image. Is there any update or workaround for this?

@mrseeker
Copy link

I am facing the same issue for deploying the Zephyr-7B model on T4 GPU using latest TGI image. Is there any update or workaround for this?

You need to have a device with Ampere or Hopper architecture.

@xihajun
Copy link

xihajun commented Nov 22, 2023

Any way to disable flash attention?

if model_type == "mistral":
if MISTRAL:
return FlashMistral(
model_id,
revision,
quantize=quantize,
dtype=dtype,
trust_remote_code=trust_remote_code,
)
raise NotImplementedError("Mistral model requires flash attention v2")

if FLASH_ATTENTION:
return FlashLlama(
model_id,
revision,
quantize=quantize,
dtype=dtype,
trust_remote_code=trust_remote_code,
)
elif sharded:
raise NotImplementedError(FLASH_ATT_ERROR_MESSAGE.format("Sharded Llama"))
else:
return CausalLM(
model_id,
revision,
quantize=quantize,
dtype=dtype,
trust_remote_code=trust_remote_code,
)

@bilelomrani1
Copy link

Hi, any news on this?

@muhammad-asn
Copy link

@bilelomrani1 you can use Ampere architecture (A100, A10)

@bilelomrani1
Copy link

For some instances I don't have access to Ampere GPUs, only V100

@kno10
Copy link

kno10 commented Dec 12, 2023

@bilelomrani1 There is a pull request for this, #1279 - have you tried this patch, does it work?

@suparious
Copy link

I tried the patch, and it is not working for Mistral 7B as is. Giving a "key_error: mistral" message when running the text-generation-launcher binary.

I tried some of the previous commits in @xihajun 's repo, but they all lead to this error.

here is the full output:

2023-12-15T00:32:47.209675Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

Traceback (most recent call last):

  File "/home/ubuntu/miniconda3/envs/text-generation-inference-xihajun/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/home/ubuntu/text-generation-inference-xihajun/server/text_generation_server/cli.py", line 83, in serve
    server.serve(

  File "/home/ubuntu/text-generation-inference-xihajun/server/text_generation_server/server.py", line 207, in serve
    asyncio.run(

  File "/home/ubuntu/miniconda3/envs/text-generation-inference-xihajun/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)

  File "/home/ubuntu/miniconda3/envs/text-generation-inference-xihajun/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()

  File "/home/ubuntu/text-generation-inference-xihajun/server/text_generation_server/server.py", line 159, in serve_inner
    model = get_model(

  File "/home/ubuntu/text-generation-inference-xihajun/server/text_generation_server/models/__init__.py", line 260, in get_model
    return CausalLM(

  File "/home/ubuntu/text-generation-inference-xihajun/server/text_generation_server/models/causal_lm.py", line 504, in __init__
    model = AutoModelForCausalLM.from_pretrained(

  File "/home/ubuntu/miniconda3/envs/text-generation-inference-xihajun/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 527, in from_pretrained
    config, kwargs = AutoConfig.from_pretrained(

  File "/home/ubuntu/miniconda3/envs/text-generation-inference-xihajun/lib/python3.9/site-packages/transformers/models/auto/configuration_auto.py", line 1039, in from_pretrained
    config_class = CONFIG_MAPPING[config_dict["model_type"]]

  File "/home/ubuntu/miniconda3/envs/text-generation-inference-xihajun/lib/python3.9/site-packages/transformers/models/auto/configuration_auto.py", line 734, in __getitem__
    raise KeyError(key)

KeyError: 'mistral'
 rank=0
2023-12-15T00:32:47.308425Z ERROR text_generation_launcher: Shard 0 failed to start
2023-12-15T00:32:47.308447Z  INFO text_generation_launcher: Shutting down shards
Error: ShardCannotStart

@suparious
Copy link

suparious commented Dec 15, 2023

It's worth noting, that I only have this issue when running on A10G (Ampere) as native bfloat16, using any model based on Mistral. On the NVIDIA RTX 3060 (with bitsandbytes, or awq enabled), I can simply use the official Docker image.

@xihajun
Copy link

xihajun commented Dec 15, 2023

It's worth noting, that I only have this issue when running on A10G (Ampere) as native bfloat16, using any model based on Mistral. On the NVIDIA RTX 3060 (with bitsandbytes, or awq enabled), I can simply use the official Docker image.

I built a container with my fix a few weeks ago, it worked on v100 without flash attention. https://hub.docker.com/r/xihajun/text-generation-inference

Haven’t tested it yet, but it should work I hope.

In terms of my pull request, I will double check if they have been out of date already

@OlivierDehaene
Copy link
Member

@suparious you need to update transformers.
pip install --upgrade transformers

@OlivierDehaene
Copy link
Member

#1351 will relax the requirements for mistral. It will now default to CausalLM for V100 and allow Flash Attention v1 to be used if sliding window is None or -1.

@shieldsjared
Copy link

shieldsjared commented Jan 11, 2024

@OlivierDehaene - Is there any indication as to when 1.3.4 will be released so that we can make use of this fix? We're trying desperately to use Mistral in GovCloud where G5 compute instances arent available.

@suparious
Copy link

Will the version of transformers be updated in the Docker image? The current version doesn't support this.

kdamaszk pushed a commit to kdamaszk/tgi-gaudi that referenced this issue Apr 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.