-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Request failed during generation: Server error: Expected is_sm90 || is_sm8x || is_sm75 to be true, but got false. (Could this error message be improved? If so, please report an enhancement request to PyTorch.) #198
Comments
Full traceback:
|
Because llama use FlashAttention by default, and your devices are not sm75/8x/90 gpu architectures. In my experience, 2080ti and A100 can work, but V100 can' t. |
@Wen1163204547, thanks for your help! @Jblauvs, I will add a check to see if the GPU architecture is supported before importing flash attention. |
Makes perfect sense, as I'm using older V100s. Thanks all! |
Hi @Wen1163204547 , Is there any method to support v100? |
Using the docker container ala these instructions:
https://github.com/huggingface/text-generation-inference#docker
in order to run the server locally. I'm using an app very similar to the one here:
https://huggingface.co/spaces/olivierdehaene/chat-llm-streaming to hit that local server.
I'm seeing this error in the server logs:
Any ideas?
The text was updated successfully, but these errors were encountered: