New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vicuna-7B-v1.5-GPTQ fails with Error: Warmup(Generation("Not enough memory to handle 1024 prefill tokens. You need to decrease --max-batch-prefill-tokens
"))
#940
Comments
Early I got the error as RuntimeError: Not enough memory to handle 4096 prefill tokens. You need to decrease `--max-batch-prefill-tokens` So I tried by reducing 4096 to 1024 still getting the same error? |
It seems exllama fails to work on T4 (compute_cap 7.5). You should be able to run with |
@Narsil We are exeriencing the same exact error on T4 in GKE Autopilot. Setting Exllama GPTQ cuda kernels (which are faster) could have been used, but are not currently installed, try using BUILD_EXTENSIONS=True but TGI still failed with: RuntimeError: Not enough memory to handle 4096 prefill tokens. You need to decrease --max-batch-prefill-tokens |
This works on T4, the TGI won't saying:
BUT, the infer speed is very very slow. |
@AzureSilent Without exllama it's using the triton kernel which uses JIT. Same in production, it will start being slow before getting back to more acceptable speeds. However I'm not sure how well it compiles on T4 for sure, it's definitely possible that it's slower unfortunately (altough 8x seems like a lot) |
@Narsil -- |
I also had an issue running pretty much any GPTQ model, I can't seem to run TheBloke/Llama-2-7b-Chat-GPTQ model (resulting in the same Disabling exllama didn't help either. |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
System Info
Ubuntu 22.04.2 LTS
v1.0.2
(latest)GCP
2 * T4
12.0
4
47GB
Information
Tasks
Reproduction
Run the docker command to reproduce:
Warnings:
Error Stacktrace:
Expected behavior
The model is expected to be loaded and warmed up without any error. But it's still throwing OOM even though only 22% of GPU Memory is occupied by the model before warmup. Check the below image.
The text was updated successfully, but these errors were encountered: