-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem with quantize model #85
Comments
Do you measure before or after warmup? With the start up the kv cache gets reserved, with quantited model there is more memory for cache, but the total used memory is the same. |
Hey @prd-tuong-nguyen, as @flozi00 said this is likely due to the warmup phase, where we allocate additional memory in advance for batching to avoid having to allocate it on the fly during inference. For example, here's the memory usage reported by
And here are the results after warmup:
As you can see, lorax will use as much memory as it can get away to maximize the batch size. |
Resolved by edit |
Glad you got it working! We'll be adding some better docs soon to make these parameters easier to find. Closing this issue for now. |
System Info
Can you tell me abit about how to serve model in 4bit quantize mode?
I added the
--quantize bitsandbytes-nf4
when run docker container but nothing change, the GPU memory keep the sameInformation
Tasks
Reproduction
docker run --gpus all --shm-size 1g-p 8080:80 -v ./ckpts:/data ghcr.io/predibase/lorax:latest --model-id /data/OpenHermes-2-7B-base-2.3 --quantize bitsandbytes-nf4
Expected behavior
Reduce the GPU memory
The text was updated successfully, but these errors were encountered: