-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HQQ OOMs on large models #29
Comments
Hi @rationalism, yeah unfortunately loading automatically to multiple GPUs is not supported. Maybe you can try:
Otherwise, I can take a stab at it and see how to do it on 2 GPUs, or more generally how to do it automatically. |
Same problem. |
@mobicham Thanks. With larger models like DBRX coming out this year, I think being able to split across multiple GPUs will be an important feature to manage that https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm |
@mobicham I have rtx 4090 and 128gb ram, is it possible to load the original mixtral instruct and quantize it using HQQ ? Currently my script gets killed while loading as in the example file, I suppose I need to use your already quantized mixtral you linked , the 2-bit 4-bit , right? |
I managed to solve it by increasing WSL memory allocation and page swap file, nice! :) |
Yeah increasing the swap should do it, but it's gonna be slow. |
Would love to be able to actually use this model lol: https://huggingface.co/catid/cat-llama-3-70b-hqq Need support for device_map="auto"
|
Yeah I am aware of this, there should be a simple fix but I've been very busy with other things. I hope I will have the time to take a look at it in the next days. Sorry for the delay! |
You can now shard quantized models on multiple gpus. Just pass the list of devices as a list like this: model.quantize_model(quant_config=quant_config, compute_dtype=compute_dtype, device=['cuda:0', 'cuda:1']) You still need to have the main model on CPU before quantizing. Will see how to dynamically dispatch directly to the gpu. |
That worked thanks, just in time! https://huggingface.co/catid/cat-llama-3-70b-san66-hqq |
@catid making it work with "from_quantized" would require some additional work. But if you quantize directly it should work fine, as long as it's an official HF model that follows the same layer naming logic. |
Closing this since HQQ now is integrated with transformers: huggingface/transformers#29637 |
Hey, I have a machine with two 4090 GPUs (24 GB VRAM each). When I try to run HQQ quantization of Llama-2-70B:
the first half of the layers seem to work fine, but then it OOMs, presumably because it tries to put the entire quantized model on a single GPU device. For Llama-2-70B, I could try renting an A100 machine and that should work, but for even larger models (eg. Grok-1) it would be impossible to fit the entire thing on a single GPU. Is splitting quantization across multiple GPUs supported, or planned to be supported in the future? Thanks :)
The text was updated successfully, but these errors were encountered: