HQQ OOMs on large models #29

rationalism · 2024-03-27T21:58:28Z

Hey, I have a machine with two 4090 GPUs (24 GB VRAM each). When I try to run HQQ quantization of Llama-2-70B:

from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer

#Model and setttings
model_id      = 'meta-llama/Llama-2-70b-chat-hf'
compute_dtype = torch.float16
device        = 'cuda:0'

#Load model on the CPU
######################
model     = HQQModelForCausalLM.from_pretrained(model_id, torch_dtype=compute_dtype)
tokenizer = AutoTokenizer.from_pretrained(model_id) 

#Quantize the model
######################
from hqq.core.quantize import *
quant_config = BaseQuantizeConfig(nbits=4, group_size=64)
model.quantize_model(quant_config=quant_config, compute_dtype=compute_dtype, device=device)

the first half of the layers seem to work fine, but then it OOMs, presumably because it tries to put the entire quantized model on a single GPU device. For Llama-2-70B, I could try renting an A100 machine and that should work, but for even larger models (eg. Grok-1) it would be impossible to fit the entire thing on a single GPU. Is splitting quantization across multiple GPUs supported, or planned to be supported in the future? Thanks :)

The text was updated successfully, but these errors were encountered:

mobicham · 2024-03-27T22:15:27Z

Hi @rationalism, yeah unfortunately loading automatically to multiple GPUs is not supported. Maybe you can try:

Splitting the model into two parts and quantize each one on a different GPU. This is the patching function that replaces the linear layer with HQQLinear: https://github.com/mobiusml/hqq/blob/master/hqq/models/base.py#L219C13-L219C26 . You also would need to transfer the rest of the layers to the right GPU, and make sure to add a call that transfers the output to the right GPU (only at the end of the first model part).
Use Mixtral instead, with meta-data CPU offloading. This one runs on a single 24GB and is very close to the original model (71.10 llm eval score vs. 72.62): https://huggingface.co/mobiuslabsgmbh/Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-3bit-metaoffload-HQQ

Otherwise, I can take a stab at it and see how to do it on 2 GPUs, or more generally how to do it automatically.

Minami-su · 2024-04-02T00:59:58Z

Hey, I have a machine with two 4090 GPUs (24 GB VRAM each). When I try to run HQQ quantization of Llama-2-70B:
from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer

#Model and setttings
model_id      = 'meta-llama/Llama-2-70b-chat-hf'
compute_dtype = torch.float16
device        = 'cuda:0'

#Load model on the CPU
######################
model     = HQQModelForCausalLM.from_pretrained(model_id, torch_dtype=compute_dtype)
tokenizer = AutoTokenizer.from_pretrained(model_id) 

#Quantize the model
######################
from hqq.core.quantize import *
quant_config = BaseQuantizeConfig(nbits=4, group_size=64)
model.quantize_model(quant_config=quant_config, compute_dtype=compute_dtype, device=device) 
the first half of the layers seem to work fine, but then it OOMs, presumably because it tries to put the entire quantized model on a single GPU device. For Llama-2-70B, I could try renting an A100 machine and that should work, but for even larger models (eg. Grok-1) it would be impossible to fit the entire thing on a single GPU. Is splitting quantization across multiple GPUs supported, or planned to be supported in the future? Thanks :)

Same problem.

rationalism · 2024-04-03T20:11:17Z

@mobicham Thanks. With larger models like DBRX coming out this year, I think being able to split across multiple GPUs will be an important feature to manage that

https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm

Sneakr · 2024-04-05T17:41:50Z

@mobicham I have rtx 4090 and 128gb ram, is it possible to load the original mixtral instruct and quantize it using HQQ ? Currently my script gets killed while loading as in the example file, I suppose I need to use your already quantized mixtral you linked , the 2-bit 4-bit , right?

Sneakr · 2024-04-05T20:12:34Z

I managed to solve it by increasing WSL memory allocation and page swap file, nice! :)

mobicham · 2024-04-06T10:13:27Z

Yeah increasing the swap should do it, but it's gonna be slow.
Otherwise, you can use this branch of transformers that supports on-the-fly loading and HQQ quantization, so you don't need a lot of ram: huggingface/transformers#29637
Soon it will be integrated into transformers and you wouldn't face this memory issue, I just need to fix a couple of things for the pull request.

catid · 2024-04-20T07:31:03Z

Would love to be able to actually use this model lol: https://huggingface.co/catid/cat-llama-3-70b-hqq

Need support for device_map="auto"

model_id = 'catid/cat-llama-3-70b-hqq'

from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
model     = HQQModelForCausalLM.from_quantized(model_id)

(hqq) ➜  openai-hqq-server git:(main) ✗ python test.py
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Failed to load the weights
Traceback (most recent call last):
  File "/home/catid/mambaforge/envs/hqq/lib/python3.10/site-packages/hqq/models/base.py", line 328, in from_quantized
    weights = cls.load_weights(save_dir)
  File "/home/catid/mambaforge/envs/hqq/lib/python3.10/site-packages/hqq/models/base.py", line 195, in load_weights
    return torch.load(cls.get_weight_file(save_dir), map_location=map_location)
  File "/home/catid/mambaforge/envs/hqq/lib/python3.10/site-packages/torch/serialization.py", line 1026, in load
    return _load(opened_zipfile,
  File "/home/catid/mambaforge/envs/hqq/lib/python3.10/site-packages/torch/serialization.py", line 1438, in _load
    result = unpickler.load()
  File "/home/catid/mambaforge/envs/hqq/lib/python3.10/site-packages/torch/serialization.py", line 1408, in persistent_load
    typed_storage = load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location))
  File "/home/catid/mambaforge/envs/hqq/lib/python3.10/site-packages/torch/serialization.py", line 1382, in load_tensor
    wrap_storage=restore_location(storage, location),
  File "/home/catid/mambaforge/envs/hqq/lib/python3.10/site-packages/torch/serialization.py", line 391, in default_restore_location
    result = fn(storage, location)
  File "/home/catid/mambaforge/envs/hqq/lib/python3.10/site-packages/torch/serialization.py", line 271, in _cuda_deserialize
    return obj.cuda(device)
  File "/home/catid/mambaforge/envs/hqq/lib/python3.10/site-packages/torch/_utils.py", line 115, in _cuda
    untyped_storage = torch.UntypedStorage(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 23.64 GiB of which 64.81 MiB is free. Including non-PyTorch memory, this process has 23.57 GiB memory in use. Of the allocated memory 23.19 GiB is allocated by PyTorch, and 9.97 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

mobicham · 2024-04-20T09:38:28Z

Yeah I am aware of this, there should be a simple fix but I've been very busy with other things. I hope I will have the time to take a look at it in the next days. Sorry for the delay!

mobicham · 2024-04-22T19:34:09Z

You can now shard quantized models on multiple gpus. Just pass the list of devices as a list like this:

model.quantize_model(quant_config=quant_config, compute_dtype=compute_dtype, device=['cuda:0', 'cuda:1'])

You still need to have the main model on CPU before quantizing. Will see how to dynamically dispatch directly to the gpu.

catid · 2024-04-23T06:06:40Z

That worked thanks, just in time! https://huggingface.co/catid/cat-llama-3-70b-san66-hqq

mobicham · 2024-04-23T07:04:13Z

@catid making it work with "from_quantized" would require some additional work. But if you quantize directly it should work fine, as long as it's an official HF model that follows the same layer naming logic.

mobicham · 2024-05-03T07:59:12Z

Closing this since HQQ now is integrated with transformers: huggingface/transformers#29637

mobicham added the enhancement New feature or request label Mar 27, 2024

mobicham closed this as completed May 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HQQ OOMs on large models #29

HQQ OOMs on large models #29

rationalism commented Mar 27, 2024

mobicham commented Mar 27, 2024 •

edited

Loading

Minami-su commented Apr 2, 2024

rationalism commented Apr 3, 2024

Sneakr commented Apr 5, 2024

Sneakr commented Apr 5, 2024

mobicham commented Apr 6, 2024

catid commented Apr 20, 2024 •

edited

Loading

mobicham commented Apr 20, 2024

mobicham commented Apr 22, 2024

catid commented Apr 23, 2024

mobicham commented Apr 23, 2024

mobicham commented May 3, 2024

HQQ OOMs on large models #29

HQQ OOMs on large models #29

Comments

rationalism commented Mar 27, 2024

mobicham commented Mar 27, 2024 • edited Loading

Minami-su commented Apr 2, 2024

rationalism commented Apr 3, 2024

Sneakr commented Apr 5, 2024

Sneakr commented Apr 5, 2024

mobicham commented Apr 6, 2024

catid commented Apr 20, 2024 • edited Loading

mobicham commented Apr 20, 2024

mobicham commented Apr 22, 2024

catid commented Apr 23, 2024

mobicham commented Apr 23, 2024

mobicham commented May 3, 2024

mobicham commented Mar 27, 2024 •

edited

Loading

catid commented Apr 20, 2024 •

edited

Loading