-
Notifications
You must be signed in to change notification settings - Fork 27.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hqq serialization #33141
Hqq serialization #33141
Conversation
1/3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice ! Let's fix the issue regarding the torchao backend and we can merge this. I left a few comments
2/3: Multi-gpu loading 3/3: state_dict on the same safetensor chunk model_id = 'meta-llama/Meta-Llama-3-8B-Instruct' #OK
model_id = 'meta-llama/Meta-Llama-3-70B' #OK
model_id = "facebook/opt-125m" #OK
model_id = "meta-llama/Llama-2-13b-chat-hf" #OK
model_id = "microsoft/Phi-3-mini-128k-instruct" #OK
model_id = "google/gemma-2-9b-it" #OK
model_id = "google/gemma-2-2b" #OK so I think for the moment we can leave it until someone reports some issue, I can't reproduce the problem anyway. Next steps: |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a couple of comments !
Regarding this: #33141 (comment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a suggestion about axis
Just for curiosity, what miss to merge? |
Waiting for @mobicham to check the latest review and give me to heads-up to merge ! This should be done soon ! Also it looks like that there are some conflits to fix |
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Thanks for iterating @mobicham! Merging! |
* HQQ model serialization attempt * fix hqq dispatch and unexpected keys * style * remove check_old_param * revert to check HQQLinear in quantizer_hqq.py * revert to check HQQLinear in quantizer_hqq.py * update HqqConfig default params * make ci happy * make ci happy * revert to HQQLinear check in quantizer_hqq.py * check hqq_min version 0.2.0 * set axis=1 as default in quantization_config.py * validate_env with hqq>=0.2.0 version message * deprecated hqq kwargs message * make ci happy * remove run_expected_keys_check hack + bump to 0.2.1 min hqq version * fix unexpected_keys hqq update * add pre_quantized check * add update_expected_keys to base quantizerr * ci base.py fix? * ci base.py fix? * fix "quantization typo" src/transformers/utils/quantization_config.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * fix post merge --------- Co-authored-by: Marc Sun <marc@huggingface.co> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
@mobicham minor documentation issue, but the transformers documentation page for quantization has a giant features matrix which still says serialization of HQQ models is not supported https://huggingface.co/docs/transformers/main/quantization/overview |
Would you like to open a PR to fix this @rohit-gupta ? |
@rohit-gupta thanks for flagging ! |
now model.save_pretrained(save_path) give this:
|
@blap is this related to the latest transformer changes? Otherwise, which hqq version causes this? |
I think so. I didn't had this problem in the release of hqq in transformers. |
* HQQ model serialization attempt * fix hqq dispatch and unexpected keys * style * remove check_old_param * revert to check HQQLinear in quantizer_hqq.py * revert to check HQQLinear in quantizer_hqq.py * update HqqConfig default params * make ci happy * make ci happy * revert to HQQLinear check in quantizer_hqq.py * check hqq_min version 0.2.0 * set axis=1 as default in quantization_config.py * validate_env with hqq>=0.2.0 version message * deprecated hqq kwargs message * make ci happy * remove run_expected_keys_check hack + bump to 0.2.1 min hqq version * fix unexpected_keys hqq update * add pre_quantized check * add update_expected_keys to base quantizerr * ci base.py fix? * ci base.py fix? * fix "quantization typo" src/transformers/utils/quantization_config.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * fix post merge --------- Co-authored-by: Marc Sun <marc@huggingface.co> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Transformers version 4.48.0.dev0 still has this problem... |
Any one from the HF team can track down this problem please? What changed ? Nothing on the hqq lib side changed much. |
@SunMarc ? |
Can you share your script @blap ? I'll have a look asap ! |
Error:
|
So... |
@blap why don't you use the latest release ? It works fine last time I tried (last week) |
Which version do you use? Version 4.45.2 give me this:
|
@blap |
I just got the same error in this version too. |
# pip install transformers==4.47.0;
# pip install hqq --upgrade;
##################################################################
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, HqqConfig
model_path = "meta-llama/Meta-Llama-3-8B-Instruct"
quant_model = "quant_model"
quant_config = HqqConfig(nbits=4, group_size=64, axis=1)
model = AutoModelForCausalLM.from_pretrained(model_path,
torch_dtype=torch.float16,
cache_dir='.',
device_map="cuda:0",
quantization_config=quant_config,
low_cpu_mem_usage=True)
tokenizer = AutoTokenizer.from_pretrained(model_path)
model.save_pretrained(quant_model)
tokenizer.save_pretrained(quant_model) |
I found the problem: |
Hmm interesting, thanks for flagging! Fixed here. Would recommend using 64 or 128 though, some of the fast kernels like Marlin in VLLM and TinyGemm in torchao don't support |
Follow-up to #32379
The goal of this PR is to add full support to save/load HQQ-quantized models directly in transformers. So far, serialization was done on the hqq-lib side via the
.pt
format which is not safe and doesn't work with very large models (>100B params) since the model is not sharded.What was done during this PR:
update_expected_keys()
call in the quantizer. This allows loading quantized models that were initialized withtorch.nn.Linear
insteadFull gist to try it out: https://gist.github.com/mobicham/701dd564c52590203ee09631425ad797