-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge and unload changes inference a lot on quantized llama 2 #1043
Comments
This is really hard to say. Merging the weights, especially with 4bit quantization, will inevitably lead to discrepancies, but it's not easy to say how much is within the expected range.
Do you find that the results are now completely random, or basically the same as without LoRA? Or is it still improved over the baseline, just not as good as without merging? |
I can confirm this. I have to do merge_and_unload() otherwise the generation is full of special characters. My adaptor_config.json is like this: {
"auto_mapping": null,
"base_model_name_or_path": "lmsys/vicuna-7b-v1.5-16k",
"bias": "none",
"fan_in_fan_out": false,
"inference_mode": true,
"init_lora_weights": true,
"layers_pattern": null,
"layers_to_transform": null,
"lora_alpha": 16,
"lora_dropout": 0.05,
"modules_to_save": null,
"peft_type": "LORA",
"r": 8,
"revision": null,
"target_modules": [
"q_proj",
"v_proj"
],
"task_type": "CAUSAL_LM"
} And my inference code is like this: peft_model_id = 'mymodel'
config = PeftConfig.from_pretrained(peft_model_id)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
config.base_model_name_or_path,
quantization_config=bnb_config,
use_auth_token=False,
torch_dtype=torch.bfloat16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
#%%
model = PeftModel.from_pretrained(model, peft_model_id, device_map="auto")
model = model.merge_and_unload() The last line is very essential in order to get meaningful inference. |
Thanks for the additional report @lithces. IIUC, your issue is the opposite of OP's, who mentioned that their results get worse after merging whereas you report that they get better. To further investigate, could you please check how the logits differ between unmerged and merged model? |
Hey, sorry for the delay.
In the end for my use case I just used more compute to fix it. I am nevertheless still interested in what I did wrong. |
Could you kindly tell me how to do this? Thank you! |
So IIUC, the first few generated tokens look good and over time the model starts deviating from the pattern it was trained on. This indicates to me that the model is not completely broken, but that there are small errors that accumulate over time. That's not totally surprising, as merging into the quantized weights will introduce some small errors. Possibly, you would see better results if you train your model for longer before merging. Perhaps tuning other hyper-parameters like the LoRA rank (
You are correct, the adapters are not in 4 bit, as the weights that are being trained require higher precision. AFAIK, there is no technique yet that would allow to go below 16 bit, though there are some forays into 8 bit. To check the dtype of the adapters, you can go through the
Instead of calling |
I did that and yes, all the lora adapters are in high precision and the base model in 4 bit uint8. After merging, all the weights are in 4bit.
I ran an example before and after merging and the resulting logits are way further apart then when I run the same model twice. So it seems that the conversion of the lora weights to 4 bit causes rounding errors that accumulate over time. That explains why it works at the beginning and then gets worse over time. So you were right. So to conclude this my only option for next time is to either train everything in full precision and then load it in 4 bit for inference or to leave the configuration as is, which means I do not merge at all. Is that correct? Thank you so much for your help! |
Unfortunately, that seems to be the case. Maybe training for longer could also help, but that's not guaranteed. It's also possible that #1122 could help in your situation, we'll have to see. In principle, it's also possible that there is room for improvement when it comes to merging quantized weights, but we mostly rely on the tools given to us by bnb, so if there is a significant development on that front, it could also improve your situation. |
I see, thank you so much for you help!! I will try the loftq as soon as it is merged. |
System Info
I fine-tuned a quantized LLama 2 7b chat model using peft and it worked great. The only problem is the inference is reaaaally slow.
So I read the docs and found the merge_and_unload. Once I apply this the inference doubles in speed, but the results worsen a lot using the same parameters for generation.
Is this a known issue? Am I doing something wrong?
I saw there is an inference mode option in the peft config. But since I am loading my adapters from huggingface, where I safed them, how can I change the peft config after creation? Or is merge and unload already doing that?
As a side note, my usecase is really simple and the fine-tuning follow the pattern: [some word] is [some value] and this repeats for different words and values for data generation.
As soon as I use merge_and_unload, it does not stick to the pattern that it learned anymore, even on low temperature. Before, it is really slow but works perfectly.
Who can help?
No response
Information
Tasks
examples
folderReproduction
Inference of a fine-tuned llama 2 model with and without merge and unload
Expected behavior
both produce roughly the same result
The text was updated successfully, but these errors were encountered: