Merge and unload changes inference a lot on quantized llama 2 #1043

Lanzelot-Moll · 2023-10-22T16:45:27Z

System Info

I fine-tuned a quantized LLama 2 7b chat model using peft and it worked great. The only problem is the inference is reaaaally slow.

So I read the docs and found the merge_and_unload. Once I apply this the inference doubles in speed, but the results worsen a lot using the same parameters for generation.

Is this a known issue? Am I doing something wrong?

I saw there is an inference mode option in the peft config. But since I am loading my adapters from huggingface, where I safed them, how can I change the peft config after creation? Or is merge and unload already doing that?

As a side note, my usecase is really simple and the fine-tuning follow the pattern: [some word] is [some value] and this repeats for different words and values for data generation.
As soon as I use merge_and_unload, it does not stick to the pattern that it learned anymore, even on low temperature. Before, it is really slow but works perfectly.

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder
My own task or dataset (give details below)

Reproduction

Inference of a fine-tuned llama 2 model with and without merge and unload

import torch
import transformers

# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

# begin initializing HF items, need auth token for these
hf_auth = auth_token
model_config = transformers.AutoConfig.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)

# initialize the model
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto',
    use_auth_token=hf_auth
)

from peft import LoraConfig, get_peft_model
from peft import prepare_model_for_kbit_training
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

config = LoraConfig(
    r=16, 
    lora_alpha=32,
    target_modules=["q_proj", "up_proj","o_proj","k_proj","down_proj", "gate_proj","v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)

Expected behavior

both produce roughly the same result

The text was updated successfully, but these errors were encountered:

BenjaminBossan · 2023-10-23T11:07:39Z

This is really hard to say. Merging the weights, especially with 4bit quantization, will inevitably lead to discrepancies, but it's not easy to say how much is within the expected range.

As soon as I use merge_and_unload, it does not stick to the pattern that it learned anymore, even on low temperature.

Do you find that the results are now completely random, or basically the same as without LoRA? Or is it still improved over the baseline, just not as good as without merging?

lithces · 2023-11-09T03:55:19Z

I can confirm this. I have to do merge_and_unload() otherwise the generation is full of special characters.

My adaptor_config.json is like this:

{
  "auto_mapping": null,
  "base_model_name_or_path": "lmsys/vicuna-7b-v1.5-16k",
  "bias": "none",
  "fan_in_fan_out": false,
  "inference_mode": true,
  "init_lora_weights": true,
  "layers_pattern": null,
  "layers_to_transform": null,
  "lora_alpha": 16,
  "lora_dropout": 0.05,
  "modules_to_save": null,
  "peft_type": "LORA",
  "r": 8,
  "revision": null,
  "target_modules": [
    "q_proj",
    "v_proj"
  ],
  "task_type": "CAUSAL_LM"
}

And my inference code is like this:

peft_model_id = 'mymodel'

config = PeftConfig.from_pretrained(peft_model_id)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
    config.base_model_name_or_path,
    quantization_config=bnb_config,
    use_auth_token=False,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
#%%
model = PeftModel.from_pretrained(model, peft_model_id, device_map="auto")
model = model.merge_and_unload()

The last line is very essential in order to get meaningful inference.

BenjaminBossan · 2023-11-09T11:39:55Z

Thanks for the additional report @lithces. IIUC, your issue is the opposite of OP's, who mentioned that their results get worse after merging whereas you report that they get better.

To further investigate, could you please check how the logits differ between unmerged and merged model?

Lanzelot-Moll · 2023-11-14T16:57:27Z

This is really hard to say. Merging the weights, especially with 4bit quantization, will inevitably lead to discrepancies, but it's not easy to say how much is within the expected range.

As soon as I use merge_and_unload, it does not stick to the pattern that it learned anymore, even on low temperature.

Do you find that the results are now completely random, or basically the same as without LoRA? Or is it still improved over the baseline, just not as good as without merging?

Hey, sorry for the delay. In the end for my use case I just used more compute to fix it. I am nevertheless still interested in what I did wrong.
To your question: it is not the same as before. If I run the base model, it gives me a text as output. If I use merge and unload it works for the first couple of tokens, but at some point it just starts generating sentences, instead of the trained pattern.
If I do not use merge and unload, the generation follows the trained pattern basically all the time.
My assumption is that the adapters are not trained/loaded in 4 bit and therefore the merging fails. Is there any way to verify that assumption? How can I check what number format my adapters have? What would be the solution for this, if that is the case? Can I train my adapters in 4 bit too?

Lanzelot-Moll · 2023-11-14T16:58:31Z

Thanks for the additional report @lithces. IIUC, your issue is the opposite of OP's, who mentioned that their results get worse after merging whereas you report that they get better.

To further investigate, could you please check how the logits differ between unmerged and merged model?

Could you kindly tell me how to do this? Thank you!

BenjaminBossan · 2023-11-15T10:38:11Z

If I use merge and unload it works for the first couple of tokens, but at some point it just starts generating sentences, instead of the trained pattern.

So IIUC, the first few generated tokens look good and over time the model starts deviating from the pattern it was trained on. This indicates to me that the model is not completely broken, but that there are small errors that accumulate over time. That's not totally surprising, as merging into the quantized weights will introduce some small errors. Possibly, you would see better results if you train your model for longer before merging. Perhaps tuning other hyper-parameters like the LoRA rank (r) could also help, but this is just speculation.

My assumption is that the adapters are not trained/loaded in 4 bit and therefore the merging fails. Is there any way to verify that assumption? How can I check what number format my adapters have? What would be the solution for this, if that is the case? Can I train my adapters in 4 bit too?

You are correct, the adapters are not in 4 bit, as the weights that are being trained require higher precision. AFAIK, there is no technique yet that would allow to go below 16 bit, though there are some forays into 8 bit. To check the dtype of the adapters, you can go through the model.named_parameters() and print the param.dtype if the name matches that of an adapter layer (e.g. if "lora_" in name).

Could you kindly tell me how to do this? Thank you!

Instead of calling generate, you could call the model directly, e.g. model(**input). But from what you wrote earlier, I would expect that the outputs are similar, not totally random, which is what I wanted to verify.

Lanzelot-Moll · 2023-11-15T16:02:49Z

You are correct, the adapters are not in 4 bit, as the weights that are being trained require higher precision. AFAIK, there is no technique yet that would allow to go below 16 bit, though there are some forays into 8 bit. To check the dtype of the adapters, you can go through the model.named_parameters() and print the param.dtype if the name matches that of an adapter layer (e.g. if "lora_" in name).

I did that and yes, all the lora adapters are in high precision and the base model in 4 bit uint8. After merging, all the weights are in 4bit.

Instead of calling generate, you could call the model directly, e.g. model(**input). But from what you wrote earlier, I would expect that the outputs are similar, not totally random, which is what I wanted to verify.

I ran an example before and after merging and the resulting logits are way further apart then when I run the same model twice. So it seems that the conversion of the lora weights to 4 bit causes rounding errors that accumulate over time. That explains why it works at the beginning and then gets worse over time. So you were right.

So to conclude this my only option for next time is to either train everything in full precision and then load it in 4 bit for inference or to leave the configuration as is, which means I do not merge at all. Is that correct?

Thank you so much for your help!

BenjaminBossan · 2023-11-15T16:21:41Z

So to conclude this my only option for next time is to either train everything in full precision and then load it in 4 bit for inference or to leave the configuration as is, which means I do not merge at all. Is that correct?

Unfortunately, that seems to be the case. Maybe training for longer could also help, but that's not guaranteed. It's also possible that #1122 could help in your situation, we'll have to see.

In principle, it's also possible that there is room for improvement when it comes to merging quantized weights, but we mostly rely on the tools given to us by bnb, so if there is a significant development on that front, it could also improve your situation.

Lanzelot-Moll · 2023-11-15T17:30:22Z

I see, thank you so much for you help!! I will try the loftq as soon as it is merged.

Lanzelot-Moll closed this as completed Nov 15, 2023

vaibhavad mentioned this issue Jul 30, 2024

Loading model + merged adapter is different to model + adapter? McGill-NLP/llm2vec#128

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge and unload changes inference a lot on quantized llama 2 #1043

Merge and unload changes inference a lot on quantized llama 2 #1043

Lanzelot-Moll commented Oct 22, 2023 •

edited

Loading

BenjaminBossan commented Oct 23, 2023

lithces commented Nov 9, 2023

BenjaminBossan commented Nov 9, 2023

Lanzelot-Moll commented Nov 14, 2023

Lanzelot-Moll commented Nov 14, 2023

BenjaminBossan commented Nov 15, 2023

Lanzelot-Moll commented Nov 15, 2023

BenjaminBossan commented Nov 15, 2023

Lanzelot-Moll commented Nov 15, 2023

Merge and unload changes inference a lot on quantized llama 2 #1043

Merge and unload changes inference a lot on quantized llama 2 #1043

Comments

Lanzelot-Moll commented Oct 22, 2023 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

BenjaminBossan commented Oct 23, 2023

lithces commented Nov 9, 2023

BenjaminBossan commented Nov 9, 2023

Lanzelot-Moll commented Nov 14, 2023

Lanzelot-Moll commented Nov 14, 2023

BenjaminBossan commented Nov 15, 2023

Lanzelot-Moll commented Nov 15, 2023

BenjaminBossan commented Nov 15, 2023

Lanzelot-Moll commented Nov 15, 2023

Lanzelot-Moll commented Oct 22, 2023 •

edited

Loading