Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge and unload changes inference a lot on quantized llama 2 #1043

Closed
1 of 4 tasks
Lanzelot-Moll opened this issue Oct 22, 2023 · 9 comments
Closed
1 of 4 tasks

Merge and unload changes inference a lot on quantized llama 2 #1043

Lanzelot-Moll opened this issue Oct 22, 2023 · 9 comments

Comments

@Lanzelot-Moll
Copy link

Lanzelot-Moll commented Oct 22, 2023

System Info

I fine-tuned a quantized LLama 2 7b chat model using peft and it worked great. The only problem is the inference is reaaaally slow.

So I read the docs and found the merge_and_unload. Once I apply this the inference doubles in speed, but the results worsen a lot using the same parameters for generation.

Is this a known issue? Am I doing something wrong?

I saw there is an inference mode option in the peft config. But since I am loading my adapters from huggingface, where I safed them, how can I change the peft config after creation? Or is merge and unload already doing that?

As a side note, my usecase is really simple and the fine-tuning follow the pattern: [some word] is [some value] and this repeats for different words and values for data generation.
As soon as I use merge_and_unload, it does not stick to the pattern that it learned anymore, even on low temperature. Before, it is really slow but works perfectly.

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder
  • My own task or dataset (give details below)

Reproduction

Inference of a fine-tuned llama 2 model with and without merge and unload

import torch
import transformers

# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

# begin initializing HF items, need auth token for these
hf_auth = auth_token
model_config = transformers.AutoConfig.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)

# initialize the model
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto',
    use_auth_token=hf_auth
)
from peft import LoraConfig, get_peft_model
from peft import prepare_model_for_kbit_training
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

config = LoraConfig(
    r=16, 
    lora_alpha=32,
    target_modules=["q_proj", "up_proj","o_proj","k_proj","down_proj", "gate_proj","v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)

Expected behavior

both produce roughly the same result

@BenjaminBossan
Copy link
Member

This is really hard to say. Merging the weights, especially with 4bit quantization, will inevitably lead to discrepancies, but it's not easy to say how much is within the expected range.

As soon as I use merge_and_unload, it does not stick to the pattern that it learned anymore, even on low temperature.

Do you find that the results are now completely random, or basically the same as without LoRA? Or is it still improved over the baseline, just not as good as without merging?

@lithces
Copy link

lithces commented Nov 9, 2023

I can confirm this. I have to do merge_and_unload() otherwise the generation is full of special characters.

My adaptor_config.json is like this:

{
  "auto_mapping": null,
  "base_model_name_or_path": "lmsys/vicuna-7b-v1.5-16k",
  "bias": "none",
  "fan_in_fan_out": false,
  "inference_mode": true,
  "init_lora_weights": true,
  "layers_pattern": null,
  "layers_to_transform": null,
  "lora_alpha": 16,
  "lora_dropout": 0.05,
  "modules_to_save": null,
  "peft_type": "LORA",
  "r": 8,
  "revision": null,
  "target_modules": [
    "q_proj",
    "v_proj"
  ],
  "task_type": "CAUSAL_LM"
}

And my inference code is like this:

peft_model_id = 'mymodel'

config = PeftConfig.from_pretrained(peft_model_id)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
    config.base_model_name_or_path,
    quantization_config=bnb_config,
    use_auth_token=False,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
#%%
model = PeftModel.from_pretrained(model, peft_model_id, device_map="auto")
model = model.merge_and_unload()

The last line is very essential in order to get meaningful inference.

@BenjaminBossan
Copy link
Member

Thanks for the additional report @lithces. IIUC, your issue is the opposite of OP's, who mentioned that their results get worse after merging whereas you report that they get better.

To further investigate, could you please check how the logits differ between unmerged and merged model?

@Lanzelot-Moll
Copy link
Author

This is really hard to say. Merging the weights, especially with 4bit quantization, will inevitably lead to discrepancies, but it's not easy to say how much is within the expected range.

As soon as I use merge_and_unload, it does not stick to the pattern that it learned anymore, even on low temperature.

Do you find that the results are now completely random, or basically the same as without LoRA? Or is it still improved over the baseline, just not as good as without merging?

Hey, sorry for the delay.
In the end for my use case I just used more compute to fix it. I am nevertheless still interested in what I did wrong.
To your question:
it is not the same as before. If I run the base model, it gives me a text as output. If I use merge and unload it works for the first couple of tokens, but at some point it just starts generating sentences, instead of the trained pattern.
If I do not use merge and unload, the generation follows the trained pattern basically all the time.
My assumption is that the adapters are not trained/loaded in 4 bit and therefore the merging fails. Is there any way to verify that assumption? How can I check what number format my adapters have? What would be the solution for this, if that is the case? Can I train my adapters in 4 bit too?

@Lanzelot-Moll
Copy link
Author

Thanks for the additional report @lithces. IIUC, your issue is the opposite of OP's, who mentioned that their results get worse after merging whereas you report that they get better.

To further investigate, could you please check how the logits differ between unmerged and merged model?

Could you kindly tell me how to do this? Thank you!

@BenjaminBossan
Copy link
Member

If I use merge and unload it works for the first couple of tokens, but at some point it just starts generating sentences, instead of the trained pattern.

So IIUC, the first few generated tokens look good and over time the model starts deviating from the pattern it was trained on. This indicates to me that the model is not completely broken, but that there are small errors that accumulate over time. That's not totally surprising, as merging into the quantized weights will introduce some small errors. Possibly, you would see better results if you train your model for longer before merging. Perhaps tuning other hyper-parameters like the LoRA rank (r) could also help, but this is just speculation.

My assumption is that the adapters are not trained/loaded in 4 bit and therefore the merging fails. Is there any way to verify that assumption? How can I check what number format my adapters have? What would be the solution for this, if that is the case? Can I train my adapters in 4 bit too?

You are correct, the adapters are not in 4 bit, as the weights that are being trained require higher precision. AFAIK, there is no technique yet that would allow to go below 16 bit, though there are some forays into 8 bit. To check the dtype of the adapters, you can go through the model.named_parameters() and print the param.dtype if the name matches that of an adapter layer (e.g. if "lora_" in name).

Could you kindly tell me how to do this? Thank you!

Instead of calling generate, you could call the model directly, e.g. model(**input). But from what you wrote earlier, I would expect that the outputs are similar, not totally random, which is what I wanted to verify.

@Lanzelot-Moll
Copy link
Author

You are correct, the adapters are not in 4 bit, as the weights that are being trained require higher precision. AFAIK, there is no technique yet that would allow to go below 16 bit, though there are some forays into 8 bit. To check the dtype of the adapters, you can go through the model.named_parameters() and print the param.dtype if the name matches that of an adapter layer (e.g. if "lora_" in name).

I did that and yes, all the lora adapters are in high precision and the base model in 4 bit uint8. After merging, all the weights are in 4bit.

Instead of calling generate, you could call the model directly, e.g. model(**input). But from what you wrote earlier, I would expect that the outputs are similar, not totally random, which is what I wanted to verify.

I ran an example before and after merging and the resulting logits are way further apart then when I run the same model twice. So it seems that the conversion of the lora weights to 4 bit causes rounding errors that accumulate over time. That explains why it works at the beginning and then gets worse over time. So you were right.

So to conclude this my only option for next time is to either train everything in full precision and then load it in 4 bit for inference or to leave the configuration as is, which means I do not merge at all. Is that correct?

Thank you so much for your help!

@BenjaminBossan
Copy link
Member

So to conclude this my only option for next time is to either train everything in full precision and then load it in 4 bit for inference or to leave the configuration as is, which means I do not merge at all. Is that correct?

Unfortunately, that seems to be the case. Maybe training for longer could also help, but that's not guaranteed. It's also possible that #1122 could help in your situation, we'll have to see.

In principle, it's also possible that there is room for improvement when it comes to merging quantized weights, but we mostly rely on the tools given to us by bnb, so if there is a significant development on that front, it could also improve your situation.

@Lanzelot-Moll
Copy link
Author

I see, thank you so much for you help!! I will try the loftq as soon as it is merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants