Why is the adapter layer merged in gpt-neox-20b_peft? #250

ohashi56225 · 2023-03-25T09:21:45Z

In the gpt-neox-20b_peft example, step 1 is the SFT with lora, step 2 is the merging of the trained lora adapters' weights into the base model, and step 3 is loading the merged model for PPO training with lora. I am unclear on why the models are merged in step 2.

Is it necessary to merge the adapter layer, or can we simply load both the base model and lora weights in step 3 (gpt-neo-20b_sentiment_peft.py) instead of loading the merged model? The code to load both models in step 3 would look like this:

peft_model_id = "edbeeching/gpt-neo-125M-imdb-lora"
peft_config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForCausalLM.from_pretrained(
    peft_config.base_model_name_or_path,
    ...
)
model = PeftModel.from_pretrained(model, peft_model_id)

Sanster · 2023-03-27T03:37:31Z

I think the main reason is to save vRAM, and also an adapter lora model maybe is enough for a single task

younesbelkada · 2023-03-27T09:57:14Z

Hi @ohashi56225
Thanks for your question
Indeed it is needed to first merge the LORA layers inside the base model, as later on we train new adapters for the RLHF part. As you cannot have multiple adapters on the same model, you need to merge the adapters that were trained first, use this merged model as a standalone transformers model, and use the latest as a base model for the new LORA layers that you want to train
Does that makes things clearer? Let me know if this is still unclear

ohashi56225 · 2023-03-27T16:28:26Z

Thank you for your response! I understand the reason for merging the LoRA layers more clearly now, but I have two new questions that have arisen:

If we initialize the weights of the new LoRA layers randomly, wouldn't we risk losing the ability to generate IMDB reviews learned in the SFT step, resulting in poor sentence generation at the beginning of the RLFH training loop?
Wouldn't it make more sense to fine-tune the pre-trained LoRA layers directly, rather than adding a new LoRA layer for RLHF? This would also reduce the number of parameters in the model.

I apologize if my questions are out of line since I am not familiar with the PEFT and LoRA implementations of Hugging Face.

younesbelkada · 2023-03-27T16:55:33Z

Hi @ohashi56225
Thanks a lot for following up!

If we initialize the weights of the new LoRA layers randomly, wouldn't we risk losing the ability to generate IMDB reviews learned in the SFT step, resulting in poor sentence generation at the beginning of the RLFH training loop?

In fact when we "merge" the LoRA layers directly to the base model, it means that we are transferring the knowledge that has been learned by the LoRA layers directly on the base model. Precisely, this also means that once you have merged the LoRA layers, you can push these new "merged" weights on the Hub (or locally) and directly load it and benefit from the SFT fine-tuned performance. An example can be found here: https://huggingface.co/edbeeching/gpt-neo-125M-imdb-lora-adapter-merged , as you can see it, there no more adapters anymore in this repository.

Wouldn't it make more sense to fine-tune the pre-trained LoRA layers directly, rather than adding a new LoRA layer for RLHF? This would also reduce the number of parameters in the model.

It is indeed possible, but the model will convergence quite slowly as the model needs to :
1- Adapt in the imdb domain
2- Learn to generate positive content based on the feedback that we provide
We have empirically find out that the "domain" gap is an important factor to consider, i.e. fine-tuning a model that is already prepared to generate movie reviews converges faster / easier than base models.

Let me know if anything else is unclead!

ohashi56225 · 2023-03-28T01:46:31Z

Thank you for your answer! However, I realized that my question may have been unclear, leading to an answer that didn't quite address my intended meaning.

First of all, is my understanding correct that the current implementation follows the steps shown in the image below? If so, my first question means that if the addition of new LoRA layers in step 3 is affecting the review generation ability of the merged model. Specifically, since the parameters of these LoRA layers are initialized randomly (which I believe corresponds to line 112 of gpt-neo-20b_sentiment_peft.py), they need to be trained from scratch during RL .

alexrame · 2023-03-30T09:49:02Z

@ohashi56225 I had exactly the same questions as you. I also think your understanding of the 3steps procedure is correct. However, I think I understand why simply fine-tuning the adapter weights without merging them works less well. It is related to the KL divergence in the PPO training.

In fact, in https://github.com/lvwerra/trl/blob/b5cce0d13e95c8e21eb9d57177930253d9092a02/examples/sentiment/scripts/gpt-neox-20b_peft/gpt-neo-20b_sentiment_peft.py#L209, the ref_model=None. In this case, and since model is a PEFT_model, the ref_model simply becomes the PEFT_model with the adapters removed https://github.com/lvwerra/trl/blob/b5cce0d13e95c8e21eb9d57177930253d9092a02/trl/trainer/ppo_trainer.py#L559.

In our case, if there is no merging step, this corresponds to taking as the ref_model the gpt-neo-125m model without any adapters and thus not adapted to imdb. With the merging step, this amounts to taking as ref_model the gpt-neo-125M-imdb-lora-adapter-merged model with better quality ref_logprobs. A bug fix that should make the merging obsolete would be to explicitly choose the adapted ref_model rather than letting ref_model=None.

That's my current understanding from reading the code, but experimental validation is certainly desirable.

younesbelkada · 2023-03-30T10:24:40Z

Hey @Sanster @alexrame
Thanks a lot for the nice diagram btw, I made a diagram based on your to explain the procedure

Actually when lora weights gets merged, the Lora layers 'disappear', and this is what I meant by yo can use the merged model as a standalone model !
Btw, this functionality will be supported soon on peft, check: huggingface/peft#227

In our case, if there is no merging step, this corresponds to taking as the ref_model the gpt-neo-125m model without any adapters and thus not adapted to imdb. With the merging step, this amounts to taking as ref_model the gpt-neo-125M-imdb-lora-adapter-merged model with better quality ref_logprobs.

I think that this is totally valid and this is my empirical understanding of things!

ohashi56225 · 2023-03-31T04:20:27Z

@younesbelkada @alexrame
I now have a complete understanding of how the LoRA layers are merged to create a standalone model, as well as the reasons behind this process. Thank you both for providing such a detailed explanation! And I agree that it works to choose the adapted ref_model instead of leaving ref_model=None.

Therefore, I think it's appropriate to close this issue. Thank you once again!

AttentionAllUNeed · 2023-04-04T08:39:55Z

Thanks for the nice explanations. So should we indicate the ref_model = gpt-neo-125M-imdb-lora-adapter-merged instead of leaving it to None?

alexrame · 2023-04-04T09:41:38Z

If your base model is "gpt-neo-125M-imdb-lora-adapter-merged", then you can either set ref_model = gpt-neo-125M-imdb-lora-adapter-merged or None, they both are equivalent. yet, the None option is preferred because it's more efficient (you don't need to load the same model twice).

In contrast, if you have not applied the merge procedure, the ref model should be explicitly called.
ref_model = Loader.load_peft_model(ref_base_model, peft_name=ref_peft_name)

ohashi56225 closed this as completed Mar 31, 2023

younesbelkada mentioned this issue Apr 8, 2023

adds sentiment example for a 20b model #208

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is the adapter layer merged in gpt-neox-20b_peft? #250

Why is the adapter layer merged in gpt-neox-20b_peft? #250

ohashi56225 commented Mar 25, 2023

Sanster commented Mar 27, 2023 •

edited

younesbelkada commented Mar 27, 2023

ohashi56225 commented Mar 27, 2023

younesbelkada commented Mar 27, 2023 •

edited

ohashi56225 commented Mar 28, 2023

alexrame commented Mar 30, 2023

younesbelkada commented Mar 30, 2023 •

edited

ohashi56225 commented Mar 31, 2023

AttentionAllUNeed commented Apr 4, 2023 •

edited

alexrame commented Apr 4, 2023

Why is the adapter layer merged in gpt-neox-20b_peft? #250

Why is the adapter layer merged in gpt-neox-20b_peft? #250

Comments

ohashi56225 commented Mar 25, 2023

Sanster commented Mar 27, 2023 • edited

younesbelkada commented Mar 27, 2023

ohashi56225 commented Mar 27, 2023

younesbelkada commented Mar 27, 2023 • edited

ohashi56225 commented Mar 28, 2023

alexrame commented Mar 30, 2023

younesbelkada commented Mar 30, 2023 • edited

ohashi56225 commented Mar 31, 2023

AttentionAllUNeed commented Apr 4, 2023 • edited

alexrame commented Apr 4, 2023

Sanster commented Mar 27, 2023 •

edited

younesbelkada commented Mar 27, 2023 •

edited

younesbelkada commented Mar 30, 2023 •

edited

AttentionAllUNeed commented Apr 4, 2023 •

edited