Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why is the adapter layer merged in gpt-neox-20b_peft? #250

Closed
ohashi56225 opened this issue Mar 25, 2023 · 10 comments
Closed

Why is the adapter layer merged in gpt-neox-20b_peft? #250

ohashi56225 opened this issue Mar 25, 2023 · 10 comments

Comments

@ohashi56225
Copy link
Contributor

In the gpt-neox-20b_peft example, step 1 is the SFT with lora, step 2 is the merging of the trained lora adapters' weights into the base model, and step 3 is loading the merged model for PPO training with lora. I am unclear on why the models are merged in step 2.

Is it necessary to merge the adapter layer, or can we simply load both the base model and lora weights in step 3 (gpt-neo-20b_sentiment_peft.py) instead of loading the merged model? The code to load both models in step 3 would look like this:

peft_model_id = "edbeeching/gpt-neo-125M-imdb-lora"
peft_config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForCausalLM.from_pretrained(
    peft_config.base_model_name_or_path,
    ...
)
model = PeftModel.from_pretrained(model, peft_model_id)
@Sanster
Copy link

Sanster commented Mar 27, 2023

I think the main reason is to save vRAM, and also an adapter lora model maybe is enough for a single task

@younesbelkada
Copy link
Collaborator

Hi @ohashi56225
Thanks for your question
Indeed it is needed to first merge the LORA layers inside the base model, as later on we train new adapters for the RLHF part. As you cannot have multiple adapters on the same model, you need to merge the adapters that were trained first, use this merged model as a standalone transformers model, and use the latest as a base model for the new LORA layers that you want to train
Does that makes things clearer? Let me know if this is still unclear

@ohashi56225
Copy link
Contributor Author

Thank you for your response! I understand the reason for merging the LoRA layers more clearly now, but I have two new questions that have arisen:

  1. If we initialize the weights of the new LoRA layers randomly, wouldn't we risk losing the ability to generate IMDB reviews learned in the SFT step, resulting in poor sentence generation at the beginning of the RLFH training loop?
  2. Wouldn't it make more sense to fine-tune the pre-trained LoRA layers directly, rather than adding a new LoRA layer for RLHF? This would also reduce the number of parameters in the model.

I apologize if my questions are out of line since I am not familiar with the PEFT and LoRA implementations of Hugging Face.

@younesbelkada
Copy link
Collaborator

younesbelkada commented Mar 27, 2023

Hi @ohashi56225
Thanks a lot for following up!

  1. If we initialize the weights of the new LoRA layers randomly, wouldn't we risk losing the ability to generate IMDB reviews learned in the SFT step, resulting in poor sentence generation at the beginning of the RLFH training loop?

In fact when we "merge" the LoRA layers directly to the base model, it means that we are transferring the knowledge that has been learned by the LoRA layers directly on the base model. Precisely, this also means that once you have merged the LoRA layers, you can push these new "merged" weights on the Hub (or locally) and directly load it and benefit from the SFT fine-tuned performance. An example can be found here: https://huggingface.co/edbeeching/gpt-neo-125M-imdb-lora-adapter-merged , as you can see it, there no more adapters anymore in this repository.

  1. Wouldn't it make more sense to fine-tune the pre-trained LoRA layers directly, rather than adding a new LoRA layer for RLHF? This would also reduce the number of parameters in the model.

It is indeed possible, but the model will convergence quite slowly as the model needs to :
1- Adapt in the imdb domain
2- Learn to generate positive content based on the feedback that we provide
We have empirically find out that the "domain" gap is an important factor to consider, i.e. fine-tuning a model that is already prepared to generate movie reviews converges faster / easier than base models.

Let me know if anything else is unclead!

@ohashi56225
Copy link
Contributor Author

Thank you for your answer! However, I realized that my question may have been unclear, leading to an answer that didn't quite address my intended meaning.

First of all, is my understanding correct that the current implementation follows the steps shown in the image below? If so, my first question means that if the addition of new LoRA layers in step 3 is affecting the review generation ability of the merged model. Specifically, since the parameters of these LoRA layers are initialized randomly (which I believe corresponds to line 112 of gpt-neo-20b_sentiment_peft.py), they need to be trained from scratch during RL .

image

@alexrame
Copy link

@ohashi56225 I had exactly the same questions as you. I also think your understanding of the 3steps procedure is correct. However, I think I understand why simply fine-tuning the adapter weights without merging them works less well. It is related to the KL divergence in the PPO training.

In fact, in https://github.com/lvwerra/trl/blob/b5cce0d13e95c8e21eb9d57177930253d9092a02/examples/sentiment/scripts/gpt-neox-20b_peft/gpt-neo-20b_sentiment_peft.py#L209, the ref_model=None. In this case, and since model is a PEFT_model, the ref_model simply becomes the PEFT_model with the adapters removed https://github.com/lvwerra/trl/blob/b5cce0d13e95c8e21eb9d57177930253d9092a02/trl/trainer/ppo_trainer.py#L559.

In our case, if there is no merging step, this corresponds to taking as the ref_model the gpt-neo-125m model without any adapters and thus not adapted to imdb. With the merging step, this amounts to taking as ref_model the gpt-neo-125M-imdb-lora-adapter-merged model with better quality ref_logprobs. A bug fix that should make the merging obsolete would be to explicitly choose the adapted ref_model rather than letting ref_model=None.

That's my current understanding from reading the code, but experimental validation is certainly desirable.

@younesbelkada
Copy link
Collaborator

younesbelkada commented Mar 30, 2023

Hey @Sanster @alexrame
Thanks a lot for the nice diagram btw, I made a diagram based on your to explain the procedure

Screenshot 2023-03-30 at 12 22 06

Actually when lora weights gets merged, the Lora layers 'disappear', and this is what I meant by yo can use the merged model as a standalone model !
Btw, this functionality will be supported soon on peft, check: huggingface/peft#227

In our case, if there is no merging step, this corresponds to taking as the ref_model the gpt-neo-125m model without any adapters and thus not adapted to imdb. With the merging step, this amounts to taking as ref_model the gpt-neo-125M-imdb-lora-adapter-merged model with better quality ref_logprobs.

I think that this is totally valid and this is my empirical understanding of things!

@ohashi56225
Copy link
Contributor Author

@younesbelkada @alexrame
I now have a complete understanding of how the LoRA layers are merged to create a standalone model, as well as the reasons behind this process. Thank you both for providing such a detailed explanation! And I agree that it works to choose the adapted ref_model instead of leaving ref_model=None.

Therefore, I think it's appropriate to close this issue. Thank you once again!

@AttentionAllUNeed
Copy link

AttentionAllUNeed commented Apr 4, 2023

Thanks for the nice explanations. So should we indicate the ref_model = gpt-neo-125M-imdb-lora-adapter-merged instead of leaving it to None?

@alexrame
Copy link

alexrame commented Apr 4, 2023

If your base model is "gpt-neo-125M-imdb-lora-adapter-merged", then you can either set ref_model = gpt-neo-125M-imdb-lora-adapter-merged or None, they both are equivalent. yet, the None option is preferred because it's more efficient (you don't need to load the same model twice).

In contrast, if you have not applied the merge procedure, the ref model should be explicitly called.
ref_model = Loader.load_peft_model(ref_base_model, peft_name=ref_peft_name)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants