Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loss Calculation Question #10

Closed
jpanaro opened this issue Oct 24, 2020 · 5 comments
Closed

Loss Calculation Question #10

jpanaro opened this issue Oct 24, 2020 · 5 comments

Comments

@jpanaro
Copy link

jpanaro commented Oct 24, 2020

Hello! Thanks again for your help last time, right now I am on the brink of finishing my system but I have a couple questions regarding the loss function inside the PPOTrainer class.

  1. So for the following lines: here, using the default parameters (ppo_epochs = 4 and batch_size = 256) we run the second loop, and therefore the train_minibatch function a total of 1024 times but with only 256 unique samples. This means we also backpropagate the calculated loss and take an optimizer step 1024 times or 4 times for every single sample in the given batch since we only pass a single sample to the train_minibatch function. Would my understanding be correct here? And if so, is there a reason we are doing this for each sample instead of, for example, each forward_batch_size (default 16)?

  2. In this loop found here we are calculating the reversed advantages for use later in the loss function and we perform this loop for the length of the query. Would you mind explaining to me why it has to be approached this way? In my system I don't have a spearation of query and response just input features and a text caption as output so I am struggling a bit as to how to adapt it for this loop specifically.

@lvwerra
Copy link
Member

lvwerra commented Oct 24, 2020

Hi @jpanaro, glad you are still working on this project. See my answers for your questions below:

  1. So the reason we run PPO for 4 epochs is that we want to make most of the data we gathered. If generating samples was cheap we could only train for one epoch. Also, we don't want to overfit the current batch so we only train for 4 epochs. Naturally, this may vary depending on your application and since it is a parameter you can easily experiment with other values. The reason you don't make the batch much bigger is that after training for a couple of epochs the data becomes out of date since the updated model would not actually generate this outputs anymore. Finally, the minibatch size is set to 1 since I could not train GPT-2 on a single GPU with larger numbers. The forward_batch_size is independent of the above mentioned considerations since it is just to conserve memory during the forward passes. If possible the most efficient way would be to set it to the batch_size

  2. Actually, you calculate the advantages for each token in the response : gen_len = response.shape[1]. I have not tested what happens when the query is an empty tensor but if it breaks for example here it should be fairly easy to fix. The reason the query has to be masked is that this is given to the model and they should be ignored during training.

I hope these comments help. Let me know if you have any more questions!

@jpanaro
Copy link
Author

jpanaro commented Oct 24, 2020

  1. That makes perfect sense, I was thinking it was a memory issue but I wanted to be certain that I wasn't missing an important specific design choice.
  2. Ah, I meant response, I blame my tired brain sorry. I think I can just set gen_len equal to the caption length and it should be fine. I will experiment with that a little more to see if that works out.

Your comments were very helpful!

A few more small questions that cropped up:

  1. When you load the GPT2 models are both the active model and the reference model the same pretrained model with the exception being the active model is the one we backpropagate the PPO loss through?
  2. Have you experimented at all with a learning rate scheduler instead of just an optimizer? My base model utilizes a ReduceLrOnPlateau() scheduler so I was curious to know if you have tried something similar.

@lvwerra
Copy link
Member

lvwerra commented Oct 25, 2020

  1. Yes, you are correct. The reference models helps determine how far the active model's output distribution deviates from the reference model. The KL term in the reward makes sure the model stays close to the original distrbution.

  2. I have not experimented with that, but might be worth checking out to gain the last few percent performance! Since the advantages are whitened (see here) it could be that the losses don't change as much as they would in a supervised setup. Let me know how it goes!

@lvwerra lvwerra closed this as completed Oct 25, 2020
@yanghoonkim
Copy link

Finally, the minibatch size is set to 1 since I could not train GPT-2 on a single GPU with larger numbers.

@lvwerra
Was it ok for you to train GPT-2 with a single sentence batch every time?
I Implemented a T5 version of trl (referring to your code) and found it does not work well... ( the reward fluctuates a lot and the generated result is also getting bad)

@jayelm
Copy link

jayelm commented Jun 6, 2022

@lvwerra assuming memory is not an issue, do you expect the code to run fine if the minibatch size is set to something > 1?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants