Loss Calculation Question #10

jpanaro · 2020-10-24T03:29:31Z

Hello! Thanks again for your help last time, right now I am on the brink of finishing my system but I have a couple questions regarding the loss function inside the PPOTrainer class.

So for the following lines: here, using the default parameters (ppo_epochs = 4 and batch_size = 256) we run the second loop, and therefore the train_minibatch function a total of 1024 times but with only 256 unique samples. This means we also backpropagate the calculated loss and take an optimizer step 1024 times or 4 times for every single sample in the given batch since we only pass a single sample to the train_minibatch function. Would my understanding be correct here? And if so, is there a reason we are doing this for each sample instead of, for example, each forward_batch_size (default 16)?
In this loop found here we are calculating the reversed advantages for use later in the loss function and we perform this loop for the length of the query. Would you mind explaining to me why it has to be approached this way? In my system I don't have a spearation of query and response just input features and a text caption as output so I am struggling a bit as to how to adapt it for this loop specifically.

lvwerra · 2020-10-24T13:23:15Z

Hi @jpanaro, glad you are still working on this project. See my answers for your questions below:

So the reason we run PPO for 4 epochs is that we want to make most of the data we gathered. If generating samples was cheap we could only train for one epoch. Also, we don't want to overfit the current batch so we only train for 4 epochs. Naturally, this may vary depending on your application and since it is a parameter you can easily experiment with other values. The reason you don't make the batch much bigger is that after training for a couple of epochs the data becomes out of date since the updated model would not actually generate this outputs anymore. Finally, the minibatch size is set to 1 since I could not train GPT-2 on a single GPU with larger numbers. The forward_batch_size is independent of the above mentioned considerations since it is just to conserve memory during the forward passes. If possible the most efficient way would be to set it to the batch_size
Actually, you calculate the advantages for each token in the response : gen_len = response.shape[1]. I have not tested what happens when the query is an empty tensor but if it breaks for example here it should be fairly easy to fix. The reason the query has to be masked is that this is given to the model and they should be ignored during training.

I hope these comments help. Let me know if you have any more questions!

jpanaro · 2020-10-24T19:50:24Z

That makes perfect sense, I was thinking it was a memory issue but I wanted to be certain that I wasn't missing an important specific design choice.
Ah, I meant response, I blame my tired brain sorry. I think I can just set gen_len equal to the caption length and it should be fine. I will experiment with that a little more to see if that works out.

Your comments were very helpful!

A few more small questions that cropped up:

When you load the GPT2 models are both the active model and the reference model the same pretrained model with the exception being the active model is the one we backpropagate the PPO loss through?
Have you experimented at all with a learning rate scheduler instead of just an optimizer? My base model utilizes a ReduceLrOnPlateau() scheduler so I was curious to know if you have tried something similar.

lvwerra · 2020-10-25T12:17:19Z

Yes, you are correct. The reference models helps determine how far the active model's output distribution deviates from the reference model. The KL term in the reward makes sure the model stays close to the original distrbution.
I have not experimented with that, but might be worth checking out to gain the last few percent performance! Since the advantages are whitened (see here) it could be that the losses don't change as much as they would in a supervised setup. Let me know how it goes!

yanghoonkim · 2021-05-22T04:52:16Z

Finally, the minibatch size is set to 1 since I could not train GPT-2 on a single GPU with larger numbers.

@lvwerra
Was it ok for you to train GPT-2 with a single sentence batch every time?
I Implemented a T5 version of trl (referring to your code) and found it does not work well... ( the reward fluctuates a lot and the generated result is also getting bad)

jayelm · 2022-06-06T23:47:01Z

@lvwerra assuming memory is not an issue, do you expect the code to run fine if the minibatch size is set to something > 1?

lvwerra closed this as completed Oct 25, 2020

yanghoonkim mentioned this issue May 26, 2021

Some Questions about Implementation #4

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loss Calculation Question #10

Loss Calculation Question #10

jpanaro commented Oct 24, 2020

lvwerra commented Oct 24, 2020

jpanaro commented Oct 24, 2020

lvwerra commented Oct 25, 2020 •

edited

Loading

yanghoonkim commented May 22, 2021

jayelm commented Jun 6, 2022

Loss Calculation Question #10

Loss Calculation Question #10

Comments

jpanaro commented Oct 24, 2020

lvwerra commented Oct 24, 2020

jpanaro commented Oct 24, 2020

lvwerra commented Oct 25, 2020 • edited Loading

yanghoonkim commented May 22, 2021

jayelm commented Jun 6, 2022

lvwerra commented Oct 25, 2020 •

edited

Loading