You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi quick question, so DataCollatorForCompletionOnlyLM will train only on the responses by loss masking the prompts.
Does it work this way with DPOTrainer (DPODataCollatorWithPadding) as well? Looking at the code, it does look like it trains on the prompts. But maybe it doesn't matter with the DPO loss.
The reason I ask is my dataset has long prompts, but short responses. And the resulting model trained by DPO is barely different from the reference model (even with accuracy 90+%). So loss masking the prompts may help focus the learning, just a guess.
Thanks!
The text was updated successfully, but these errors were encountered:
I am having the same problem with the DPO training not improving the baseline model
Nevertheless, DPODataCollatorWithPadding seems to put an ignore index for label indexes corresponding to prompt so probably the reason is elsewhere
Here is the relevant piece of code that does that:
Gentlemen, If I understood your discussion correctly (and the code snippet mentioned), DPOTrainer does in fact mask out the prompt tokens by default? Were you able to figure out why your models did not improve?
Hi quick question, so DataCollatorForCompletionOnlyLM will train only on the responses by loss masking the prompts.
Does it work this way with DPOTrainer (DPODataCollatorWithPadding) as well? Looking at the code, it does look like it trains on the prompts. But maybe it doesn't matter with the DPO loss.
The reason I ask is my dataset has long prompts, but short responses. And the resulting model trained by DPO is barely different from the reference model (even with accuracy 90+%). So loss masking the prompts may help focus the learning, just a guess.
Thanks!
The text was updated successfully, but these errors were encountered: