Conversation
|
The documentation is not available anymore as the PR was closed or merged. |
|
Thank you @kashif. |
|
agree @Forbu and I think we will refactor the data-collator so that we only have a mask on the positive and negative part of the sequence... |
|
Hello @kashif, thanks for the DPO integration. I did a refactoring of the data collator to compute the mean logprobs only on the positive and negative part of the sequence. Can you share with me the branch please? |
|
@gaetanlop i have added you to my fork |
|
Thanks @kashif, just pushed the required changes to make the trainer not compute the mean logprobs on masked input_ids. I used a similar approach as the HF DataCollatorForTokenClassification (https://github.com/huggingface/transformers/blob/main/src/transformers/data/data_collator.py) |
|
Flagging that at some point we'll want the ScriptArguments to be consistent with the TrainingArguments (log_with -> report_to, batch_size -> per_device_train_batch_size, model_name -> model_name_or_path) the second and third ones are especially important since the semantic of the argument is actually different. |
|
@kashif @gaetanlop a fix to enable distributed training required a change to |
|
@kashif just wanted to check in on this- curious if you've had the chance to re-run the replication experiment :) |
|
@eric-mitchell i did but not with a SFT'ed pythia... also with Peft worked... We had validation loss around 0.6 or so as per your wandb so that was nice... if you can share your SFT'ed pythia model with me I can also run it now with that? |
|
Here are the weights to our pre-trained Pythia. You can load with It's nice that peft worked! What type of peft did you use? |
|
@eric-mitchell thanks! yes we tried QLora and Lora as well.. let me confirm. Thanks for the weights! |
lvwerra
left a comment
There was a problem hiding this comment.
Two small nits, then we can merge :)
* initial DPO Trainer * typo * initial dpo from reward trainer * calc. log_probs from logits * remove dpo config for now * fix inits * add intial DPODataCollatorWithPadding * use the RewardDataCollatorWithPadding * initial test * means of loss * add assert * just call the train instead of step * functional debug example before refactor * check the params have changed * initial DPODataCollatorWithPadding * Data collator with masking * going through trainer.accelerate to wrap ref_model * style / imports * style / imports * `broadcast_buffers=False` fix to distributed training * better fix for DDP issues * arguments and style clean-up * better doc, some light refactoring * better imports * initial dpo doc * fix test * fix formatting * fix * called models once * fix tests * add example * fix doc string * intitial example with anthropic hh dataset * refactored dpo trainer * revert * return metrics * fixed tests * updated docs * update test * fixed typo * note about the beta * added dpo authors * fix docstrings * add prediction_step * remove compute_metrics and log metrics manually * fix typo * add DPOTrainer doc * add dpo to toc * ValueError * add to index and example * fix docs * fix assert --------- Co-authored-by: TevenLeScao <teven.lescao@gmail.com> Co-authored-by: Gaetan LOPEZ <gaetanloplat@gmail.com> Co-authored-by: younesbelkada <younesbelkada@gmail.com>

Initial DPOTrainer class for #405 by copying the
PPOTrainerRewardTrainer and started to implement changes in itFixes #405