Skip to content

[DPO] remove response/pairs from the DPO side#540

Merged
lvwerra merged 4 commits intohuggingface:mainfrom
kashif:dpo-collate
Jul 19, 2023
Merged

[DPO] remove response/pairs from the DPO side#540
lvwerra merged 4 commits intohuggingface:mainfrom
kashif:dpo-collate

Conversation

@kashif
Copy link
Copy Markdown
Collaborator

@kashif kashif commented Jul 19, 2023

fixes #537

TODO:

  • docs
  • tests

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

HuggingFaceDocBuilderDev commented Jul 19, 2023

The documentation is not available anymore as the PR was closed or merged.

Comment thread examples/dpo.py Outdated
@@ -95,11 +95,12 @@ def split_prompt_and_responses(ex):

def gen():
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function can be simplified a lot now:

def get_hh(split: str, sanity_check: bool = False, silent: bool = False, cache_dir: str = None) -> Dataset:
    """Load the Anthropic Helpful-Harmless dataset from Hugging Face and convert it to the necessary format.

    The dataset is converted to a dictionary with the following structure:
    {
        'prompt': List[str],
        'chosen': List[str],
        'rejected': List[str],
    }

    Prompts should be structured as follows:
      \n\nHuman: <prompt>\n\nAssistant:
    Multiple turns are allowed, but the prompt should always start with \n\nHuman: and end with \n\nAssistant:.
    """
    dataset = load_dataset("Anthropic/hh-rlhf", split=split, cache_dir=cache_dir)
    if sanity_check:
        dataset = dataset.select(range(min(len(dataset), 1000)))

    def split_prompt_and_responses(sample) -> Dict[str, str]:
        prompt = extract_anthropic_prompt(sample["chosen"])
        return {
            "prompt": prompt,
            "chosen": sample["chosen"][len(prompt) :],
            "rejected": sample["rejected"][len(prompt) :],
        }

    return dataset.map(split_prompt_and_responses)

I can't push this to the PR directly or add this as a suggestion in the review, sadly.

@kashif
Copy link
Copy Markdown
Collaborator Author

kashif commented Jul 19, 2023

you should be able to push to my branch now... please feel free!

@kashif
Copy link
Copy Markdown
Collaborator Author

kashif commented Jul 19, 2023

let me fix up the docs and test

@kashif
Copy link
Copy Markdown
Collaborator Author

kashif commented Jul 19, 2023

@younesbelkada or @lvwerra should be ready for review!

Copy link
Copy Markdown
Contributor

@younesbelkada younesbelkada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good on my side, thanks a lot everyone for working on this!
Since we did not release a pypi version with DPOTrainer the breaking change is ok to have it now!

@tomaarsen
Copy link
Copy Markdown
Member

I appreciate that you left some time for people to find these issues before releasing. I'd love to look into the spamminess of the logs during training as well. Do you know when you intend to release?

  • Tom Aarsen

@kashif
Copy link
Copy Markdown
Collaborator Author

kashif commented Jul 19, 2023

yes @tomaarsen let's open another PR for the logging issue... ideally, I wanted to have this all logged to wandb etc. and note we also have the sample generation helper too which is currently not being used...

@younesbelkada younesbelkada requested a review from lvwerra July 19, 2023 15:27
@tomaarsen
Copy link
Copy Markdown
Member

tomaarsen commented Jul 19, 2023

wandb support should be automatically implemented right? As long as it's installed and the report_to is set up right.
Is the issue simply that we don't call self.callback_handler.on_log(self.args, self.state, self.control, ...)?

I can help with this, as I have a good amount of experience with how Transformers does its Trainer via a SetFit logging/callbacks integration that I've been working on.

  • Tom Aarsen

@kashif
Copy link
Copy Markdown
Collaborator Author

kashif commented Jul 19, 2023

yup! for some reason i was not getting it to log with the trainer and so I "forced" it to log things via the call to:

if self.accelerator.is_main_process:
     self.log_metrics("test", metrics)

would appreciate any insight here!

@lvwerra lvwerra merged commit fd50e06 into huggingface:main Jul 19, 2023
@younesbelkada
Copy link
Copy Markdown
Contributor

Thanks everyone! For the release probably mid-next week would be a nice time, but not sure sure @lvwerra

@kashif kashif deleted the dpo-collate branch April 23, 2024 10:12
yxliu-TAMU pushed a commit to mincheolseong/ECEN743-GRPO-Project-Proposal that referenced this pull request Apr 20, 2025
* remove response/pairs from the DPO side

* Simplify get_hh helper function

* removed unused import

* update tests and docs for dpo_trainer

---------

Co-authored-by: Tom Aarsen <Cubiegamedev@gmail.com>
Co-authored-by: Shoaib Burq <saburq@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DPOTrainer ignores training & evaluation data if there are more than 2 responses

6 participants