[DPO] remove response/pairs from the DPO side#540
Conversation
|
The documentation is not available anymore as the PR was closed or merged. |
| @@ -95,11 +95,12 @@ def split_prompt_and_responses(ex): | |||
|
|
|||
| def gen(): | |||
There was a problem hiding this comment.
This function can be simplified a lot now:
def get_hh(split: str, sanity_check: bool = False, silent: bool = False, cache_dir: str = None) -> Dataset:
"""Load the Anthropic Helpful-Harmless dataset from Hugging Face and convert it to the necessary format.
The dataset is converted to a dictionary with the following structure:
{
'prompt': List[str],
'chosen': List[str],
'rejected': List[str],
}
Prompts should be structured as follows:
\n\nHuman: <prompt>\n\nAssistant:
Multiple turns are allowed, but the prompt should always start with \n\nHuman: and end with \n\nAssistant:.
"""
dataset = load_dataset("Anthropic/hh-rlhf", split=split, cache_dir=cache_dir)
if sanity_check:
dataset = dataset.select(range(min(len(dataset), 1000)))
def split_prompt_and_responses(sample) -> Dict[str, str]:
prompt = extract_anthropic_prompt(sample["chosen"])
return {
"prompt": prompt,
"chosen": sample["chosen"][len(prompt) :],
"rejected": sample["rejected"][len(prompt) :],
}
return dataset.map(split_prompt_and_responses)I can't push this to the PR directly or add this as a suggestion in the review, sadly.
|
you should be able to push to my branch now... please feel free! |
|
let me fix up the docs and test |
|
@younesbelkada or @lvwerra should be ready for review! |
younesbelkada
left a comment
There was a problem hiding this comment.
Looks good on my side, thanks a lot everyone for working on this!
Since we did not release a pypi version with DPOTrainer the breaking change is ok to have it now!
|
I appreciate that you left some time for people to find these issues before releasing. I'd love to look into the spamminess of the logs during training as well. Do you know when you intend to release?
|
|
yes @tomaarsen let's open another PR for the logging issue... ideally, I wanted to have this all logged to wandb etc. and note we also have the sample generation helper too which is currently not being used... |
|
I can help with this, as I have a good amount of experience with how Transformers does its
|
|
yup! for some reason i was not getting it to log with the trainer and so I "forced" it to log things via the call to: if self.accelerator.is_main_process:
self.log_metrics("test", metrics)would appreciate any insight here! |
|
Thanks everyone! For the release probably mid-next week would be a nice time, but not sure sure @lvwerra |
* remove response/pairs from the DPO side * Simplify get_hh helper function * removed unused import * update tests and docs for dpo_trainer --------- Co-authored-by: Tom Aarsen <Cubiegamedev@gmail.com> Co-authored-by: Shoaib Burq <saburq@gmail.com>
fixes #537
TODO: