ORPO trainer #1435

kashif · 2024-03-17T17:17:30Z

ORPO trainer

Reference-free Monolithic Preference Optimization with Odds Ratio

figure out what to log
add logging section to the docs

kashif · 2024-03-17T19:42:09Z

HuggingFaceDocBuilderDev · 2024-03-17T19:44:27Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

lewtun

Very cool to see this elegant method get added so quickly @kashif !

I left a few remarks about what to log etc and to harmonise the example script to be less hard-coded wrt Anthropic HH. I'd also like to see some small experiments which show the metrics look sane for the examples. Otherwise it's looking great!

docs/source/orpo_trainer.md

lewtun · 2024-03-18T07:48:30Z

docs/source/orpo_trainer.md

+
+While training and evaluating we record the following reward metrics:
+
+TODO


WDYT about logging the log probs and log odds ratio alongside the SFT loss, OR loss and full loss? This way the user can debug if the rejected log probs are decreasing over the course of training

examples/scripts/orpo.py

lewtun · 2024-03-18T08:05:31Z

trl/trainer/orpo_trainer.py

+        compute_metrics (`Callable[[EvalPrediction], Dict]`, *optional*):
+            The function to use to compute the metrics. Must take a `EvalPrediction` and return
+            a dictionary string to metric values.
+        dataset_num_proc (`Optional[int]`, *optional*):


I think this could live in the ORPOConfig (ideally we want nearly everything that is not a callable to live in a single config so it can be easily tweaked at the command line)

@kashif WDYT about this proposal to move the arg to the config?

trl/trainer/orpo_trainer.py

lewtun · 2024-03-18T08:19:23Z

trl/trainer/orpo_trainer.py

+        reward_accuracies = (chosen_rewards > rejected_rewards).float()
+
+        prefix = "eval_" if train_eval == "eval" else ""
+        metrics[f"{prefix}rewards/chosen"] = chosen_rewards.mean().cpu()


Maybe we can just keep logps and nll_loss and also log odds ratio to simplify this?

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

jiwooya1000 · 2024-03-18T10:36:41Z

Thank you so much for such a fast implementation of ORPO @kashif😀

Also, about #1435 (comment), I think it is a great idea to log the log odds ratio as it can help monitor the effect of β(the weighting hyperparam) as this report.

Thank you again for the implementation!

kashif · 2024-03-18T10:37:53Z

ok i'll add the logging next

vwxyzjn

Great work! Added some comments.

vwxyzjn · 2024-03-18T14:27:43Z

examples/scripts/orpo.py

+    sanity_check: bool = field(default=True, metadata={"help": "only train on 1000 samples"})
+
+
+def extract_anthropic_prompt(prompt_and_response):


We now have standard datasets under https://huggingface.co/datasets/trl-internal-testing/hh-rlhf-trl-style (#1424).

https://github.com/huggingface/trl/blob/main/examples/datasets/anthropic_hh.py creates the standard dataset (prompt, chosen, rejected)

https://github.com/huggingface/trl/blob/main/examples/datasets/tokenize_ds.py is a usage example.

In this case, maybe you could try:

ds = load_dataset(args.dataset) if args.debug: for key in ds: ds[key] = ds[key].select(range(50)) tokenizer = AutoTokenizer.from_pretrained(args.model) if tokenizer.chat_template is None: tokenizer.chat_template = "{% for message in messages %}{{message['role'] + ': ' + message['content'] + '\n\n'}}{% endfor %}{{ eos_token }}" def process(row): row["chosen"] = tokenizer.apply_chat_template(row["chosen"], tokenize=False) row["rejected"] = tokenizer.apply_chat_template(row["rejected"], tokenize=False) return row ds = ds.map( process, num_proc=1 if args.debug else multiprocessing.cpu_count(), load_from_cache_file=False, ) train_dataset = ds["train"] eval_dataset = ds["test"] trainer = ORPOTrainer( model, args=orpo_args, train_dataset=train_dataset, eval_dataset=eval_dataset, tokenizer=tokenizer, peft_config=get_peft_config(model_config), )

here the args.debug is doing the same thing as sanity_check.

lewtun

Thanks for iterating @kashif ! I left one final nit and a question about moving the dataset proc args to the config. Apart form that LGTM 🔥

examples/scripts/orpo.py

lewtun · 2024-03-19T09:38:32Z

trl/trainer/orpo_trainer.py

+        compute_metrics (`Callable[[EvalPrediction], Dict]`, *optional*):
+            The function to use to compute the metrics. Must take a `EvalPrediction` and return
+            a dictionary string to metric values.
+        dataset_num_proc (`Optional[int]`, *optional*):


@kashif WDYT about this proposal to move the arg to the config?

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

jiwooya1000 · 2024-03-21T14:46:36Z

Just gave a quick try with ORPOTrainer with facebook/opt-350m + argilla/ultrafeedback-binarized-preferences-cleaned, and it seems to be working well😃 Thank you for your work @kashif!

trl/trainer/orpo_config.py

trl/trainer/orpo_trainer.py

Co-authored-by: Alvaro Bartolome <alvarobartt@gmail.com>

trl/trainer/orpo_trainer.py

Co-authored-by: Alvaro Bartolome <alvarobartt@gmail.com>

alvarobartt

Maybe it would also be nice to include the ORPOTrainer within the README.md listing of supported trainers see https://github.com/huggingface/trl?tab=readme-ov-file#highlights

kashif · 2024-03-22T17:04:06Z

@alvarobartt ok yes good idea! adding

* initial orpo skeleton * typos * calculate orpo loss * fix class name * fix tests * fix typo * Update docs/source/orpo_trainer.md Co-authored-by: lewtun <lewis.c.tunstall@gmail.com> * Update docs/source/orpo_trainer.md Co-authored-by: lewtun <lewis.c.tunstall@gmail.com> * Update docs/source/orpo_trainer.md Co-authored-by: lewtun <lewis.c.tunstall@gmail.com> * rename max_target_length * Update examples/scripts/orpo.py Co-authored-by: lewtun <lewis.c.tunstall@gmail.com> * Update examples/scripts/orpo.py Co-authored-by: lewtun <lewis.c.tunstall@gmail.com> * Update examples/scripts/orpo.py Co-authored-by: lewtun <lewis.c.tunstall@gmail.com> * more docs * log log_odds_ratio and log_odds * average_log_prob as per paper * added logging section * add nll_loss * fix typo * more verbose * rename log_odds to log_odds_chosen * allow datasets to be loaded * remove dup debug arg * tokenizer exists * fix typo * use trl-internal-testing/hh-rlhf-trl-style dataset * formatting * add missing imports * fix output dir name * Update examples/scripts/orpo.py Co-authored-by: lewtun <lewis.c.tunstall@gmail.com> * move dataset_num_proc to configs * Update trl/trainer/orpo_config.py Co-authored-by: Alvaro Bartolome <alvarobartt@gmail.com> * Update trl/trainer/orpo_trainer.py Co-authored-by: Alvaro Bartolome <alvarobartt@gmail.com> * add ORPOTrainer to readme * fix typo --------- Co-authored-by: lewtun <lewis.c.tunstall@gmail.com> Co-authored-by: Alvaro Bartolome <alvarobartt@gmail.com>

initial orpo skeleton

f793861

kashif marked this pull request as draft March 17, 2024 17:18

kashif added 4 commits March 17, 2024 18:21

typos

1d588bd

calculate orpo loss

06dafae

fix class name

eb4a0d8

fix tests

f0d085e

kashif marked this pull request as ready for review March 17, 2024 19:35

fix typo

8993aec

kashif requested a review from lewtun March 17, 2024 19:49

lewtun reviewed Mar 18, 2024

View reviewed changes

kashif and others added 8 commits March 18, 2024 09:23

Update docs/source/orpo_trainer.md

8a01ab7

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

Update docs/source/orpo_trainer.md

5509520

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

Update docs/source/orpo_trainer.md

5187ffc

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

rename max_target_length

a99f362

Update examples/scripts/orpo.py

7d75c80

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

Update examples/scripts/orpo.py

7e79743

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

Update examples/scripts/orpo.py

ade72ea

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

more docs

eae80f9

kashif added 7 commits March 18, 2024 12:14

log log_odds_ratio and log_odds

6fb33d1

average_log_prob as per paper

8191a9d

added logging section

f5dc9bd

add nll_loss

c44ef2a

fix typo

d33610c

more verbose

7ed91b0

rename log_odds to log_odds_chosen

b37233d

vwxyzjn reviewed Mar 18, 2024

View reviewed changes

kashif added 9 commits March 18, 2024 17:38

allow datasets to be loaded

a0acecc

remove dup debug arg

fdd7f5e

tokenizer exists

229200e

fix typo

fa65623

use trl-internal-testing/hh-rlhf-trl-style dataset

545d987

Merge branch 'main' into orpo

c2013ed

formatting

28e2c6e

add missing imports

e2b02d3

fix output dir name

c57bc91

lewtun approved these changes Mar 19, 2024

View reviewed changes

kashif and others added 2 commits March 19, 2024 10:40

Update examples/scripts/orpo.py

9174686

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

move dataset_num_proc to configs

8754b4f

sayakpaul mentioned this pull request Mar 21, 2024

[Research Projects] ORPO diffusion for alignment huggingface/diffusers#7423

Merged

alvarobartt reviewed Mar 21, 2024

View reviewed changes

trl/trainer/orpo_config.py Outdated Show resolved Hide resolved

alvarobartt reviewed Mar 21, 2024

View reviewed changes

trl/trainer/orpo_trainer.py Show resolved Hide resolved

Update trl/trainer/orpo_config.py

4052db6

Co-authored-by: Alvaro Bartolome <alvarobartt@gmail.com>

alvarobartt reviewed Mar 21, 2024

View reviewed changes

trl/trainer/orpo_trainer.py Show resolved Hide resolved

trl/trainer/orpo_trainer.py Show resolved Hide resolved

trl/trainer/orpo_trainer.py Outdated Show resolved Hide resolved

trl/trainer/orpo_trainer.py Show resolved Hide resolved

Update trl/trainer/orpo_trainer.py

443c1bb

Co-authored-by: Alvaro Bartolome <alvarobartt@gmail.com>

alvarobartt reviewed Mar 22, 2024

View reviewed changes

kashif added 4 commits March 22, 2024 19:14

Merge remote-tracking branch 'upstream/main' into orpo

99fd6e7

add ORPOTrainer to readme

dc19c61

Merge remote-tracking branch 'upstream/main' into orpo

160ddf3

fix typo

1846f07

kashif merged commit 2ce8e45 into main Mar 22, 2024
9 checks passed

kashif deleted the orpo branch March 22, 2024 21:07

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ORPO trainer #1435

ORPO trainer #1435

kashif commented Mar 17, 2024 •

edited

Loading

kashif commented Mar 17, 2024

HuggingFaceDocBuilderDev commented Mar 17, 2024

lewtun left a comment

lewtun Mar 18, 2024

lewtun Mar 18, 2024

lewtun Mar 19, 2024

lewtun Mar 18, 2024

jiwooya1000 commented Mar 18, 2024

kashif commented Mar 18, 2024

vwxyzjn left a comment

vwxyzjn Mar 18, 2024

lewtun left a comment

lewtun Mar 19, 2024

jiwooya1000 commented Mar 21, 2024

alvarobartt left a comment

kashif commented Mar 22, 2024


		While training and evaluating we record the following reward metrics:

		TODO

		sanity_check: bool = field(default=True, metadata={"help": "only train on 1000 samples"})


		def extract_anthropic_prompt(prompt_and_response):

ORPO trainer #1435

ORPO trainer #1435

Conversation

kashif commented Mar 17, 2024 • edited Loading

kashif commented Mar 17, 2024

HuggingFaceDocBuilderDev commented Mar 17, 2024

lewtun left a comment

Choose a reason for hiding this comment

lewtun Mar 18, 2024

Choose a reason for hiding this comment

lewtun Mar 18, 2024

Choose a reason for hiding this comment

lewtun Mar 19, 2024

Choose a reason for hiding this comment

lewtun Mar 18, 2024

Choose a reason for hiding this comment

jiwooya1000 commented Mar 18, 2024

kashif commented Mar 18, 2024

vwxyzjn left a comment

Choose a reason for hiding this comment

vwxyzjn Mar 18, 2024

Choose a reason for hiding this comment

lewtun left a comment

Choose a reason for hiding this comment

lewtun Mar 19, 2024

Choose a reason for hiding this comment

jiwooya1000 commented Mar 21, 2024

alvarobartt left a comment

Choose a reason for hiding this comment

kashif commented Mar 22, 2024

kashif commented Mar 17, 2024 •

edited

Loading