Skip to content

Add whiten ops before compute advatanges#887

Merged
vwxyzjn merged 2 commits intohuggingface:mainfrom
SingL3:main
Oct 23, 2023
Merged

Add whiten ops before compute advatanges#887
vwxyzjn merged 2 commits intohuggingface:mainfrom
SingL3:main

Conversation

@SingL3
Copy link
Copy Markdown
Contributor

@SingL3 SingL3 commented Oct 18, 2023

  1. From LLaMA 2 paper, it says:
We also find it important to whiten the final linear scores (shown here by reversing the sigmoid with the logit function) in order to increase stability and balance properly with the KL penalty term (β) above.
  1. This function is taken from alpaca_farm

Lin Junpeng added 2 commits October 18, 2023 12:00
1. From LLaMA 2 paper, it says:
```
We also find it important to whiten the final linear scores (shown here by reversing the sigmoid with the logit function) in order to increase stability and balance properly with the KL penalty term (β) above.
```
2. This function is taken from [alpaca_farm](https://github.com/tatsu-lab/alpaca_farm/blob/64e489c67ea502ab5fa944bebde3078c9722f6ee/src/alpaca_farm/rl/ppo_trainer.py#L86)
@younesbelkada
Copy link
Copy Markdown
Contributor

cc @vwxyzjn @lvwerra

@vwxyzjn
Copy link
Copy Markdown
Contributor

vwxyzjn commented Oct 19, 2023

@SingL3, thanks for the change! It makes sense. I have kicked started our benchmark to help understand its effect

image

Copy link
Copy Markdown
Contributor

@vwxyzjn vwxyzjn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No noticeable difference in performance with whiten_rewards=True. We can even the set the default value to True. WDYT @lvwerra?

test.png
test.png
test.png
test.png
test.png

test.png
test.png
test.png

**Plot commands:**
# pip install openrlbenchmark==0.2.1a5
# see https://github.com/openrlbenchmark/openrlbenchmark#get-started for documentation
export TAGS_STRING='?tag=v0.4.7-137-g3f90b24&tag=pr-887'
export FOLDER_STRING='v0.4.7-137-g3f90b24_pr-887'
echo "we deal with $TAGS_STRING"

python -m openrlbenchmark.rlops_multi_metrics \
    --filters '?we=huggingface&wpn=trl&xaxis=_step&ceik=trl_ppo_trainer_config.value.reward_model&cen=trl_ppo_trainer_config.value.exp_name&metrics=env/reward_mean&metrics=objective/kl' \
        "ppo$TAGS_STRING" \
    --env-ids sentiment-analysis:lvwerra/distilbert-imdb \
    --no-check-empty-runs \
    --pc.ncols 2 \
    --pc.ncols-legend 1 \
    --output-filename benchmark/trl/$FOLDER_STRING/hello_world \
    --scan-history

python -m openrlbenchmark.rlops_multi_metrics \
    --filters '?we=huggingface&wpn=trl&xaxis=_step&ceik=trl_ppo_trainer_config.value.reward_model&cen=trl_ppo_trainer_config.value.exp_name&metrics=env/reward_mean&metrics=objective/kl' \
        "ppo$TAGS_STRING" \
        "ppo_gpt2xl_grad_accu$TAGS_STRING" \
    --env-ids sentiment-analysis:lvwerra/distilbert-imdb \
    --no-check-empty-runs \
    --pc.ncols 2 \
    --pc.ncols-legend 1 \
    --output-filename benchmark/trl/$FOLDER_STRING/different_models \
    --scan-history

python -m openrlbenchmark.rlops_multi_metrics \
    --filters '?we=huggingface&wpn=trl&xaxis=_step&ceik=trl_ppo_trainer_config.value.reward_model&cen=trl_ppo_trainer_config.value.exp_name&metrics=env/reward_mean&metrics=objective/kl' \
        "ppo_Cerebras-GPT-6.7B_grad_accu_deepspeed_stage2$TAGS_STRING" \
    --env-ids sentiment-analysis:cerebras/Cerebras-GPT-6.7B \
    --no-check-empty-runs \
    --pc.ncols 2 \
    --pc.ncols-legend 1 \
    --output-filename benchmark/trl/$FOLDER_STRING/deepspeed \
    --scan-history

python -m openrlbenchmark.rlops_multi_metrics \
    --filters '?we=huggingface&wpn=trl&xaxis=_step&ceik=trl_ppo_trainer_config.value.reward_model&cen=trl_ppo_trainer_config.value.exp_name&metrics=env/reward_mean&metrics=objective/kl' \
        "ppo$TAGS_STRING" \
        "ppo_step_grad_accu$TAGS_STRING" \
    --env-ids sentiment-analysis:lvwerra/distilbert-imdb \
    --no-check-empty-runs \
    --pc.ncols 2 \
    --pc.ncols-legend 1 \
    --output-filename benchmark/trl/$FOLDER_STRING/grad_accu \
    --scan-history

python -m openrlbenchmark.rlops_multi_metrics \
    --filters '?we=huggingface&wpn=trl&xaxis=_step&ceik=trl_ppo_trainer_config.value.reward_model&cen=trl_ppo_trainer_config.value.exp_name&metrics=env/reward_mean&metrics=objective/kl' \
        "ppo$TAGS_STRING" \
        "ppo_gpt2$TAGS_STRING" \
        "ppo_falcon_rw_1b$TAGS_STRING" \
        "ppo_peft$TAGS_STRING" \
    --env-ids sentiment-analysis:lvwerra/distilbert-imdb \
    --no-check-empty-runs \
    --pc.ncols 2 \
    --pc.ncols-legend 1 \
    --output-filename benchmark/trl/$FOLDER_STRING/more_different_models \
    --scan-history

python benchmark/upload_benchmark.py \
    --folder_path="benchmark/trl/$FOLDER_STRING" \
    --path_in_repo="images/benchmark/$FOLDER_STRING" \
    --repo_id="trl-internal-testing/example-images" \
    --repo_type="dataset"


# pip install openrlbenchmark==0.2.1a5
# see https://github.com/openrlbenchmark/openrlbenchmark#get-started for documentation
TAGS_STRING='?tag=v0.4.7-137-g3f90b24&tag=pr-887'
FOLDER_STRING='pr-887_vs_baseline'
BASELINE_PR_TAG=v0.4.7-55-g110e672
BASELINE_PR_NAME=PR-662

echo "we deal with $TAGS_STRING"

python -m openrlbenchmark.rlops_multi_metrics \
    --filters '?we=huggingface&wpn=trl&xaxis=_step&ceik=trl_ppo_trainer_config.value.reward_model&cen=trl_ppo_trainer_config.value.exp_name&metrics=env/reward_mean&metrics=objective/kl' \
        "ppo$TAGS_STRING" \
    --filters '?we=huggingface&wpn=trl&xaxis=_step&ceik=trl_ppo_trainer_config.value.reward_model&cen=trl_ppo_trainer_config.value.exp_name&metrics=env/reward_mean&metrics=objective/kl' \
        "sentiment_tuning?tag=$BASELINE_PR_TAG&cl=sentiment lvwerra/gpt2-imdb ($BASELINE_PR_NAME)" \
    --env-ids sentiment-analysis:lvwerra/distilbert-imdb \
    --no-check-empty-runs \
    --pc.ncols 2 \
    --pc.ncols-legend 1 \
    --output-filename benchmark/trl/$FOLDER_STRING/hello_world \
    --scan-history

python -m openrlbenchmark.rlops_multi_metrics \
    --filters '?we=huggingface&wpn=trl&xaxis=_step&ceik=trl_ppo_trainer_config.value.reward_model&cen=trl_ppo_trainer_config.value.exp_name&metrics=env/reward_mean&metrics=objective/kl' \
        "ppo$TAGS_STRING" \
        "ppo_gpt2xl_grad_accu$TAGS_STRING" \
    --filters '?we=huggingface&wpn=trl&xaxis=_step&ceik=trl_ppo_trainer_config.value.reward_model&cen=trl_ppo_trainer_config.value.exp_name&metrics=env/reward_mean&metrics=objective/kl' \
        "sentiment_tuning?tag=$BASELINE_PR_TAG&cl=sentiment lvwerra/gpt2-imdb ($BASELINE_PR_NAME)" \
        "sentiment_tuning_gpt2xl_grad_accu?tag=$BASELINE_PR_TAG&cl=sentiment gpt2xl ($BASELINE_PR_NAME)" \
    --env-ids sentiment-analysis:lvwerra/distilbert-imdb \
    --no-check-empty-runs \
    --pc.ncols 2 \
    --pc.ncols-legend 1 \
    --output-filename benchmark/trl/$FOLDER_STRING/different_models \
    --scan-history

python -m openrlbenchmark.rlops_multi_metrics \
    --filters '?we=huggingface&wpn=trl&xaxis=_step&ceik=trl_ppo_trainer_config.value.reward_model&cen=trl_ppo_trainer_config.value.exp_name&metrics=env/reward_mean&metrics=objective/kl' \
        "ppo$TAGS_STRING" \
        "ppo_gpt2$TAGS_STRING" \
        "ppo_falcon_rw_1b$TAGS_STRING" \
        "ppo_peft$TAGS_STRING" \
    --filters '?we=huggingface&wpn=trl&xaxis=_step&ceik=trl_ppo_trainer_config.value.reward_model&cen=trl_ppo_trainer_config.value.exp_name&metrics=env/reward_mean&metrics=objective/kl' \
        "sentiment_tuning?tag=$BASELINE_PR_TAG&cl=sentiment lvwerra/gpt2-imdb ($BASELINE_PR_NAME)" \
        "sentiment_tuning_gpt2?tag=$BASELINE_PR_TAG&cl=sentiment gpt2 ($BASELINE_PR_NAME)" \
        "sentiment_tuning_falcon_rw_1b?tag=$BASELINE_PR_TAG&cl=sentiment tiiuae/falcon-rw-1b ($BASELINE_PR_NAME)" \
        "sentiment_tuning_peft?tag=$BASELINE_PR_TAG&cl=sentiment lvwerra/gpt2-imdb w/ peft ($BASELINE_PR_NAME)" \
    --env-ids sentiment-analysis:lvwerra/distilbert-imdb \
    --no-check-empty-runs \
    --pc.ncols 2 \
    --pc.ncols-legend 1 \
    --output-filename benchmark/trl/$FOLDER_STRING/more_different_models \
    --scan-history

python benchmark/upload_benchmark.py \
    --folder_path="benchmark/trl/$FOLDER_STRING" \
    --path_in_repo="images/benchmark/$FOLDER_STRING" \
    --repo_id="trl-internal-testing/example-images" \
    --repo_type="dataset"

@vwxyzjn
Copy link
Copy Markdown
Contributor

vwxyzjn commented Oct 23, 2023

I am going to merge as is. Thanks so much @SingL3!

@vwxyzjn vwxyzjn merged commit 1f3314f into huggingface:main Oct 23, 2023
kashif pushed a commit to kashif/trl that referenced this pull request Oct 27, 2023
* Add whiten ops before compute advatanges

1. From LLaMA 2 paper, it says:
```
We also find it important to whiten the final linear scores (shown here by reversing the sigmoid with the logit function) in order to increase stability and balance properly with the KL penalty term (β) above.
```
2. This function is taken from [alpaca_farm](https://github.com/tatsu-lab/alpaca_farm/blob/64e489c67ea502ab5fa944bebde3078c9722f6ee/src/alpaca_farm/rl/ppo_trainer.py#L86)

* Fix type def of self

---------

Co-authored-by: Lin Junpeng <linjunpeng@sensetime.com>
lapp0 pushed a commit to lapp0/trl that referenced this pull request May 10, 2024
* Add whiten ops before compute advatanges

1. From LLaMA 2 paper, it says:
```
We also find it important to whiten the final linear scores (shown here by reversing the sigmoid with the logit function) in order to increase stability and balance properly with the KL penalty term (β) above.
```
2. This function is taken from [alpaca_farm](https://github.com/tatsu-lab/alpaca_farm/blob/64e489c67ea502ab5fa944bebde3078c9722f6ee/src/alpaca_farm/rl/ppo_trainer.py#L86)

* Fix type def of self

---------

Co-authored-by: Lin Junpeng <linjunpeng@sensetime.com>
yxliu-TAMU pushed a commit to mincheolseong/ECEN743-GRPO-Project-Proposal that referenced this pull request Apr 20, 2025
* Add whiten ops before compute advatanges

1. From LLaMA 2 paper, it says:
```
We also find it important to whiten the final linear scores (shown here by reversing the sigmoid with the logit function) in order to increase stability and balance properly with the KL penalty term (β) above.
```
2. This function is taken from [alpaca_farm](https://github.com/tatsu-lab/alpaca_farm/blob/64e489c67ea502ab5fa944bebde3078c9722f6ee/src/alpaca_farm/rl/ppo_trainer.py#L86)

* Fix type def of self

---------

Co-authored-by: Lin Junpeng <linjunpeng@sensetime.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants