Add score scaling/normalization/clipping by zfang · Pull Request #560 · huggingface/trl

zfang · 2023-07-24T05:26:56Z

Summary

Add score (aka reward) scaling/normalization/clipping to improve PPO training stability based on Section 5.3.1 of Secrets of RLHF in Large Language Models Part I: PPO and https://github.com/OpenLMLab/MOSS-RLHF:

Tests

The following is tested on a Google Colab notebook with an Nvidia T4 GPU. My notebook gets disconnected by itself after a few hours while I was getting 1 iteration per minute, so my runs crashed fairly early.

`sentiment-tuning.py`

Command for baseline:

python examples/scripts/sentiment_tuning.py --log_with wandb

Command for score scaling/normalization/clipping:

python examples/scripts/sentiment_tuning.py --log_with wandb --use_score_scaling --use_score_norm --score_clip 0.5

Screenshots of wandb:

`multi_adapter_rl_v2.py`

Command for baseline:

python examples/scripts/multi_adapter_rl_v2.py --model_name ../llama-7b --log_with wandb --use_safetensors

Command for score scaling/normalization/clipping:

python examples/scripts/multi_adapter_rl_v2.py --model_name ../llama-7b --log_with wandb --use_safetensors --use_score_scaling --use_score_norm --score_clip 0.5

Screenshots of wandb:

younesbelkada

Thanks a lot for working on this and adding this new nice feature.
I am ok with this PR in principle that it is backward compatible with the existing setup and works in distributed setting as well (from the code of RunningMoments & get_global_statistics)
Would love to hear from @lvwerra & @vwxyzjn to hear their thoughts on this
Can you also run the styling checks?

make precommit

Thanks!

HuggingFaceDocBuilderDev · 2023-07-24T11:51:19Z

The documentation is not available anymore as the PR was closed or merged.

younesbelkada

Can you also add few lines in the documentation explaining this feature to users ? The details could go in a dedicated section here: https://github.com/lvwerra/trl/blob/main/docs/source/customization.mdx
Also can you share the behaviour of the env/rewards_mean and env/rewards_std ?
Thanks!

younesbelkada · 2023-07-24T11:55:21Z

        default=1, metadata={"help": "the number of gradient accumulation steps"}
    )
    early_stopping: Optional[bool] = field(default=False, metadata={"help": "whether to early stop"})
-    target_kl: Optional[float] = field(default=6, metadata={"help": "kl target for early stopping"})


This field seems to have been removed by mistake?

Hi Younes,

You will find that target_kl already exists on L57 with a much smaller value.

I dug deeper and found that PPOConfig has two configs target and target_kl, where target has a default value of 6. So I assume the first duplicate target_kl config here was meant to be target. However, target is NOT used to populate PPOConfig at L64, so I just removed it.

Regards,

Felix

great point, thank you !

I think this is actually a bug from here: 1620da3
we overloaded the target_kl term - we should rename it!

cc @edbeeching

@lvwerra as much as I love introducing bugs into trl. I think this time it was @younesbelkada , in the Big refactor of examples and documentation (#509). Here

I agree to rename to early_stop_kl, or something

zfang · 2023-07-24T18:42:55Z

Can you also add few lines in the documentation explaining this feature to users ? The details could go in a dedicated section here: https://github.com/lvwerra/trl/blob/main/docs/source/customization.mdx Also can you share the behaviour of the env/rewards_mean and env/rewards_std ? Thanks!

The rewards here are actually independent of score scaling/normalization/clipping because they are logged independently:

...
    pipe_outputs = sentiment_pipe(texts, **sent_kwargs)
    rewards = [torch.tensor(output[1]["score"]) for output in pipe_outputs]

    # Run PPO step
    stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
    ppo_trainer.log_stats(stats, batch, rewards)

...

    def log_stats(
        self,
        stats: dict,
        batch: dict,
        rewards: List[torch.FloatTensor],
    ):
            ...

            logs["env/reward_mean"] = torch.mean(rewards).cpu().numpy().item()
            logs["env/reward_std"] = torch.std(rewards).cpu().numpy().item()
            logs["env/reward_dist"] = rewards.cpu().numpy()

            ...

younesbelkada

Thanks a lot for this great work, this looks very nice on my side, let's see what others will say !

lvwerra

Very clean PR, thanks! left a few questions :)

lvwerra · 2023-07-25T17:04:29Z

    target_kl: Optional[float] = field(default=0.1, metadata={"help": "kl target for early stopping"})
    seed: Optional[int] = field(default=0, metadata={"help": "the random seed"})
+    use_score_scaling: Optional[bool] = field(default=False, metadata={"help": "Use score scaling"})
+    use_score_norm: Optional[bool] = field(default=False, metadata={"help": "Use score normalization"})


Maybe we should clarify that this only works if use_score_scaling is also True otherwise it's actually ignored. we change the logic a bit in general

lvwerra · 2023-07-26T06:12:06Z

The rewards here are actually independent of score scaling/normalization/clipping because they are logged independently:

Is that really true? inside step we only log scores which are normalized with this PR. also if it weren't true then it looks like we have a strong performance degradation.

zfang · 2023-07-26T16:19:16Z

The rewards here are actually independent of score scaling/normalization/clipping because they are logged independently:

Is that really true? inside step we only log scores which are normalized with this PR. also if it weren't true then it looks like we have a strong performance degradation.

Hi @lvwerra,

Could you elaborate on the performance degradation?

In sentiment-tuning.py (and similarly multi_adapter_rl_v2.py), we have

for epoch, batch in tqdm(enumerate(ppo_trainer.dataloader)):
    query_tensors = batch["input_ids"]

    # Get response from gpt2
    response_tensors = ppo_trainer.generate(query_tensors, return_prompt=False, **generation_kwargs)
    batch["response"] = tokenizer.batch_decode(response_tensors)

    # Compute sentiment score
    texts = [q + r for q, r in zip(batch["query"], batch["response"])]
    pipe_outputs = sentiment_pipe(texts, **sent_kwargs)
    rewards = [torch.tensor(output[1]["score"]) for output in pipe_outputs]

    # Run PPO step
    stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
    ppo_trainer.log_stats(stats, batch, rewards)

env/reward_mean and env/reward_std are logged inside of ppo_trainer.log_stats, which is based on the raw rewards from sentiment_pipe. I think the reason we observe different reward curves is randomness. Even in ScriptArguments we provide the seed config, it is only used in PPOConfig but not LengthSampler which can impact the input. I do observe near identical reward curves with multi_adapter_rl_v2.py

ppo/mean_scores and ppo/std_scores are the per-batch (normalized) score stats that are logged inside of ppo_trainer.step. From the wandb screenshots you can see that the scores have a mean close to zero with a 0.5 std (clipping value).

It's not obvious to me whether score scaling/normalization/clipping improves or degrades the performance. It's meant to improve training stability but I guess I haven't run the training long enough to observe possible divergences (well Google Colab would crash on me). In general I observe more smooth curves.

I do observe better loss curves on the value head/function and assume that that can be attributed to the more stable and smooth reward scores. In sentiment-tuning.py I also observe better KL divergence and thus better non-score rewards assume that this is because normalized score rewards make them less dominant over the non-score rewards. In my opinion this makes it easier to configure the KL coefficient because we know what the range of score rewards are to expect.

Regardless, the configs are optional and are backward compatible.

Regards,

Felix

lvwerra · 2023-08-08T12:07:44Z

Hi @zfang I was mainly referring to this plot that you shared:

It appears that the rewards are considerably lower and was wondering if that's due to scaling. Curious to hear your thoughts.

vwxyzjn · 2023-08-08T17:48:08Z

Hey @zfang thanks for the PR! Sometimes random seeds could impact the results a lot. E.g., #462 (comment) Could you run the experiment for 10 random seeds?

zfang · 2023-08-08T19:25:36Z

Hey @zfang thanks for the PR! Sometimes random seeds could impact the results a lot. E.g., #462 (comment) Could you run the experiment for 10 random seeds?

Actually I just made a change in sentiment_tuning.py to move set_seed before calling build_dataset. Re-running sentiment-tuning.py now.

zfang · 2023-08-08T21:50:45Z

Hi @zfang I was mainly referring to this plot that you shared: It appears that the rewards are considerably lower and was wondering if that's due to scaling. Curious to hear your thoughts.

Update: I do observe consistent patterns of difference in env/reward_std and env/reward_mean with sentiment_tuning.py, but not multi_adapter_rl_v2.py

zfang · 2023-08-08T23:23:56Z

Hi @zfang I was mainly referring to this plot that you shared: It appears that the rewards are considerably lower and was wondering if that's due to scaling. Curious to hear your thoughts.

Hi @lvwerra and @vwxyzjn,

After some investigations, I have the root cause.

Based on the following code snippet

for epoch, batch in tqdm(enumerate(ppo_trainer.dataloader)):
    query_tensors = batch["input_ids"]

    # Get response from gpt2
    response_tensors = ppo_trainer.generate(query_tensors, return_prompt=False, **generation_kwargs)
    batch["response"] = tokenizer.batch_decode(response_tensors)

    # Compute sentiment score
    texts = [q + r for q, r in zip(batch["query"], batch["response"])]
    pipe_outputs = sentiment_pipe(texts, **sent_kwargs)
    rewards = [torch.tensor(output[1]["score"]) for output in pipe_outputs]

    # Run PPO step
    stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
    ppo_trainer.log_stats(stats, batch, rewards)

We have the dependencies of rewards <- pipe_outputs <- texts <- batch["response"] <- response_tensors <- ppo_trainer. Because ppo_trainer.step updates ppo_trainer differently between baseline and score normalization, we start to get different batch["response"] and thus different rewards.

In other words, because gpt2-imdb generates slightly different responses (perhaps less overly "positive" at the expense of higher KL loss) after PPO with score normalization, we start to also see different sentiment scores from distilbert-imdb.

On a high level I think that makes sense: we normalize the sentiment scores so it's less dominant over the KL loss, and thus we observe that with score normalization the model is less eager to optimize for sentiment scores in comparison to the KL loss. This can be adjusted by using a smaller init_kl_coef.

Hopefully that makes sense to you.

Regards,

Felix

lvwerra · 2023-08-10T08:30:51Z

Ok, makes sense - since it's optional it's not directly a regression and we can merge.

* Add reward/score scaling/normalization/clipping * Run pre-commit to fix styles and remove some dupe code * Make sure score module and pretrained_model have the same dtype * Add multi_adapter_rl_v2.py * Add log_with * Add more verbose help message for use_score_norm * Fix score clipping for float16 * Minor fix

Add reward/score scaling/normalization/clipping

2348147

younesbelkada reviewed Jul 24, 2023

View reviewed changes

Run pre-commit to fix styles and remove some dupe code

2c573ca

younesbelkada approved these changes Jul 24, 2023

View reviewed changes

younesbelkada requested review from lvwerra and vwxyzjn July 24, 2023 20:26

zfang added 3 commits July 25, 2023 09:46

Make sure score module and pretrained_model have the same dtype

f40fec0

Add multi_adapter_rl_v2.py

6a79ca8

Add log_with

6ea53de

lvwerra reviewed Jul 25, 2023

View reviewed changes

zfang added 2 commits July 25, 2023 10:42

Add more verbose help message for use_score_norm

665aaaf

Fix score clipping for float16

ee8213a

Minor fix

414f2f8

Merge branch 'main' into main

d391451

lvwerra approved these changes Aug 10, 2023

View reviewed changes

lvwerra merged commit 3b2c820 into huggingface:main Aug 10, 2023

Conversation

zfang commented Jul 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Tests

sentiment-tuning.py

Command for baseline:

Screenshots of wandb:

multi_adapter_rl_v2.py

Screenshots of wandb:

Uh oh!

younesbelkada left a comment

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Jul 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

younesbelkada left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zfang commented Jul 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

younesbelkada left a comment

Choose a reason for hiding this comment

Uh oh!

lvwerra left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lvwerra commented Jul 26, 2023

Uh oh!

zfang commented Jul 26, 2023

Uh oh!

lvwerra commented Aug 8, 2023

Uh oh!

vwxyzjn commented Aug 8, 2023

Uh oh!

zfang commented Aug 8, 2023

Uh oh!

zfang commented Aug 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zfang commented Aug 8, 2023

Uh oh!

lvwerra commented Aug 10, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

zfang commented Jul 24, 2023 •

edited

Loading

`sentiment-tuning.py`

`multi_adapter_rl_v2.py`

HuggingFaceDocBuilderDev commented Jul 24, 2023 •

edited

Loading

zfang commented Jul 24, 2023 •

edited

Loading

zfang commented Aug 8, 2023 •

edited

Loading