Reward Modeling

TRL supports custom reward modeling for anyone to perform reward modeling on their dataset and model.

Check out a complete flexible example at examples/scripts/reward_modeling.py.

Expected dataset format

The [RewardTrainer] expects a very specific format for the dataset since the model will be trained on pairs of examples to predict which of the two is preferred. We provide an example from the Anthropic/hh-rlhf dataset below:

Therefore the final dataset object should contain two 4 entries at least if you use the default [RewardDataCollatorWithPadding] data collator. The entries should be named:

input_ids_chosen
attention_mask_chosen
input_ids_rejected
attention_mask_rejected

Using the `RewardTrainer`

After preparing your dataset, you can use the [RewardTrainer] in the same way as the Trainer class from 🤗 Transformers. You should pass an AutoModelForSequenceClassification model to the [RewardTrainer], along with a [RewardConfig] which configures the hyperparameters of the training.

Leveraging 🤗 PEFT to train a reward model

Just pass a peft_config in the keyword arguments of [RewardTrainer], and the trainer should automatically take care of converting the model into a PEFT model!

from peft import LoraConfig, TaskType
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from trl import RewardTrainer, RewardConfig

model = AutoModelForSequenceClassification.from_pretrained("gpt2")
peft_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    inference_mode=False,
    r=8,
    lora_alpha=32,
    lora_dropout=0.1,
)

...

trainer = RewardTrainer(
    model=model,
    args=training_args,
    tokenizer=tokenizer,
    train_dataset=dataset,
    peft_config=peft_config,
)

trainer.train()

Adding a margin to the loss

As in the Llama 2 paper, you can add a margin to the loss by adding a margin column to the dataset. The reward collator will automatically pass it through and the loss will be computed accordingly.

def add_margin(row):
    # Assume you have a score_chosen and score_rejected columns that you want to use to compute the margin
    return {'margin': row['score_chosen'] - row['score_rejected']}

dataset = dataset.map(add_margin)

RewardConfig

[[autodoc]] RewardConfig

RewardTrainer

[[autodoc]] RewardTrainer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reward_trainer.mdx

reward_trainer.mdx

Reward Modeling

Expected dataset format

Using the `RewardTrainer`

Leveraging 🤗 PEFT to train a reward model

Adding a margin to the loss

RewardConfig

RewardTrainer

Files

reward_trainer.mdx

Latest commit

History

reward_trainer.mdx

File metadata and controls

Reward Modeling

Expected dataset format

Using the RewardTrainer

Leveraging 🤗 PEFT to train a reward model

Adding a margin to the loss

RewardConfig

RewardTrainer

Using the `RewardTrainer`