TRL supports custom reward modeling for anyone to perform reward modeling on their dataset and model.
Check out a complete flexible example at examples/scripts/reward_modeling.py
.
The [RewardTrainer
] expects a very specific format for the dataset since the model will be trained on pairs of examples to predict which of the two is preferred. We provide an example from the Anthropic/hh-rlhf
dataset below:
Therefore the final dataset object should contain two 4 entries at least if you use the default [RewardDataCollatorWithPadding
] data collator. The entries should be named:
input_ids_chosen
attention_mask_chosen
input_ids_rejected
attention_mask_rejected
After preparing your dataset, you can use the [RewardTrainer
] in the same way as the Trainer
class from 🤗 Transformers.
You should pass an AutoModelForSequenceClassification
model to the [RewardTrainer
], along with a [RewardConfig
] which configures the hyperparameters of the training.
Just pass a peft_config
in the keyword arguments of [RewardTrainer
], and the trainer should automatically take care of converting the model into a PEFT model!
from peft import LoraConfig, TaskType
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from trl import RewardTrainer, RewardConfig
model = AutoModelForSequenceClassification.from_pretrained("gpt2")
peft_config = LoraConfig(
task_type=TaskType.SEQ_CLS,
inference_mode=False,
r=8,
lora_alpha=32,
lora_dropout=0.1,
)
...
trainer = RewardTrainer(
model=model,
args=training_args,
tokenizer=tokenizer,
train_dataset=dataset,
peft_config=peft_config,
)
trainer.train()
As in the Llama 2 paper, you can add a margin to the loss by adding a margin
column to the dataset. The reward collator will automatically pass it through and the loss will be computed accordingly.
def add_margin(row):
# Assume you have a score_chosen and score_rejected columns that you want to use to compute the margin
return {'margin': row['score_chosen'] - row['score_rejected']}
dataset = dataset.map(add_margin)
[[autodoc]] RewardConfig
[[autodoc]] RewardTrainer