TRL supports the Kahneman-Tversky Optimization (KTO) Trainer for aligning language models with binary feedback data (e.g., upvote/downvote), as described in the paper by Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela.
For a full example have a look at [examples/scripts/kto.py
].
Depending on how good your base model is, you may or may not need to do SFT before KTO. This is different from standard RLHF and DPO, which always require SFT.
The KTO trainer expects a very specific format for the dataset as it does not require pairwise preferences. Since the model will be trained to directly optimize examples that consist of a prompt, model completion, and a label to indicate whether the completion is "good" or "bad", we expect a dataset with the following columns:
prompt
completion
label
for example:
kto_dataset_dict = {
"prompt": [
"Hey, hello",
"How are you",
"What is your name?",
"What is your name?",
"Which is the best programming language?",
"Which is the best programming language?",
"Which is the best programming language?",
],
"completion": [
"hi nice to meet you",
"leave me alone",
"I don't have a name",
"My name is Mary",
"Python",
"C++",
"Java",
],
"label": [
True,
False,
False,
True,
True,
False,
False,
],
}
where the prompt
contains the context inputs, completion
contains the corresponding responses and label
contains the corresponding flag that indicates if the generated completion is desired (True
) or undesired (False
).
A prompt can have multiple responses and this is reflected in the entries being repeated in the dictionary's value arrays.
The KTO trainer expects a model of AutoModelForCausalLM
, compared to PPO that expects AutoModelForCausalLMWithValueHead
for the value function.
For a detailed example have a look at the examples/scripts/kto.py
script. At a high level we need to initialize the KTOTrainer
with a model
we wish to train and a reference ref_model
which we will use to calculate the implicit rewards of the preferred and rejected response.
The beta
refers to the hyperparameter of the implicit reward, and the dataset contains the 3 entries listed above. Note that the model
and ref_model
need to have the same architecture (ie decoder only or encoder-decoder).
The desirable_weight
and undesirable_weight
refer to the weights placed on the losses for desirable/positive and undesirable/negative examples.
By default, they are both 1. However, if you have more of one or the other, then you should upweight the less common type such that the ratio of (desirable_weight
* number of positives) to (undesirable_weight
* number of negatives) is in the range 1:1 to 4:3.
training_args = KTOConfig(
beta=0.1,
desirable_weight=1.0,
undesirable_weight=1.0,
)
kto_trainer = KTOTrainer(
model,
model_ref,
args=training_args,
train_dataset=train_dataset,
tokenizer=tokenizer,
)
After this one can then call:
kto_trainer.train()
Given the binary signal data indicating whether a completion is desirable or undesirable for a prompt, we can optimize an implicit reward function that aligns with the key principles of Kahneman-Tversky's prospect theory, such as reference dependence, loss aversion, and diminishing sensitivity.
The BCO authors train a binary classifier whose logit serves as a reward so that the classifier maps {prompt, chosen completion} pairs to 1 and {prompt, rejected completion} pairs to 0.
The KTOTrainer
can be switched to this loss via the loss_type="bco"
argument.
[[autodoc]] KTOTrainer
[[autodoc]] KTOConfig