Skip to content

Latest commit

 

History

History
100 lines (77 loc) · 3.95 KB

kto_trainer.mdx

File metadata and controls

100 lines (77 loc) · 3.95 KB

KTO Trainer

TRL supports the Kahneman-Tversky Optimization (KTO) Trainer for aligning language models with binary feedback data (e.g., upvote/downvote), as described in the paper by Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. For a full example have a look at [examples/scripts/kto.py].

Depending on how good your base model is, you may or may not need to do SFT before KTO. This is different from standard RLHF and DPO, which always require SFT.

Expected dataset format

The KTO trainer expects a very specific format for the dataset as it does not require pairwise preferences. Since the model will be trained to directly optimize examples that consist of a prompt, model completion, and a label to indicate whether the completion is "good" or "bad", we expect a dataset with the following columns:

  • prompt
  • completion
  • label

for example:

kto_dataset_dict = {
    "prompt": [
        "Hey, hello",
        "How are you",
        "What is your name?",
        "What is your name?",
        "Which is the best programming language?",
        "Which is the best programming language?",
        "Which is the best programming language?",
    ],
    "completion": [
        "hi nice to meet you",
        "leave me alone",
        "I don't have a name",
        "My name is Mary",
        "Python",
        "C++",
        "Java",
    ],
    "label": [
        True,
        False,
        False,
        True,
        True,
        False,
        False,
    ],
}

where the prompt contains the context inputs, completion contains the corresponding responses and label contains the corresponding flag that indicates if the generated completion is desired (True) or undesired (False). A prompt can have multiple responses and this is reflected in the entries being repeated in the dictionary's value arrays.

Expected model format

The KTO trainer expects a model of AutoModelForCausalLM, compared to PPO that expects AutoModelForCausalLMWithValueHead for the value function.

Using the KTOTrainer

For a detailed example have a look at the examples/scripts/kto.py script. At a high level we need to initialize the KTOTrainer with a model we wish to train and a reference ref_model which we will use to calculate the implicit rewards of the preferred and rejected response.

The beta refers to the hyperparameter of the implicit reward, and the dataset contains the 3 entries listed above. Note that the model and ref_model need to have the same architecture (ie decoder only or encoder-decoder).

The desirable_weight and undesirable_weight refer to the weights placed on the losses for desirable/positive and undesirable/negative examples. By default, they are both 1. However, if you have more of one or the other, then you should upweight the less common type such that the ratio of (desirable_weight * number of positives) to (undesirable_weight * number of negatives) is in the range 1:1 to 4:3.

training_args = KTOConfig(
    beta=0.1,
    desirable_weight=1.0,
    undesirable_weight=1.0,
)

kto_trainer = KTOTrainer(
    model,
    model_ref,
    args=training_args,
    train_dataset=train_dataset,
    tokenizer=tokenizer,
)

After this one can then call:

kto_trainer.train()

Loss Functions

Given the binary signal data indicating whether a completion is desirable or undesirable for a prompt, we can optimize an implicit reward function that aligns with the key principles of Kahneman-Tversky's prospect theory, such as reference dependence, loss aversion, and diminishing sensitivity.

The BCO authors train a binary classifier whose logit serves as a reward so that the classifier maps {prompt, chosen completion} pairs to 1 and {prompt, rejected completion} pairs to 0. The KTOTrainer can be switched to this loss via the loss_type="bco" argument.

KTOTrainer

[[autodoc]] KTOTrainer

KTOConfig

[[autodoc]] KTOConfig