TRL supports the Binary Classifier Optimization (BCO).
The BCO authors train a binary classifier whose logit serves as a reward so that the classifier maps {prompt, chosen completion} pairs to 1 and {prompt, rejected completion} pairs to 0.
For a full example have a look at [examples/scripts/bco.py
].
The [BCOTrainer
] requires an unpaired preference dataset.
The [BCOTrainer
] supports both conversational and standard dataset format. When provided with a conversational dataset, the trainer will automatically apply the chat template to the dataset.
The BCO trainer expects a model of AutoModelForCausalLM
, compared to PPO that expects AutoModelForCausalLMWithValueHead
for the value function.
For a detailed example have a look at the examples/scripts/bco.py
script. At a high level we need to initialize the BCOTrainer
with a model
we wish to train and a reference ref_model
which we will use to calculate the implicit rewards of the preferred and rejected response.
The beta
refers to the hyperparameter of the implicit reward, and the dataset contains the 3 entries listed above. Note that the model
and ref_model
need to have the same architecture (ie decoder only or encoder-decoder).
training_args = BCOConfig(
beta=0.1,
)
bco_trainer = BCOTrainer(
model,
model_ref,
args=training_args,
train_dataset=train_dataset,
processing_class=tokenizer,
)
After this one can then call:
bco_trainer.train()
In practical scenarios, the thumbs-up and thumbs-down datasets are likely to have divergent underlying distributions of prompts.
Consider an LLM deployed for user feedback: if the model excels in writing tasks but underperforms in coding, the thumbs-up dataset will be dominated by writing-related prompts, while the thumbs-down dataset will contain mostly coding-related prompts.
If the prompts in your desired and undesired datasets differ a lot, it is useful to enable UDM.
Choose an embedding model and tokenizer:
embedding_model = AutoModel.from_pretrained(your_model_id)
embedding_tokenizer = AutoTokenizer.from_pretrained(your_model_id)
# customize this function depending on your embedding model
def embed_prompt(input_ids, attention_mask, model):
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
return outputs.last_hidden_state.mean(dim=1)
embedding_model = Accelerator().prepare_model(self.embedding_model)
embedding_func = partial(embed_prompt, model=embedding_model)
Set prompt_sample_size
to defined how many prompts are selected to train the UDM classifier and start the training with the provided embedding function:
training_args = BCOConfig(
beta=0.1,
prompt_sample_size=512,
)
bco_trainer = BCOTrainer(
model,
model_ref,
args=training_args,
train_dataset=train_dataset,
processing_class=tokenizer,
embedding_func=embedding_func,
embedding_tokenizer=self.embedding_tokenizer,
)
bco_trainer.train()
MOEs are the most efficient if the load is about equally distributed between experts.
To ensure that we train MOEs similarly during preference-tuning, it is beneficial to add the auxiliary loss from the load balancer to the final loss.
This option is enabled by setting output_router_logits=True
in the model config (e.g. MixtralConfig).
To scale how much the auxiliary loss contributes to the total loss, use the hyperparameter router_aux_loss_coef=...
(default: 0.001).
[[autodoc]] BCOTrainer
[[autodoc]] BCOConfig