Supervised fine-tuning (or SFT for short) is a crucial step in RLHF. In TRL we provide an easy-to-use API to create your SFT models and train them with few lines of code on your dataset.
Check out a complete flexible example inside examples/scripts
folder.
If you have a dataset hosted on the 🤗 Hub, you can easily fine-tune your SFT model using [SFTTrainer
] from TRL. Let us assume your dataset is imdb
, the text you want to predict is inside the text
field of the dataset, and you want to fine-tune the facebook/opt-350m
model.
The following code-snippet takes care of all the data pre-processing and training for you:
from datasets import load_dataset
from trl import SFTTrainer
dataset = load_dataset("imdb", split="train")
trainer = SFTTrainer(
"facebook/opt-350m",
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=512,
)
trainer.train()
Make sure to pass a correct value for max_seq_length
as the default value will be set to min(tokenizer.model_max_length, 1024)
.
You can also construct a model outside of the trainer and pass it as follows:
from transformers import AutoModelForCausalLM
from datasets import load_dataset
from trl import SFTTrainer
dataset = load_dataset("imdb", split="train")
model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m")
trainer = SFTTrainer(
model,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=512,
)
trainer.train()
The above snippets will use the default training arguments from the transformers.TrainingArguments
class. If you want to modify that, make sure to create your own TrainingArguments
object and pass it to the [SFTTrainer
] constructor as it is done on the supervised_finetuning.py
script on the stack-llama example.
You can use the DataCollatorForCompletionOnlyLM
to train your model on the generated prompts only. Note that this works only in the case when packing=False
.
To instantiate that collator, pass a response template and the tokenizer. Here is an example of how it would work to fine-tune opt-350m
on completions only on the CodeAlpaca dataset:
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM
dataset = load_dataset("lucasmccabe-lmi/CodeAlpaca-20k", split="train")
model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m")
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")
def formatting_prompts_func(example):
output_texts = []
for i in range(len(example['instruction'])):
text = f"### Question: {example['instruction'][i]}\n ### Answer: {example['output'][i]}"
output_texts.append(text)
return output_texts
response_template = " ### Answer:"
collator = DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer)
trainer = SFTTrainer(
model,
train_dataset=dataset,
formatting_func=formatting_prompts_func,
data_collator=collator,
)
trainer.train()
For instruction fine-tuning, it is quite common to have two columns inside the dataset: one for the prompt & the other for the response. This allows people to format examples like Stanford-Alpaca did as follows:
Below is an instruction ...
### Instruction
{prompt}
### Response:
{completion}
Let us assume your dataset has two fields, question
and answer
. Therefore you can just run:
...
def formatting_prompts_func(example):
output_texts = []
for i in range(len(example['question'])):
text = f"### Question: {example['question'][i]}\n ### Answer: {example['answer'][i]}"
output_texts.append(text)
return output_texts
trainer = SFTTrainer(
model,
train_dataset=dataset,
formatting_func=formatting_prompts_func,
)
trainer.train()
To preperly format your input make sure to process all the examples by looping over them and returning a list of processed text. Check out a full example on how to use SFTTrainer on alpaca dataset here
[SFTTrainer
] supports example packing, where multiple short examples are packed in the same input sequence to increase training efficiency. This is done with the [ConstantLengthDataset
] utility class that returns constant length chunks of tokens from a stream of examples. To enable the usage of this dataset class, simply pass packing=True
to the [SFTTrainer
] constructor.
...
trainer = SFTTrainer(
"facebook/opt-350m",
train_dataset=dataset,
dataset_text_field="text",
packing=True
)
trainer.train()
Note that if you use a packed dataset and if you pass max_steps
in the training arguments you will probably train your models for more than few epochs, depending on the way you have configured the packed dataset and the training protocol. Double check that you know and understand what you are doing.
If your dataset has several fields that you want to combine, for example if the dataset has question
and answer
fields and you want to combine them, you can pass a formatting function to the trainer that will take care of that. For example:
def formatting_func(example):
text = f"### Question: {example['question']}\n ### Answer: {example['answer']}"
return text
trainer = SFTTrainer(
"facebook/opt-350m",
train_dataset=dataset,
packing=True,
formatting_func=formatting_func
)
trainer.train()
You can also customize the [ConstantLengthDataset
] much more by directly passing the arguments to the [SFTTrainer
] constructor. Please refer to that class' signature for more information.
You can directly pass the kwargs of the from_pretrained()
method to the [SFTTrainer
]. For example, if you want to load a model in a different precision, analoguous to
model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", torch_dtype=torch.bfloat16)
```python
...
trainer = SFTTrainer(
"facebook/opt-350m",
train_dataset=dataset,
dataset_text_field="text",
torch_dtype=torch.bfloat16,
)
trainer.train()
Note that all keyword arguments of from_pretrained()
are supported.
We also support a tight integration with 🤗 PEFT library so that any user can conveniently train adapters and share them on the Hub instead of training the entire model
from datasets import load_dataset
from trl import SFTTrainer
from peft import LoraConfig
dataset = load_dataset("imdb", split="train")
peft_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
trainer = SFTTrainer(
"EleutherAI/gpt-neo-125m",
train_dataset=dataset,
dataset_text_field="text",
peft_config=peft_config
)
trainer.train()
Note that in case of training adapters, we manually add a saving callback to automatically save the adapters only:
class PeftSavingCallback(TrainerCallback):
def on_save(self, args, state, control, **kwargs):
checkpoint_path = os.path.join(args.output_dir, f"checkpoint-{state.global_step}")
kwargs["model"].save_pretrained(checkpoint_path)
if "pytorch_model.bin" in os.listdir(checkpoint_path):
os.remove(os.path.join(checkpoint_path, "pytorch_model.bin"))
If you want to add more callbacks, make sure to add this one as well to properly save the adapters only during training.
...
callbacks = [YourCustomCallback(), PeftSavingCallback()]
trainer = SFTTrainer(
"EleutherAI/gpt-neo-125m",
train_dataset=dataset,
dataset_text_field="text",
torch_dtype=torch.bfloat16,
peft_config=peft_config,
callbacks=callbacks
)
trainer.train()
For that you need to first load your 8bit model outside the Trainer and pass a PeftConfig
to the trainer. For example:
...
peft_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = AutoModelForCausalLM.from_pretrained(
"EleutherAI/gpt-neo-125m",
load_in_8bit=True,
device_map="auto",
)
trainer = SFTTrainer(
model,
train_dataset=dataset,
dataset_text_field="text",
torch_dtype=torch.bfloat16,
peft_config=peft_config,
)
trainer.train()
Pay attention to the following best practices when training a model with that trainer:
- [
SFTTrainer
] always pads by default the sequences to themax_seq_length
argument of the [SFTTrainer
]. If none is passed, the trainer will retrieve that value from the tokenizer. Some tokenizers do not provide default value, so there is a check to retrieve the minimum between 2048 and that value. Make sure to check it before training. - For training adapters in 8bit, you might need to tweak the arguments of the
prepare_model_for_int8_training
method from PEFT, hence we advise users to useprepare_in_int8_kwargs
field, or create thePeftModel
outside the [SFTTrainer
] and pass it. - For a more memory-efficient training using adapters, you can load the base model in 8bit, for that simply add
load_in_8bit
argument when creating the [SFTTrainer
], or create a base model in 8bit outside the trainer and pass it. - If you create a model outside the trainer, make sure to not pass to the trainer any additional keyword arguments that are relative to
from_pretrained()
method.
[[autodoc]] SFTTrainer
[[autodoc]] trainer.ConstantLengthDataset