This repository contains the code and released models for our paper SimPO: Simple Preference Optimization with a Reference-Free Reward. We propose a simpler and more effective preference optimization algorithm than DPO (Direct Preference Optimization) without using a reference model. SimPO outperforms DPO and its latest variants across AlpacaEval 2, MT-Bench, and Arena-Hard benchmarks under various settings.
Given the various inquiries about SimPO, we provide a list of tips to help you reproduce our paper results and achieve better outcomes for running SimPO on your own tasks.
Hyperparameter tuning is crucial for SimPO. The three main hyperparameters to focus on are learning_rate, beta, and gamma.
learning_rate
: learning_rate: The learning rate is the most critical hyperparameter for preference optimization. A large learning rate (e.g., 1e-5) can significantly degrade performance, causing the model to produce incoherent sentences or completely repetitive responses. We recommend grid searching over 3e-7, 5e-7, and 1e-6, if resources allow.- `beta: Beta controls the reward scaling between winning and losing responses. In our preprint, we used a small beta (e.g., 2.0 or 2.5), but researchers from Meta suggest that a larger beta (e.g., 10) could yield better results.
- `gamma: Gamma controls the target reward margin. We suggest tuning gamma in tandem with beta, where gamma = c * beta. We recommend grid searching over 0.25, 0.3, and 0.4. A well-tuned gamma can provide a modest improvement, but it is not as critical as other hyperparameters.
We used the following hyperparameters for training the released models.
Setting | β | γ | Learning rate |
---|---|---|---|
Mistral-Base | 2.0 | 1.6 | 3e-7 |
Mistral-Instruct | 2.5 | 0.3 | 5e-7 |
Llama3-Base | 2.0 | 1.0 | 6e-7 |
Llama3-Instruct | 2.5 | 1.4 | 1e-6 |
Our released Llama3 models use the initial version of the Llama3 tokenizer (prior to this PR). We have found that the updated Llama3 tokenizer with vLLM occasionally introduces two BOS tokens, which can affect evaluation results. Therefore, please ensure that only one BOS token is included in the prompt after applying the Llama3 chat template during any evaluation.
Notably, if you are training Llama3 and evaluating the trained models on AlpacaEval 2 and Arena-Hard using the templates provided in this repo, please make sure to use the pre-update Llama3 tokenizer (i.e., the one before the PR).
We have observed that, in some cases, adding an additional SFT loss can help improve results. These findings have been initially validated in the CPO_SIMPO repository. We are currently working on integrating this improvement into our main repository.
Below is the complete list of models evaluated in our preprint.
models | AE2 LC | AE2 WR | AH | |
---|---|---|---|---|
Mistral Base 7B SFT | alignment-handbook/zephyr-7b-sft-full | 8.4 | 6.2 | 1.3 |
Mistral Base 7B DPO (Zephyr) | princeton-nlp/Mistral-7B-Base-SFT-DPO | 15.1 | 12.5 | 10.4 |
Mistral Base 7B IPO | princeton-nlp/Mistral-7B-Base-SFT-IPO | 11.8 | 9.4 | 7.5 |
Mistral Base 7B KTO | princeton-nlp/Mistral-7B-Base-SFT-KTO | 13.1 | 9.1 | 5.6 |
Mistral Base 7B ORPO | kaist-ai/mistral-orpo-beta | 14.7 | 12.2 | 7.0 |
Mistral Base 7B R-DPO | princeton-nlp/Mistral-7B-Base-SFT-RDPO | 17.4 | 12.8 | 9.9 |
Mistral Base 7B SimPO | princeton-nlp/Mistral-7B-Base-SFT-SimPO | 21.4 | 20.8 | 16.6 |
Mistral Instruct 7B SFT | mistralai/Mistral-7B-Instruct-v0.2 | 17.1 | 14.7 | 12.6 |
Mistral Instruct 7B DPO | princeton-nlp/Mistral-7B-Instruct-DPO | 26.8 | 24.9 | 16.3 |
Mistral Instruct 7B IPO | princeton-nlp/Mistral-7B-Instruct-IPO | 20.3 | 20.3 | 16.2 |
Mistral Instruct 7B KTO | princeton-nlp/Mistral-7B-Instruct-KTO | 24.5 | 23.6 | 17.9 |
Mistral Instruct 7B ORPO | princeton-nlp/Mistral-7B-Instruct-ORPO | 24.5 | 24.9 | 20.8 |
Mistral Instruct 7B R-DPO | princeton-nlp/Mistral-7B-Instruct-RDPO | 27.3 | 24.5 | 16.1 |
Mistral Instruct 7B SimPO | princeton-nlp/Mistral-7B-Instruct-SimPO | 32.1 | 34.8 | 21.0 |
Llama3 Base 8B SFT | princeton-nlp/Llama-3-Base-8B-SFT | 6.2 | 4.6 | 3.3 |
Llama3 Base 8B DPO | princeton-nlp/Llama-3-Base-8B-SFT-DPO | 18.2 | 15.5 | 15.9 |
Llama3 Base 8B IPO | princeton-nlp/Llama-3-Base-8B-SFT-IPO | 14.4 | 14.2 | 17.8 |
Llama3 Base 8B KTO | princeton-nlp/Llama-3-Base-8B-SFT-KTO | 14.2 | 12.4 | 12.5 |
Llama3 Base 8B ORPO | princeton-nlp/Llama-3-Base-8B-SFT-ORPO | 12.2 | 10.6 | 10.8 |
Llama3 Base 8B R-DPO | princeton-nlp/Llama-3-Base-8B-SFT-RDPO | 17.6 | 14.4 | 17.2 |
Llama3 Base 8B SimPO | princeton-nlp/Llama-3-Base-8B-SFT-SimPO | 22.0 | 20.3 | 23.4 |
Llama3 Instruct 8B SFT | meta-llama/Meta-Llama-3-Instruct-8B | 26.0 | 25.3 | 22.3 |
Llama3 Instruct 8B DPO | princeton-nlp/Llama-3-Instruct-8B-DPO | 40.3 | 37.9 | 32.6 |
Llama3 Instruct 8B IPO | princeton-nlp/Llama-3-Instruct-8B-IPO | 35.6 | 35.6 | 30.5 |
Llama3 Instruct 8B KTO | princeton-nlp/Llama-3-Instruct-8B-KTO | 33.1 | 31.8 | 26.4 |
Llama3 Instruct 8B ORPO | princeton-nlp/Llama-3-Instruct-8B-ORPO | 28.5 | 27.4 | 25.8 |
Llama3 Instruct 8B R-DPO | princeton-nlp/Llama-3-Instruct-8B-RDPO | 41.1 | 37.8 | 33.1 |
Llama3 Instruct 8B SimPO | princeton-nlp/Llama-3-Instruct-8B-SimPO | 44.7 | 40.5 | 33.8 |
Please refer to the generate.py script for detailed instructions on loading the model with the appropriate chat template.
Our codebase is built upon the alignment-handbook repo. The following steps will guide you through the installation process.
First, create a Python virtual environment using e.g. Conda:
conda create -n handbook python=3.10 && conda activate handbook
Next, install PyTorch v2.2.2
. Since this is hardware-dependent, we
direct you to the PyTorch Installation Page.
You can then install the remaining package dependencies of alignment-handbook as follows:
git clone https://github.com/huggingface/alignment-handbook.git
cd ./alignment-handbook/
python -m pip install .
You will also need Flash Attention 2 installed, which can be done by running:
python -m pip install flash-attn --no-build-isolation
We provide four training config files for the four training setups reported in our paper. The training config is set for 8xH100 GPUs. You may need to adjust num_processes
and per_device_train_batch_size
based on your computation environment.
- Mistral-Base:
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file accelerate_configs/deepspeed_zero3.yaml scripts/run_simpo.py training_configs/mistral-7b-base-simpo.yaml
- Mistral-Instruct:
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file accelerate_configs/deepspeed_zero3.yaml scripts/run_simpo.py training_configs/mistral-7b-instruct-simpo.yaml
- Llama3-Base:
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file accelerate_configs/deepspeed_zero3.yaml scripts/run_simpo.py training_configs/llama-3-8b-base-simpo.yaml
- Llama3-Instruct:
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file accelerate_configs/deepspeed_zero3.yaml scripts/run_simpo.py training_configs/llama-3-8b-instruct-simpo.yaml
We follow the official implementation for evaluation on AlpacaEval 2, Arena-Hard, and MT-Bench, as follows (more details can be found under the eval directory):
-
AlpacaEval 2: Please refer to the AlpacaEval repo for evaluation.
-
Arena-Hard: Please refer to to the Arena-Hard-Auto repo for evaluation.
-
MT-Bench: Please refer to the FastChat repo for evaluation.
If you have any questions related to the code or the paper, feel free to email Yu (yumeng5@virginia.edu). If you encounter any problems when using the code, or want to report a bug, feel free to open an issue! Please try to specify the problem with details so we can help you better and quicker!
Please cite our paper if you find the repo helpful in your work:
@article{meng2024simpo,
title={{SimPO}: Simple Preference Optimization with a Reference-Free Reward},
author={Meng, Yu and Xia, Mengzhou and Chen, Danqi},
year={2024}
}