Skip to content
/ SPPO Public
forked from uclaml/SPPO

The official implementation of Self-Play Preference Optimization (SPPO)

License

Notifications You must be signed in to change notification settings

kaykyr/SPPO

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SPPO: Self-Play Preference Optimization for Language Model Alignment (4bit quant implementation)

Mistral-7B-Instruct Llama-3-8B-Instruct AlpacaEval 2.0 Open LLM MT-Bench


About SPPO 4-bit quant

This repository is a fork of uclaml/SPPO, this is my frustrating attempt to run this training method on 2 home GPUs (2x RTX 4090). All code was adapted for use on 2 GPUs, from synthetic dataset generation to SPPO training. Feel free to contribute.


This repository contains the non-official code (4bit quant) and official released models for the paper Self-Play Preference Optimization for Language Model Alignment.

Authors: Yue Wu*, Zhiqing Sun*, Huizhuo Yuan*, Kaixuan Ji, Yiming Yang, Quanquan Gu

[Webpage] [Huggingface] [Paper]

About SPPO

We propose a new self-play framework dubbed SPPO for language model alignment and a new learning objective (called SPPO loss) derived from the self-play framework to fine-tune large language models efficiently.


AlpacaEval 2.0 leaderboard results of normal and length-controlled (LC) win rates in percentage (\%). Mistral-7B-SPPO can outperform larger models and Mistral-7B-SPPO (best-of-16) can outperform proprietary models such as GPT-4(6/13). Llama-3-8B-SPPO exhibits even better performance.

SPPO can significantly enhance the performance of an LLM without strong external signals such as responses or preferences from GPT-4. It can outperform the model trained with iterative direct preference optimization (DPO), among other methods. SPPO is theoretically grounded, ensuring that the LLM can converge to the von Neumann winner (i.e., Nash equilibrium) under general, potentially intransitive preference, and empirically validated through extensive evaluations on multiple datasets.

For more details, you can check our paper here.

Base Models and Released Models

Model AlpacaEval2.0 LC Win Rate AlpacaEval2.0 Win Rate
🤗Mistral-7B-Instruct-v0.2 17.11 14.72
🤗Mistral-7B-SPPO Iter1 24.79 23.51
🤗Mistral-7B-SPPO Iter2 26.89 27.62
🤗Mistral-7B-SPPO Iter3 28.53 31.02
🤗Llama-3-8B-Instruct 22.92 22.57
🤗Llama-3-8B-SPPO Iter1 31.73 31.74
🤗Llama-3-8B-SPPO Iter2 35.15 35.98
🤗Llama-3-8B-SPPO Iter3 38.77 39.85
🤗Gemma-2-9B-It 45.08 35.62
🤗Gemma-2-9B-SPPO Iter1 48.70 40.76
🤗Gemma-2-9B-SPPO Iter2 50.93 44.64
🤗Gemma-2-9B-SPPO Iter3 53.27 47.74

Environment Setup

Our training code is based on the alignment-handbook codebase. We utilize vllm for generation and pairRM for ranking. Follow the steps below to set up your environment:

  1. Create a Virtual Environment:

    conda create -n sppo python=3.10
    conda activate sppo
  2. Install vllm for Generation:

    pip install vllm
  3. Install PairRM:

    git clone https://github.com/yuchenlin/LLM-Blender.git
    cd LLM-Blender
    pip install -e .
  4. Download and Install Training Dependencies:

    git clone https://github.com/kaykyr/SPPO.git
    cd SPPO
    pip install -e .

Training Scripts

Execute the training scripts based on the base model you choose:

  • For Llama-3:

    bash run.sh

    Don't forget to replace the model path on run.sh, and some scripts in ./scripts

    Use tail -f ./out/logs/* to follow the logs details

These scripts manage the training iterations, generation, and PairRM ranking processes. Note that some scripts may attempt to push datasets to the Hugging Face Hub under the UCLA-AGI organization. Ensure you have write access, or modify the organization name accordingly, or comment out any push_to_hub commands if necessary. Detailed scripts for each component are listed as follows:

Breakdown of Scripts:

  1. Generation:

    python scripts/generate.py --model $MODEL --maxlen 2048 --output_dir $OUTPUT_DIR --prompts $PROMPTS
  2. Ranking:

    python scripts/rank.py --output_dir $OUTPUT_DIR --prompts $PROMPTS
  3. Training:

    bash scripts/pipeline.sh --model $MODEL --iter $ITER --dataset $DATASET --output_dir $OUTPUT_DIR --num 1

Evaluation

We adhere to the established guidelines for evaluation and utilize the following repositories:

We provide the model configurations used during AlpacaEval 2 in the models_configs directory. Please note that after the initial release of our model, we retrained it using a slightly modified prompt. The win rates observed post-retraining are comparable to the original results.

Troubleshoot

For questions related to the paper, please contact the authors via email. If you encounter any issues with the code or wish to report a bug, feel free to open an issue on our GitHub repository.

Citation

@article{wu2024self,
  title={Self-play preference optimization for language model alignment},
  author={Wu, Yue and Sun, Zhiqing and Yuan, Huizhuo and Ji, Kaixuan and Yang, Yiming and Gu, Quanquan},
  year={2024}
}

Acknowledgements

We thank the authors of The Alignment Handbook for their foundational contributions to the training code. We also acknowledge the use of PairRM for ranking and vllm for generation.

TODO - Quant version

  • Fix generation (it's working but is duplicating data, we can fix it later)
  • Traning code (it's almost done, when ready, we can clean the code and implements a easy-to-use script)
  • Write documentation

About

The official implementation of Self-Play Preference Optimization (SPPO)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 97.2%
  • Shell 2.8%