# OpenRLHF

```{note}
An Easy-to-use, Scalable and
High-performance RLHF Framework.
```

## Introduction

As large language models (LLMs) continue to grow by scaling laws, reinforcement
learning from human feedback (RLHF) has gained significant attention due to its
outstanding performance. However, as models grow larger, vanilla
RLHF typically requires maintaining multiple models and a more complex learning pipeline, e.g. PPO requires maintaining four models
during training, leading
to increased demands for memory and computational resources.

```{tip}
The four models PPO requires are SFT model, reward model, actor and critic.
```

Existing open-source RLHF frameworks such as Transformer Reinforcement Learning (TRL), rely on parallelization approaches like Zero
Redundancy Optimizer (ZeRO) to co-locate the four models involved in RLHF training on the same
GPU. However, as models continue to grow past 70 billion parameters, this scheduling
approach becomes increasingly inefficient with limited GPU memory. To address the limitations of
co-location, some frameworks like TRL compromise on memory usage by `merging the actor and
critic models` or employing techniques like Low-Rank Adaptation (LoRA). However, these
can reduce model performance.

```{tip}
ZeRO (Zero Redundancy Optimizer) is a memory optimization technology for large-scale deep learning models. Developed by Microsoft as part of the DeepSpeed library, it aims to reduce memory usage and enhance training efficiency. ZeRO achieves this by partitioning model states, optimizer states, and gradients across multiple GPUs, allowing for the training of models with billions of parameters. It has three optimization stages: ZeRO-1 (optimizer state partitioning), ZeRO-2 (gradient partitioning), and ZeRO-3 (parameter partitioning), which progressively improve memory efficiency and scalability.
```

To enable easy RLHF training at scale, OpenRLHF `redesigns model scheduling using Ray, vLLM and DeepSpeed`, enabling training of models beyond 70 billion parameters.

**PPO Support Matrix** 

| Feature | OpenRLHF | DSChat | CAIChat | TRL |
| ------------- |:-------------:| :-------------:| :-------------:| :-------------:|
| 70B+ Full Tuning with 16 A100-80GB      | ✅ | ❌ | ❌ | ❌ |
| 7B Full Tuning with 4 RTX4090 | ✅      |    ❌ | ❌ | ❌ |
| 34B DPO Full Tuning with 8 A100-80GB | ✅      |    ❌ | ❌ | ❌ |  
| Inference Engine in PPO | ✅      |    ✅ | ❌ | ❌ |  
| PPO Implementation Tricks | ✅      |    ❌ | ❌ | ✅ |
| Support QLoRA | ✅      |    ❌ | ❌ | ✅ | 
| Support Mixtral 8*7b | ✅      |    ❌ | ❌ | ❌ |  
| Support Unmerged Actor-Critic | ✅     |   ✅ | ✅ | ❌ | 
| Support Multiple Reward Models | ✅      |    ❌ | ❌ | ❌ |   
| Support Huggingface Models | ✅      |    ✅ | ✅ | ✅ | 
| Easy-to-use | ✅      |   ❌ (HybridEngine bugs) | ✅ | ✅ | 

## Background

### Reinforcement Learning from Human Feedback

The classic training of large language models based on a pre-trained Generative Pre-trained
Transformer (GPT) involves three steps: Supervised Fine-tuning (SFT), Reward Model (RM) training,
and PPO training.
![](../images/trl1.png)

$$
\begin{aligned}
\text{PPO_objective} = &\mathbb{E}_{(x, y)\sim D_{\pi_{\phi}^{\text{RL}}}}\left[r(x, y) - \beta\log\left(\frac{\pi_{\phi}^{\text{RL}}(y|x)}{\pi^{\text{SFT}}(y|x)}\right)\right]
\end{aligned}
$$

### Ray

```{note}
Ray is a distributed execution framework that provides powerful scheduling and scaling capabilities
for parallel and distributed computing workloads.
```

### vLLM

```{note}

```