# How ChatGPT Works Part 3: RLHF

<a target="_blank" href="https://colab.research.google.com/github/life-efficient/RLHF-Implementation/blob/main/Notebook.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

> Reinforcement Learning with Human Feedback, or RLHF, is a technique used to update a machine learning model based on human feedback

The second and third step in the diagram below encompass RLHF:
- The reward model is trained to predict the reward for each response using a supervised dataset of prompts and various responses in step 2
- The reward model is used in the reinforcement learning setup in step 3 to predict the reward for each response on an unsupervised dataset of prompts

![](./images/How%20chatGPT%20is%20trained.png)

### Recap: What is Reinforcement Learning?

> Reinforcement learning is where an agent (in our case, the AI system) interacts with an an environment (in our case, interacting with the chat interface by responding to prompts), and tries to maximise a reward which is receives for doing well (or a punishment for not doing well).

![](./images/RL%20Formulation.png)

### The Reward Model is Used to Encode Complex Behaviours that are Very Difficult to Define

> It can be very difficult to define many of the behaviours that we want our AI systems to exhibit

- What does it mean to be unbiased?
- What does it mean to act professionally?
- What does it mean to be ethical?

> Instead of trying to explicity write out the rules for what each of these things, a better approach can be to learn them from human feedback

It's hard to write the rules for these things, but it's relatively easy for a human to tell whether an output is biased, professional, or ethical.
That's why the reward model is trained on human feedback (rankings of different responses to a given prompt). 
If the reward model is trained sufficiently to fit a dataset that prefers unbiased, ethical responses etc, then it should encode these complex behaviours.

> The reward model is used to provide the reward used in the reinforcement learning setup

## The Overall Loss Function

ChatGPT uses the PPO reinforcement learning algorithm objective. 
This is the thing that it tries to maximise.

![](./images/PPOLoss.png)
<!-- 
## The REINFORCE Obective

> REINFORCE is a reinforcement learning algorithm that PPO (the algorithm we will use) builds upon

The REINFORCE objective function is as follows: -->

<!-- ## PPO -->

- Averaged over a batch of different responses
- Rewards ratio: $\frac{reward \ with\ new\ params}{reward\ with\ old\ params}$ for the same input prompt    
- Multiplied by the advantage function
- Clipped to not change the policy too much - so that the new policy is in proximity of the other in terms of how much the reward will change

### The Rewards Ratio

![](./images/Rewards%20Ratio.png)

- If the reward ratio is > 1, then it means that taking action $a$ in state $s$ is more likely with the new policy compared to the old one.
- If the reward ratio is < 1, then it means that taking action $a$ in state $s$ is less likely with the new policy compared to the old one.

> The ratio of the rewards tells you how drastically the policy is changing per update.

### Clipping the Reward

> If the policy changes too much, the 


## What if the reward model is wrong?

The policy is optimised to maximise the reward model score.

That means everything depends on the reward model being accurate.

Assuming that the reward model is accurate, with too much fine tuning via RLHF, the policy can begin to overfit to the reward model and in fact produce responses less preferred by humans.

## The Dataset

To implement the reinforcement learning loop, we'll need the dataset. Thanks to the reward model, which will provide the reward as a label for each response, we don't need human written labels for each of them. The dataset should simply return different prompts. The model will then complete them and the reward model will score them, before we use the reward to update the policy for generating responses.

In [None]:
import pandas as pd
import torch

class PromptDataset(torch.utils.data.Dataset):
    def __init__(self):
        super().__init__()
        self.prompts = pd.read_csv('prompt_dataset.csv')["Prompt"]

    def __len__(self):
        return len(self.prompts)
    
    def __getitem__(self, idx):
        return self.prompts[idx]

prompt_dataset = PromptDataset()
prompt_dataset[0]

## Load in the Pre-Trained Language Model

By this point, we should already have performed supervised fine-tuning (SFT) on a large langauge model.

Let's load in our fine-tuned language model:

In [None]:
from SFT_model import train_and_save_SFT_model, SFTModel

train_and_save_SFT_model()

Now we've trained and saved the SFT model, we need to load it in and set its parameters.

In [None]:
sft_model = SFTModel() # create model
sft_state_dict = torch.load('sft_model.pt') # load model weights
sft_model.load_state_dict(sft_state_dict) # set model weights

## Load in the Pre-Trained Reward Model

By this point, we should have already trained a reward model that takes in a prompt and a response and produces a scalar reward - a measure of how good the response is for that context.

Let's load in our reward model:

In [None]:
from reward_model import train_and_save_reward_model, RewardModel

train_and_save_reward_model()

Now we've trained and saved the reward model, we need to load it in and set its parameters.

In [None]:
reward_model = RewardModel()  # create model
reward_state_dict = torch.load('reward_model.pt')  # load model weights
reward_model.load_state_dict(reward_state_dict)  # set model weights