# An Overview of ChatGPT

## Introduction

> ChatGPT is an AI system that can engage in back and forth conversational interactions in a chatbot-style interface. It is capable of writing code, correcting or adjusting its responses based on feedback from the user.

ChatGPT is a complicated system that builds on top of a large language model, like GPT-3. The sum of this amounts to a breakthrough that has truly democratised AI by making it available in an intuitive interface that anyone can use.

ChatGPT is certainly capable of making mistakes, including:
- Providing infactual information
- Producing biased responses
- Sometimes (although more and more rarely) producing inappropriate or harmful responses

## How ChatGPT Works

So how does it work under the hood?

ChatGPT is implemented in 3 steps, as shown below:

![](./images/How%20chatGPT%20is%20trained.png)

1. Supervised Fine-Tuning (SFT): Fine tune a pre-trained language model (GPT-3.5) to act like a chatbot
2. The Reward Model (RM): Train a new _reward model_ to identify which responses generated by the chatbot are better than others
3. Reinforcement Learning with Human Feedback (RLHF): Use the reward model to score generated responses, and update the language model to prefer responses with a higher score

> The point of SFT and RLHF are to make the language model better and more aligned with human intention. The RM is a necessary component to do RLHF.

## InstructGPT

Before the development and release of ChatGPT, this method was used to produce a model called InstructGPT, a language model based on GPT-3 which interprets prompts as instructions rather than as some text that needs continuing on from. 
This makes the models more easy to interact with because you can just give them commands, instead of having to do prompt engineering.

Nowadays, all models deployed on the OpenAI API use the InstructGPT variant.

### InstructGPT Results:

As reported in the [paper on InstructGPT](https://arxiv.org/pdf/2203.02155.pdf)
- Labelers significantly prefer InstructGPT outputs over outputs from GPT-3
- InstructGPT models show improvements in truthfulness over GPT-3
- InstructGPT shows small improvements in toxicity over GPT-3, but not bias
- InstructGPT still makes simple mistakes
- And more

Aside from that, it's clear how ChatGPT has become extremely useful in many use cases by following the same training approach.

## Data Collection

As described in their [paper](https://arxiv.org/pdf/2203.02155.pdf), to collect data to fine tune the very initial InstructGPT models, OpenAI had human labellers create prompts. The three requested prompt types were:
- Plain: Ask the labelers to come up with an arbitrary task, while ensuring the
tasks had sufficient diversity.
- Few-shot: Ask the labelers to come up with an instruction, and multiple query/response
pairs for that instruction
- User-based: Ask labelers to come up with prompts corresponding to use-cases stated in waitlist applications to the OpenAI
API.

These manually created prompts led to three datasets, used for the three stages of training:
- The supervised fine-tuning (SFT) dataset
    - Features: Prompts
    - Labels: Ideal responses
- The reward model (RM) dataset
    - Features: Prompts & responses
    - Labels: Rankings of each response 
- The PPO dataset
    - Features: Prompts & responses
    - No labels

To create the datasets for the original InstructGPT models, OpenAI hired a team of 40 labellers from [Upwork](https://www.upwork.com/) and used [Scale AI](https://scale.com/rlhf) to manage the datasets.

## Supervised Fine-Tuning (SFT)

> Like RLHF, the point of supervised fine-tuning is to make the language model better and more aligned with human intention

> SFT requires a labelled dataset

> Up to a point, overfitting the model to the SFT dataset can continue to increase human preference ratings.



## Step 1: Collect demonstration data and train a supervised policy

This means that a team of humans literally write out acceptable responses to a range of prompts. These responses are saved, and make up the raw data for a dataset.

The final part of step 1 is to _train a supervised policy_.

> A policy is something that defines how you act in a certain situation. In the case of ChatGPT, the "situation" is the instruction written by the user (or the conversation so far), and the policy defines what response ChatGPT should produce.

The policy is determined by the parameters of the language model. Initially, this policy is defined by the parameters of the model used as a starting point (the backbone).

## Reinforcement Learning with Human Feedback

### Recap: What is Reinforcement Learning?

> Reinforcement learning is where 

### No Labels, No Problem

The fine-tuned model can use the reward model (RM) to evaluate new generated text, without requiring labelled ideal responses.

> RLHF does not require the prompts in its dataset to be labelled with ideal responses, but it does require the reward model (RM)

### Using the reward model

> The reward model can be used to produce a reward for the fine-tuned model in a reinforcement learning setup.
