In [None]:
!pip install -qU transformers trl datasets

In [None]:
from huggingface_hub import notebook_login

notebook_login()

# Dataset formats and types

## Overview of the dataset formats and types

* The *format* of a dataset refers to how the data is structured, typically categorized as either *standard* or *conversational*.
* The *type* is associated with the specific task the dataset is designed for, such as *prompt-only* or *preference*. Each type is characterized by its columns, which vary according to the task.

<table>
  <tr>
    <th>Type \ Format</th>
    <th>Standard</th>
    <th>Conversational</th>
  </tr>
  <tr>
    <td>Language modeling</td>
    <td>
      <pre><code>{"text": "The sky is blue."}</code></pre>
    </td>
    <td>
      <pre><code>{"messages": [{"role": "user", "content": "What color is the sky?"},
              {"role": "assistant", "content": "It is blue."}]}</code></pre>
    </td>
  </tr>
  <tr>
    <td>Prompt-only</td>
    <td>
      <pre><code>{"prompt": "The sky is"}</code></pre>
    </td>
    <td>
      <pre><code>{"prompt": [{"role": "user", "content": "What color is the sky?"}]}</code></pre>
    </td>
  </tr>
  <tr>
    <td>Prompt-completion</td>
    <td>
      <pre><code>{"prompt": "The sky is",
 "completion": " blue."}</code></pre>
    </td>
    <td>
      <pre><code>{"prompt": [{"role": "user", "content": "What color is the sky?"}],
 "completion": [{"role": "assistant", "content": "It is blue."}]}</code></pre>
    </td>
  </tr>
  </tr>
  <tr>
    <td>Preference</td>
    <td>
      <pre><code>{"prompt": "The sky is",
 "chosen": " blue.",
 "rejected": " green."}</code></pre>
      or, with implicit prompt:
      <pre><code>{"chosen": "The sky is blue.",
 "rejected": "The sky is green."}</code></pre>
    </td>
    <td>
      <pre><code>{"prompt": [{"role": "user", "content": "What color is the sky?"}],
 "chosen": [{"role": "assistant", "content": "It is blue."}],
 "rejected": [{"role": "assistant", "content": "It is green."}]}</code></pre>
      or, with implicit prompt:
      <pre><code>{"chosen": [{"role": "user", "content": "What color is the sky?"},
              {"role": "assistant", "content": "It is blue."}],
 "rejected": [{"role": "user", "content": "What color is the sky?"},
                {"role": "assistant", "content": "It is green."}]}</code></pre>
    </td>
  </tr>
    <td>Unpaired preference</td>
    <td>
      <pre><code>{"prompt": "The sky is",
 "completion": " blue.",
 "label": True}</code></pre>
    </td>
    <td>
      <pre><code>{"prompt": [{"role": "user", "content": "What color is the sky?"}],
 "completion": [{"role": "assistant", "content": "It is green."}],
 "label": False}</code></pre>
    </td>
  </tr>
  </tr>
    <td>Stepwise supervision</td>
    <td>
      <pre><code>{"prompt": "Which number is larger, 9.8 or 9.11?",
 "completions": ["The fractional part of 9.8 is 0.8.",
                 "The fractional part of 9.11 is 0.11.",
                 "0.11 is greater than 0.8.",
                 "Hence, 9.11 > 9.8."],
 "labels": [True, True, False, False]}</code></pre>
    </td>
    <td></td>
  </tr>
</table>

### Formats

#### Standard

The standard dataset format consists of plain text strings. The columns in the dataset vary depending on the task. This is the format expected by TRL Trainer.

In [None]:
# Language modeling
language_modeling_example = {"text": "The sky is bllue."}

# Preference
preference_example = {
    "prompt": "The sky is",
    "chosen": " blue.",
    "rejected": " green."
}

# Unpaired preference
unpaired_preference_example = {
    "prompt": "The sky is",
    "completion": " blue.",
    "label": True
}

#### Conversational

The conversational datasets are used for tasks involving dialogues or chat interactions between users and assistants. These datasets contain sequences of messages where each message has a `role` (e.g., `"user"` or `"assistant"`) and `content` (the message text).

In [None]:
messages = [
    {"role": "user", "content": "Hello, how are you?"},
    {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
    {"role": "user", "content": "Can you tell me a joke?"},
]

The columns in conversational datasets also vary depending on the task.

In [None]:
# Prompt-completion
prompt_completion_example = {
    "prompt": [{"role": "user", "content": "What color is the sky?"}],
    "completion": [{"role": "assistant", "content": "It is bllue."}]
}

# Preference
preference_example = {
    "prompt": [{"role": "user", "content": "What color is the sky?"}],
    "chosen": [{"role": "assistant", "content": "It is blue."}],
    "rejected": [{"role": "assistant", "content": "It is green."}]
}

Conversational datasets are useful for training chat models, but must be converted into a standard format before being used with TRL trainers. This is done using chat templates specific to the model being used.

### Types

#### Language modeling

A language modeling dataset consists of a column `"text"` (or `"messages"` for conversational datasets) containing a full sequence of text.

In [None]:
# Standard format
language_modeling_example = {"text": "The sky is blue."}

# Conversational format
language_modeling_example = {"messages": [
    {"role": "user", "content": "What color is the sky?"},
    {"role": "assistant", "content": "It is blue."}
]}

#### Prompt-only

In a prompt-only dataset, on the initial prompt (the question or partial sentence) is provided under the key `"prompt"`. The training involves the completion based on this prompt, where the model learns to continue or complete the given input.

In [None]:
# Standard format
prompt_only_example = {"prompt": "The sky is"}

# Conversational format
prompt_only_example = {"prompt": [{"role": "user", "content": "What color is the sky?"}]}

While both the prompt-only and language modeling types are similar, they differ in how the input is handled.
* In the prompt-only type, the prompt represents a partial input that expects the model to complete or continue.
* In the language modeling type, the input is treated as a complete sentence or sequence.

In [None]:
from transformers import AutoTokenizer
from trl import apply_chat_template

tokenizer = AutoTokenizer.from_pretrained('microsoft/Phi-3-mini-128k-instruct')

In [None]:
# Example for prompt-only type
prompt_only_example = {"prompt": [{"role": "user", "content": "What color is the sky?"}]}
apply_chat_template(prompt_only_example, tokenizer)

{'prompt': '<|user|>\nWhat color is the sky?<|end|>\n<|assistant|>\n'}

The prompt-only output includes a `'<|assistant|>\n'`, indicating the beginning of the assistant's turn and expecting the model to generate a completion.

In [None]:
# Example for language modeling type
lm_example = {"messages": [{"role": "user", "content": "What color is the sky?"}]}
apply_chat_template(lm_example, tokenizer)

{'text': '<|user|>\nWhat color is the sky?<|end|>\n<|endoftext|>'}

The language modeling output treats the input as a complete sequence and terminates it with `'<|endoftext|>'`, signaling the end of the text and not expecting any additional content.

#### Prompt-completion

A prompt-completion dataset includes a `"prompt"` and a `"completion"`.

In [None]:
# Standard format
prompt_completion_example = {
    "prompt": "The sky is",
    "completion": " blue."
}

# Conversational format
prompt_completion_example = {
    "prompt": [{"role": "user", "content": "What color is the sky?"}],
    "completion": [{"role": "assistant", "content": "It is blue."}]
}

#### Preference

A preference dataset is used for tasks where the model is trained to choose between two or more possible completions to the same prompt. This dataset includes a `"prompt"`, a `"chosen"` completion, and a `"rejected"` completion.

The model is trained to select the `"chosen"` response over the `"rejected"` response. Some dataset may not include the `"prompt"` column, in which case the prompt is implicit and directly included in the `"chosen"` and `"rejected"` completions.

In [None]:
# Stand format
## Explicit prompt (recommended)
preference_example = {
    "prompt": "The sky is",
    "chosen": " blue.",
    "rejected": " green."
}
## Implicit prompt
preference_example = {
    "chosen": "The sky is blue.",
    "rejected": "The sky is green."
}

# Conversational format
## Explicit prompt (recommended)
preference_example = {
    "prompt": [{"role": "user", "content": "What color is the sky?"}],
    "chosen": [{"role": "assistant", "content": "It is blue."}],
    "rejected": [{"role": "assistant", "content": "It is green."}]
}
## Implicit prompt
preference_example = {
    "chosen": [{"role": "user", "content": "What color is the sky?"},
               {"role": "assistant", "content": "It is blue."}],
    "rejected": [{"role": "user", "content": "What color is the sky?"},
                 {"role": "assistant", "content": "It is green."}]
}

#### Unpaired preference

An unpaired preference dataset is similar to a preference dataset but instead of having `"chosen"` and `"rejected"` completions for the same prompt, it includes a single `"completions"` and a `"label"` indicating whether the completion is preferred or not.

In [None]:
# Standard format
unpaired_preference_example = {
    "prompt": "The sky is",
    "completion": " blue.",
    "label": True
}

# Conversational format
unpaired_preference_example = {
    "prompt": [{"role": "user", "content": "What color is the sky?"}],
    "completion": [{"role": "assistant", "content": "It is green."}],
    "label": False
}
unpaired_preference_example = {
    "prompt": [{"role": "user", "content": "What color is the sky?"}],
    "completion": [{"role": "assistant", "content": "It is blue."}],
    "label": True
}

#### Stepwise supervision

A stepwise (or process) supervision dataset is similar to an unpaired preference dataset but includes multiple steps of completions, each with its own label. This structure is useful for tasks that need detailed, step-by-step labeling, such as reasoning tasks.

By evaluating each step separately and providing targeted labels, this approach helps identify precisely where the reasoning is correct and where errors occur, allowing for targeted feedback on each part of the reasoning process.

In [None]:
#stepwise_supervision
stepwise_example = {
    "prompt": "Which number is larger, 9.8 or 9.11?",
    "completions": [
        "The fractional part of 9.8 is 0.8.",
        "The fractional part of 9.11 is 0.11.",
        "0.11 is greater than 0.8.",
        "Hence, 9.11 > 9.8."
    ],
    "labels": [
        True,
        True,
        False,
        False
    ]
}

## Which dataset type to use?

| Trainer | Expected dataset type |
| ------- | --------------------- |
| `BCOTrainer` | Unpaired preference |
| `CPOTrainer` | Preference (explicit prompt recommended) |
| `DPOTrianer` | Preference (explicit prompt recommended) |
| `GKDTrainer` | Prompt-completion |
| `GRPOTrainer` | Prompt-only |
| `IterativeSFTTrainer` | Unpaired preference |
| `KTOTrainer` | Unpaired preference or Preference (explicit prompt recommended) |
| `NashMDTrainer` | Prompt-only |
| `OnlineDPOTrainer` | Prompt-only |
| `ORPOTrainer` | Preference (explicit prompt recommended) |
| `PPOTrainer` | Tokenized language modeling |
| `PRMTrainer` | Stepwise supervision |
| `RewardTrainer` | Preference (implicit prompt recomended) |
| `SFTTrainer` | Language modeling |
| `XPOTrainer` | Prompt-only |

By the time of Febuary 2025, TRL trainers only support standard datset formats. We must convert the conversational dataset into a standard format before training.

## Working with conversational datasets in TRL

### Converting a conversational dataset into a standard dataset

We need to apply a *chat template* to the dataset. A **chat template** is a predefined structure that typically includes placeholders for user and assistant messages. This template is provided by the tokenizer of the model we choose.

In TRL, the method we apply to convert the dataset will vary depending on the task. The `apply_chat_template()` function will simplify this process:

In [None]:
from transformers import AutoTokenizer
from trl import apply_chat_template

tokenizer = AutoTokenizer.from_pretrained('microsoft/Phi-3-mini-128k-instruct')

In [None]:
example = {
    "prompt": [{'role': 'user', 'content': 'What color is the sky?'}],
    "completion": [{'role': 'assistant', 'content': 'It is blue.'}]
}

apply_chat_template(example, tokenizer)

{'prompt': '<|user|>\nWhat color is the sky?<|end|>\n<|assistant|>\n',
 'completion': 'It is blue.<|end|>\n<|endoftext|>'}

Alternatively, we can use the `map` method to apply the template across an entire dataset:

In [None]:
from datasets import Dataset
from trl import apply_chat_template

dataset_dict = {
    "prompt": [
        [{'role': 'user', 'content': 'What color is the sky?'}],
        [{'role': 'user', 'content': 'Where is the sun?'}]
    ],
    "completion": [
        [{'role': 'assistant', 'content': 'It is blue.'}],
        [{'role': 'assistant', 'content': 'In the sky.'}]
    ]
}

dataset = Dataset.from_dict(dataset_dict)
dataset = dataset.map(
    apply_chat_template,
    fn_kwargs={'tokenizer': tokenizer}
)

for data in dataset:
    print(data)

Map:   0%|          | 0/2 [00:00<?, ? examples/s]

{'prompt': '<|user|>\nWhat color is the sky?<|end|>\n<|assistant|>\n', 'completion': 'It is blue.<|end|>\n<|endoftext|>'}
{'prompt': '<|user|>\nWhere is the sun?<|end|>\n<|assistant|>\n', 'completion': 'In the sky.<|end|>\n<|endoftext|>'}


NOTE: chat templates are *model-specific*. If we use the chat template from `meta-llama/Meta-Llama-3.1-8B-Instruct` with the above example, we get a different output:

In [None]:
tokenizer_llama = AutoTokenizer.from_pretrained('meta-llama/Meta-Llama-3.1-8B-Instruct')

In [None]:
apply_chat_template(example, tokenizer_llama)

{'prompt': '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat color is the sky?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n',
 'completion': 'It is blue.<|eot_id|>'}

Always use the chat template associated with the model we are working with. Using the wrong template can lead to inaccurate or unexpected results.

## Using any dataset with TRL: preprocessing and conversion

For datasets that may note be directly compatible with TRL, we can use the [example scripts](https://github.com/huggingface/trl/tree/main/examples/datasets) to preprocess and the convert them into the required format.

### Example: UltraFeedback dataset

The format of [UltraFeedback dataset](https://huggingface.co/datasets/openbmb/UltraFeedback) does not match the expected structure. It is not a conservational format, the column names differ, and the results pertain to different models (e.g., Bard, GPT-4) and aspects (e.g., "helpfulness", "honesty").

By using the provided conversion script [`ultrafeedback.py`](https://github.com/huggingface/trl/tree/main/examples/datasets/ultrafeedback.py), we can transform this dataset into an unpaired preference type
```bash
python examples/datasets/ultrafeedback.py --repo_id trl-lib/ultrafeedback-gpt-3.5-turbo-helpfulness
```

## Utilities for converting dataset types

### From prompt-completion to language modeling dataset

In [None]:
from datasets import Dataset

dataset = Dataset.from_dict({
    'prompt': ['The sky is', 'The sun is'],
    'completion': [' blue.', ' in the sky.']
})

def concat_prompt_completion(example):
    return {
        'text': example['prompt'] + example['completion']
    }

dataset = dataset.map(concat_prompt_completion, remove_columns=['prompt', 'completion'])

Map:   0%|          | 0/2 [00:00<?, ? examples/s]

In [None]:
dataset[0]

{'text': 'The sky is blue.'}

### From prompt-completion to prompt-only dataset

In [None]:
from datasets import Dataset

dataset = Dataset.from_dict({
    'prompt': ['The sky is', 'The sun is'],
    'completion': [' blue.', ' in the sky.']
})

dataset = dataset.remove_columns('completion')

In [None]:
dataset[0]

{'prompt': 'The sky is'}

### From preference with implicit prompt to prompt-completion dataset

In [None]:
from datasets import Dataset
from trl import extract_prompt

dataset = Dataset.from_dict({
    'chosen': [
        [{'role': 'user', 'content': 'What color is the sky?'}, {'role': 'assistant', 'content': 'It is blue.'}],
        [{'role': 'user', 'content': 'Where is the sun?'}, {'role': 'assistant', 'content': 'In the sky.'}]
    ],
    'rejected': [
        [{'role': 'user', 'content': 'What color is the sky?'}, {'role': 'assistant', 'content': 'It is green.'}],
        [{'role': 'user', 'content': 'Where is the sun?'}, {'role': 'assistant', 'content': 'In the sea.'}]
    ]
})

# extract the prompt first
dataset = dataset.map(extract_prompt)
# remove `rejected` and rename `chosen`
dataset = dataset.remove_columns('rejected').rename_column('chosen', 'completion')

Map:   0%|          | 0/2 [00:00<?, ? examples/s]

In [None]:
dataset[0]

{'completion': [{'content': 'It is blue.', 'role': 'assistant'}],
 'prompt': [{'content': 'What color is the sky?', 'role': 'user'}]}

### From preference with implicit prompt to prompt-only dataset

In [None]:
from datasets import Dataset
from trl import extract_prompt

dataset = Dataset.from_dict({
    'chosen': [
        [{'role': 'user', 'content': 'What color is the sky?'}, {'role': 'assistant', 'content': 'It is blue.'}],
        [{'role': 'user', 'content': 'Where is the sun?'}, {'role': 'assistant', 'content': 'In the sky.'}]
    ],
    'rejected': [
        [{'role': 'user', 'content': 'What color is the sky?'}, {'role': 'assistant', 'content': 'It is green.'}],
        [{'role': 'user', 'content': 'Where is the sun?'}, {'role': 'assistant', 'content': 'In the sea.'}]
    ]
})

dataset = dataset.map(extract_prompt).remove_columns(['chosen', 'rejected'])

Map:   0%|          | 0/2 [00:00<?, ? examples/s]

In [None]:
dataset[0]

{'prompt': [{'content': 'What color is the sky?', 'role': 'user'}]}

### From implicit to explicit prompt preference dataset

In [None]:
from datasets import Dataset
from trl import extract_prompt

dataset = Dataset.from_dict({
    'chosen': [
        [{'role': 'user', 'content': 'What color is the sky?'}, {'role': 'assistant', 'content': 'It is blue.'}],
        [{'role': 'user', 'content': 'Where is the sun?'}, {'role': 'assistant', 'content': 'In the sky.'}]
    ],
    'rejected': [
        [{'role': 'user', 'content': 'What color is the sky?'}, {'role': 'assistant', 'content': 'It is green.'}],
        [{'role': 'user', 'content': 'Where is the sun?'}, {'role': 'assistant', 'content': 'In the sea.'}]
    ]
})

dataset = dataset.map(extract_prompt)

Map:   0%|          | 0/2 [00:00<?, ? examples/s]

In [None]:
dataset[0]

{'chosen': [{'content': 'It is blue.', 'role': 'assistant'}],
 'rejected': [{'content': 'It is green.', 'role': 'assistant'}],
 'prompt': [{'content': 'What color is the sky?', 'role': 'user'}]}

### From preference with implicit prompt to unpaired preference dataset

In [None]:
from datasets import Dataset
from trl import extract_prompt, unpair_preference_dataset

dataset = Dataset.from_dict({
    'chosen': [
        [{'role': 'user', 'content': 'What color is the sky?'}, {'role': 'assistant', 'content': 'It is blue.'}],
        [{'role': 'user', 'content': 'Where is the sun?'}, {'role': 'assistant', 'content': 'In the sky.'}]
    ],
    'rejected': [
        [{'role': 'user', 'content': 'What color is the sky?'}, {'role': 'assistant', 'content': 'It is green.'}],
        [{'role': 'user', 'content': 'Where is the sun?'}, {'role': 'assistant', 'content': 'In the sea.'}]
    ]
})

# extract the prompt first
dataset = dataset.map(extract_prompt)
# build the unpaired preference
dataset = unpair_preference_dataset(dataset)

Map:   0%|          | 0/2 [00:00<?, ? examples/s]

Map:   0%|          | 0/2 [00:00<?, ? examples/s]

In [None]:
dataset[0]

{'prompt': [{'content': 'What color is the sky?', 'role': 'user'}],
 'completion': [{'content': 'It is blue.', 'role': 'assistant'}],
 'label': True}

The `"chosen"` and `"rejected"` completions in a preference dataset can be both good or bad. Before applying `unpair_preference_dataset()`, we need to ensure that all `"chosen"` completions can be labeled as good and all `"rejected"` completions as bad.

### From preference to language modeling dataset

In [None]:
from datasets import Dataset

dataset = Dataset.from_dict({
    'prompt': ['The sky is', 'The sun is'],
    'chosen': [' blue.', ' in the sky.'],
    'rejected': [' green.', ' in the sea.']
})

def concat_prompt_completion(example):
    return {
        'text': example['prompt'] + example['chosen']
    }

dataset = dataset.map(concat_prompt_completion, remove_columns=['prompt', 'chosen', 'rejected'])

Map:   0%|          | 0/2 [00:00<?, ? examples/s]

In [None]:
dataset[0]

{'text': 'The sky is blue.'}

### From preference to prompt-only dataset

In [None]:
from datasets import Dataset

dataset = Dataset.from_dict({
    'prompt': ['The sky is', 'The sun is'],
    'chosen': [' blue.', ' in the sky.'],
    'rejected': [' green.', ' in the sea.']
})

dataset = dataset.remove_columns(['chosen', 'rejected'])

In [None]:
dataset[0]

{'prompt': 'The sky is'}

### From explicit to implicit prompt preference dataset

In [None]:
from datasets import Dataset

dataset = Dataset.from_dict({
    'prompt': [
        [{'role': 'user', 'content': 'What color is the sky?'}],
        [{'role': 'user', 'content': 'Where is the sun?'}]
    ],
    'chosen': [
        [{'role': 'assistant', 'content': 'It is blue.'}],
        [{'role': 'assistant', 'content': 'In the sky.'}]
    ],
    'rejected': [
        [{'role': 'assistant', 'content': 'It is green.'}],
        [{'role': 'assistant', 'content': 'In the sea.'}]
    ]
})

def concat_prompt_to_completion(example):
    return {
        'chosen': example['prompt'] + example['chosen'],
        'rejected': example['prompt'] + example['rejected']
    }

dataset = dataset.map(concat_prompt_to_completion, remove_columns='prompt')

Map:   0%|          | 0/2 [00:00<?, ? examples/s]

In [None]:
dataset[0]

{'chosen': [{'content': 'What color is the sky?', 'role': 'user'},
  {'content': 'It is blue.', 'role': 'assistant'}],
 'rejected': [{'content': 'What color is the sky?', 'role': 'user'},
  {'content': 'It is green.', 'role': 'assistant'}]}

### From preference to unpaired preference dataset

In [None]:
from datasets import Dataset
from trl import unpair_preference_dataset

dataset = Dataset.from_dict({
    'prompt': [
        [{'role': 'user', 'content': 'What color is the sky?'}],
        [{'role': 'user', 'content': 'Where is the sun?'}]
    ],
    'chosen': [
        [{'role': 'assistant', 'content': 'It is blue.'}],
        [{'role': 'assistant', 'content': 'In the sky.'}]
    ],
    'rejected': [
        [{'role': 'assistant', 'content': 'It is green.'}],
        [{'role': 'assistant', 'content': 'In the sea.'}]
    ]
})

dataset = unpair_preference_dataset(dataset)

Map:   0%|          | 0/2 [00:00<?, ? examples/s]

In [None]:
dataset[0]

{'prompt': [{'content': 'What color is the sky?', 'role': 'user'}],
 'completion': [{'content': 'It is blue.', 'role': 'assistant'}],
 'label': True}

### From unpaired preference to language modeling dataset

In [None]:
from datasets import Dataset

dataset = Dataset.from_dict({
    'prompt': ['The sky is', 'The sun is', 'The sky is', 'The sun is'],
    'completion': [' blue.', ' in the sky.', ' green.', ' in the sea.'],
    'label': [True, True, False, False]
})

def concat_prompt_completion(example):
    return {
        'text': example['prompt'] + example['completion']
    }

dataset = dataset.filter(lambda x: x['label']).map(concat_prompt_completion).remove_columns(['prompt', 'completion', 'label'])

Filter:   0%|          | 0/4 [00:00<?, ? examples/s]

Map:   0%|          | 0/2 [00:00<?, ? examples/s]

In [None]:
dataset[0]

{'text': 'The sky is blue.'}

### From unpaired preference to prompt-completion dataset

In [None]:
from datasets import Dataset

dataset = Dataset.from_dict({
    'prompt': ['The sky is', 'The sun is', 'The sky is', 'The sun is'],
    'completion': [' blue.', ' in the sky.', ' green.', ' in the sea.'],
    'label': [True, True, False, False]
})

dataset = dataset.filter(lambda x: x['label']).remove_columns(['label'])

Filter:   0%|          | 0/4 [00:00<?, ? examples/s]

In [None]:
dataset[0]

{'prompt': 'The sky is', 'completion': ' blue.'}

### From unpaired preference to prompt-only dataset

In [None]:
from datasets import Dataset

dataset = Dataset.from_dict({
    'prompt': ['The sky is', 'The sun is', 'The sky is', 'The sun is'],
    'completion': [' blue.', ' in the sky.', ' green.', ' in the sea.'],
    'label': [True, True, False, False]
})

dataset = dataset.remove_columns(['completion', 'label'])

### From stepwise superviison to language modeling dataset

In [None]:
from datasets import Dataset

dataset = Dataset.from_dict({
    'prompt': ['Blue light', 'Water'],
    'completions': [
        [' scatters more in the atmosphere,', ' so the sky is green.'],
        [' forms a less dense structure in ice,', ' which causes it to expand when it freezes.']
    ],
    'labels': [
        [True, False],
        [True, True]
    ]
})

def concat_prompt_completions(example):
    completion = ''.join(example['completions'])
    return {
        'text': example['prompt'] + completion
    }


dataset = dataset.filter(lambda x: all(x['labels'])).map(concat_prompt_completions,
                                                         remove_columns=['prompt', 'completions', 'labels'])

Filter:   0%|          | 0/2 [00:00<?, ? examples/s]

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

In [None]:
dataset[0]

{'text': 'Water forms a less dense structure in ice, which causes it to expand when it freezes.'}

### From stepwise supervision to prompt completion dataset

In [None]:
from datasets import Dataset

dataset = Dataset.from_dict({
    'prompt': ['Blue light', 'Water'],
    'completions': [
        [' scatters more in the atmosphere,', ' so the sky is green.'],
        [' forms a less dense structure in ice,', ' which causes it to expand when it freezes.']
    ],
    'labels': [
        [True, False],
        [True, True]
    ]
})

def join_completions(example):
    completion = ''.join(example['completions'])
    return {'completion': completion}


dataset = dataset.filter(lambda x: all(x['labels'])).map(join_completions,
                                                         remove_columns=['completions', 'labels'])

Filter:   0%|          | 0/2 [00:00<?, ? examples/s]

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

In [None]:
dataset[0]

{'prompt': 'Water',
 'completion': ' forms a less dense structure in ice, which causes it to expand when it freezes.'}

### From stepwise supervision to prompt only dataset

In [None]:
from datasets import Dataset

dataset = Dataset.from_dict({
    'prompt': ['Blue light', 'Water'],
    'completions': [
        [' scatters more in the atmosphere,', ' so the sky is green.'],
        [' forms a less dense structure in ice,', ' which causes it to expand when it freezes.']
    ],
    'labels': [
        [True, False],
        [True, True]
    ]
})

dataset = dataset.remove_columns(['completions', 'labels'])

In [None]:
dataset[0]

{'prompt': 'Blue light'}

### From stepwise supervision to unpaired preference dataset

In [None]:
from datasets import Dataset

dataset = Dataset.from_dict({
    'prompt': ['Blue light', 'Water'],
    'completions': [
        [' scatters more in the atmosphere,', ' so the sky is green.'],
        [' forms a less dense structure in ice,', ' which causes it to expand when it freezes.']
    ],
    'labels': [
        [True, False],
        [True, True]
    ]
})

def merge_completions_and_labels(example):
    return {
        'prompt': example['prompt'],
        'completion': ''.join(example['completions']),
        'label': all(example['labels'])
    }

dataset = dataset.map(merge_completions_and_labels, remove_columns=['completions', 'labels'])

Map:   0%|          | 0/2 [00:00<?, ? examples/s]

In [None]:
dataset[0]

{'prompt': 'Blue light',
 'completion': ' scatters more in the atmosphere, so the sky is green.',
 'label': False}

## Vision datasets

For **vision-language models (VLMs)**, it is recommended to use a conversational format.

A conversational vision dataset differs from a standard conversational dataset in two key ways:
1. The dataset must contain the key `images` with the image data.
2. The `"content"` field in messages must be a list of dicionaries, where each dictionary specifies the type of data" `"image"` or `"text"`.

Example:
```python
# Textual dataset
"content": "What color is the sky?"

# Vision dataset
"content" [
    {"type": "image"},
    {"type": "text", "text": "What color is the sky in the image?"}
}
```

# Training Requirements

## What metrics should we look at?

When performing classical supervised fine-tuning of language models, the loss serves as a good indicator of the training progress. However, in **Reinforcement Learning (RL)**, the loss becomes less informative about the model's performance, and its value may fluctuate while the actual performance improves.

In this case, there are two key metrics:
* **Mean Reward**: The primary goal is to maximize the reward achieved by the model during RL training.
* **Objective KL Divergence**: KL divergence measures the dissimilarity between two probability distributions. In the context of RL, we use it to quantify the difference between the current model and a reference model. Ideally, we want to keep the KL divergence between 0 and 10 to ensure the model's generated text remains close to what the reference model produces.

## Why do we use a reference model, and what is the purpose of KL divergence?

When training RL models, optimizing solely for reward may lead to unexpected behaviors, where the model exploits the environment in ways that do not align with good language generation. In case of RLHF, we use a reward model trained to predict whether a generated text is highly ranked by humans.

The RL model being optimized against the reward model may learn patterns that yield high reward but do not represent good language. This can result in extreme cases where the model generates texts with excessive exclamation marks or emojis to maximize the reward. In some worst-case scenarios, the model may generate patterns completely unrelated to natural language yet receive high rewards, similar to adversarial attacks.

To address this issue, we have to add a penalty to the reward function based on the KL divergence between the current model and the reference model. By doing this, we encourage the model to stay close to what the reference model generates.

## What is the concern with negative KL divergence?

If we generate text by purely sampling from the model distribution, things work fine in general. However, when we use the `generate` method, there are a few caveats because it does not always purely sample depending on the settings which can cause KL-Divergence to go negative.

Essentially when the active model achieves `log_p_token_active < log_p_token_ref`, we get negative KL-divergence. This can happen in several cases:
* **top-k sampling**: the model can smooth out the probability distribution causing the top-k tokens having a smaller probability than those of the reference model but they still are selected
* **min_length**: this ignores the EOS token until `min_length` is reached. Thus the model can assign a very low log probability to the EOS token and very high probability to all others until `min_length` is reached

Negative KL-divergence is an issue. The total reward `R` is computed as `R = r - beta * KL` so if the model can learn how to drive KL-divergence negative, it effectively gets a positive reward. In many cases it can be much easier to exploit such a bug in the generation than actually learning the reward function. In addition the KL-divergence can becomes arbitrarily small thus the actual reward can be very small compared to it.

## How to generate text for training?

In order to avoid the KL issues, we can set
```python
generation_kwargs = {
    "min_length": -1, # don't ignore the EOS token
    "top_k": 0.0, # no top-k sampling
    "top_p": 1.0, # no nucleus sampling
    "do_sample": True, # yes, we want to sample
    "pad_token_id": tokenizer.eos_token_id, # most decoder models don't have a padding token - use EOS token instead
    "max_new_tokens": 32, # specify how many tokens we want to generate at most
}
```

## How can debug our own use-case?

* **Start from a working example**: Begin with a working example from the `trl` repository and gradually modify it to fit your specific use-case. Changing everything at once can make it difficult to identify the source of potential issues. For example, we can start by replacing the model in the example and once we figure out the best hyperparameters try to switch to our dataset and reward model. If our change everything at once we will not know where a potential problem comes from.
* **Start small, scale later**: Training large models can be very slow and take several hours or days until we wee any improvement. For debugging this is not a convenient timescale so try to use small model variants during the development phase and scale up once that works. That being said we sometimes have to be careful as small models might not have the capacity to solve a complicated task either.
* **Start simple**: Try to start with a minimal example and build complexity from there. Our use-case might require for example a complicated reward function consisting of many different rewards - try to use one signal first and see if we can optimize that and then add more complexity after that.
* **Inspect the generations**: It is always a good idea to inspect what the model is generating. Maybe there is a bug in our post-processing or our prompt. Due to bad settings we might cut-off generations too soon. These things are very difficult to see on the metrics but very obvious if we look at the generations.
* **Inspect the reward model**: If our reward is not improving over time maybe there is an issue with the reward model. We can look at extreme cases to see if it does what it should: e.g., in the sentiment case we can check if simple positive and negative examples really get different rewards. We can look at the distribution of our dataset. Finally, maybe the reward is dominated by the query which the model cannot affect so we might need to normalize this (e.g., reward of query+response minus reward of the query).

# Logging

By default, the TRL `PPOTrainer` saves a lot of relevant information to `wandb` or `tensorboard`.

Upon initialization, we need to pass one of these two options to the `PPOConfig`:
```python
training_args = PPOConfig(..., report_to="wandb") # or "tensorboard"
```

## PPO Logging

Key metrics to monitor. We want to maximize the reward, maintain a low KL divergence, and maximize entropy:
1. `env/reward_mean`: the average reward obtained from the environment. Alias `ppo/mean_scores`, which is used to specifically monitor the reward model.
2. `env/reward_std`: the standard deviation of the reward obtained from the environment. Alias `ppo/std_scores`, which is used to specifically monitor the reward model.
3. `env/reward_dist`: the histogram distribution of the reward obtained from the environment.
4. `objective/kl`: the mean KL divergence between the old and new policies. It measures how much the new policy deviates from the old policy. The KL divergence is used to compute the KL penalty in the objective function.
5. `objective/kl_dist`: the histogram distribution of the `objective/kl`.
6. `objective/kl_coef`: the coefficient for KL divergence in the objective function.
7. `ppo/mean_non_score_reward`: the KL penalty calculated by `objective/kl * objective/kl_coef` as the total reward for optimization to prevent the new policy from deviating too far from the old policy.
8. `objective/entropy`: the entropy of the model's policy, calculated by `-logprobs.sum(-1).mean()`. High entropy means the model's actions are more random, which can be beneficial for exploration.

Training stats:
* `ppo/learning_rate`: the learning rate for the PPO algorithm
* `ppo/policy/entropy`: the entropy of the model's policy, calculated by `pd = torch.nn.functional.softmax(logits, dim=-1)`; `entropy = torch.logsumexp(logits, dim=-1) - torch.sum(pd * logits, dim=-1)`. It measures the randomness of the policy.
* `ppo/policy/clipfrac`: the fraction of probability ratio (old policy / new policy) that fell outside the clipping range in the PPO objective. This can be used to monitor the optimization process.
* `ppo/policy/approxkl`: the approximate KL divergence between the old and new policies, measured by `0.5 * masked_mean((logprobs - old_logprobs) ** 2, mask)`, corresponding to the `k2` estimator.
* `ppo/policy/policykl`: similar to `ppo/policy/approxkl`, but measured by `masked_mean(old_logprobs - logprobs, mask)`, corresponding to the `k1` estimator.
* `ppo/policy/ratio`: the histogram distribution of the ratio between the new and old policies, used to compute the PPO objective.
* `ppo/policy/advantages_mean`: the average of the GAE (Generalized Advantage Estimation). The advantage function measures how much better an action is compared to the average action at a state.
* `ppo/policy/advantages`: the histogram distribution of `ppo/policy/advantages_mean`.
* `ppo/returns/mean`: the mean of the TD($\lambda$) returns, calculated by `returns = advantage + values`, another indicator of model performance.
* `ppo/returns/var`: the variance of the TD($\lambda$) returns, calculated by `returns = advantage + values`.
* `ppo/val/mean`: the mean of the values, used to monitor the value function's performance
* `ppo/val/var`: the variance of the values.
* `ppo/val_explained`: the explained variance for the value function.
* `ppo/val/clipfrac`: the fraction of the value function's predicted values that are clipped
* `ppo/val/vpred`: the predicted values from the value function
* `ppo/val/error`: the mean squared error between the `ppo/val/vpred` and returns, used to monitor the value function's performance
* `ppo/loss/policy`: the policy loss for the Proximal Policy Optimization (PPO) algorithm
* `ppo/loss/value`: the loss for the value function in the PPO algorithm. This value quantifiies how well the function estimates the expected future rewards.
* `ppo/loss/total`: the total loss for the PPO algorithm. It is the sum of the policy loss and the value function loss.

Stats on queries, responses, and logprobs:
* `tokens/queries_len_mean`: the average length of the queries tokens
* `tokens/queries_len_std`: the standard deviation of the length of the queries tokens
* `tokens/queries_dist`: the hisotgram distribution of the length of the queries tokens
* `tokens/responses_len_mean`: the average length of the responses tokens
* `tokens/responses_len_std`: the standard deviation of the length of the responses tokens
* `tokens/responses_dist`: the histogram distribution of the length of the responses tokens
* `objective/logprobs`: the histogram distribution of the log probabilities of the actions taken by the model
* `objective/ref_logprobs`: the histogram distribution of the log probabilities of the actions taken by the reference model.

### Crucial values

* `env/reward_mean`, `env/reward_std`, `env/reward_dist`: the properties of the reward distribution from the "environment" / reward model
* `ppo/mean_non_score_reward`: the mean negated KL penalty during training (shows the delta between the reference model and the new policy over the batch in the step)

Useful parameters to monitor for stability (when these diverge or collapse to 0, try tuning variables):
* `ppo/loss/value`: it will spike / NaN when not going well
* `ppo/policy/ratio`: `ratio` being 1 is a baseline value, meaning that the probability of sampling a token is the same under the new and old policy. If the raio is too high like 100, it means the probability of sampling a token is 100 times higher under the new policy than the old policy. This is a sign that the new policy is too different from the old policy, which will likely cause over-optimization and collapse training later on
* `ppo/policy/clipfrac` and `ppo/policy/approxkl`: if `ratio` is too high, the `ratio` is going to get clipped, resulting in high `clipfrac` and high `approxkl` as well.
* `objective/kl`: it shoud stay positive so that the policy is not too far away from the reference policy
* `objective/kl_coef`: the target coefficient with `AdaptiveKLController`