# Reward Modeling: Data Preparation Pipeline

**Objective:** To process the `Dahoas/synthetic-instruct-gptj-pairwise` dataset for training a reward model.

1.  **Load & Inspect:** Load the raw dataset and examine its structure (`prompt`, `chosen`, `rejected`).
2.  **Format:** Combine the prompt and responses into a unified "Human-Assistant" dialogue format.
3.  **Tokenize:** Convert the formatted text pairs into token IDs and attention masks suitable for the model.
4.  **Finalize:** Split the data and save it in a format ready for the `RewardTrainer`.

In [38]:
import json
from datasets import load_dataset, DatasetDict
import torch
from transformers import GPT2Tokenizer, GPT2ForSequenceClassification, TrainingArguments
from peft import LoraConfig, TaskType
from trl import RewardTrainer
import matplotlib.pyplot as plt
import os
import warnings

warnings.filterwarnings('ignore')
def warn(*args, **kwargs):
    pass
warnings.warn = warn

In [39]:
# os.chdir('Reward Modeling')

In [40]:
from utils import *


## Data set

Loading a data set that is used for training the reward model. We use the Dahoas/synthetic-instruct-gptj-pairwise data set from Hugging Face, a synthetic data set that is designed for training and evaluating instruction-following models. 

In [41]:
dataset = load_dataset("Dahoas/synthetic-instruct-gptj-pairwise")
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['prompt', 'chosen', 'rejected'],
        num_rows: 33143
    })
})


### Data set features

```Prompt:``` A text prompt that the model should respond to

```Chosen:``` The preferred response to the prompt

```Rejected:``` The less preferred response to the prompt


In [42]:
for i in range(10):    
    print('Prompt')
    print(dataset["train"][i]['prompt'],'\n')
    print('-'*50)
    
    print('chosen')
    print(dataset[ 'train'][i]['chosen'],'\n')
    print('-'*50)

    print('rejected')
    print(dataset[ 'train'][i]['rejected'],'\n')
    print('='*100)

Prompt
I was wondering if you could walk me through the process of setting up a hydroponic garden for herbs. 

--------------------------------------------------
chosen
Sure! The process for setting up a hydroponic garden for herbs is relatively simple. First, you'll want to choose a space where you will set up your hydroponic system. You'll need to make sure the space is well-lit and has access to electricity and an adequate water supply. Next, you'll need to choose the type of hydroponic system you want to use. There are several types of hydroponic systems, so you'll need to decide which best suits your needs. Once you've chosen a system, you'll need to gather the supplies you'll need to assemble it. This includes things like pumps, growing trays, grow lights, and nutrients. Once you've assembled the system, you'll need to add your choice of herbs to the system. Lastly, you'll need to monitor and adjust the system as needed to ensure your herbs are getting the correct amount of light

## Model and Tokenizer Setup
Setting up the tokenizer and the model for training. We will use the GPT-2 model for sequence classification, which helps in determining the quality of responses.

In [43]:
from transformers import AutoTokenizer

tokenizer_gpt = AutoTokenizer.from_pretrained("gpt2")

print(f"Original pad token: {tokenizer_gpt.pad_token}")

tokenizer_gpt.pad_token = tokenizer_gpt.eos_token
print(f"New pad token: {tokenizer_gpt.pad_token}")

sentences = ["This is a short sentence.", "This is a much longer sentence, right?"]
encoded_input = tokenizer_gpt(sentences, padding=True, return_tensors="pt")

print("\nInput IDs")
print(encoded_input['input_ids'])

print("\nAttention Mask (the 0s for padding):")
print(encoded_input['attention_mask'])

Original pad token: None
New pad token: <|endoftext|>

Input IDs
tensor([[ 1212,   318,   257,  1790,  6827,    13, 50256, 50256, 50256],
        [ 1212,   318,   257,   881,  2392,  6827,    11,   826,    30]])

Attention Mask (the 0s for padding):
tensor([[1, 1, 1, 1, 1, 1, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1]])


In [44]:
tokenizer_gpt.eos_token_id

50256

In [45]:
model_name_or_path = "gpt2"

tokenizer = GPT2Tokenizer.from_pretrained(model_name_or_path, use_fast=True)
model = GPT2ForSequenceClassification.from_pretrained(model_name_or_path, num_labels=1)

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [46]:
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = model.config.eos_token_id

max_length = 1024

`Lambda Function`: Lambda function `get_res` that takes the data set and a response type (chosen or rejected) and combines the prompt with the respective response. Each entry is formatted as a dialogue between "Human" and "Assistant".

In [47]:
get_res = lambda dataset, res: [
    "\n\nHuman: " + prompt + "\n\nAssistant: " + resp
    for prompt, resp in zip(dataset["train"]["prompt"], dataset["train"][res])
]

`Chosen Samples`: Apply the `get_res` function to create a list of chosen samples.

`Rejected Samples`: Similarly, create a list of rejected samples using the same function.

After applying the function,  you get the following results.


In [48]:
chosen_samples = get_res(dataset, 'chosen')
rejected_samples = get_res(dataset, 'rejected')

print('Chosen', chosen_samples[0])
print('Rejected', rejected_samples[0])

Chosen 

Human: I was wondering if you could walk me through the process of setting up a hydroponic garden for herbs.

Assistant: Sure! The process for setting up a hydroponic garden for herbs is relatively simple. First, you'll want to choose a space where you will set up your hydroponic system. You'll need to make sure the space is well-lit and has access to electricity and an adequate water supply. Next, you'll need to choose the type of hydroponic system you want to use. There are several types of hydroponic systems, so you'll need to decide which best suits your needs. Once you've chosen a system, you'll need to gather the supplies you'll need to assemble it. This includes things like pumps, growing trays, grow lights, and nutrients. Once you've assembled the system, you'll need to add your choice of herbs to the system. Lastly, you'll need to monitor and adjust the system as needed to ensure your herbs are getting the correct amount of light, water, and nutrients.
Rejected 

Huma

function `add_combined_columns` that takes an example (a single data point) and adds two new columns:
- `prompt_chosen`: Combines the `prompt` with the `chosen` response in the same labeled format.
- `prompt_rejected`: Combines the `prompt` with the `rejected` response in the same labeled format.

In [49]:
def add_combined_columns(example):
    example['prompt_chosen'] = "\n\nHuman: " + example["prompt"] + "\n\nAssistant: " + example["chosen"]
    example['prompt_rejected'] = "\n\nHuman: " + example["prompt"] + "\n\nAssistant: " + example["rejected"]
    
    return example

In [50]:
dataset['train'] = dataset['train'].map(add_combined_columns)

In [51]:
dataset

DatasetDict({
    train: Dataset({
        features: ['prompt', 'chosen', 'rejected', 'prompt_chosen', 'prompt_rejected'],
        num_rows: 33143
    })
})

In [52]:
print(dataset['train']['prompt_chosen'][0])



Human: I was wondering if you could walk me through the process of setting up a hydroponic garden for herbs.

Assistant: Sure! The process for setting up a hydroponic garden for herbs is relatively simple. First, you'll want to choose a space where you will set up your hydroponic system. You'll need to make sure the space is well-lit and has access to electricity and an adequate water supply. Next, you'll need to choose the type of hydroponic system you want to use. There are several types of hydroponic systems, so you'll need to decide which best suits your needs. Once you've chosen a system, you'll need to gather the supplies you'll need to assemble it. This includes things like pumps, growing trays, grow lights, and nutrients. Once you've assembled the system, you'll need to add your choice of herbs to the system. Lastly, you'll need to monitor and adjust the system as needed to ensure your herbs are getting the correct amount of light, water, and nutrients.


When using pretrained transformers for classification tasks, understanding the maximum sequence length supported by the model is crucial, as pretrained transformers have a fixed maximum token length, for example, GPT-2 has 1024 tokens. Inputs longer than this are truncated, potentially losing important information. So a function is written to determine the max length.


In [53]:
get_max_len = lambda samples: max([len(sample) for sample in samples])
get_max_len

<function __main__.<lambda>(samples)>

In [54]:
print("rejected samples length", get_max_len(rejected_samples))
print("chosen samples length", get_max_len(chosen_samples))

rejected samples length 5011
chosen samples length 3167


In [55]:
find_short = lambda dataset, max_length: [
    i for i, (chosen, rejected) in enumerate(zip(dataset['prompt_chosen'], dataset['prompt_rejected']))
    if len(chosen) < max_length or len(rejected) < max_length
]

In [56]:
max_length = 1024
subset_indices = find_short(dataset['train'], max_length)
dataset['train'] = dataset['train'].select(subset_indices)

In [57]:
subset_indices[0:10]

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [58]:
dataset

DatasetDict({
    train: Dataset({
        features: ['prompt', 'chosen', 'rejected', 'prompt_chosen', 'prompt_rejected'],
        num_rows: 33043
    })
})


 The ```preprocess_function``` tokenizes the ```prompt_chosen``` and ```prompt_rejected``` keys, which are crucial for the RewardTrainer. The ```chosen``` key represents the preferred responses, while the ```rejected``` key represents the less preferred responses.
 Tokenizing these keys allows the model to process and understand the differences between high-quality and low-quality responses. By providing both ```chosen``` and ```rejected``` inputs, the RewardTrainer can learn to distinguish and prioritize better responses, which is essential for training models to follow instructions effectively.


In [59]:
def preprocess_function(examples):
    
    tokenized_chosen = tokenizer(examples['prompt_chosen'], truncation=True, max_length=max_length, padding="max_length")
    tokenized_rejected = tokenizer(examples['prompt_rejected'], truncation=True, max_length=max_length, padding="max_length")
    
    return {
        "input_ids_chosen": tokenized_chosen["input_ids"],
        "attention_mask_chosen": tokenized_chosen["attention_mask"],
        "input_ids_rejected": tokenized_rejected["input_ids"],
        "attention_mask_rejected": tokenized_rejected["attention_mask"],
    }

In [60]:
example = preprocess_function(dataset['train'][0])
example.keys()

dict_keys(['input_ids_chosen', 'attention_mask_chosen', 'input_ids_rejected', 'attention_mask_rejected'])

In [61]:
train_str = {'chosen': [sample for sample in dataset['train']['prompt_chosen']],
             'rejected': [sample for sample in dataset['train']['prompt_rejected']]
            }

    The code applies the preprocess_function to each example in the training data set using the map method, which tokenizes the ```prompt_chosen``` and ```prompt_rejected``` texts. 
    
    The `batched = True` parameter allows the function to process multiple examples at once.
    
    The `remove_columns` parameter specifies a list of columns (```prompt```, ```chosen```, ```rejected```, ```prompt_chosen```, ```prompt_rejected```) to be removed from the data set after processing. This ensures that only the tokenized inputs and attention masks generated by `preprocess_function` are retained.


In [62]:
dataset['train'] = dataset['train'].map(preprocess_function, 
                                        batched=True, 
                                        remove_columns=['prompt', "chosen", "rejected", 'prompt_chosen', 
                                                        'prompt_rejected'])

In [63]:
dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids_chosen', 'attention_mask_chosen', 'input_ids_rejected', 'attention_mask_rejected'],
        num_rows: 33043
    })
})

In [64]:
dataset.column_names

{'train': ['input_ids_chosen',
  'attention_mask_chosen',
  'input_ids_rejected',
  'attention_mask_rejected']}

    split the data set into training and testing data sets.


In [65]:
split_dataset = dataset['train'].train_test_split(test_size=0.2)
split_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids_chosen', 'attention_mask_chosen', 'input_ids_rejected', 'attention_mask_rejected'],
        num_rows: 26434
    })
    test: Dataset({
        features: ['input_ids_chosen', 'attention_mask_chosen', 'input_ids_rejected', 'attention_mask_rejected'],
        num_rows: 6609
    })
})