# Create dataset for DPO

## Goal

Create a dataset with the required format for DPO fine-tuning:

- https://huggingface.co/docs/trl/main/en/dpo_trainer#expected-dataset-format

The dataframe should have the following fields: prompt, chosen and rejected.

The evaluation of the MATH test dataset provides me with correct and incorrect answers. I could change the prompt at my will.

## Imports

In [None]:
import pandas as pd
import json as json
from tqdm.auto import tqdm
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import random

from transformers import AutoTokenizer

plt.plot()
plt.close('all')
plt.rcParams["figure.figsize"] = (20, 5)
mpl.rcParams['lines.linewidth'] = 3
mpl.rcParams['font.size'] = 16

## Data loading

In [None]:
dataset = pd.read_csv('/mnt/hdd0/Kaggle/aimo/external_data/filtered_MATH_test_5.csv')
dataset.head()

In [None]:
len(dataset)

In [None]:
with open('/mnt/hdd0/Kaggle/aimo/experiments/17_vllm/400_repetitions.json', 'r') as file:
    responses = json.load(file)
len(responses)

In [None]:
responses['0'][0]['prompt'][-50:]

## Data filtering

I will use the following definition of good and bad responses:

- A good response is one where the text and code answer is the same and equal to the ground truth.
- A bad response is one where there is at least text answer and it is different to the ground truth.

In [None]:
good_responses = dict()
bad_responses = dict()

for idx, ground_truth in tqdm(enumerate(dataset['answer']), total=len(dataset)):
    good_responses[idx] = []
    bad_responses[idx] = []
    for result in responses[str(idx)]:
        if ground_truth == result['text_answer'] and ground_truth == result['code_answer']:
            good_responses[idx].append(result['response'])
        if ground_truth != result['text_answer'] and result['text_answer'] is not None:
            bad_responses[idx].append(result['response'])
    # remove repetitions
    good_responses[idx] = list(set(good_responses[idx]))
    bad_responses[idx] = list(set(bad_responses[idx]))

Let's study the distribution of good and bad responses.

In [None]:
plt.bar(np.arange(580), [len(good_responses[idx]) for idx in range(580)], label='Good responses')
plt.bar(np.arange(580), [len(bad_responses[idx]) for idx in range(580)], label='Bad responses', bottom=[len(good_responses[idx]) for idx in range(580)])
plt.xlim(-1, 580)
plt.legend()
plt.xlabel('Question index')
plt.ylabel('Number of responses');

In [None]:
def show_random_responses(idx):
    print('Ground truth:', dataset['answer'][idx])
    print('Good response:')
    print(random.choice(good_responses[idx]))
    print('\n\nBad responses:')
    print(random.choice(bad_responses[idx]))

In [None]:
show_random_responses(200)

This looks correct.

Let's also gather the prompts.

In [None]:
max_prompt_length = 300
tokenizer = AutoTokenizer.from_pretrained('/home/gbarbadillo/data/deepseekmath')
unique_prompts = dict()
for idx in range(580):
    results = responses[str(idx)]
    unique_prompts[idx] = list(set([result['prompt'] for result in results]))
    unique_prompts[idx] = [prompt for prompt in unique_prompts[idx] if len(tokenizer.tokenize(prompt)) < max_prompt_length]

## Creating dataset for DPO

At this point I have good and bad responses. I have to create pairs of them.

For this first version of the dataset I'm going to avoid repetitions. But on future version I could create many more pairs.

In [None]:
chosen, rejected, prompt, problem_idx = [], [], [], []
max_pairs_per_problem = 30
for idx in range(580):
    n_pairs = min(len(good_responses[idx]), len(bad_responses[idx]), max_pairs_per_problem)
    if n_pairs > 0 and unique_prompts[idx]:
        chosen.extend(np.random.choice(good_responses[idx], n_pairs, replace=False).tolist())
        rejected.extend(np.random.choice(bad_responses[idx], n_pairs, replace=False).tolist())
        prompt.extend(np.random.choice(unique_prompts[idx], n_pairs, replace=True).tolist())
        problem_idx.extend([idx] * n_pairs)
assert len(chosen) == len(rejected) == len(problem_idx)
len(chosen)

Now let's gather prompts for the problems, let's reuse the results for that.

In [None]:
plt.hist(np.unique(problem_idx, return_counts=True)[1])

Many problems have a lot of good and bad responses.

In [None]:
df = pd.DataFrame({'prompt': prompt, 'chosen': chosen, 'rejected': rejected, 'problem_idx': problem_idx})
df.head()

Let's count the max lenght and max prompt length.

In [None]:
df['max_prompt_length'] = df['prompt'].apply(lambda x: len(tokenizer.tokenize(x)))
df['chosen_length'] = df['chosen'].apply(lambda x: len(tokenizer.tokenize(x)))
df['rejected_length'] = df['rejected'].apply(lambda x: len(tokenizer.tokenize(x)))

print(f'Max prompt length: {df["max_prompt_length"].max()}')
print(f'Max length: {df["max_prompt_length"].max() + max(df["chosen_length"].max(), df["rejected_length"].max())}')

In [None]:
len(df.problem_idx.unique())

In [None]:
df.to_csv('/mnt/hdd0/Kaggle/aimo/external_data/dpo/v0.csv', index=False)
df.head()