<a href="https://colab.research.google.com/github/mattdepaolis/llm-tutorials/blob/main/Build_a_High_Quality_DPO_Dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Build a High-Quality DPO Dataset

To effectively implement Direct Preference Optimization (DPO), it's essential to curate a dataset of high-quality preference pairs. Several notable datasets can serve as valuable resources:

- argilla/distilabel-math-preference-dpo: Developed by Argilla using the Distilabel framework, this dataset comprises approximately 2,418 entries. Each entry includes a math-related instruction, two model-generated responses, and corresponding quality ratings, facilitating the enhancement of mathematical reasoning in language models.

- argilla/distilabel-intel-orca-dpo-pairs: This dataset is a "distilabeled" version of the widely used Intel/orca_dpo_pairs. It has been improved using the Distilabel framework to enhance the quality of preference pairs, making it suitable for fine-tuning models with diverse preference data.

- argilla/ultrafeedback-binarized-preferences-cleaned: This dataset offers cleaned and binarized preference pairs, providing a refined resource for training models to understand and prioritize user preferences effectively.

- M4-ai/prm_dpo_pairs_cleaned: Containing cleaned DPO pairs, this dataset aids in fine-tuning models to align with preferred responses, enhancing their decision-making capabilities.

- jondurbin/truthy-dpo-v0.1: Focused on truthfulness, this dataset provides preference pairs that help models discern and prioritize truthful information, crucial for maintaining accuracy and reliability.

- unalignment/toxic-dpo-v0.2: This dataset addresses toxicity by offering preference pairs that guide models to avoid generating harmful or offensive content, promoting safer AI interactions.

- argilla/Capybara-Preferences: A collection of preference pairs tailored to specific tasks, this dataset assists in fine-tuning models for specialized applications, enhancing their adaptability and performance.

By selecting the highest-rated responses from these datasets, we can curate a collection of superior preference pairs, thereby enhancing the effectiveness of DPO fine-tuning.

## Code Implementation

Let's dive into the code to see how we can achieve this. We'll use the datasets library from Hugging Face to handle dataset loading and manipulation.

### 1. Import Necessary Libraries

In [None]:
from datasets import load_dataset, Dataset, concatenate_datasets as hf_concatenate_datasets, DatasetDict, Features, Value

### 2. Load Datasets

In [None]:
# Load datasets
datasets = {
    "math_preference": load_dataset("argilla/distilabel-math-preference-dpo"),
    "intel_orca": load_dataset("argilla/distilabel-intel-orca-dpo-pairs"),
    "ultrafeedback_binarized": load_dataset("argilla/ultrafeedback-binarized-preferences-cleaned"),
    "prm_dpo": load_dataset("M4-ai/prm_dpo_pairs_cleaned"),
    "truthy_dpo": load_dataset("jondurbin/truthy-dpo-v0.1"),
    "toxic_dpo": load_dataset("unalignment/toxic-dpo-v0.2"),
    "capybara": load_dataset("argilla/Capybara-Preferences"),
}

### 3. Define a Consistent Schema

In [None]:
# Define the consistent schema
consistent_features = Features({
    "origin": Value("string"),
    "chosen": [{"content": Value("string"), "role": Value("string")}],
    "rejected": [{"content": Value("string"), "role": Value("string")}],
    "prompt": Value("string"),
})

### 4. Transform Examples Function

In [None]:
# Function to transform the 'chosen' and 'rejected' features into lists of dictionaries
def transform_example(example):
    if 'prompt' in example and 'chosen' in example:
        example['chosen'] = [
            {"content": example['prompt'], "role": "user"},
            {"content": example['chosen'], "role": "assistant"}
        ]
    if 'prompt' in example and 'rejected' in example:
        example['rejected'] = [
            {"content": example['prompt'], "role": "user"},
            {"content": example['rejected'], "role": "assistant"}
        ]
    return example

### 5. Align Dataset Features

In [None]:
# Align dataset features
def align_features(dataset, source_name):
    aligned_data = {
        feature: dataset[feature] if feature in dataset.column_names else [None] * len(dataset)
        for feature in consistent_features
    }
    aligned_data["origin"] = [source_name] * len(dataset)
    return Dataset.from_dict(aligned_data, features=consistent_features)

### 6. Preprocess Datasets

We preprocess each dataset individually to filter and transform the data according to our requirements.

6.1 Capybara Dataset

In [None]:
# Capybara dataset
datasets['capybara']['train'] = datasets['capybara']['train']\
    .filter(lambda x: x['chosen_rating'] is float(x['chosen_rating']) >= 5)\
    .map(lambda x: {'prompt': x['chosen'][0]['content'] if x['chosen'] else "", **x})

6.2 PRM DPO Dataset

In [None]:
# PRM DPO dataset
datasets['prm_dpo']['train'] = datasets['prm_dpo']['train']\
    .filter(lambda x: x['is_chosen_correct'])\
    .map(transform_example)

6.3 Ultrafeedback Binarized Dataset

In [None]:
# Ultrafeedback binarized dataset
datasets['ultrafeedback_binarized']['train'] = datasets['ultrafeedback_binarized']['train']\
    .filter(lambda x: x['chosen-rating'] is x['chosen-rating'] >= 5)

6.4 Intel ORCA Dataset

In [None]:
# Intel ORCA dataset
datasets['intel_orca']['train'] = datasets['intel_orca']['train']\
    .rename_column('input', 'prompt')\
    .filter(lambda x: x['rating'] is not None and x['rating'][0] >= 10 and x['rating'][1] >= 10)\
    .filter(lambda x: not x.get('in_gsm8k_train', False))\
    .map(transform_example)

6.5 Math Preference Dataset

In [None]:
# Math preference dataset
datasets['math_preference']['train'] = datasets['math_preference']['train']\
    .rename_column('instruction', 'prompt')\
    .rename_column('chosen_response', 'chosen')\
    .rename_column('rejected_response', 'rejected')\
    .filter(lambda x: x['chosen_rating'] is x['chosen_rating'] >= 9)\
    .map(transform_example)

6.6 Truthy DPO and Toxic DPO Datasets

In [None]:
# Truthy DPO and Toxic DPO datasets
datasets['truthy_dpo'] = datasets['truthy_dpo'].map(transform_example)
datasets['toxic_dpo'] = datasets['toxic_dpo'].map(transform_example)

### 7. Align and Collect All Datasets

In [None]:
# Align and collect all datasets
all_datasets = []
for name, dataset_dict in datasets.items():
    for split, dataset in dataset_dict.items():
        aligned_dataset = align_features(dataset, name)
        all_datasets.append(aligned_dataset)

### 8. Concatenate All Datasets

In [None]:
# Concatenate all datasets
combined_dataset = hf_concatenate_datasets(all_datasets)

### 9. Create the Final Dataset

In [None]:
# Create the final dataset
final_dataset = DatasetDict({'train': combined_dataset})

10. Verify the Dataset

In [None]:
# Print the combined dataset schema and a few rows to verify
print(final_dataset)
print(final_dataset['train'][:1])