<a href="https://colab.research.google.com/github/mshojaei77/Awesome-Fine-tuning/blob/main/%E2%9A%A1Online_DPO.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ⚡ Online DPO with TRL

In [None]:
# clone trl at oline-dpo-llmjudge branch
!pip install git+https://github.com/huggingface/trl.git

Cloning into 'trl'...
remote: Enumerating objects: 8567, done.[K
remote: Counting objects: 100% (1669/1669), done.[K
remote: Compressing objects: 100% (423/423), done.[K
remote: Total 8567 (delta 1471), reused 1344 (delta 1231), pack-reused 6898[K
Receiving objects: 100% (8567/8567), 7.08 MiB | 23.76 MiB/s, done.
Resolving deltas: 100% (5926/5926), done.


In [None]:
from dataclasses import dataclass
from typing import Optional

from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoModelForSequenceClassification,
    AutoTokenizer,
)

from trl import HfPairwiseJudge, ModelConfig
from trl.commands.cli_utils import TrlParser
from trl.trainer import OnlineDPOConfig, OnlineDPOTrainer
from trl.trainer.utils import SIMPLE_QUERY_CHAT_TEMPLATE

# 1. Prepare the dataset for online DPO

### Section 1: Prepare Dataset

Preparing the dataset is a crucial step in the training process, ensuring that the model receives input data in a format that is optimized for efficient learning. This section describes how to pre-process and tokenize the dataset to be used for Online Direct Preference Optimization (ODPO).

1. **Load the Dataset**: Start by loading the dataset that contains the prompts and responses which will be used for training the model. In the provided code, this is done using the `load_dataset` function, which retrieves the dataset specified by `dataset_name` in the `ScriptArguments` class.

2. **Sanity Check**: If you are conducting a sanity check, the code limits the size of the dataset to the first 1024 entries. This step helps to quickly validate the training pipeline without running the full dataset, making the debugging process faster and easier.

3. **Tokenization**: The dataset must be tokenized before training. Tokenization is the process of converting text into a sequence of tokens that the model can understand. In the `prepare_dataset` function, each entry in the dataset is tokenized using the provided `tokenizer`. The key parameters include:
   - **Input Text Field**: The specific field in the dataset containing the text to be tokenized, specified by `dataset_text_field`.
   - **Padding**: The function disables padding during tokenization to avoid unnecessary processing, as padding can be applied later during collation if needed.
   - **Multiprocessing**: Tokenization is parallelized using multiple processes (`num_proc=4`) to speed up the process.

   This pre-tokenization approach ensures that the data is efficiently processed and ready for training, avoiding repeated tokenization during each training step.

4. **Prepare Train and Evaluation Sets**:
   - **Training Dataset**: The training split of the dataset, specified by `dataset_train_split`, is prepared by applying the `prepare_dataset` function to it. This results in a tokenized version of the training data that the model will use to learn the alignment.
   - **Evaluation Dataset**: If a validation or test split is provided (`dataset_test_split`), it is similarly prepared using the same tokenization function. This allows the model’s performance to be evaluated on a separate set of data, ensuring that it generalizes well beyond the training examples.

By following these steps, the dataset is transformed into a format that is optimized for the model training process. This preparation not only speeds up the training process but also ensures that the model receives consistent and well-structured input data, crucial for effective learning and alignment.

In [None]:
################
# Dataset
################

def prepare_dataset(dataset, tokenizer, dataset_text_field):
    """pre-tokenize the dataset before training; only collate during training"""

    def tokenize(element):
        outputs = tokenizer(
            element[dataset_text_field],
            padding=False,
        )
        return {"input_ids": outputs["input_ids"]}

    return dataset.map(
        tokenize,
        remove_columns=dataset.column_names,
        batched=True,
        num_proc=4,  # multiprocessing.cpu_count(),
        load_from_cache_file=False,
    )


@dataclass
class ScriptArguments:
    dataset_name:str="trl-internal-testing/tldr-preference-sft-trl-style"
    dataset_text_field: str = "prompt"
    dataset_train_split: str = "train"
    dataset_test_split: Optional[str] = "validation"
    max_length: int = 512
    sanity_check: bool=True
    response_length: int = 53
    stop_token: str = "eos"
    non_eos_penalty: bool = False

args = ScriptArguments()


raw_datasets = load_dataset(args.dataset_name)
if args.sanity_check:
    for key in raw_datasets:
        raw_datasets[key] = raw_datasets[key].select(range(1024))
train_dataset = raw_datasets[args.dataset_train_split]
train_dataset = prepare_dataset(train_dataset, tokenizer, args.dataset_text_field)

if args.dataset_test_split is not None:
    eval_dataset = raw_datasets[args.dataset_test_split]
    eval_dataset = prepare_dataset(eval_dataset, tokenizer, args.dataset_text_field)
else:
    eval_dataset = None

# 2. Define the model and Tokenizer

### Section 2: Prepare Model

Once the dataset is ready, the next step is to set up the model and tokenizer for training with Online Direct Preference Optimization (ODPO). The provided code outlines how to configure and prepare the necessary components to effectively implement ODPO.

1. **Configure Model and Tokenizer**:
   - **Model Configuration**: Begin by defining the configuration for ODPO using the `OnlineDPOConfig` class. This includes specifying paths to the pre-trained models (`sft_model_path` and `reward_model_path`), the output directory for saving the trained models, learning rate, batch size, and the total number of training episodes.
   - **Tokenizer Setup**: The tokenizer is initialized using `AutoTokenizer.from_pretrained`, which loads the tokenizer associated with the pre-trained model path defined in `ModelConfig`. Special tokens, like the padding token, are added to ensure the tokenizer handles inputs correctly. If a chat template is not provided, it defaults to `SIMPLE_QUERY_CHAT_TEMPLATE`, ensuring consistent input formatting.

2. **Load Pre-trained Models**:
   - **Language Model**: Load the pre-trained language model using `AutoModelForCausalLM.from_pretrained`, which retrieves the model specified in `model_config.model_name_or_path`. This model will serve as both the base model and the reference model (`ref_model`), which is essential for comparing outputs during ODPO training.
   - **Reward Model**: If a reward model path is provided, load it using `AutoModelForSequenceClassification.from_pretrained`, specifying `num_labels=1` since it outputs a single scalar value representing the reward. This reward model will be used to evaluate the quality of generated responses during training.
   - **Judge Model**: If the configuration includes a judge model, instantiate it using `HfPairwiseJudge`. This model will be responsible for providing pairwise comparisons of generated responses, which are crucial for aligning the model during ODPO training.

3. **Model Preparation**:
   - **Pre-trained Model Selection**: The code selects `EleutherAI/pythia-14m` as the pre-trained model, which is a lightweight and efficient model suitable for experimentation and alignment tasks. Both the main model and the reference model are initialized with this pre-trained base.
   - **Reward Model Integration**: The reward model is integrated into the training loop to evaluate the generated responses and guide the alignment process. If no specific reward model is provided, this step can be skipped, but it is crucial for ensuring the model learns to produce aligned outputs.
   - **Judge Model Integration**: The judge model is optionally integrated, providing real-time feedback on the quality of outputs during training. This model helps in making fine-grained adjustments to the main model’s parameters based on the pairwise comparison of outputs.

By following these steps, the model and tokenizer are properly configured and ready for the ODPO training phase. This setup ensures that the model is capable of receiving and processing feedback, which is essential for iteratively improving alignment throughout the training process.

In [None]:
################
# Model & Tokenizer
################

config = OnlineDPOConfig(
    sft_model_path="EleutherAI/pythia-14m",
    reward_model_path="EleutherAI/pythia-14m",
    output_dir="models/minimal/online_dpo_llmjudge",
    learning_rate=3e-6,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=64,
    total_episodes=30000,
)

model_config = ModelConfig(
    model_name_or_path="EleutherAI/pythia-14m",
)


tokenizer = AutoTokenizer.from_pretrained(
    model_config.model_name_or_path,
    padding_side="left",
    trust_remote_code=True,
)
tokenizer.add_special_tokens({"pad_token": "[PAD]"})
if tokenizer.chat_template is None:
    tokenizer.chat_template = SIMPLE_QUERY_CHAT_TEMPLATE

ref_model = AutoModelForCausalLM.from_pretrained(model_config.model_name_or_path)
model = AutoModelForCausalLM.from_pretrained(model_config.model_name_or_path)

if config.reward_model_path is not None:
    reward_model = AutoModelForSequenceClassification.from_pretrained(config.reward_model_path, num_labels=1)
else:
    reward_model = None

if config.judge is not None:
    judge = HfPairwiseJudge()
else:
    judge = None


# 3 Train the model

### Section 3: Training

With the dataset and model prepared, the training phase can now commence. This phase involves leveraging the Online Direct Preference Optimization (ODPO) process to iteratively improve the alignment of the model based on the feedback from the reward and judge models.

1. **Initialize Training**:
   - The training process begins by setting up the `OnlineDPOTrainer` with the prepared model, reference model, reward model, and judge model. The `trainer` is configured with the provided `config` settings, which include important parameters like learning rate, batch size, and the number of training episodes.
   - The training and evaluation datasets, which have been pre-tokenized and processed, are also passed to the trainer. The tokenizer is included to ensure that any additional text processing during training is handled consistently.

2. **Training Loop**:
   - The `trainer.train()` method initiates the ODPO training loop. During each iteration, the model generates responses based on the prompts from the training dataset.
   - These generated responses are compared against the reference model’s responses. The judge model evaluates which response is preferred, providing critical feedback for aligning the model's outputs with desired behaviors.

3. **Model Updates**:
   - The feedback from the judge and reward models is used to update the model’s parameters. The ODPO objective function guides this update, optimizing the model's policy to produce outputs that are increasingly aligned with the annotated preferences.
   - The reference model (`ref_model`) plays a crucial role in stabilizing the updates by serving as a baseline for comparison, ensuring that the main model does not drift too far from its original performance.

4. **Iterate**:
   - The process of generating responses, receiving feedback, and updating the model is repeated for the specified number of episodes (`total_episodes`). Each iteration helps the model gradually refine its ability to generate preferred responses, improving its alignment over time.

5. **Monitor and Adjust**:
   - Throughout the training process, it’s important to monitor the model’s performance. This can be done by evaluating metrics such as the win rates of the preferred responses. Monitoring ensures that the model continues to improve and does not overfit or diverge from the desired output quality.
   - If necessary, adjustments to the training parameters or model configuration can be made mid-training to better align the model's performance with the desired outcomes.

6. **Evaluate and Fine-tune**:
   - After the training loop completes, the model should be evaluated on the separate validation dataset (`eval_dataset`). This evaluation helps determine how well the model generalizes to new prompts and whether it meets the alignment goals.
   - Depending on the evaluation results, fine-tuning might be required to address any remaining issues or to further refine specific aspects of the model’s behavior.

By following these steps, the model undergoes a thorough and iterative training process, ensuring that it aligns more closely with human preferences and generates reliable, high-quality responses. The use of the `OnlineDPOTrainer` streamlines this process, allowing for effective integration of feedback and continual model improvement.

In [None]:
################
# Training
################

trainer = OnlineDPOTrainer(
    model=model,
    config=config,
    ref_model=ref_model,
    reward_model=reward_model,
    judge=judge,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
)
trainer.train()