# DPO trainer for llm alignment

## Notebook Introduction: Doing the llm alignment with DPO trainer

In this notebook, we will walk you through a step-by-step process of how to do alignment for a SOTA llm with DPO method. You don't need to be an expert in machine learning or natural language processing to follow along â€“ our approach focuses on simplicity and effectiveness.

### First, we will select the model we wish to align and take the matching tokenizer and appropriate config

In [7]:
import os
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
from transformers import logging

logging.set_verbosity("CRITICAL")

model_name = "mistralai/Mistral-7B-Instruct-v0.2"
tokenizer = model_name
generation_config = GenerationConfig.from_pretrained(model_name)

### Then, in order to use with mlrun, we will create an mlrun project and create an mlrun function

In [8]:
import mlrun

project = mlrun.get_or_create_project(
    name="dpo-trainer-test",
    context="./",
    user_project=True,
)

> 2024-04-01 16:49:17,440 [info] Project loaded successfully: {'project_name': 'dpo-trainer-test'}


In [9]:
project.set_function(
    "huggingface_dpo_trainer.py",
    name="dpo-trainer",
    kind="local",
    handler="dpo_train",
)
project.save()

<mlrun.projects.project.MlrunProject at 0x7f46038f9f10>

### we can set the every config or parameter we want, including training arguments, hyper parameters and more, and pass to the function

In [10]:
train_dataset = "reciprocate/ultrafeedback_cleaned_high_dpo"
eval_dataset = "reciprocate/ultrafeedback_cleaned_high_dpo"
training_arguments = {
    "evaluation_strategy": "steps",
    "do_eval": True,
    "optim": "paged_adamw_8bit",
    "per_device_train_batch_size": 1,
    "gradient_accumulation_steps": 1,
    "per_device_eval_batch_size": 1,
    "log_level": "info",
    "save_steps": 1,
    "learning_rate": 5e-7,
    "eval_steps": 1,
    "num_train_epochs": 1,
    "max_steps": 1,
    "warmup_steps": 1,
    "fp16": True,
    "lr_scheduler_type": "cosine",
    "remove_unused_columns": True,
    "gradient_checkpointing": True,
}
params = {
    "model": model_name,
    "tokenizer": tokenizer,
    "train_dataset": train_dataset,
    "eval_dataset": eval_dataset,
    "peft_config": True,
    "training_config": training_arguments,
    "use_cuda": True,
    "beta": 0.1,
}

### Now we simply run the function

In [None]:
training_run = mlrun.run_function(
    function="dpo-trainer",
    name="dpo-trainer",
    local=True,
    params=params,
    handler="dpo_train",
    outputs=["model"],
)

> 2024-04-01 16:49:20,738 [info] Storing function: {'name': 'dpo-trainer', 'uid': 'b4ed0d2bdc8c4e44892aee1a3549969d', 'db': 'http://mlrun-api:8080'}


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

trainable params: 7241732096 || all params: 7241732096 || trainable%: 100.0


When using DPODataCollatorWithPadding, you should set `remove_unused_columns=False` in your TrainingArguments we have set it for you, but you should do it yourself in the future.
Passing the following arguments to `Accelerator` is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches', 'even_batches', 'use_seedable_sampler']). Please pass an `accelerate.DataLoaderConfiguration` instead: 
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
You have loaded a model on multiple GPUs. `is_model_parallel` attribute will be force-set to `True` to avoid any unexpected behavior such as device placement mismatching.
max_steps is given, it will override any value given in num_train_epochs
Using auto half precision backend


> 2024-04-01 16:49:40,542 [info] training 'mistralai/Mistral-7B-Instruct-v0.2'


***** Running training *****
  Num examples = 541
  Num Epochs = 1
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 1
  Gradient Accumulation steps = 1
  Total optimization steps = 1
  Number of trainable parameters = 41,943,040
torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
None of the inputs have requires_grad=True. Gradients will be None
Could not estimate the number of tokens of the input, floating-point operations will not be computed
***** Running Evaluation *****
  Num examples = 541
  Batch size = 1
