<div align="center">
  <img src="logo_branding.png" width="250" alt="kavi.ai Logo">
  <h1>DPO: Direct Preference Alignment</h1>
  <p><b>A Premium Training Module by kavi.ai</b></p>
</div>

---

### 💎 **Smarter Overview**
Direct Preference Optimization (DPO) replaces the unstable RLHF/PPO loops with a more robust maximum likelihood approach.

### 🚀 **Enterprise Use Case**
Refining model safety and helpfulness by training on competitive (Chosen vs. Rejected) datasets.

### 📈 **Strategic Advantages**
- **Eliminates Complexity**
- **Superior Robustness**
- **Human-Centric**

---

## Step 1: Install Dependencies

### **Purpose:**
Installing `trl` and `peft` for preference-based optimization.

### **Line-by-Line Breakdown:**
- `trl`: The core library for DPO implementation.

In [None]:
!pip install transformers --upgrade
!pip install datasets
!pip install trl[peft] --upgrade
!pip install -U git+https://github.com/huggingface/trl
!pip install bitsandbytes loralib
!pip install wandb -U
!pip install hf_transfer
!pip install sentencepiece


In [None]:
%env HF_HUB_ENABLE_HF_TRANSFER=True
%env WANDB_PROJECT=LLM-Training-Course
%env WANDB_RUN_ID=DPO
%env WANDB_NOTEBOOK_NAME={__vsc_ipynb_file__}

In [None]:
import wandb
wandb.login()

In [None]:
import sys
sys.path.append('/root/llm-training-course/')

In [None]:
from datasets import load_dataset

train_ds, eval_ds = load_dataset("mlabonne/orpo-dpo-mix-40k", split=["train[:20%]","train[20%:25%]"])

In [None]:
train_ds

In [None]:
train_ds = train_ds.map(lambda x: { "messages": [{"role":"system", "content": x["prompt"] }] + x["chosen"] })
eval_ds = eval_ds.map(lambda x: { "messages": [{"role":"system", "content": x["prompt"] }] + x["chosen"] })

In [None]:
from transformers import AutoTokenizer

model_id = "mistralai/Mistral-7B-Instruct-v0.3"

tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token_id = tokenizer.eos_token_id

In [None]:
columns_to_remove = [c for c in train_ds.column_names if c not in ["chosen", "rejected", "prompt"]]
dpo_train_ds = train_ds.remove_columns(columns_to_remove)

columns_to_remove = [c for c in eval_ds.column_names if c not in ["chosen", "rejected", "prompt"]]
dpo_eval_ds = eval_ds.remove_columns(columns_to_remove)


## Step 2: Preference Data Preparation

### **Purpose:**
Formatting 'chosen' and 'rejected' response pairs for the model to learn from.

### **Line-by-Line Breakdown:**
- `chosen`: The preferred response.
- `rejected`: The non-preferred response.

In [None]:
dpo_train_ds = dpo_train_ds.map(lambda x: { "chosen": tokenizer.apply_chat_template(x["chosen"], tokenize=False),
                                            "rejected": tokenizer.apply_chat_template(x["rejected"], tokenize=False),
                                            "prompt": x["prompt"]})
dpo_eval_ds = dpo_eval_ds.map(lambda x: { "chosen": tokenizer.apply_chat_template(x["chosen"], tokenize=False),
                                            "rejected": tokenizer.apply_chat_template(x["rejected"], tokenize=False),
                                            "prompt": x["prompt"]})

In [None]:
from transformers import AutoModelForCausalLM
from peft import LoraConfig
import torch


peft_config = LoraConfig(
    r=16,
    lora_alpha=16,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=['k_proj', 'gate_proj', 'v_proj', 'up_proj', 'q_proj', 'o_proj', 'down_proj']
)

# Model to fine-tune
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="cuda"
)

In [None]:
from helpers import get_gpu_status
get_gpu_status()

In [None]:
from helpers import stream_responses_for_sample
from transformers import GenerationConfig

generation_config =  GenerationConfig(max_new_tokens=75)
sample_conversations = [
    [{"role": "user", "content": "What is the capital of France?"}],
    [{"role": "user", "content": "Write me a javascript function that check if string is palindrome."}],
    [{"role": "user", "content": "Given x^2=36-4 what is x?"}]
]
stream_responses_for_sample(model, tokenizer, sample_conversations,generation_config)

## Step 3: Initialize DPO Trainer

### **Purpose:**
Configuring the DPO loss function and training loop.

### **Line-by-Line Breakdown:**
- `DPOTrainer`: Specifically designed for preference optimization without a reward model.

In [None]:
from trl import DPOTrainer, DPOConfig
import os
training_args = DPOConfig(
    output_dir=os.getenv("WANDB_RUN_ID") + "_DPO" ,
    report_to="wandb",
    num_train_epochs=1.0,
    do_train=True,
    do_eval=True,
    log_level="debug",
    gradient_checkpointing=True,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,
    per_device_eval_batch_size=4,
    lr_scheduler_type="constant",
    bf16=True,
    warmup_steps=150,
    evaluation_strategy="steps",
    eval_steps=0.2,
    logging_steps=0.1,
    max_grad_norm=.3,
    learning_rate=1e-6,
)


In [None]:
dpo_train_ds["chosen"][0]

## Step 4: Initialize DPO Trainer

### **Purpose:**
Configuring the DPO loss function and training loop.

### **Line-by-Line Breakdown:**
- `DPOTrainer`: Specifically designed for preference optimization without a reward model.

In [None]:

DPO_BETA=0.1

dpo_trainer = DPOTrainer(
    model,
    peft_config=peft_config,
    args=training_args,
    beta=DPO_BETA,
    train_dataset=dpo_train_ds,
    eval_dataset=dpo_eval_ds,
    tokenizer=tokenizer
)
dpo_trainer.train()
dpo_trainer.save_model()