# RLHF with DPO (Direct Preference Optimization)

This notebook demonstrates Direct Preference Optimization (DPO), a simpler alternative to PPO for RLHF.

**DPO Advantages:**
- No separate reward model needed
- More stable training than PPO
- Direct learning from preferences
- Simpler implementation

In [None]:
# Install required libraries
!pip install -q transformers datasets accelerate peft trl bitsandbytes

In [None]:
import torch
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    TrainingArguments,
    BitsAndBytesConfig
)
from peft import LoraConfig, prepare_model_for_kbit_training
from trl import DPOTrainer
from datasets import load_dataset, Dataset
import warnings
warnings.filterwarnings('ignore')

# Check device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

## 1. Understanding DPO

DPO directly optimizes the policy to match human preferences without a separate reward model.

In [None]:
print("DPO Training Pipeline:")
print("="*50)
print("\n1. Start with SFT model (instruction-tuned)")
print("2. Collect preference data:")
print("   - Prompt")
print("   - Chosen response (preferred)")
print("   - Rejected response (not preferred)")
print("3. Train with DPO loss:")
print("   - Increase likelihood of chosen")
print("   - Decrease likelihood of rejected")
print("   - Maintain KL divergence with reference")
print("\nNo reward model needed!")

## 2. Create Preference Dataset

In [None]:
# Create a small preference dataset for demonstration
# In practice, you would use human-labeled preferences

preference_data = [
    {
        "prompt": "What is the capital of France?",
        "chosen": "The capital of France is Paris, a beautiful city known for the Eiffel Tower, rich history, and cultural significance.",
        "rejected": "paris"
    },
    {
        "prompt": "Explain machine learning in simple terms.",
        "chosen": "Machine learning is a type of artificial intelligence where computers learn patterns from data to make predictions or decisions, without being explicitly programmed for every scenario. Think of it like teaching a computer to recognize patterns the way humans do.",
        "rejected": "ML is when computers learn stuff from data using algorithms and math to do things."
    },
    {
        "prompt": "How do I stay healthy?",
        "chosen": "To maintain good health: 1) Eat a balanced diet with fruits, vegetables, and whole grains, 2) Exercise regularly (at least 30 minutes daily), 3) Get 7-9 hours of sleep, 4) Stay hydrated, 5) Manage stress through relaxation techniques, 6) Regular check-ups with healthcare providers.",
        "rejected": "just eat good food and exercise sometimes"
    },
    {
        "prompt": "Write a professional email opening.",
        "chosen": "Dear [Name],\n\nI hope this email finds you well. I am writing to discuss...",
        "rejected": "Hey there, wanted to talk about..."
    },
    {
        "prompt": "What is climate change?",
        "chosen": "Climate change refers to long-term shifts in global temperatures and weather patterns. While natural variations occur, scientific evidence shows human activities, particularly burning fossil fuels, have been the dominant driver since the mid-20th century, leading to global warming and environmental impacts.",
        "rejected": "weather getting hotter because of pollution and stuff"
    }
]

# Duplicate data for larger training set
preference_data = preference_data * 20  # 100 examples

# Create dataset
train_dataset = Dataset.from_list(preference_data)
eval_dataset = Dataset.from_list(preference_data[:10])  # Small eval set

print(f"Training examples: {len(train_dataset)}")
print(f"Evaluation examples: {len(eval_dataset)}")
print(f"\nExample preference:")
print(f"Prompt: {train_dataset[0]['prompt']}")
print(f"Chosen: {train_dataset[0]['chosen'][:50]}...")
print(f"Rejected: {train_dataset[0]['rejected']}")

## 3. Load Models

In [None]:
model_name = "microsoft/phi-2"  # Small model for demo

# Quantization config for memory efficiency
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

# Load model and reference model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

model_ref = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

# Prepare for training
model = prepare_model_for_kbit_training(model)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"  # Important for DPO

print(f"Models loaded: {model_name}")

## 4. Configure LoRA for Efficient Training

In [None]:
# LoRA configuration
peft_config = LoraConfig(
    r=8,  # Low rank
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

print("LoRA configured for efficient DPO training")

## 5. Configure DPO Training

In [None]:
# Training arguments
training_args = TrainingArguments(
    output_dir="./dpo_results",
    num_train_epochs=1,  # Quick demo
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    gradient_checkpointing=True,
    learning_rate=5e-5,
    logging_steps=10,
    save_strategy="no",  # Don't save for demo
    evaluation_strategy="steps",
    eval_steps=20,
    warmup_steps=10,
    report_to="none",
    fp16=True,
    push_to_hub=False,
)

# DPO specific arguments
dpo_trainer = DPOTrainer(
    model,
    model_ref,
    args=training_args,
    beta=0.1,  # KL penalty coefficient
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    peft_config=peft_config,
    max_prompt_length=128,
    max_length=256,
)

print("DPO trainer configured")
print(f"Beta (KL coefficient): 0.1")
print(f"Effective batch size: {training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps}")

## 6. Train with DPO

In [None]:
# Train
print("Starting DPO training...")
print("This trains the model to prefer 'chosen' over 'rejected' responses")
print()

dpo_trainer.train()

print("\nDPO training completed!")

## 7. Test the Aligned Model

In [None]:
def generate_response(prompt, model):
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=128)
    inputs = {k: v.to(model.device) for k, v in inputs.items()}
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=50,
            temperature=0.7,
            do_sample=True,
            top_p=0.9,
            pad_token_id=tokenizer.pad_token_id,
        )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response[len(prompt):].strip()

# Test prompts
test_prompts = [
    "What is the capital of France?",
    "How do I stay healthy?",
    "Explain machine learning in simple terms.",
    "Write a professional email opening.",
]

print("Testing DPO-aligned model:")
print("="*50)

for prompt in test_prompts:
    print(f"\nPrompt: {prompt}")
    response = generate_response(prompt, model)
    print(f"Response: {response}")
    print("-"*30)

## 8. Understanding DPO Loss

In [None]:
print("DPO Loss Function:")
print("="*50)
print()
print("L_DPO = -log σ(β * log[π(y_chosen|x)/π_ref(y_chosen|x)]")
print("              - β * log[π(y_rejected|x)/π_ref(y_rejected|x)])")
print()
print("Where:")
print("- π: Policy (model being trained)")
print("- π_ref: Reference policy (frozen)")
print("- y_chosen: Preferred response")
print("- y_rejected: Non-preferred response")
print("- β: KL penalty coefficient")
print("- σ: Sigmoid function")
print()
print("This loss:")
print("1. Increases likelihood of chosen responses")
print("2. Decreases likelihood of rejected responses")
print("3. Prevents model from deviating too far from reference")

## 9. PPO vs DPO Comparison

In [None]:
import pandas as pd

comparison_data = {
    "Aspect": [
        "Reward Model",
        "Training Complexity",
        "Memory Usage",
        "Training Stability",
        "Implementation",
        "Hyperparameters",
        "Performance"
    ],
    "PPO": [
        "Required",
        "High",
        "High (4 models)",
        "Can be unstable",
        "Complex",
        "Many to tune",
        "95-100%"
    ],
    "DPO": [
        "Not needed",
        "Low",
        "Low (2 models)",
        "Very stable",
        "Simple",
        "Few (mainly β)",
        "90-95%"
    ]
}

df = pd.DataFrame(comparison_data)
print("\nPPO vs DPO Comparison:")
print("="*60)
print(df.to_string(index=False))

print("\n" + "="*60)
print("Recommendation: Start with DPO for simplicity,")
print("use PPO only if you need maximum performance.")

## 10. Best Practices for DPO

In [None]:
print("DPO Best Practices:")
print("="*50)
print()
print("1. Data Quality:")
print("   - Ensure clear preference distinctions")
print("   - Balance different types of preferences")
print("   - Include diverse prompts")
print()
print("2. Hyperparameters:")
print("   - β: Start with 0.1-0.2")
print("   - Learning rate: 1e-6 to 5e-5")
print("   - Batch size: As large as memory allows")
print()
print("3. Model Selection:")
print("   - Start with good SFT model")
print("   - Ensure reference model is fixed")
print("   - Consider LoRA for efficiency")
print()
print("4. Evaluation:")
print("   - Monitor chosen vs rejected rewards")
print("   - Check for reward hacking")
print("   - Human evaluation for quality")
print()
print("5. Common Issues:")
print("   - If model degenerates: Increase β")
print("   - If no improvement: Decrease β")
print("   - If unstable: Reduce learning rate")

## Key Takeaways

### DPO Advantages
1. **Simpler than PPO**: No reward model or value network
2. **Stable training**: Direct optimization without RL loops
3. **Memory efficient**: Only needs model + reference
4. **Easy to implement**: Standard supervised learning setup

### When to Use DPO
- Starting RLHF experiments
- Limited computational resources
- Need stable, predictable training
- Preference data available

### When to Use PPO Instead
- Need maximum performance
- Complex reward signals
- Online data collection
- Full control over optimization

### Next Steps
- Scale up with more preference data
- Try different β values
- Combine with constitutional AI
- Implement iterative DPO