# Notebook 3: Reinforcement Learning with Preference Data (DPO/ORPO)

This notebook demonstrates **Direct Preference Optimization (DPO)** and **Odds Ratio Preference Optimization (ORPO)** using Unsloth.ai.

## Key Concepts

### What is RLHF?
**Reinforcement Learning from Human Feedback (RLHF)** aligns models with human preferences:
1. Supervised fine-tuning (SFT) - base instruction following
2. Reward modeling - learn what humans prefer
3. RL training - optimize for preferred outputs

### DPO: Simpler than PPO
**Traditional RLHF** uses PPO (Proximal Policy Optimization) which:
- Requires separate reward model
- Complex training loop
- Unstable training

**DPO** (Direct Preference Optimization):
- No reward model needed!
- Directly optimizes on preference pairs
- More stable and simpler

### ORPO: Even Better
**ORPO** (Odds Ratio Preference Optimization):
- Combines SFT + preference learning in one step
- No separate SFT stage needed
- Often outperforms DPO

## Dataset Format

Preference datasets need **chosen** (good) and **rejected** (bad) responses:

```json
{
  "prompt": "Explain quantum computing",
  "chosen": "Quantum computing uses qubits that can be in superposition...",
  "rejected": "Quantum is like really fast computers."
}
```

## Video Recording Checklist
- [ ] Explain RLHF, DPO, and ORPO differences
- [ ] Show preference dataset format
- [ ] Demonstrate why we need preference learning
- [ ] Compare outputs: base model → SFT → DPO/ORPO
- [ ] Explain training dynamics
- [ ] Show alignment improvements

## Step 1: Install Unsloth and DPO Dependencies

In [None]:
%%capture
# Install Unsloth and dependencies
# Use colab-new for Google Colab, cu121-torch230 for Vertex AI Workbench
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes

## Step 2: Import Libraries

In [None]:
from unsloth import FastLanguageModel, PatchDPOTrainer
PatchDPOTrainer()  # Apply Unsloth optimizations to DPO trainer

import torch
from datasets import load_dataset
from transformers import TrainingArguments
from trl import DPOTrainer, ORPOTrainer, ORPOConfig
from unsloth import is_bfloat16_supported

print("Libraries imported successfully!")
print(f"Using DPO/ORPO trainers from TRL library")

## Step 3: Load Pre-trained or SFT Model

**Important**: For best results, start with a model that's already instruction-tuned (SFT).
We can use:
1. Pre-trained instruction model (Llama-3-8B-Instruct)
2. Model we fine-tuned in Notebook 1 or 2

In [None]:
max_seq_length = 2048
dtype = None
load_in_4bit = True

# Load instruction-tuned model
# For demo: using Gemma-2-2B-it (instruction-tuned)
# Alternative: "unsloth/Llama-3.1-8B-Instruct"
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/gemma-2-2b-it-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

print(f"Model loaded: {model.config._name_or_path}")
print(f"This is an INSTRUCTION-TUNED model (already did SFT)")
print(f"Now we'll align it with human preferences using DPO/ORPO")

## Step 4: Add LoRA Adapters

We'll use LoRA for efficient preference learning

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 16,
    lora_dropout = 0.05,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
)

print("LoRA adapters added for preference learning!")

## Step 5: Load Preference Dataset

We'll use the Anthropic HH-RLHF dataset - the gold standard for preference learning

In [None]:
# Load Anthropic's Helpful and Harmless RLHF dataset
dataset = load_dataset("Anthropic/hh-rlhf", split="train")

print(f"Dataset size: {len(dataset):,} examples")
print("\nFirst example:")
print(dataset[0])
print("\nDataset columns:", dataset.column_names)

## Step 6: Format Dataset for DPO/ORPO

The dataset needs specific format:
- `prompt`: The user query
- `chosen`: Preferred response
- `rejected`: Dispreferred response

In [None]:
def format_preference_data(examples):
    """
    Anthropic HH-RLHF format:
    - 'chosen': Full conversation with good response
    - 'rejected': Full conversation with bad response
    
    We need to extract:
    - prompt: The user's last message
    - chosen: Assistant's good response
    - rejected: Assistant's bad response
    """
    formatted = []
    
    for chosen, rejected in zip(examples['chosen'], examples['rejected']):
        # Split conversations into turns
        # Format: "\n\nHuman: ... \n\nAssistant: ..."
        
        # Extract prompt (everything before the last Assistant response)
        prompt_end = chosen.rfind("\n\nAssistant:")
        prompt = chosen[:prompt_end].strip()
        
        # Extract chosen response (after last Assistant:)
        chosen_response = chosen[prompt_end:].replace("\n\nAssistant:", "").strip()
        
        # Extract rejected response
        rejected_prompt_end = rejected.rfind("\n\nAssistant:")
        rejected_response = rejected[rejected_prompt_end:].replace("\n\nAssistant:", "").strip()
        
        formatted.append({
            "prompt": prompt,
            "chosen": chosen_response,
            "rejected": rejected_response
        })
    
    return formatted

# Format first 10000 examples for faster training (use more for better results)
dataset_small = dataset.select(range(10000))
formatted_data = format_preference_data(dataset_small)

# Convert to HF dataset
from datasets import Dataset
preference_dataset = Dataset.from_list(formatted_data)

print(f"Formatted {len(preference_dataset)} preference pairs")
print("\nExample formatted data:")
print(f"Prompt: {preference_dataset[0]['prompt'][:200]}...")
print(f"\nChosen: {preference_dataset[0]['chosen'][:200]}...")
print(f"\nRejected: {preference_dataset[0]['rejected'][:200]}...")

## Step 7: Configure DPO Training

Let's start with DPO (Direct Preference Optimization)

In [None]:
# DPO-specific arguments
training_args = TrainingArguments(
    per_device_train_batch_size = 2,  # DPO needs more memory
    gradient_accumulation_steps = 4,
    warmup_ratio = 0.1,
    num_train_epochs = 1,
    max_steps = 500,
    learning_rate = 5e-5,  # Lower LR for preference learning
    fp16 = not is_bfloat16_supported(),
    bf16 = is_bfloat16_supported(),
    logging_steps = 10,
    optim = "adamw_8bit",
    weight_decay = 0.01,
    lr_scheduler_type = "linear",
    seed = 3407,
    output_dir = "outputs/gemma2_dpo",
    report_to = "none",
)

print("DPO training configuration:")
print(f"Beta (KL penalty): 0.1 (default, controls how much model can change)")
print(f"Learning rate: {training_args.learning_rate}")
print(f"This trains the model to prefer 'chosen' over 'rejected' responses")

## Step 8: Create DPO Trainer

In [None]:
dpo_trainer = DPOTrainer(
    model = model,
    ref_model = None,  # Unsloth handles reference model internally
    args = training_args,
    beta = 0.1,  # KL divergence penalty (how much model can deviate from reference)
    train_dataset = preference_dataset,
    tokenizer = tokenizer,
    max_length = 512,  # Max sequence length
    max_prompt_length = 256,  # Max prompt length
)

print("DPO Trainer created!")
print("\nHow DPO works:")
print("1. Shows model a prompt + chosen response")
print("2. Shows model same prompt + rejected response")
print("3. Trains model to assign higher probability to chosen")
print("4. Uses reference model to prevent too much deviation")

## Step 9: Test Model BEFORE DPO Training

In [None]:
FastLanguageModel.for_inference(model)

test_prompt = "What's the best way to learn programming?"

inputs = tokenizer([test_prompt], return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100, temperature=0.7)
response_before = tokenizer.decode(outputs[0], skip_special_tokens=True)

print("=" * 70)
print("MODEL OUTPUT BEFORE DPO TRAINING")
print("=" * 70)
print(response_before)
print("=" * 70)

# Re-enable training mode
model.train()

## Step 10: Train with DPO

In [None]:
print("\n" + "="*50)
print("STARTING DPO TRAINING")
print("="*50 + "\n")

dpo_stats = dpo_trainer.train()

print("\n" + "="*50)
print("DPO TRAINING COMPLETE!")
print("="*50)
print(f"Training time: {dpo_stats.metrics['train_runtime']:.2f} seconds")
print(f"Loss: {dpo_stats.metrics['train_loss']:.4f}")

## Step 11: Test Model AFTER DPO Training

In [None]:
FastLanguageModel.for_inference(model)

# Test with same prompt
inputs = tokenizer([test_prompt], return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100, temperature=0.7)
response_after = tokenizer.decode(outputs[0], skip_special_tokens=True)

print("=" * 70)
print("MODEL OUTPUT AFTER DPO TRAINING")
print("=" * 70)
print(response_after)
print("=" * 70)

print("\n" + "=" * 70)
print("COMPARISON")
print("=" * 70)
print("BEFORE DPO:")
print(response_before)
print("\nAFTER DPO:")
print(response_after)
print("=" * 70)
print("\nNotice how the response is more:")
print("- Helpful and detailed")
print("- Harmless (avoids problematic content)")
print("- Aligned with human preferences")

## Step 12: Try ORPO Instead (Optional)

ORPO often works better than DPO and doesn't need a reference model!

In [None]:
# Load fresh model for ORPO
model_orpo, tokenizer_orpo = FastLanguageModel.from_pretrained(
    model_name = "unsloth/gemma-2-2b-it-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

model_orpo = FastLanguageModel.get_peft_model(
    model_orpo,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 16,
    lora_dropout = 0.05,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
)

print("Fresh model loaded for ORPO comparison")

In [None]:
# ORPO Configuration
orpo_config = ORPOConfig(
    per_device_train_batch_size = 2,
    gradient_accumulation_steps = 4,
    warmup_ratio = 0.1,
    num_train_epochs = 1,
    max_steps = 500,
    learning_rate = 8e-6,  # ORPO uses lower LR
    fp16 = not is_bfloat16_supported(),
    bf16 = is_bfloat16_supported(),
    logging_steps = 10,
    optim = "adamw_8bit",
    weight_decay = 0.01,
    lr_scheduler_type = "linear",
    seed = 3407,
    output_dir = "outputs/gemma2_orpo",
    report_to = "none",
)

orpo_trainer = ORPOTrainer(
    model = model_orpo,
    args = orpo_config,
    train_dataset = preference_dataset,
    tokenizer = tokenizer_orpo,
    max_length = 512,
    max_prompt_length = 256,
)

print("ORPO Trainer created!")
print("\nORPO advantages over DPO:")
print("- No reference model needed (saves memory)")
print("- Combines SFT + preference learning")
print("- Often more stable training")
print("- Better final performance in many cases")

In [None]:
print("\n" + "="*50)
print("STARTING ORPO TRAINING")
print("="*50 + "\n")

orpo_stats = orpo_trainer.train()

print("\n" + "="*50)
print("ORPO TRAINING COMPLETE!")
print("="*50)
print(f"Training time: {orpo_stats.metrics['train_runtime']:.2f} seconds")
print(f"Loss: {orpo_stats.metrics['train_loss']:.4f}")

## Step 13: Compare DPO vs ORPO

In [None]:
FastLanguageModel.for_inference(model_orpo)

# Test ORPO model
inputs = tokenizer_orpo([test_prompt], return_tensors="pt").to("cuda")
outputs = model_orpo.generate(**inputs, max_new_tokens=100, temperature=0.7)
response_orpo = tokenizer_orpo.decode(outputs[0], skip_special_tokens=True)

print("=" * 70)
print("COMPARISON: DPO vs ORPO")
print("=" * 70)
print("\nBase Model:")
print(response_before)
print("\n" + "-" * 70)
print("\nAfter DPO:")
print(response_after)
print("\n" + "-" * 70)
print("\nAfter ORPO:")
print(response_orpo)
print("=" * 70)

## Step 14: Save Models

In [None]:
# Save DPO model
model.save_pretrained("gemma2_dpo_aligned")
tokenizer.save_pretrained("gemma2_dpo_aligned")

# Save ORPO model
model_orpo.save_pretrained("gemma2_orpo_aligned")
tokenizer_orpo.save_pretrained("gemma2_orpo_aligned")

print("Both models saved!")
print("DPO model: gemma2_dpo_aligned/")
print("ORPO model: gemma2_orpo_aligned/")

## Step 15: Export to GGUF/Ollama

In [None]:
# Merge and export DPO model
model_merged = model.merge_and_unload()
model_merged.save_pretrained_gguf(
    "gemma2_dpo_gguf",
    tokenizer,
    quantization_method = "q4_k_m"
)

# Merge and export ORPO model
model_orpo_merged = model_orpo.merge_and_unload()
model_orpo_merged.save_pretrained_gguf(
    "gemma2_orpo_gguf",
    tokenizer_orpo,
    quantization_method = "q4_k_m"
)

print("Models exported to GGUF format!")

## Summary

### What we accomplished:
1. Learned about RLHF, DPO, and ORPO
2. Trained models using **preference data** (chosen vs rejected)
3. Compared DPO and ORPO approaches
4. Demonstrated alignment improvements

### Key Insights:

**Traditional RLHF (PPO)**:
- ❌ Complex: Needs separate reward model
- ❌ Unstable: Training can diverge
- ❌ Slow: Multiple models to train

**DPO (Direct Preference Optimization)**:
- ✅ Simple: No reward model
- ✅ Stable: Direct optimization
- ⚠️ Needs reference model (more memory)

**ORPO (Odds Ratio PO)**:
- ✅ No reference model (saves memory)
- ✅ Combines SFT + preference learning
- ✅ Often best performance

### When to use preference learning:
- Aligning model with human values
- Improving response quality
- Teaching helpfulness/harmlessness
- Reducing unwanted behaviors

### Dataset Requirements:
- Pairs of (prompt, chosen, rejected)
- High-quality preference labels
- Diverse examples
- 10k+ pairs recommended

### Next Steps:
- Try with your own preference data
- Experiment with different beta values (DPO)
- Test on larger models (Llama 3.1 8B)
- Combine with reinforcement learning (GRPO)

### Performance Tips:
1. Start with instruction-tuned model (not base model)
2. Use high-quality preference data
3. Lower learning rate than SFT (5e-6 to 5e-5)
4. ORPO often better than DPO
5. More preference pairs = better alignment