# Class 7: Comprehensive Alignment Comparison

Welcome to the educational walkthrough for **PPO vs DPO vs GRPO** in LLM alignment!

This series of notebooks will guide you step by step through the concepts, code, and experiments for comparing three major alignment methods for language models.

**Key Topics:**
- Preference data collection
- PPO (Proximal Policy Optimization)
- DPO (Direct Preference Optimization)
- GRPO (Group Relative Policy Optimization)
- Evaluation and comparison

---

## 1. Introduction & Setup

This notebook sets up the environment and introduces the main configuration and dependencies.

In [1]:
# Imports and configuration
import os
import json
import torch
import warnings
import numpy as np
from typing import List, Dict, Optional, Any
from datasets import Dataset, load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import LoraConfig, get_peft_model, TaskType
from openai import OpenAI
from dotenv import load_dotenv
import logging

# Configuration
USE_SMALL_MODELS = True
SKIP_OLLAMA = True

# Import libraries with error handling
try:
    from trl import PPOConfig, DPOConfig, GRPOConfig
    TRL_AVAILABLE = True
    print("✅ TRL library loaded successfully")
except ImportError as e:
    print(f'❌ TRL not found: {e}')
    TRL_AVAILABLE = False

try:
    import gradio as gr
    GRADIO_AVAILABLE = True
    print("✅ Gradio library loaded successfully")
except ImportError:
    print("❌ Gradio not found. Install with: pip install gradio")
    GRADIO_AVAILABLE = False

warnings.filterwarnings("ignore")
load_dotenv()
logging.basicConfig(level=logging.INFO)


  from .autonotebook import tqdm as notebook_tqdm


✅ TRL library loaded successfully
✅ Gradio library loaded successfully


# Step 2: Loading Preference Datasets

In this step, we'll load and inspect the preference datasets that will be used for alignment training.
Preference datasets are essential for training and evaluating alignment methods such as PPO, DPO, and GRPO.

---

## Function: `load_preference_datasets`

This function loads a comprehensive set of preference data, including both sample data and (if available) external datasets.


In [2]:
from typing import Dict
class AlignmentComparisonManager:
    def load_preference_datasets(self) -> Dict:
        """Step 1: Load comprehensive preference datasets"""
        print("📚 STEP 1: LOADING PREFERENCE DATASETS")
        print("Creating comprehensive preference data for alignment training...")
        # Create high-quality preference data
        preference_data = [
            {
                'prompt': 'How do you debug a memory leak in a Python application?',
                'chosen': 'To debug memory leaks in Python: 1) Use memory profilers like memory_profiler or pympler, 2) Identify objects not being garbage collected, 3) Check for circular references, 4) Review global variables and caches, 5) Use weak references appropriately, 6) Monitor memory usage over time, 7) Use tools like objgraph to visualize object references.',
                'rejected': "Just restart the application when it uses too much memory. Memory leaks aren\'t really a problem in Python."
            },
            {
                'prompt': "What's the difference between supervised and unsupervised learning?",
                'chosen': "Supervised learning uses labeled training data where input-output pairs guide the algorithm to learn patterns for prediction. Examples include classification and regression. Unsupervised learning finds hidden patterns in unlabeled data without target outputs, such as clustering and dimensionality reduction. The key difference is the presence of target variables in supervised learning.",
                'rejected': "Supervised learning is when someone supervises the computer while it learns. Unsupervised learning is when the computer learns by itself."
            },
            {
                'prompt': 'How do you develop a go-to-market strategy for a new product?',
                'chosen': 'A comprehensive GTM strategy includes: 1) Market research and customer segmentation, 2) Value proposition definition, 3) Competitive analysis, 4) Pricing strategy, 5) Distribution channel selection, 6) Marketing and sales strategies, 7) Success metrics and KPIs, 8) Launch timeline and milestones, 9) Risk assessment and contingency plans, 10) Post-launch optimization plan.',
                'rejected': "Just build the product and start selling it. If it's good, people will buy it. Marketing isn't that important."
            },
            {
                'prompt': 'How do you handle underperformance in your team?',
                'chosen': 'Address underperformance systematically: 1) Document specific performance gaps, 2) Have a direct, empathetic conversation to understand root causes, 3) Collaborate on a performance improvement plan with clear expectations, 4) Provide necessary resources and support, 5) Schedule regular check-ins, 6) Recognize improvements, 7) If no improvement, follow HR procedures.',
                'rejected': 'Call them out in front of the team so everyone knows they need to improve. Public pressure usually motivates people.'
            },
            {
                'prompt': 'How do you communicate complex technical concepts to non-technical stakeholders?',
                'chosen': "Effective technical communication involves: 1) Understanding your audience's background, 2) Using analogies and real-world examples, 3) Avoiding jargon, 4) Focusing on business impact, 5) Using visual aids, 6) Structuring information logically, 7) Checking for understanding, 8) Providing clear next steps.",
                'rejected': 'Dumb it down as much as possible and use simple words. Technical people just need to learn to explain things better.'
            }
        ]
        # Try to load real datasets
        datasets_info = {
            'comprehensive_preferences': {
                'dataset': Dataset.from_list(preference_data),
                'description': 'Comprehensive multi-domain preference dataset',
                'size': len(preference_data)
            }
        }
        try:
            print('  📥 Attempting to load real preference datasets...')
            hh_dataset = load_dataset('Anthropic/hh-rlhf', split='train[:10]')
            datasets_info['hh_rlhf'] = {
                'dataset': hh_dataset,
                'description': 'Anthropic HH-RLHF helpful and harmless',
                'size': len(hh_dataset)
            }
            print(f'  ✅ Loaded {len(hh_dataset)} samples from HH-RLHF')
        except Exception as e:
            print(f'  ⚠️ Could not load external datasets: {e}')
        print(f'📊 Total datasets available: {len(datasets_info)}')
        return datasets_info


---

## Example: Load and Inspect Datasets

Let's create an instance of the manager and load the datasets.


In [3]:
# Instantiate the manager and load datasets
manager = AlignmentComparisonManager()
datasets_info = manager.load_preference_datasets()

# Show available datasets and their sizes
for name, info in datasets_info.items():
    print(f"{name}: {info['description']} (size: {info['size']})")


📚 STEP 1: LOADING PREFERENCE DATASETS
Creating comprehensive preference data for alignment training...
  📥 Attempting to load real preference datasets...
  ✅ Loaded 10 samples from HH-RLHF
📊 Total datasets available: 2
comprehensive_preferences: Comprehensive multi-domain preference dataset (size: 5)
hh_rlhf: Anthropic HH-RLHF helpful and harmless (size: 10)


# Step 3: Building the Annotation Platform

In this step, we'll build an interactive annotation platform for collecting preference data.

This platform allows users to compare two responses to a prompt and indicate which one they prefer, along with their reasoning.

---

## Function: `create_annotation_platform`

This function creates a Gradio interface for preference annotation.


In [4]:
import gradio as gr
import numpy as np
import json

class AlignmentComparisonManager:
    def __init__(self):
        self.annotations = []

    def create_annotation_platform(self):
        """Create preference annotation platform"""
        def generate_responses(prompt, style_a="helpful", style_b="brief"):
            templates = {
                "helpful": f"Here's a comprehensive approach to {prompt.lower()}: ...",
                "brief": f"For {prompt.lower()}: ...",
                "detailed": f"Regarding {prompt.lower()}, this requires careful consideration...",
                "technical": f"To address {prompt.lower()}, implement systematic methodology..."
            }
            return (templates.get(style_a, templates["helpful"]), templates.get(style_b, templates["brief"]))

        def save_annotation(prompt, response_a, response_b, preference, reasoning):
            annotation = {
                "prompt": prompt,
                "response_a": response_a,
                "response_b": response_b,
                "chosen": response_a if preference == "Response A" else response_b,
                "rejected": response_b if preference == "Response A" else response_a,
                "reasoning": reasoning,
                "timestamp": str(np.datetime64('now'))
            }
            self.annotations.append(annotation)
            with open("annotations.json", "w") as f:
                json.dump(self.annotations, f, indent=2)
            return f"✅ Saved annotation #{len(self.annotations)}"

        with gr.Blocks(title="🎯 Preference Annotation Platform") as demo:
            gr.Markdown("# 🎯 Preference Annotation Platform")
            gr.Markdown("Create preference data for alignment training")
            with gr.Row():
                with gr.Column():
                    prompt_input = gr.Textbox(label="📝 Enter Prompt", placeholder="How do you handle difficult conversations?", lines=3)
                    with gr.Row():
                        style_a = gr.Dropdown(["helpful", "detailed", "technical", "brief"], label="Style A", value="helpful")
                        style_b = gr.Dropdown(["helpful", "detailed", "technical", "brief"], label="Style B", value="brief")
                    generate_btn = gr.Button("🚀 Generate Responses", variant="primary")
                with gr.Column():
                    status_output = gr.Textbox(label="📊 Status", interactive=False)
            with gr.Row():
                with gr.Column():
                    gr.Markdown("### 🅰️ Response A")
                    response_a_output = gr.Textbox(label="Response A", lines=6, interactive=False)
                with gr.Column():
                    gr.Markdown("### 🅱️ Response B")
                    response_b_output = gr.Textbox(label="Response B", lines=6, interactive=False)
            preference_radio = gr.Radio(["Response A", "Response B"], label="👍 Which response is better?", value=None)
            reasoning_input = gr.Textbox(label="💭 Reasoning", placeholder="Why is this response better?", lines=3)
            save_btn = gr.Button("💾 Save Annotation", variant="secondary")
            generate_btn.click(fn=generate_responses, inputs=[prompt_input, style_a, style_b], outputs=[response_a_output, response_b_output])
            save_btn.click(fn=save_annotation, inputs=[prompt_input, response_a_output, response_b_output, preference_radio, reasoning_input], outputs=[status_output])
        return demo


---

## Usage: Launch the Annotation Platform

To launch the interactive annotation platform, simply run the following code cell. This will open a Gradio interface in your browser.


In [5]:
manager = AlignmentComparisonManager()
demo = manager.create_annotation_platform()
demo.launch()  # This will start the Gradio app for annotation


INFO:httpx:HTTP Request: GET http://127.0.0.1:7860/gradio_api/startup-events "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: HEAD http://127.0.0.1:7860/ "HTTP/1.1 200 OK"


* Running on local URL:  http://127.0.0.1:7860
* To create a public link, set `share=True` in `launch()`.




INFO:httpx:HTTP Request: GET https://api.gradio.app/pkg-version "HTTP/1.1 200 OK"


# Step 4: Preparing Datasets for Alignment Methods

In this step, we'll prepare the datasets for different alignment methods: PPO, DPO, and GRPO.

This step converts the loaded preference data into the formats required for each method.

---

## Function: `prepare_datasets`

This function processes all available datasets and creates specialized datasets for each alignment method.


In [6]:
from datasets import Dataset

class AlignmentComparisonManager:
    def __init__(self):
        self.annotations = []

    def prepare_datasets(self, datasets_info):
        """Prepare datasets for different alignment methods"""
        print("\n🔧 STEP 3: PREPARING ALIGNMENT DATASETS")
        print("Converting data for PPO, DPO, and GRPO training...")
        all_preference_data = []
        for dataset_name, info in datasets_info.items():
            dataset = info['dataset']
            print(f"  📊 Processing {dataset_name} ({info['size']} samples)...")
            for example in dataset:
                try:
                    if 'chosen' in example and 'rejected' in example:
                        preference_example = {
                            'prompt': str(example.get('prompt', '')),
                            'chosen': str(example['chosen']),
                            'rejected': str(example['rejected'])
                        }
                    else:
                        continue
                    if all(len(str(preference_example[key]).strip()) > 10 for key in ['prompt', 'chosen', 'rejected']):
                        all_preference_data.append(preference_example)
                except Exception:
                    continue
        # Add manual annotations
        if self.annotations:
            print(f"  📝 Adding {len(self.annotations)} manual annotations...")
            for annotation in self.annotations:
                preference_example = {
                    'prompt': str(annotation['prompt']),
                    'chosen': str(annotation['chosen']),
                    'rejected': str(annotation['rejected'])
                }
                all_preference_data.append(preference_example)
        datasets = {}
        if all_preference_data:
            datasets['dpo'] = Dataset.from_list(all_preference_data)
            datasets['ppo'] = Dataset.from_list([{'query': item['prompt']} for item in all_preference_data])
            datasets['grpo'] = Dataset.from_list([{'prompt': item['prompt']} for item in all_preference_data])
            print(f"  ✅ Created datasets - DPO: {len(datasets['dpo'])}, PPO: {len(datasets['ppo'])}, GRPO: {len(datasets['grpo'])}")
        return datasets
    
        #🔧 STEP 3: PREPARING ALIGNMENT DATASETS")
        print("Converting data for PPO, DPO, and GRPO training...")
        all_preference_data = []
        for dataset_name, info in datasets_info.items():
            dataset = info['dataset']
            print(f"  📊 Processing {dataset_name} ({info['size']} samples)...")
            for example in dataset:
                try:
                    if 'chosen' in example and 'rejected' in example:
                        preference_example = {
                            'prompt': str(example.get('prompt', '')),
                            'chosen': str(example['chosen']),
                            'rejected': str(example['rejected'])
                        }
                    else:
                        continue
                    if all(len(str(preference_example[key]).strip()) > 10 for key in ['prompt', 'chosen', 'rejected']):
                        all_preference_data.append(preference_example)
                except Exception:
                    continue
        # Add manual annotations
        if self.annotations:
            print(f"  📝 Adding {len(self.annotations)} manual annotations...")
            for annotation in self.annotations:
                preference_example = {
                    'prompt': str(annotation['prompt']),
                    'chosen': str(annotation['chosen']),
                    'rejected': str(annotation['rejected'])
                }
                all_preference_data.append(preference_example)
        datasets = {}
        if all_preference_data:
            datasets['dpo'] = Dataset.from_list(all_preference_data)
            datasets['ppo'] = Dataset.from_list([{'query': item['prompt']} for item in all_preference_data])
            datasets['grpo'] = Dataset.from_list([{'prompt': item['prompt']} for item in all_preference_data])
            print(f"  ✅ Created datasets - DPO: {len(datasets['dpo'])}, PPO: {len(datasets['ppo'])}, GRPO: {len(datasets['grpo'])}")
        return datasets


---

## Example: Prepare and Inspect Datasets

Let's use the manager to prepare the datasets and inspect their sizes.

In [7]:
manager = AlignmentComparisonManager()
# Assume datasets_info is already loaded from previous step
datasets = manager.prepare_datasets(datasets_info)
for name, ds in datasets.items():
    print(f"{name}: {len(ds)} samples")


🔧 STEP 3: PREPARING ALIGNMENT DATASETS
Converting data for PPO, DPO, and GRPO training...
  📊 Processing comprehensive_preferences (5 samples)...
  📊 Processing hh_rlhf (10 samples)...
  ✅ Created datasets - DPO: 5, PPO: 5, GRPO: 5
dpo: 5 samples
ppo: 5 samples
grpo: 5 samples


# Step 5: Setting Up the Model for Alignment Training

In this step, we'll set up the language model and tokenizer for alignment training.

We'll use a small model (like GPT-2 or DistilGPT-2) for demonstration, and configure it for parameter-efficient fine-tuning (LoRA).

---

## Function: `setup_models`

This function loads and configures the model and tokenizer for alignment training.


In [8]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import LoraConfig, get_peft_model, TaskType

class AlignmentComparisonManager:
    def __init__(self):
        self.device = 'cuda' if torch.cuda.is_available() else ('mps' if torch.backends.mps.is_available() else 'cpu')

    def setup_models(self):
        """Setup models for alignment training"""
        print("\n🔧 STEP 4: SETTING UP MODELS")
        print("Loading and configuring models for alignment training...")
        model_options = ['distilgpt2', 'gpt2']
        for model_name in model_options:
            try:
                print(f'🔄 Loading {model_name}...')
                tokenizer = AutoTokenizer.from_pretrained(model_name)
                if tokenizer.pad_token is None:
                    tokenizer.pad_token = tokenizer.eos_token
                    tokenizer.pad_token_id = tokenizer.eos_token_id
                model = AutoModelForCausalLM.from_pretrained(
                    model_name,
                    torch_dtype=torch.float32,
                    device_map=None,
                    low_cpu_mem_usage=True
                )
                model = model.to(self.device)
                # Setup LoRA
                lora_config = LoraConfig(
                    task_type=TaskType.CAUSAL_LM,
                    r=8,
                    lora_alpha=16,
                    lora_dropout=0.1,
                    target_modules=['c_attn'],
                    bias='none'
                )
                model = get_peft_model(model, lora_config)
                print(f'✅ Successfully loaded {model_name}')
                return {'model': model, 'tokenizer': tokenizer, 'name': model_name}
            except Exception as e:
                print(f'❌ Failed to load {model_name}: {e}')
                continue
        raise RuntimeError('Failed to load any model')
    
        # 🔧 STEP 4: SETTING UP MODELS"
        print("Loading and configuring models for alignment training...")
        model_options = ['distilgpt2', 'gpt2']
        for model_name in model_options:
            try:
                print(f'🔄 Loading {model_name}...')
                tokenizer = AutoTokenizer.from_pretrained(model_name)
                if tokenizer.pad_token is None:
                    tokenizer.pad_token = tokenizer.eos_token
                    tokenizer.pad_token_id = tokenizer.eos_token_id
                model = AutoModelForCausalLM.from_pretrained(
                    model_name,
                    torch_dtype=torch.float32,
                    device_map=None,
                    low_cpu_mem_usage=True
                )
                model = model.to(self.device)
                # Setup LoRA
                lora_config = LoraConfig(
                    task_type=TaskType.CAUSAL_LM,
                    r=8,
                    lora_alpha=16,
                    lora_dropout=0.1,
                    target_modules=['c_attn'],
                    bias='none'
                )
                model = get_peft_model(model, lora_config)
                print(f'✅ Successfully loaded {model_name}')
                return {'model': model, 'tokenizer': tokenizer, 'name': model_name}
            except Exception as e:
                print(f'❌ Failed to load {model_name}: {e}')
                continue
        raise RuntimeError('Failed to load any model')


---

## Example: Setup the Model

Let's use the manager to set up the model and tokenizer for alignment training.

In [9]:
   manager = AlignmentComparisonManager()
   model_info = manager.setup_models()
   print(f"Loaded model: {model_info['name']}")


🔧 STEP 4: SETTING UP MODELS
Loading and configuring models for alignment training...
🔄 Loading distilgpt2...


`torch_dtype` is deprecated! Use `dtype` instead!


'NoneType' object has no attribute 'cadam32bit_grad_fp32'
✅ Successfully loaded distilgpt2
Loaded model: distilgpt2


## Direct Preference Optimization (DPO) Explained:

DPO optimizes the model to prefer chosen responses over rejected ones using this loss function:

```
L_DPO = -log(σ(β * log(π_θ(y_w|x)/π_ref(y_w|x)) - β * log(π_θ(y_l|x)/π_ref(y_l|x))))
```

Where:
- `y_w` = chosen response
- `y_l` = rejected response  
- `β` = beta parameter (controls preference strength)
- `π_θ` = current model
- `π_ref` = reference model

**Key Parameters:**
- **Beta (β)**: Higher values = stronger preference enforcement
- **Learning Rate**: Lower for stable alignment
- **Batch Size**: Small for memory efficiency
- **Max Length**: Shorter sequences for stability


## PPO Training (Classic RLHF)

**What is PPO?**

PPO (Proximal Policy Optimization) is the classic reinforcement learning approach used in the original RLHF pipeline (ChatGPT). Unlike DPO, PPO requires a separate reward model and uses reinforcement learning to optimize the policy.

**PPO vs DPO Comparison:**

| Aspect | PPO/RLHF | DPO |
|--------|----------|-----|
| **Complexity** | High (needs reward model + RL) | Low (direct optimization) |
| **Stability** | Can be unstable | More stable |
| **Speed** | Slower (multiple models) | Faster (single model) |
| **Memory** | Higher (policy + value + reward models) | Lower (policy + reference) |
| **Data Requirements** | Preference pairs → Reward model → RL | Direct preference pairs |

**PPO Process:**
1. **Collect Preferences**: Human feedback on model outputs
2. **Train Reward Model**: Predict human preferences 
3. **RL Training**: Use PPO to maximize reward while staying close to reference model
4. **Iterate**: Repeat process for continuous improvement

**When to Use PPO:**
- Need explicit reward signals
- Have large-scale preference data
- Want fine-grained control over reward modeling
- Building complex multi-objective systems


## 🚀 Cutting-Edge: GRPO Training

**What is GRPO?**

GRPO (Generalized Reward Preference Optimization) is the newest alignment technique that combines the best of both PPO and DPO. It's more flexible than DPO and more stable than PPO.

**GRPO Advantages:**
- **Flexible Feedback**: Handles pairs, rankings, numeric scores, and mixed feedback types
- **Built-in Reward Modeling**: Learns reward model and policy simultaneously
- **Sample Efficient**: Requires less data than traditional RLHF
- **Stable Training**: More robust than PPO, more flexible than DPO
- **Multi-Objective**: Can optimize for multiple alignment goals

**GRPO vs Other Methods:**

| Feature | PPO | DPO | GRPO |
|---------|-----|-----|------|
| **Feedback Types** | Scores only | Pairs only | Any type |
| **Reward Model** | Separate training | Not needed | Built-in |
| **Stability** | Moderate | High | High |
| **Flexibility** | Low | Moderate | High |
| **Implementation** | Complex | Simple | Moderate |

**When to Use GRPO:**
- Have mixed types of feedback data
- Need multi-objective alignment
- Want cutting-edge performance
- Building production alignment systems

**GRPO Process:**
1. **Unified Data Processing**: Handle multiple feedback formats
2. **Joint Training**: Simultaneously learn reward model and policy
3. **Adaptive Optimization**: Adjust to different feedback types
4. **Multi-Objective Balancing**: Optimize for multiple alignment goals


# Step 6: Demonstrating Alignment Methods (PPO, DPO, GRPO)

**Why Evaluate Alignment?**

- **Measure Improvement**: Compare before/after responses
- **Quality Assessment**: Ensure alignment actually helps
- **Safety Check**: Verify no harmful behavior
- **Task Performance**: Maintain task-specific capabilities

**Evaluation Methods:**
- **Qualitative**: Human evaluation of response quality
- **Quantitative**: Automated metrics and benchmarks
- **Comparative**: Before vs after alignment
- **Safety**: Red-teaming and adversarial testing

In this step, we'll demonstrate the three major alignment methods: PPO, DPO, and GRPO.

We'll run a simplified training loop for each method and observe the results.

---

## Function: `demonstrate_alignment_methods`

This function runs a demonstration of PPO, DPO, and GRPO training using the prepared datasets and model.

In [10]:
import torch
import numpy as np

class AlignmentComparisonManager:
    def _prepare_datasets_for_training(self, datasets, tokenizer):
        """Prepare datasets with proper tokenization for training"""
        prepared_datasets = {}
        # Prepare PPO dataset
        if "ppo" in datasets:
            ppo_data = []
            for item in datasets["ppo"]:
                tokenized = tokenizer(
                    item["query"], 
                    return_tensors="pt", 
                    padding=True, 
                    truncation=True,
                    max_length=128
                )
                ppo_data.append({
                    "input_ids": tokenized["input_ids"].squeeze(),
                    "attention_mask": tokenized["attention_mask"].squeeze(),
                    "query": item["query"]
                })
            if ppo_data:
                prepared_datasets["ppo"] = ppo_data
        # Prepare DPO dataset  
        if "dpo" in datasets:
            dpo_data = []
            for item in datasets["dpo"]:
                if all(key in item for key in ["prompt", "chosen", "rejected"]):
                    dpo_data.append({
                        "prompt": str(item["prompt"]),
                        "chosen": str(item["chosen"]),
                        "rejected": str(item["rejected"])
                    })
            if dpo_data:
                prepared_datasets["dpo"] = dpo_data
        # Prepare GRPO dataset
        if "grpo" in datasets:
            grpo_data = []
            for item in datasets["grpo"]:
                if "prompt" in item:
                    grpo_data.append({
                        "prompt": str(item["prompt"])
                    })
            if grpo_data:
                prepared_datasets["grpo"] = grpo_data
        return prepared_datasets

    def _test_ppo_training(self, model_info, dataset):
        if not dataset or len(dataset) == 0:
            return {"status": "no_data", "message": "No dataset available"}
        model = model_info["model"]
        tokenizer = model_info["tokenizer"]
        losses = []
        for step in range(min(3, len(dataset))):
            query = dataset[step]["query"]
            inputs = tokenizer(
                f"Query: {query}\nResponse:",
                return_tensors="pt",
                padding=True,
                truncation=True,
                max_length=128
            ).to(model.device)
            model.train()
            outputs = model(**inputs, labels=inputs["input_ids"])
            loss = outputs.loss
            loss.backward()
            with torch.no_grad():
                for param in model.parameters():
                    if param.grad is not None:
                        grad_norm = torch.norm(param.grad.data)
                        if grad_norm > 1.0:
                            param.grad.data = param.grad.data / grad_norm
                        param.data -= 1e-5 * param.grad.data
                        param.grad.zero_()
            losses.append(loss.item())
        return {
            "status": "trained_successfully",
            "losses": losses,
            "final_loss": losses[-1] if losses else 0,
            "steps": len(losses)
        }

    def _test_dpo_training(self, model_info, dataset):
        if not dataset or len(dataset) == 0:
            return {"status": "no_data", "message": "No dataset available"}
        model = model_info["model"]
        tokenizer = model_info["tokenizer"]
        losses = []
        for step in range(min(3, len(dataset))):
            data = dataset[step]
            prompt = data["prompt"]
            chosen = data["chosen"]
            rejected = data["rejected"]
            chosen_text = f"{prompt}\n{chosen}"
            rejected_text = f"{prompt}\n{rejected}"
            chosen_inputs = tokenizer(
                chosen_text,
                return_tensors="pt",
                padding=True,
                truncation=True,
                max_length=256
            ).to(model.device)
            rejected_inputs = tokenizer(
                rejected_text,
                return_tensors="pt",
                padding=True,
                truncation=True,
                max_length=256
            ).to(model.device)
            model.train()
            chosen_outputs = model(**chosen_inputs, labels=chosen_inputs["input_ids"])
            rejected_outputs = model(**rejected_inputs, labels=rejected_inputs["input_ids"])
            chosen_loss = chosen_outputs.loss
            rejected_loss = rejected_outputs.loss
            beta = 0.1
            dpo_loss = -torch.log(torch.sigmoid(beta * (rejected_loss - chosen_loss)))
            dpo_loss.backward()
            with torch.no_grad():
                for param in model.parameters():
                    if param.grad is not None:
                        param.data -= 5e-6 * param.grad.data
                        param.grad.zero_()
            losses.append(dpo_loss.item())
        return {
            "status": "trained_successfully",
            "losses": losses,
            "final_loss": losses[-1] if losses else 0,
            "steps": len(losses)
        }

    def _test_grpo_training(self, model_info, dataset):
        if not dataset or len(dataset) == 0:
            return {"status": "no_data", "message": "No dataset available"}
        model = model_info["model"]
        tokenizer = model_info["tokenizer"]
        losses = []
        for step in range(min(3, len(dataset))):
            prompt = dataset[step]["prompt"]
            inputs = tokenizer(
                f"Question: {prompt}\nAnswer:",
                return_tensors="pt",
                padding=True,
                truncation=True,
                max_length=128
            ).to(model.device)
            with torch.no_grad():
                group_responses = model.generate(
                    **inputs,
                    max_new_tokens=32,
                    num_return_sequences=4,
                    do_sample=True,
                    temperature=0.8,
                    pad_token_id=tokenizer.eos_token_id
                )
            model.train()
            outputs = model(**inputs, labels=inputs["input_ids"])
            base_loss = outputs.loss
            group_losses = []
            for response in group_responses:
                response_inputs = {"input_ids": response.unsqueeze(0), "attention_mask": torch.ones_like(response.unsqueeze(0))}
                try:
                    response_output = model(**response_inputs, labels=response.unsqueeze(0))
                    group_losses.append(response_output.loss.item())
                except:
                    group_losses.append(base_loss.item())
            mean_group_loss = np.mean(group_losses)
            std_group_loss = np.std(group_losses) + 1e-8
            relative_advantage = (base_loss.item() - mean_group_loss) / std_group_loss
            grpo_loss = base_loss * (1 + 0.1 * relative_advantage)
            grpo_loss.backward()
            with torch.no_grad():
                for param in model.parameters():
                    if param.grad is not None:
                        param.data -= 1e-5 * param.grad.data
                        param.grad.zero_()
            losses.append(grpo_loss.item())
        return {
            "status": "trained_successfully",
            "losses": losses,
            "final_loss": losses[-1] if losses else 0,
            "steps": len(losses),
            "group_size": 4
        }

    def demonstrate_alignment_methods(self, model_info, datasets):
        """Demonstrate all alignment methods (PPO, DPO, GRPO)"""
        print("\n🎯 STEP 5: DEMONSTRATING ALIGNMENT METHODS")
        print("Implementing PPO, DPO, and GRPO concepts...")
        print("⚠️  NOTE: This includes ACTUAL TRAINING - models will be updated")
        print("⚠️  Training is limited to demo purposes (few epochs/steps)")
        prepared_datasets = self._prepare_datasets_for_training(datasets, model_info['tokenizer'])
        results = {}
        print("\n🟡 PPO (Proximal Policy Optimization):")
        try:
            ppo_results = self._test_ppo_training(model_info, prepared_datasets.get('ppo'))
            results['ppo'] = ppo_results
        except Exception as e:
            print(f'  ❌ PPO training error: {e}')
            results['ppo'] = {'status': 'error', 'error': str(e)}
        print("\n🟢 DPO (Direct Preference Optimization):")
        try:
            dpo_results = self._test_dpo_training(model_info, prepared_datasets.get('dpo'))
            results['dpo'] = dpo_results
        except Exception as e:
            print(f'  ❌ DPO training error: {e}')
            results['dpo'] = {'status': 'error', 'error': str(e)}
        print("\n🟣 GRPO (Group Relative Policy Optimization):")
        try:
            grpo_results = self._test_grpo_training(model_info, prepared_datasets.get('grpo'))
            results['grpo'] = grpo_results
        except Exception as e:
            print(f'  ❌ GRPO training error: {e}')
            results['grpo'] = {'status': 'error', 'error': str(e)}
        print("\n✅ All alignment methods training completed!")
        return results

---

## Example: Run the Alignment Methods Demonstration

Let's use the manager to run the demonstration and observe the results.

In [11]:
   manager = AlignmentComparisonManager()
   results = manager.demonstrate_alignment_methods(model_info, datasets)
   print(results)

`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.



🎯 STEP 5: DEMONSTRATING ALIGNMENT METHODS
Implementing PPO, DPO, and GRPO concepts...
⚠️  NOTE: This includes ACTUAL TRAINING - models will be updated
⚠️  Training is limited to demo purposes (few epochs/steps)

🟡 PPO (Proximal Policy Optimization):

🟢 DPO (Direct Preference Optimization):

🟣 GRPO (Group Relative Policy Optimization):

✅ All alignment methods training completed!
{'ppo': {'status': 'trained_successfully', 'losses': [3.942887306213379, 3.4061694145202637, 3.471867322921753], 'final_loss': 3.471867322921753, 'steps': 3}, 'dpo': {'status': 'trained_successfully', 'losses': [0.7097368836402893, 0.7240135669708252, 0.7008044123649597], 'final_loss': 0.7008044123649597, 'steps': 3}, 'grpo': {'status': 'trained_successfully', 'losses': [4.241724014282227, 4.295168399810791, 5.627329349517822], 'final_loss': 5.627329349517822, 'steps': 3, 'group_size': 4}}


# Step 7: Evaluation and Comparison of Alignment Methods

In this step, we'll evaluate and compare the performance of the different alignment methods (Base, PPO, DPO, GRPO) using a simple quality scoring heuristic.

We'll use simulated responses for demonstration, but you can adapt this to use real model outputs if available.

---

In [12]:
import numpy as np

class AlignmentComparisonManager:
    def _calculate_quality_score(self, response: str) -> float:
        """Calculate response quality score (simple heuristic)"""
        score = 0.0
        word_count = len(response.split())
        if 20 <= word_count <= 150:
            score += 0.3
        if any(indicator in response.lower() for indicator in ["1)", "2)", "first", "then", ":"]):
            score += 0.3
        if any(word in response.lower() for word in ["example", "specific", "such as"]):
            score += 0.2
        if not any(word in response.lower() for word in ["gonna", "wanna", "kinda"]):
            score += 0.2
        return min(score, 1.0)

    def _get_simulated_responses(self):
        """Get simulated responses as fallback (for demonstration)"""
        return {
            "base": [
                "Make a list and work through it based on deadlines.",
                "Machine learning is when computers learn from examples.",
                "Turn off affected systems and call security team.",
                "Look at code and check for bugs and standards.",
                "Talk to them about performance and set expectations."
            ],
            "ppo": [
                "Use systematic prioritization: assess true deadlines, evaluate impact, consider dependencies, use frameworks like Eisenhower Matrix, communicate with stakeholders about realistic timelines.",
                "Machine learning is like teaching a computer to recognize patterns! Show it thousands of examples, and it learns to make good guesses on new information.",
                "Execute structured incident response: immediate containment, activate response team, assess scope, preserve evidence, notify stakeholders, implement remediation.",
                "Comprehensive reviews focus on functionality, security, performance, maintainability, standards adherence, test coverage, and knowledge sharing opportunities.",
                "Address underperformance with structured approach: private discussion, understand causes, collaborate on improvement goals, provide support, regular check-ins."
            ],
            "dpo": [
                "Use ICE framework (Impact, Confidence, Ease) to rank tasks, challenge 'urgent' assumptions, communicate constraints transparently, focus on business value.",
                "Think of it like teaching pattern recognition to a computer! Just like you learned faces by seeing examples, computers learn from data patterns.",
                "Follow NIST framework: Prepare, Detect/Analyze, Contain/Eradicate/Recover, Post-incident review. Focus on business continuity and evidence preservation.",
                "Balance thoroughness with efficiency: focus on logic/security/maintainability, use automation for basics, provide specific actionable feedback.",
                "Start with curiosity: understand barriers through open questions, collaborate on improvement plan, provide resources, regular supportive check-ins."
            ],
            "grpo": [
                "Structured system: list tasks with effort estimates, score impact×urgency/effort, identify dependencies, communicate timelines, batch similar tasks, daily priority review.",
                "Machine learning is like a super-smart friend practicing games! Show 1000 animal pictures with answers, they get amazing at guessing new animals by recognizing patterns!",
                "Phased response: Phase 1-Containment, Phase 2-Investigation, Phase 3-Communication, Phase 4-Recovery, Phase 5-Post-mortem with lessons learned.",
                "Structured process: automated checks first, review architecture/logic/performance, verify error handling/tests/docs, provide specific feedback with examples.",
                "Systematic approach: data gathering, root cause analysis, SMART goal setting, resource allocation, regular monitoring, plan adjustment, achievement recognition."
            ]
        }

    def evaluate_methods(self):
        """Step 7: Evaluate and compare alignment methods"""
        print("\n📊 STEP 7: EVALUATING ALIGNMENT METHODS")
        print("Comparing method performance across domains...")

        evaluation_prompts = [
            "How do you prioritize tasks when everything seems urgent?",
            "Explain machine learning to a 5-year-old.",
            "How do you handle a security breach?",
            "What's your approach to code reviews?",
            "How do you motivate underperforming team members?"
        ]

        # For demonstration, use simulated responses
        method_scores = {}
        method_responses = self._get_simulated_responses()

        # Calculate quality scores for each method
        for method, responses in method_responses.items():
            scores = []
            for response in responses:
                score = self._calculate_quality_score(response)
                scores.append(score)
            method_scores[method] = {
                "average_score": np.mean(scores),
                "std_score": np.std(scores),
                "responses": responses
            }
            print(f"  📝 {method.upper()}: Average score {method_scores[method]['average_score']:.3f}")

        return {
            "method_scores": method_scores,
            "evaluation_prompts": evaluation_prompts
        }

# Example usage:
manager = AlignmentComparisonManager()
results = manager.evaluate_methods()
print("\nSummary of evaluation:")
for method, info in results["method_scores"].items():
    print(f"{method.upper()}: Avg Score = {info['average_score']:.3f}, Responses = {info['responses']}")


📊 STEP 7: EVALUATING ALIGNMENT METHODS
Comparing method performance across domains...
  📝 BASE: Average score 0.240
  📝 PPO: Average score 0.540
  📝 DPO: Average score 0.520
  📝 GRPO: Average score 0.540

Summary of evaluation:
BASE: Avg Score = 0.240, Responses = ['Make a list and work through it based on deadlines.', 'Machine learning is when computers learn from examples.', 'Turn off affected systems and call security team.', 'Look at code and check for bugs and standards.', 'Talk to them about performance and set expectations.']
PPO: Avg Score = 0.540, Responses = ['Use systematic prioritization: assess true deadlines, evaluate impact, consider dependencies, use frameworks like Eisenhower Matrix, communicate with stakeholders about realistic timelines.', 'Machine learning is like teaching a computer to recognize patterns! Show it thousands of examples, and it learns to make good guesses on new information.', 'Execute structured incident response: immediate containment, activate re

# Step 8: Generate a Comprehensive Alignment Report

In this step, we'll generate a comprehensive report summarizing the evaluation results and providing recommendations for each alignment method (Base, PPO, DPO, GRPO).

The report will include:
- A summary of the best method and its score
- Key characteristics of each method
- Recommendations for different use cases
- Key insights from the evaluation

---

In [13]:
import json

class AlignmentComparisonManager:
    # ... (include previous methods here, or assume this is added to your existing class)

    def generate_report(self, evaluation_results):
        """Step 8: Generate comprehensive report"""
        print("\n📋 STEP 8: GENERATING COMPREHENSIVE REPORT")
        print("Creating detailed analysis and recommendations...")

        method_characteristics = {
            "base": {
                "complexity": "Low",
                "training_time": "None", 
                "memory_usage": "Low",
                "strengths": ["Simple", "Fast", "No training"],
                "weaknesses": ["No alignment", "Inconsistent", "No safety"]
            },
            "ppo": {
                "complexity": "High",
                "training_time": "Long",
                "memory_usage": "Very High", 
                "strengths": ["Proven method", "Fine control", "Stable"],
                "weaknesses": ["4 models needed", "Complex", "High cost"]
            },
            "dpo": {
                "complexity": "Medium",
                "training_time": "Medium",
                "memory_usage": "Medium",
                "strengths": ["No reward model", "Simpler", "Good results"],
                "weaknesses": ["Limited to preferences", "Beta tuning"]
            },
            "grpo": {
                "complexity": "Medium-High", 
                "training_time": "Medium-Long",
                "memory_usage": "High",
                "strengths": ["Sample efficient", "Flexible", "Latest method"],
                "weaknesses": ["Multiple generations", "Newer method"]
            }
        }

        recommendations = {
            "beginners": "Start with DPO - best balance of simplicity and performance",
            "production": "DPO for most cases, PPO for safety-critical applications", 
            "research": "GRPO for latest innovations and flexible experimentation",
            "resources": "DPO offers best performance per computational cost"
        }

        if evaluation_results and "method_scores" in evaluation_results:
            best_method = max(
                evaluation_results["method_scores"].keys(), 
                key=lambda x: evaluation_results["method_scores"][x]["average_score"]
            )
            best_score = evaluation_results["method_scores"][best_method]["average_score"]
        else:
            best_method = "dpo"
            best_score = 0.85

        report = {
            "summary": {
                "best_method": best_method,
                "best_score": best_score,
                "methods_evaluated": ["base", "ppo", "dpo", "grpo"]
            },
            "method_characteristics": method_characteristics,
            "recommendations": recommendations,
            "key_insights": [
                "DPO provides best balance of performance and simplicity",
                "GRPO shows promise for complex reasoning tasks",
                "PPO remains essential for safety-critical applications",
                "Base models lack consistency for production use"
            ]
        }

        # Optionally save to file
        with open("alignment_comparison_report.json", "w") as f:
            json.dump(report, f, indent=2)

        print("✅ Comprehensive report generated!")
        print(json.dumps(report, indent=2))
        return report

# Example usage:
# (Assume you have already run evaluation and have `results`)
manager = AlignmentComparisonManager()
report = manager.generate_report(results)


📋 STEP 8: GENERATING COMPREHENSIVE REPORT
Creating detailed analysis and recommendations...
✅ Comprehensive report generated!
{
  "summary": {
    "best_method": "ppo",
    "best_score": 0.54,
    "methods_evaluated": [
      "base",
      "ppo",
      "dpo",
      "grpo"
    ]
  },
  "method_characteristics": {
    "base": {
      "complexity": "Low",
      "training_time": "None",
      "memory_usage": "Low",
      "strengths": [
        "Simple",
        "Fast",
        "No training"
      ],
      "weaknesses": [
        "No alignment",
        "Inconsistent",
        "No safety"
      ]
    },
    "ppo": {
      "complexity": "High",
      "training_time": "Long",
      "memory_usage": "Very High",
      "strengths": [
        "Proven method",
        "Fine control",
        "Stable"
      ],
      "weaknesses": [
        "4 models needed",
        "Complex",
        "High cost"
      ]
    },
    "dpo": {
      "complexity": "Medium",
      "training_time": "Medium",
      "memo

# Step 9: Save All Results

In this final step, we'll save all the key results and artifacts from our alignment comparison workflow.  
This includes:
- The comprehensive report
- Evaluation results
- Any collected annotations (if available)
- A summary of the learning outcomes

Saving these results ensures reproducibility and makes it easy to share or revisit your findings later.

---

In [15]:
import json

class AlignmentComparisonManager:
    def __init__(self):
        self.base_model_name = "gpt2"
        self.annotations = []

    def save_results(self, report, evaluation_results):
        """Step 9: Save all results"""
        print("\n💾 STEP 9: SAVING RESULTS")
        print("Saving all artifacts and analysis...")

        # Save individual files
        with open("alignment_comparison_report.json", "w") as f:
            json.dump(report, f, indent=2)

        with open("evaluation_results.json", "w") as f:
            json.dump(evaluation_results, f, indent=2)

        if self.annotations:
            with open("collected_annotations.json", "w") as f:
                json.dump(self.annotations, f, indent=2)

        summary = {
            "class": "Class 7 - Alignment Method Comparison",
            "methods": ["PPO", "DPO", "GRPO"],
            "base_model": self.base_model_name,
            "annotations_collected": len(self.annotations),
            "best_method": report["summary"]["best_method"],
            "best_score": report["summary"]["best_score"],
            "artifacts": [
                "alignment_comparison_report.json",
                "evaluation_results.json",
                "collected_annotations.json" if self.annotations else None
            ],
            "learning_outcomes": [
                "Understanding of PPO, DPO, and GRPO concepts",
                "Hands-on preference data collection",
                "Method comparison and evaluation",
                "Interactive demonstration tools",
                "Best practice recommendations"
            ]
        }

        with open("final_summary.json", "w") as f:
            json.dump(summary, f, indent=2)

        print("✅ All results saved!")
        print(json.dumps(summary, indent=2))
        return summary

# Example usage:
# (Assume you have already run previous steps and have `report` and `results`)
manager = AlignmentComparisonManager()
summary = manager.save_results(report, results)


💾 STEP 9: SAVING RESULTS
Saving all artifacts and analysis...
✅ All results saved!
{
  "class": "Class 7 - Alignment Method Comparison",
  "methods": [
    "PPO",
    "DPO",
    "GRPO"
  ],
  "base_model": "gpt2",
  "annotations_collected": 0,
  "best_method": "ppo",
  "best_score": 0.54,
  "artifacts": [
    "alignment_comparison_report.json",
    "evaluation_results.json",
    null
  ],
  "learning_outcomes": [
    "Understanding of PPO, DPO, and GRPO concepts",
    "Hands-on preference data collection",
    "Method comparison and evaluation",
    "Interactive demonstration tools",
    "Best practice recommendations"
  ]
}
