# Transformer Models Interview Preparation

## Complete Guide for Applied Scientist Roles

This notebook covers comprehensive transformer knowledge for **Amazon Applied Scientist** and similar roles, including:

### 🏗️ **Core Architecture & Implementation:**
- **Attention Mechanism**: Self-attention, multi-head attention, scaled dot-product
- **Positional Encoding**: Sinusoidal, learned, relative positioning
- **Layer Normalization**: Pre-norm vs post-norm, RMSNorm alternatives
- **Feed-Forward Networks**: Position-wise, activation functions (GELU, SwiGLU)

### 🔧 **Fine-tuning & Optimization:**
- **Parameter-Efficient Fine-tuning**: LoRA, AdaLoRA, QLoRA, Prefix Tuning
- **Full Fine-tuning**: Task-specific adaptation, catastrophic forgetting
- **Instruction Tuning**: RLHF, constitutional AI, alignment techniques
- **Domain Adaptation**: Medical, legal, financial domain fine-tuning

### 🧠 **RAG (Retrieval-Augmented Generation):**
- **Vector Databases**: Embedding storage, similarity search
- **Retrieval Strategies**: Dense, sparse, hybrid retrieval
- **Chunk Optimization**: Size, overlap, hierarchical chunking
- **Re-ranking**: Cross-encoder re-ranking, relevance scoring

### 📝 **Prompt Engineering & Templates:**
- **Prompt Design**: Zero-shot, few-shot, chain-of-thought
- **Template Systems**: Dynamic templates, role-based prompting
- **Context Management**: Token optimization, context window usage
- **Evaluation**: Prompt effectiveness metrics, A/B testing

### 🚀 **Production Deployment:**
- **Model Serving**: Azure OpenAI, Hugging Face Inference
- **Optimization**: Quantization, pruning, knowledge distillation
- **Monitoring**: Performance metrics, drift detection
- **Scaling**: Load balancing, auto-scaling strategies

### ⚡ **Advanced Topics:**
- **Mixture of Experts (MoE)**: Sparse activation, routing
- **Multi-modal Transformers**: Vision-language models
- **Long Context**: Efficient attention, memory mechanisms
- **Safety & Alignment**: Bias detection, content filtering

### 🎯 **Interview Focus Areas:**
1. ✅ **Mathematical foundations** - Attention computation, complexity analysis
2. ✅ **Implementation details** - Custom layers, training loops
3. ✅ **Production challenges** - Latency, throughput, cost optimization
4. ✅ **Evaluation metrics** - BLEU, ROUGE, human evaluation
5. ✅ **Failure modes** - Hallucination, catastrophic forgetting
6. ✅ **Recent advances** - Latest papers, architectural innovations

## 1. Import Required Libraries

Essential libraries for transformer implementation and production deployment:

In [None]:
# Core libraries
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset
import math
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Optional, Dict, List, Tuple, Any
import json
import pandas as pd
from dataclasses import dataclass
import warnings
warnings.filterwarnings('ignore')

# Transformer-specific libraries
try:
    from transformers import (
        AutoTokenizer, AutoModel, AutoModelForCausalLM,
        TrainingArguments, Trainer, DataCollatorForLanguageModeling,
        pipeline, BitsAndBytesConfig
    )
    from peft import LoraConfig, get_peft_model, TaskType
    print("✅ Transformers libraries imported successfully")
except ImportError:
    print("⚠️ Transformers libraries not available - will show implementation concepts")

# Vector search and RAG
try:
    import faiss
    from sentence_transformers import SentenceTransformer
    print("✅ Vector search libraries imported")
except ImportError:
    print("⚠️ Vector search libraries not available - will show concepts")

# Azure libraries for production
try:
    from azure.ai.ml import MLClient
    from azure.identity import DefaultAzureCredential
    print("✅ Azure ML libraries imported")
except ImportError:
    print("⚠️ Azure libraries not available - will show deployment concepts")

# Set random seeds
np.random.seed(42)
torch.manual_seed(42)
if torch.cuda.is_available():
    torch.cuda.manual_seed(42)

print(f"\n🚀 Environment setup complete!")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"Device: {'cuda' if torch.cuda.is_available() else 'cpu'}")

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"\n🎯 Ready for transformer deep dive!")

## 2. Core Transformer Architecture: Mathematical Foundation

### 🧮 **Attention Mechanism - The Heart of Transformers**

**Scaled Dot-Product Attention:**
```
Attention(Q, K, V) = softmax(QK^T / √d_k)V
```

Where:
- **Q (Query)**: What we're looking for
- **K (Key)**: What we're looking in  
- **V (Value)**: What we retrieve
- **d_k**: Dimension of key vectors (for scaling)

**Multi-Head Attention:**
```
MultiHead(Q, K, V) = Concat(head_1, ..., head_h)W^O
head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)
```

### 📊 **Complexity Analysis:**
- **Self-Attention**: O(n²d) time, O(n²) space
- **Feed-Forward**: O(nd²) time, O(d²) space  
- **Total per layer**: O(n²d + nd²)

### 🔧 **Key Design Decisions:**
1. **Why scaling by √d_k?** Prevents softmax saturation
2. **Why multiple heads?** Different representation subspaces
3. **Why residual connections?** Gradient flow, easier training
4. **Why layer normalization?** Training stability

In [None]:
class MultiHeadAttention(nn.Module):
    """
    Multi-Head Attention implementation with all the production considerations.
    
    Key Interview Points:
    1. Scaled dot-product attention formula
    2. Multiple heads for different representation subspaces
    3. Attention weights interpretation
    4. Computational complexity: O(n²d)
    """
    
    def __init__(self, d_model: int, n_heads: int, dropout: float = 0.1):
        super().__init__()
        assert d_model % n_heads == 0, "d_model must be divisible by n_heads"
        
        self.d_model = d_model
        self.n_heads = n_heads
        self.d_k = d_model // n_heads  # Dimension per head
        
        # Linear projections for Q, K, V
        self.w_q = nn.Linear(d_model, d_model, bias=False)
        self.w_k = nn.Linear(d_model, d_model, bias=False)
        self.w_v = nn.Linear(d_model, d_model, bias=False)
        self.w_o = nn.Linear(d_model, d_model)  # Output projection
        
        self.dropout = nn.Dropout(dropout)
        self.scale = math.sqrt(self.d_k)  # Scaling factor
        
    def forward(self, query, key, value, mask=None, return_attention=False):
        batch_size, seq_len = query.size(0), query.size(1)
        
        # 1. Linear projections and reshape for multi-head
        Q = self.w_q(query).view(batch_size, seq_len, self.n_heads, self.d_k).transpose(1, 2)
        K = self.w_k(key).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
        V = self.w_v(value).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
        # Shape: (batch_size, n_heads, seq_len, d_k)
        
        # 2. Scaled dot-product attention
        attention, attention_weights = self.scaled_dot_product_attention(
            Q, K, V, mask, self.scale, self.dropout
        )
        
        # 3. Concatenate heads and project
        attention = attention.transpose(1, 2).contiguous().view(
            batch_size, seq_len, self.d_model
        )
        output = self.w_o(attention)
        
        if return_attention:
            return output, attention_weights
        return output
    
    @staticmethod
    def scaled_dot_product_attention(Q, K, V, mask, scale, dropout):
        """
        Core attention computation: Attention(Q,K,V) = softmax(QK^T/√d_k)V
        
        INTERVIEW CRITICAL: Explain each step and why scaling is needed
        """
        # Compute attention scores: QK^T
        scores = torch.matmul(Q, K.transpose(-2, -1)) / scale
        
        # Apply mask (for padding, future positions)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)
        
        # Softmax to get attention weights
        attention_weights = F.softmax(scores, dim=-1)
        attention_weights = dropout(attention_weights)
        
        # Apply attention to values
        attention = torch.matmul(attention_weights, V)
        
        return attention, attention_weights


class PositionalEncoding(nn.Module):
    """
    Sinusoidal positional encoding for transformers.
    
    Formula: PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
             PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
    
    Interview Questions:
    - Why not learnable positions?
    - How does this generalize to unseen lengths?
    - Alternative: Relative position encoding
    """
    
    def __init__(self, d_model: int, max_len: int = 5000):
        super().__init__()
        
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1).float()
        
        # Create div_term for sinusoidal pattern
        div_term = torch.exp(torch.arange(0, d_model, 2).float() *
                           -(math.log(10000.0) / d_model))
        
        pe[:, 0::2] = torch.sin(position * div_term)  # Even positions
        pe[:, 1::2] = torch.cos(position * div_term)  # Odd positions
        
        self.register_buffer('pe', pe.unsqueeze(0))
    
    def forward(self, x):
        return x + self.pe[:, :x.size(1)]


class TransformerBlock(nn.Module):
    """
    Complete transformer block with attention + feed-forward.
    
    Architecture: LayerNorm -> MultiHeadAttention -> Residual -> LayerNorm -> FFN -> Residual
    
    Key Design Choices:
    1. Pre-norm vs Post-norm (Pre-norm is more stable)
    2. GELU activation (smoother than ReLU)
    3. Residual connections (gradient flow)
    """
    
    def __init__(self, d_model: int, n_heads: int, d_ff: int, dropout: float = 0.1):
        super().__init__()
        
        self.attention = MultiHeadAttention(d_model, n_heads, dropout)
        self.feed_forward = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),  # GELU is smoother than ReLU
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model),
            nn.Dropout(dropout)
        )
        
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x, mask=None):
        # Pre-norm architecture (more stable training)
        # Self-attention block
        norm_x = self.norm1(x)
        attn_output = self.attention(norm_x, norm_x, norm_x, mask)
        x = x + self.dropout(attn_output)  # Residual connection
        
        # Feed-forward block
        norm_x = self.norm2(x)
        ff_output = self.feed_forward(norm_x)
        x = x + ff_output  # Residual connection
        
        return x


# Test the implementation
print("🏗️ TRANSFORMER ARCHITECTURE COMPONENTS")
print("=" * 50)

# Test parameters
batch_size, seq_len, d_model = 2, 10, 512
n_heads, d_ff = 8, 2048

# Create test input
x = torch.randn(batch_size, seq_len, d_model)

# Test each component
print(f"Input shape: {x.shape}")

# Positional encoding
pos_enc = PositionalEncoding(d_model)
x_with_pos = pos_enc(x)
print(f"After positional encoding: {x_with_pos.shape}")

# Multi-head attention
mha = MultiHeadAttention(d_model, n_heads)
attn_output, attn_weights = mha(x, x, x, return_attention=True)
print(f"Attention output: {attn_output.shape}")
print(f"Attention weights: {attn_weights.shape}")

# Full transformer block
transformer_block = TransformerBlock(d_model, n_heads, d_ff)
block_output = transformer_block(x)
print(f"Transformer block output: {block_output.shape}")

# Calculate model parameters
total_params = sum(p.numel() for p in transformer_block.parameters())
print(f"\n📊 Model Statistics:")
print(f"Total parameters: {total_params:,}")
print(f"Memory per sample: ~{(seq_len * d_model * 4) / 1024:.1f} KB")
print(f"Attention complexity: O(n²) where n = {seq_len}")

print(f"\n✅ Core transformer components implemented successfully!")

## 3. Fine-tuning Strategies: From Full to Parameter-Efficient

### 🎯 **Fine-tuning Approaches Comparison:**

| Method | Parameters Updated | Memory | Training Speed | Performance |
|--------|-------------------|---------|----------------|-------------|
| **Full Fine-tuning** | 100% | High | Slow | Best |
| **LoRA** | 0.1-1% | Low | Fast | 95-99% of full |
| **AdaLoRA** | 0.1-1% | Low | Fast | Better than LoRA |
| **QLoRA** | 0.1-1% | Very Low | Fast | Good (4-bit) |
| **Prefix Tuning** | 0.1% | Very Low | Very Fast | 90-95% of full |

### 🔧 **LoRA (Low-Rank Adaptation):**
- **Key Idea**: Decompose weight updates into low-rank matrices
- **Formula**: W = W₀ + ΔW = W₀ + BA (where B ∈ ℝᵈˣʳ, A ∈ ℝʳˣᵏ)
- **Rank r**: Typically 4-64 (much smaller than original dimensions)
- **Advantages**: Memory efficient, fast training, preserves original model

### ⚡ **Production Considerations:**
1. **Memory Requirements**: LoRA reduces GPU memory by 3-4x
2. **Training Speed**: 2-3x faster than full fine-tuning
3. **Inference**: Can merge adapters for zero overhead
4. **Multi-task**: Multiple adapters for different tasks

In [None]:
class LoRALinear(nn.Module):
    """
    LoRA (Low-Rank Adaptation) implementation for efficient fine-tuning.
    
    Key Innovation: Instead of updating full weight matrix W,
    we learn low-rank decomposition: ΔW = B @ A
    where B ∈ R^(d×r), A ∈ R^(r×k), and r << min(d,k)
    
    Interview Points:
    1. Why low-rank? Most updates lie in low-dimensional subspace
    2. Parameter reduction: d×k → (d+k)×r
    3. Initialization: A ~ N(0, σ), B = 0 (preserves original model)
    4. Scaling factor α controls adaptation strength
    """
    
    def __init__(self, original_layer: nn.Linear, rank: int = 4, alpha: float = 1.0):
        super().__init__()
        
        self.original_layer = original_layer
        self.rank = rank
        self.alpha = alpha
        
        # Get dimensions
        in_features = original_layer.in_features
        out_features = original_layer.out_features
        
        # LoRA matrices: W + α(B @ A)
        self.lora_A = nn.Parameter(torch.randn(rank, in_features) * 0.02)
        self.lora_B = nn.Parameter(torch.zeros(out_features, rank))
        
        # Freeze original weights
        for param in self.original_layer.parameters():
            param.requires_grad = False
        
        # Calculate parameter reduction
        original_params = in_features * out_features
        lora_params = rank * (in_features + out_features)
        self.param_reduction = original_params / lora_params
        
    def forward(self, x):
        # Original computation
        original_output = self.original_layer(x)
        
        # LoRA adaptation: x @ A^T @ B^T
        lora_output = (x @ self.lora_A.T) @ self.lora_B.T
        
        # Combine with scaling
        return original_output + (self.alpha / self.rank) * lora_output
    
    def merge_weights(self):
        """Merge LoRA weights into original layer for inference efficiency."""
        with torch.no_grad():
            # Compute ΔW = α(B @ A) / rank
            delta_w = (self.alpha / self.rank) * (self.lora_B @ self.lora_A)
            # Update original weights
            self.original_layer.weight.data += delta_w
            
    def get_stats(self):
        """Get LoRA statistics for analysis."""
        return {
            'rank': self.rank,
            'alpha': self.alpha,
            'param_reduction': f"{self.param_reduction:.1f}x",
            'trainable_params': self.lora_A.numel() + self.lora_B.numel(),
            'original_params': self.original_layer.weight.numel()
        }


class FineTuningStrategies:
    """
    Collection of fine-tuning strategies for different scenarios.
    
    Interview Topics:
    1. When to use each strategy?
    2. Trade-offs: Performance vs Efficiency
    3. Catastrophic forgetting prevention
    4. Domain adaptation techniques
    """
    
    @staticmethod
    def apply_lora_to_model(model, rank=4, alpha=1.0, target_modules=None):
        """
        Apply LoRA to specific modules in a model.
        
        Common target modules:
        - Attention: q_proj, k_proj, v_proj, o_proj
        - Feed-forward: gate_proj, up_proj, down_proj
        """
        if target_modules is None:
            target_modules = ['q_proj', 'k_proj', 'v_proj', 'o_proj']
        
        lora_modules = {}
        trainable_params = 0
        total_params = 0
        
        for name, module in model.named_modules():
            if isinstance(module, nn.Linear) and any(target in name for target in target_modules):
                lora_module = LoRALinear(module, rank, alpha)
                lora_modules[name] = lora_module
                
                # Replace in model
                parent = model
                components = name.split('.')
                for component in components[:-1]:
                    parent = getattr(parent, component)
                setattr(parent, components[-1], lora_module)
                
                trainable_params += lora_module.lora_A.numel() + lora_module.lora_B.numel()
            
            if isinstance(module, nn.Linear):
                total_params += module.weight.numel()
        
        efficiency = (trainable_params / total_params) * 100
        
        return {
            'lora_modules': lora_modules,
            'trainable_params': trainable_params,
            'total_params': total_params,
            'efficiency': f"{efficiency:.2f}%"
        }
    
    @staticmethod
    def get_training_config(strategy='lora', task_type='causal_lm'):
        """
        Get recommended training configurations for different strategies.
        
        Production considerations:
        1. Learning rates: LoRA needs higher LR than full fine-tuning
        2. Batch sizes: Can be larger due to memory efficiency
        3. Warmup: Important for stability
        4. Weight decay: Lower for parameter-efficient methods
        """
        configs = {
            'full_finetuning': {
                'learning_rate': 5e-5,
                'batch_size': 8,
                'warmup_steps': 100,
                'weight_decay': 0.01,
                'gradient_accumulation': 4,
                'memory_usage': 'High'
            },
            'lora': {
                'learning_rate': 1e-4,
                'batch_size': 16,
                'warmup_steps': 50,
                'weight_decay': 0.001,
                'gradient_accumulation': 2,
                'memory_usage': 'Low',
                'rank': 4,
                'alpha': 8
            },
            'qlora': {
                'learning_rate': 2e-4,
                'batch_size': 32,
                'warmup_steps': 50,
                'weight_decay': 0.001,
                'gradient_accumulation': 1,
                'memory_usage': 'Very Low',
                'quantization': '4-bit',
                'rank': 8,
                'alpha': 16
            }
        }
        
        return configs.get(strategy, configs['lora'])


# Demonstrate LoRA implementation
print("🔧 FINE-TUNING STRATEGIES DEMONSTRATION")
print("=" * 50)

# Create a sample transformer layer
sample_transformer = TransformerBlock(d_model=512, n_heads=8, d_ff=2048)

# Get original parameter count
original_params = sum(p.numel() for p in sample_transformer.parameters() if p.requires_grad)
print(f"Original model parameters: {original_params:,}")

# Apply LoRA
lora_stats = FineTuningStrategies.apply_lora_to_model(
    sample_transformer, rank=4, alpha=8
)

print(f"\n📊 LoRA Statistics:")
print(f"Trainable parameters: {lora_stats['trainable_params']:,}")
print(f"Parameter efficiency: {lora_stats['efficiency']}")
print(f"Memory reduction: ~{original_params // lora_stats['trainable_params']}x")

# Test different configurations
print(f"\n⚙️ Training Configurations:")
for strategy in ['full_finetuning', 'lora', 'qlora']:
    config = FineTuningStrategies.get_training_config(strategy)
    print(f"\n{strategy.upper()}:")
    for key, value in config.items():
        print(f"  {key}: {value}")

# Test LoRA module
print(f"\n🧪 LoRA Module Test:")
original_linear = nn.Linear(512, 512)
lora_linear = LoRALinear(original_linear, rank=4, alpha=8)

# Test forward pass
test_input = torch.randn(1, 10, 512)
output = lora_linear(test_input)
print(f"Input shape: {test_input.shape}")
print(f"Output shape: {output.shape}")

# Get statistics
stats = lora_linear.get_stats()
print(f"LoRA stats: {stats}")

print(f"\n✅ Fine-tuning strategies implemented successfully!")
print(f"\n💡 Key Interview Points:")
print(f"• LoRA reduces parameters by {stats['param_reduction']} while maintaining ~95% performance")
print(f"• QLoRA combines 4-bit quantization with LoRA for extreme efficiency")
print(f"• Rank selection: Higher rank = more capacity but more parameters")
print(f"• Alpha controls adaptation strength (usually 2x rank)")

## 4. RAG (Retrieval-Augmented Generation) Implementation

### 🔍 **RAG Architecture Overview:**

```
Query → Embedding → Vector Search → Retrieve Docs → Re-rank → Context + Query → LLM → Response
```

### 📊 **Key Components:**

1. **Vector Database**: FAISS, Pinecone, Weaviate, Azure Cognitive Search
2. **Embedding Models**: sentence-transformers, OpenAI embeddings, E5
3. **Chunking Strategies**: Fixed size, semantic, recursive
4. **Retrieval Methods**: Dense, sparse (BM25), hybrid
5. **Re-ranking**: Cross-encoder models, relevance scoring

### 🎯 **Production Challenges:**
- **Latency**: Sub-100ms retrieval for real-time applications
- **Relevance**: Balancing recall vs precision
- **Context Length**: Fitting retrieved docs within token limits
- **Cost**: Embedding computation and storage costs
- **Freshness**: Updating vector indices with new documents

### 💡 **Interview Focus:**
1. **Chunking strategies** - How to split documents optimally?
2. **Embedding quality** - Which models for which domains?
3. **Retrieval evaluation** - How to measure retrieval quality?
4. **Context optimization** - How to maximize relevant context?

In [None]:
import re
from typing import List, Dict, Tuple, Optional
from dataclasses import dataclass
import pickle
import hashlib

@dataclass
class Document:
    """Document representation for RAG systems."""
    content: str
    metadata: Dict[str, any] = None
    doc_id: str = None
    
    def __post_init__(self):
        if self.doc_id is None:
            self.doc_id = hashlib.md5(self.content.encode()).hexdigest()[:8]
        if self.metadata is None:
            self.metadata = {}

@dataclass
class RetrievalResult:
    """Result from retrieval system."""
    document: Document
    score: float
    embedding: Optional[np.ndarray] = None


class DocumentChunker:
    """
    Advanced document chunking strategies for optimal RAG performance.
    
    Interview Points:
    1. Fixed vs semantic chunking trade-offs
    2. Overlap importance for context preservation
    3. Chunk size optimization (token limits vs coherence)
    4. Domain-specific chunking (code, legal, medical)
    """
    
    @staticmethod
    def fixed_size_chunking(text: str, chunk_size: int = 512, overlap: int = 50) -> List[str]:
        """
        Fixed-size chunking with overlap.
        
        Pros: Predictable, simple, fast
        Cons: May split sentences/paragraphs awkwardly
        """
        words = text.split()
        chunks = []
        
        for i in range(0, len(words), chunk_size - overlap):
            chunk = ' '.join(words[i:i + chunk_size])
            if chunk.strip():
                chunks.append(chunk)
                
        return chunks
    
    @staticmethod
    def semantic_chunking(text: str, max_chunk_size: int = 512) -> List[str]:
        """
        Semantic chunking based on paragraphs and sentences.
        
        Pros: Preserves semantic coherence
        Cons: Variable chunk sizes, more complex
        """
        # Split by paragraphs first
        paragraphs = re.split(r'\n\s*\n', text)
        chunks = []
        current_chunk = ""
        
        for paragraph in paragraphs:
            # Split long paragraphs by sentences
            sentences = re.split(r'(?<=[.!?])\s+', paragraph)
            
            for sentence in sentences:
                if len((current_chunk + " " + sentence).split()) <= max_chunk_size:
                    current_chunk = (current_chunk + " " + sentence).strip()
                else:
                    if current_chunk:
                        chunks.append(current_chunk)
                    current_chunk = sentence
            
            # Add paragraph break
            if current_chunk and len(current_chunk.split()) < max_chunk_size * 0.8:
                continue
            else:
                if current_chunk:
                    chunks.append(current_chunk)
                current_chunk = ""
        
        if current_chunk:
            chunks.append(current_chunk)
            
        return [chunk for chunk in chunks if chunk.strip()]
    
    @staticmethod
    def recursive_chunking(text: str, chunk_size: int = 512, separators: List[str] = None) -> List[str]:
        """
        Recursive chunking with hierarchical separators.
        
        Used in LangChain - tries larger separators first, then smaller ones.
        """
        if separators is None:
            separators = ["\n\n", "\n", ". ", " ", ""]
        
        def _recursive_split(text: str, separators: List[str]) -> List[str]:
            if not separators or len(text.split()) <= chunk_size:
                return [text] if text.strip() else []
            
            separator = separators[0]
            splits = text.split(separator)
            
            if len(splits) == 1:
                # Separator not found, try next one
                return _recursive_split(text, separators[1:])
            
            chunks = []
            current_chunk = ""
            
            for split in splits:
                if len((current_chunk + separator + split).split()) <= chunk_size:
                    current_chunk = (current_chunk + separator + split) if current_chunk else split
                else:
                    if current_chunk:
                        chunks.append(current_chunk)
                    
                    # If single split is too large, recursively split it
                    if len(split.split()) > chunk_size:
                        chunks.extend(_recursive_split(split, separators[1:]))
                        current_chunk = ""
                    else:
                        current_chunk = split
            
            if current_chunk:
                chunks.append(current_chunk)
                
            return chunks
        
        return _recursive_split(text, separators)


class VectorSearchEngine:
    """
    Vector search engine using FAISS for similarity search.
    
    Production Features:
    1. Efficient similarity search with FAISS
    2. Multiple index types (Flat, IVF, HNSW)
    3. Batch processing for large document sets
    4. Persistent storage and loading
    
    Interview Topics:
    1. Vector similarity metrics (cosine, dot product, L2)
    2. Index types and trade-offs
    3. Approximate vs exact search
    4. Scaling to millions of documents
    """
    
    def __init__(self, embedding_dim: int = 384, index_type: str = "flat"):
        self.embedding_dim = embedding_dim
        self.index_type = index_type
        self.index = None
        self.documents = []
        self.embeddings = []
        
        # Initialize embedding model (mock for now)
        self.embedding_model = self._create_mock_embedding_model()
        
    def _create_mock_embedding_model(self):
        """Create mock embedding model for demonstration."""
        class MockEmbedder:
            def encode(self, texts, batch_size=32):
                # Simple mock: use text length and hash for "embedding"
                embeddings = []
                for text in texts:
                    # Create pseudo-embedding based on text features
                    words = text.lower().split()
                    embedding = np.random.randn(384)  # Mock 384-dim embedding
                    # Add some text-specific features
                    embedding[0] = len(words) / 100  # Length feature
                    embedding[1] = len(set(words)) / len(words) if words else 0  # Diversity
                    embeddings.append(embedding)
                return np.array(embeddings)
        
        return MockEmbedder()
    
    def _create_index(self, embeddings: np.ndarray):
        """Create FAISS index based on type."""
        try:
            import faiss
            
            if self.index_type == "flat":
                # Exact search, good for small datasets
                index = faiss.IndexFlatIP(self.embedding_dim)  # Inner product
            elif self.index_type == "ivf":
                # Approximate search, good for large datasets
                quantizer = faiss.IndexFlatIP(self.embedding_dim)
                index = faiss.IndexIVFFlat(quantizer, self.embedding_dim, 100)
                index.train(embeddings)
            else:
                # Default to flat
                index = faiss.IndexFlatIP(self.embedding_dim)
                
            return index
        except ImportError:
            print("FAISS not available, using numpy-based search")
            return None
    
    def add_documents(self, documents: List[Document], batch_size: int = 32):
        """Add documents to the search index."""
        print(f"🔄 Adding {len(documents)} documents to vector index...")
        
        # Extract text content
        texts = [doc.content for doc in documents]
        
        # Generate embeddings in batches
        all_embeddings = []
        for i in range(0, len(texts), batch_size):
            batch_texts = texts[i:i + batch_size]
            batch_embeddings = self.embedding_model.encode(batch_texts)
            all_embeddings.append(batch_embeddings)
        
        embeddings = np.vstack(all_embeddings)
        
        # Create or update index
        if self.index is None:
            self.index = self._create_index(embeddings)
        
        # Add to index
        if self.index is not None:
            self.index.add(embeddings)
        
        # Store documents and embeddings
        self.documents.extend(documents)
        self.embeddings.append(embeddings)
        
        print(f"✅ Added {len(documents)} documents. Total: {len(self.documents)}")
    
    def search(self, query: str, top_k: int = 5) -> List[RetrievalResult]:
        """Search for similar documents."""
        if not self.documents:
            return []
        
        # Encode query
        query_embedding = self.embedding_model.encode([query])[0]
        
        if self.index is not None:
            # Use FAISS search
            scores, indices = self.index.search(
                query_embedding.reshape(1, -1), top_k
            )
            
            results = []
            for score, idx in zip(scores[0], indices[0]):
                if idx < len(self.documents):
                    results.append(RetrievalResult(
                        document=self.documents[idx],
                        score=float(score)
                    ))
        else:
            # Fallback to numpy search
            all_embeddings = np.vstack(self.embeddings)
            similarities = np.dot(all_embeddings, query_embedding)
            top_indices = np.argsort(similarities)[-top_k:][::-1]
            
            results = []
            for idx in top_indices:
                results.append(RetrievalResult(
                    document=self.documents[idx],
                    score=float(similarities[idx])
                ))
        
        return results
    
    def save_index(self, filepath: str):
        """Save index and documents to disk."""
        if self.index is not None:
            try:
                import faiss
                faiss.write_index(self.index, f"{filepath}.index")
            except ImportError:
                pass
        
        # Save documents and metadata
        with open(f"{filepath}.pkl", 'wb') as f:
            pickle.dump({
                'documents': self.documents,
                'embeddings': self.embeddings,
                'embedding_dim': self.embedding_dim,
                'index_type': self.index_type
            }, f)
    
    def load_index(self, filepath: str):
        """Load index and documents from disk."""
        try:
            import faiss
            self.index = faiss.read_index(f"{filepath}.index")
        except (ImportError, FileNotFoundError):
            pass
        
        # Load documents and metadata
        with open(f"{filepath}.pkl", 'rb') as f:
            data = pickle.load(f)
            self.documents = data['documents']
            self.embeddings = data['embeddings']
            self.embedding_dim = data['embedding_dim']
            self.index_type = data['index_type']


class RAGSystem:
    """
    Complete RAG (Retrieval-Augmented Generation) system.
    
    Combines retrieval, re-ranking, and generation for enhanced LLM responses.
    
    Production Considerations:
    1. Latency optimization (sub-100ms retrieval)
    2. Context window management
    3. Relevance scoring and filtering
    4. Cost optimization (embedding + generation costs)
    """
    
    def __init__(self, vector_engine: VectorSearchEngine, context_limit: int = 4000):
        self.vector_engine = vector_engine
        self.context_limit = context_limit
        
    def retrieve_and_rerank(self, query: str, top_k: int = 10, final_k: int = 3) -> List[RetrievalResult]:
        """
        Two-stage retrieval: fast retrieval + careful re-ranking.
        
        Stage 1: Fast vector search for candidates
        Stage 2: Cross-encoder re-ranking for relevance
        """
        # Stage 1: Vector retrieval
        candidates = self.vector_engine.search(query, top_k)
        
        # Stage 2: Re-ranking (simplified scoring)
        # In production: use cross-encoder models like ms-marco-MiniLM
        for result in candidates:
            # Simple relevance scoring based on query overlap
            query_words = set(query.lower().split())
            doc_words = set(result.document.content.lower().split())
            overlap = len(query_words.intersection(doc_words))
            result.score = overlap / len(query_words) if query_words else 0
        
        # Sort by re-ranking score and return top results
        candidates.sort(key=lambda x: x.score, reverse=True)
        return candidates[:final_k]
    
    def build_context(self, query: str, retrieved_docs: List[RetrievalResult]) -> str:
        """
        Build context from retrieved documents with token management.
        
        Strategies:
        1. Prioritize by relevance score
        2. Truncate documents if needed
        3. Add source citations
        """
        context_parts = [f"Query: {query}\\n\\nRelevant Information:\\n"]
        current_length = len(context_parts[0])
        
        for i, result in enumerate(retrieved_docs):
            doc_text = result.document.content
            source_info = f"\\n[Source {i+1}]: {doc_text}\\n"
            
            if current_length + len(source_info) > self.context_limit:
                # Truncate document to fit
                remaining_space = self.context_limit - current_length - 20
                if remaining_space > 100:  # Only add if meaningful space left
                    truncated_text = doc_text[:remaining_space] + "..."
                    source_info = f"\\n[Source {i+1}]: {truncated_text}\\n"
                    context_parts.append(source_info)
                break
            
            context_parts.append(source_info)
            current_length += len(source_info)
        
        return "".join(context_parts)
    
    def generate_response(self, query: str, context: str) -> str:
        """
        Generate response using retrieved context.
        
        In production: Use OpenAI API, Azure OpenAI, or local LLM
        """
        # Mock response generation
        response = f"Based on the retrieved information:\\n\\n"
        response += f"Answer: This is a mock response to '{query}' using the provided context. "
        response += f"In production, this would be generated by an LLM like GPT-4 or Claude.\\n\\n"
        response += f"Context used: {len(context)} characters from retrieved documents."
        
        return response
    
    def query(self, question: str, top_k: int = 5) -> Dict[str, any]:
        """
        Complete RAG pipeline: retrieve, rerank, generate.
        """
        # Step 1: Retrieve relevant documents
        retrieved_docs = self.retrieve_and_rerank(question, top_k)
        
        # Step 2: Build context
        context = self.build_context(question, retrieved_docs)
        
        # Step 3: Generate response
        response = self.generate_response(question, context)
        
        return {
            'question': question,
            'response': response,
            'retrieved_documents': retrieved_docs,
            'context': context,
            'metadata': {
                'num_retrieved': len(retrieved_docs),
                'context_length': len(context),
                'avg_relevance_score': np.mean([r.score for r in retrieved_docs]) if retrieved_docs else 0
            }
        }


# Demonstrate RAG system
print("🔍 RAG SYSTEM DEMONSTRATION")
print("=" * 50)

# Create sample documents
sample_documents = [
    Document("Transformers use self-attention mechanisms to process sequences in parallel, making them much faster than RNNs for training."),
    Document("LoRA (Low-Rank Adaptation) allows efficient fine-tuning by learning low-rank matrices that approximate weight updates."),
    Document("BERT uses bidirectional attention while GPT uses causal (unidirectional) attention for different tasks."),
    Document("Vector databases store high-dimensional embeddings and enable fast similarity search for RAG applications."),
    Document("Fine-tuning strategies include full fine-tuning, LoRA, QLoRA, and prefix tuning, each with different trade-offs."),
    Document("Attention complexity is O(n²) where n is sequence length, which becomes problematic for very long sequences."),
    Document("RAG combines retrieval and generation to provide factual, up-to-date responses by grounding LLMs in external knowledge."),
]

# Test chunking strategies
print("\\n📋 CHUNKING STRATEGIES COMPARISON:")
sample_text = " ".join([doc.content for doc in sample_documents])

chunker = DocumentChunker()
fixed_chunks = chunker.fixed_size_chunking(sample_text, chunk_size=50, overlap=10)
semantic_chunks = chunker.semantic_chunking(sample_text, max_chunk_size=50)
recursive_chunks = chunker.recursive_chunking(sample_text, chunk_size=50)

print(f"Original text length: {len(sample_text.split())} words")
print(f"Fixed chunking: {len(fixed_chunks)} chunks")
print(f"Semantic chunking: {len(semantic_chunks)} chunks")
print(f"Recursive chunking: {len(recursive_chunks)} chunks")

# Create and populate vector search engine
print("\\n🔧 VECTOR SEARCH ENGINE:")
vector_engine = VectorSearchEngine(embedding_dim=384)
vector_engine.add_documents(sample_documents)

# Test search
test_query = "What is LoRA and how does it work?"
search_results = vector_engine.search(test_query, top_k=3)

print(f"\\nQuery: '{test_query}'")
print("Search Results:")
for i, result in enumerate(search_results):
    print(f"{i+1}. Score: {result.score:.3f}")
    print(f"   Content: {result.document.content[:100]}...")

# Create and test RAG system
print("\\n🧠 COMPLETE RAG SYSTEM:")
rag_system = RAGSystem(vector_engine)

rag_result = rag_system.query("Explain attention mechanisms in transformers")
print(f"\\nRAG Query: {rag_result['question']}")
print(f"Response: {rag_result['response']}")
print(f"\\nMetadata:")
for key, value in rag_result['metadata'].items():
    print(f"  {key}: {value}")

print(f"\\n✅ RAG system implemented successfully!")
print(f"\\n💡 Production Considerations:")
print(f"• Embedding model choice: sentence-transformers vs OpenAI vs domain-specific")
print(f"• Vector database scaling: FAISS vs Pinecone vs Weaviate")
print(f"• Chunking optimization: Balance coherence vs retrieval accuracy")
print(f"• Re-ranking models: Cross-encoders for better relevance")
print(f"• Context management: Token limits vs information completeness")

## 5. Prompt Engineering & Template Systems

### 📝 **Prompt Engineering Fundamentals:**

**Prompt Types:**
1. **Zero-shot**: No examples, just instruction
2. **Few-shot**: Include examples in the prompt
3. **Chain-of-Thought (CoT)**: Step-by-step reasoning
4. **Self-Consistency**: Multiple reasoning paths
5. **Tree of Thoughts**: Explore multiple solution branches

### 🎯 **Prompt Design Principles:**

| Principle | Description | Example |
|-----------|-------------|---------|
| **Clarity** | Clear, specific instructions | "Summarize in 3 bullet points" vs "Summarize" |
| **Context** | Provide relevant background | Include domain knowledge, constraints |
| **Examples** | Show desired output format | Few-shot demonstrations |
| **Structure** | Organize with headers, sections | Use markdown, numbered steps |
| **Constraints** | Specify limits, requirements | Word count, format, style |

### 🔧 **Advanced Techniques:**
- **Role Playing**: "You are an expert data scientist..."
- **Metacognitive Prompting**: "Think step by step", "Explain your reasoning"
- **Self-Correction**: "Review your answer and fix any errors"
- **Multi-Turn Conversations**: Context-aware dialogue systems

### 🚀 **Production Considerations:**
- **Token Efficiency**: Minimize prompt length while maintaining quality
- **Template Versioning**: A/B testing different prompt versions
- **Dynamic Context**: Adaptive prompts based on user/task
- **Safety Filtering**: Prompt injection prevention, content moderation

In [1]:
from enum import Enum
from jinja2 import Template
import re
from typing import List, Dict, Optional, Union, Callable
import time
import json

class PromptType(Enum):
    """Different types of prompting strategies."""
    ZERO_SHOT = "zero_shot"
    FEW_SHOT = "few_shot"
    CHAIN_OF_THOUGHT = "chain_of_thought"
    SELF_CONSISTENCY = "self_consistency"
    TREE_OF_THOUGHTS = "tree_of_thoughts"
    ROLE_PLAYING = "role_playing"

@dataclass
class PromptExample:
    """Example for few-shot prompting."""
    input_text: str
    output_text: str
    reasoning: Optional[str] = None

@dataclass
class PromptTemplate:
    """Template for consistent prompt generation."""
    name: str
    template: str
    prompt_type: PromptType
    examples: List[PromptExample] = None
    metadata: Dict[str, any] = None
    
    def __post_init__(self):
        if self.examples is None:
            self.examples = []
        if self.metadata is None:
            self.metadata = {}


class PromptTemplateEngine:
    """
    Advanced prompt template system with dynamic content generation.
    
    Features:
    1. Template inheritance and composition
    2. Dynamic variable injection
    3. Conditional logic
    4. Multi-language support
    5. A/B testing support
    
    Interview Points:
    1. Template design patterns
    2. Variable sanitization and injection
    3. Performance optimization
    4. Version control for prompts
    """
    
    def __init__(self):
        self.templates = {}
        self.base_templates = {}
        self.evaluation_metrics = {}
        
    def register_template(self, template: PromptTemplate):
        """Register a new prompt template."""
        self.templates[template.name] = template
        
    def get_zero_shot_template(self, task_description: str, constraints: List[str] = None) -> str:
        """
        Generate zero-shot prompt template.
        
        Structure:
        1. Task description
        2. Constraints and requirements
        3. Output format specification
        """
        template_parts = [
            f"Task: {task_description}",
            ""
        ]
        
        if constraints:
            template_parts.extend([
                "Requirements:",
                *[f"- {constraint}" for constraint in constraints],
                ""
            ])
        
        template_parts.extend([
            "Input: {{input}}",
            "",
            "Output:"
        ])
        
        return "\\n".join(template_parts)
    
    def get_few_shot_template(self, task_description: str, examples: List[PromptExample], 
                            constraints: List[str] = None) -> str:
        """
        Generate few-shot prompt with examples.
        
        Best Practices:
        1. 2-5 diverse examples
        2. Consistent format across examples
        3. Include edge cases
        4. Show reasoning if applicable
        """
        template_parts = [
            f"Task: {task_description}",
            ""
        ]
        
        if constraints:
            template_parts.extend([
                "Requirements:",
                *[f"- {constraint}" for constraint in constraints],
                ""
            ])
        
        # Add examples
        template_parts.append("Examples:")
        for i, example in enumerate(examples, 1):
            template_parts.extend([
                f"Example {i}:",
                f"Input: {example.input_text}",
                f"Output: {example.output_text}"
            ])
            if example.reasoning:
                template_parts.append(f"Reasoning: {example.reasoning}")
            template_parts.append("")
        
        # Add current input
        template_parts.extend([
            "Now, please complete this task:",
            "Input: {{input}}",
            "Output:"
        ])
        
        return "\\n".join(template_parts)
    
    def get_chain_of_thought_template(self, task_description: str, 
                                    reasoning_structure: List[str] = None) -> str:
        """
        Generate Chain-of-Thought prompting template.
        
        Encourages step-by-step reasoning for complex problems.
        """
        if reasoning_structure is None:
            reasoning_structure = [
                "1. Understand the problem",
                "2. Identify key information", 
                "3. Apply relevant knowledge",
                "4. Reason through step-by-step",
                "5. Provide final answer"
            ]
        
        template_parts = [
            f"Task: {task_description}",
            "",
            "Please solve this step-by-step by following this reasoning structure:",
            *reasoning_structure,
            "",
            "Input: {{input}}",
            "",
            "Step-by-step solution:"
        ]
        
        return "\\n".join(template_parts)
    
    def get_role_playing_template(self, role: str, expertise: str, 
                                task_description: str, context: str = None) -> str:
        """
        Generate role-playing prompt template.
        
        Helps the model adopt specific perspectives and expertise.
        """
        template_parts = [
            f"You are {role} with expertise in {expertise}.",
        ]
        
        if context:
            template_parts.append(f"Context: {context}")
        
        template_parts.extend([
            "",
            f"Task: {task_description}",
            "",
            "Please respond from your professional perspective, drawing on your expertise.",
            "",
            "Input: {{input}}",
            "",
            "Response:"
        ])
        
        return "\\n".join(template_parts)
    
    def render_template(self, template_name: str, variables: Dict[str, any]) -> str:
        """
        Render template with variables using Jinja2.
        
        Supports:
        1. Variable substitution
        2. Conditional logic
        3. Loops and iterations
        4. Filters and functions
        """
        if template_name not in self.templates:
            raise ValueError(f"Template '{template_name}' not found")
        
        template_obj = self.templates[template_name]
        jinja_template = Template(template_obj.template)
        
        # Add safety checks for variables
        safe_variables = self._sanitize_variables(variables)
        
        return jinja_template.render(**safe_variables)
    
    def _sanitize_variables(self, variables: Dict[str, any]) -> Dict[str, any]:
        """Sanitize template variables to prevent injection attacks."""
        safe_vars = {}
        
        for key, value in variables.items():
            if isinstance(value, str):
                # Basic sanitization - remove potentially dangerous patterns
                safe_value = re.sub(r'[{}\\\\]', '', value)  # Remove template syntax
                safe_value = safe_value.strip()
                safe_vars[key] = safe_value
            else:
                safe_vars[key] = value
                
        return safe_vars


class PromptOptimizer:
    """
    Advanced prompt optimization and evaluation system.
    
    Features:
    1. A/B testing for prompt variants
    2. Performance metrics tracking
    3. Automatic prompt refinement
    4. Cost optimization
    
    Interview Topics:
    1. How to measure prompt quality?
    2. Automated prompt engineering
    3. Cost vs performance trade-offs
    4. Multi-objective optimization
    """
    
    def __init__(self):
        self.experiments = {}
        self.metrics_history = []
        
    def create_experiment(self, name: str, prompt_variants: List[str], 
                        evaluation_function: Callable) -> str:
        """
        Create A/B testing experiment for prompt optimization.
        
        Args:
            name: Experiment identifier
            prompt_variants: List of prompt templates to test
            evaluation_function: Function to evaluate prompt performance
        """
        experiment_id = f"{name}_{int(time.time())}"
        
        self.experiments[experiment_id] = {
            'name': name,
            'variants': prompt_variants,
            'evaluator': evaluation_function,
            'results': [],
            'best_variant': None,
            'created_at': time.time()
        }
        
        return experiment_id
    
    def evaluate_prompt_variant(self, experiment_id: str, variant_idx: int, 
                              test_inputs: List[str], expected_outputs: List[str] = None) -> Dict:
        """
        Evaluate a specific prompt variant.
        
        Metrics:
        1. Task completion rate
        2. Output quality score
        3. Latency
        4. Token usage (cost)
        """
        if experiment_id not in self.experiments:
            raise ValueError(f"Experiment {experiment_id} not found")
        
        experiment = self.experiments[experiment_id]
        prompt_template = experiment['variants'][variant_idx]
        
        results = {
            'variant_idx': variant_idx,
            'prompt_template': prompt_template,
            'test_results': [],
            'avg_score': 0,
            'avg_latency': 0,
            'total_tokens': 0
        }
        
        # Test each input
        for i, test_input in enumerate(test_inputs):
            start_time = time.time()
            
            # Mock LLM response (in production: call actual LLM)
            mock_output = self._mock_llm_response(prompt_template, test_input)
            
            end_time = time.time()
            latency = end_time - start_time
            
            # Evaluate response quality
            if expected_outputs and i < len(expected_outputs):
                quality_score = self._calculate_quality_score(mock_output, expected_outputs[i])
            else:
                quality_score = self._evaluate_response_quality(mock_output)
            
            # Count tokens (mock)
            token_count = len(prompt_template.split()) + len(mock_output.split())
            
            test_result = {
                'input': test_input,
                'output': mock_output,
                'quality_score': quality_score,
                'latency': latency,
                'tokens': token_count
            }
            
            results['test_results'].append(test_result)
        
        # Calculate aggregated metrics
        results['avg_score'] = np.mean([r['quality_score'] for r in results['test_results']])
        results['avg_latency'] = np.mean([r['latency'] for r in results['test_results']])
        results['total_tokens'] = sum([r['tokens'] for r in results['test_results']])
        
        # Store results
        experiment['results'].append(results)
        
        return results
    
    def _mock_llm_response(self, prompt_template: str, user_input: str) -> str:
        """Mock LLM response for demonstration."""
        # Simple mock based on prompt complexity
        template_length = len(prompt_template.split())
        input_length = len(user_input.split())
        
        response_parts = []
        
        if "step-by-step" in prompt_template.lower():
            response_parts.extend([
                "Let me think through this step by step:",
                f"1. First, I'll analyze: {user_input[:50]}...",
                "2. Then I'll consider the requirements...",
                "3. Finally, I'll provide the solution:",
                f"Based on the analysis, here's my response to '{user_input[:30]}...'"
            ])
        elif "examples" in prompt_template.lower():
            response_parts.append(f"Following the examples provided, my response to '{user_input[:30]}...' is:")
            response_parts.append("This follows the same pattern as the examples.")
        else:
            response_parts.append(f"Response to '{user_input[:50]}...'")
        
        return " ".join(response_parts)
    
    def _calculate_quality_score(self, output: str, expected: str) -> float:
        """Calculate quality score by comparing output to expected result."""
        # Simple similarity score (in production: use BLEU, ROUGE, etc.)
        output_words = set(output.lower().split())
        expected_words = set(expected.lower().split())
        
        if not expected_words:
            return 0.0
        
        overlap = len(output_words.intersection(expected_words))
        return overlap / len(expected_words)
    
    def _evaluate_response_quality(self, output: str) -> float:
        """Evaluate response quality without ground truth."""
        # Simple heuristics (in production: use trained evaluator models)
        score = 0.5  # Base score
        
        # Check for structure
        if any(marker in output.lower() for marker in ['step', 'first', 'second', 'finally']):
            score += 0.2
        
        # Check for length appropriateness
        if 20 <= len(output.split()) <= 200:
            score += 0.2
        
        # Check for coherence (simple)
        if len(set(output.lower().split())) / len(output.split()) > 0.5:  # Diversity
            score += 0.1
        
        return min(score, 1.0)
    
    def get_best_variant(self, experiment_id: str) -> Dict:
        """Get the best performing prompt variant from experiment."""
        if experiment_id not in self.experiments:
            raise ValueError(f"Experiment {experiment_id} not found")
        
        experiment = self.experiments[experiment_id]
        
        if not experiment['results']:
            return None
        
        # Find best variant based on composite score
        best_variant = None
        best_score = -1
        
        for result in experiment['results']:
            # Composite score: quality weighted by efficiency
            composite_score = (
                result['avg_score'] * 0.6 +  # Quality weight
                (1 / (result['avg_latency'] + 0.001)) * 0.2 +  # Speed weight
                (1 / (result['total_tokens'] / len(result['test_results']) + 1)) * 0.2  # Efficiency weight
            )
            
            if composite_score > best_score:
                best_score = composite_score
                best_variant = result
        
        experiment['best_variant'] = best_variant
        return best_variant


# Comprehensive demonstration
print("📝 PROMPT ENGINEERING & TEMPLATE SYSTEMS")
print("=" * 50)

# Initialize template engine
template_engine = PromptTemplateEngine()

# Create different types of templates
print("\\n🏗️ CREATING PROMPT TEMPLATES:")

# 1. Zero-shot template
zero_shot_prompt = template_engine.get_zero_shot_template(
    task_description="Classify the sentiment of the given text",
    constraints=["Output only: Positive, Negative, or Neutral", "Provide confidence score"]
)

# 2. Few-shot template  
examples = [
    PromptExample("I love this product!", "Positive (0.95)", "Strong positive emotion expressed"),
    PromptExample("It's okay, nothing special", "Neutral (0.70)", "Mixed sentiment, neither strongly positive nor negative"),
    PromptExample("Terrible experience, would not recommend", "Negative (0.90)", "Clear negative sentiment with recommendation against")
]

few_shot_prompt = template_engine.get_few_shot_template(
    task_description="Classify the sentiment of the given text with confidence score",
    examples=examples,
    constraints=["Output format: Sentiment (confidence)", "Provide brief reasoning"]
)

# 3. Chain-of-thought template
cot_prompt = template_engine.get_chain_of_thought_template(
    task_description="Analyze the sentiment and emotional tone of text",
    reasoning_structure=[
        "1. Identify key emotional words and phrases",
        "2. Consider context and implicit meanings", 
        "3. Evaluate overall sentiment polarity",
        "4. Assess confidence based on clarity of sentiment",
        "5. Provide final classification with reasoning"
    ]
)

# 4. Role-playing template
role_prompt = template_engine.get_role_playing_template(
    role="an expert sentiment analysis researcher",
    expertise="natural language processing and computational linguistics",
    task_description="Analyze the sentiment with academic rigor",
    context="You have published papers on sentiment analysis and emotion detection"
)

# Register templates
templates = [
    PromptTemplate("sentiment_zero_shot", zero_shot_prompt, PromptType.ZERO_SHOT),
    PromptTemplate("sentiment_few_shot", few_shot_prompt, PromptType.FEW_SHOT, examples),
    PromptTemplate("sentiment_cot", cot_prompt, PromptType.CHAIN_OF_THOUGHT),
    PromptTemplate("sentiment_role", role_prompt, PromptType.ROLE_PLAYING)
]

for template in templates:
    template_engine.register_template(template)

print(f"✅ Created {len(templates)} template variants")

# Test template rendering
test_input = "The movie was absolutely fantastic! Great acting and storyline."

print("\\n🧪 TEMPLATE RENDERING EXAMPLES:")
for template in templates:
    print(f"\\n--- {template.name.upper()} ---")
    try:
        rendered = template_engine.render_template(template.name, {"input": test_input})
        print(rendered[:200] + "..." if len(rendered) > 200 else rendered)
    except Exception as e:
        print(f"Error rendering template: {e}")

# Initialize prompt optimizer and run experiments
print("\\n⚡ PROMPT OPTIMIZATION EXPERIMENT:")
optimizer = PromptOptimizer()

# Create experiment with template variants
prompt_variants = [
    zero_shot_prompt,
    few_shot_prompt, 
    cot_prompt,
    role_prompt
]

def mock_evaluator(output: str, expected: str = None) -> float:
    """Mock evaluation function for demonstration."""
    return np.random.uniform(0.6, 0.95)  # Mock quality scores

experiment_id = optimizer.create_experiment(
    name="sentiment_analysis_optimization",
    prompt_variants=prompt_variants,
    evaluation_function=mock_evaluator
)

# Test inputs for evaluation
test_inputs = [
    "I absolutely love this new smartphone!",
    "The service was disappointing and slow.",
    "It's an average product, nothing extraordinary.",
    "Best purchase I've made this year!",
    "Could be better, but it's acceptable."
]

# Evaluate each variant
print(f"\\nEvaluating {len(prompt_variants)} prompt variants...")
for i in range(len(prompt_variants)):
    result = optimizer.evaluate_prompt_variant(experiment_id, i, test_inputs)
    print(f"\\nVariant {i+1} Results:")
    print(f"  Average Quality Score: {result['avg_score']:.3f}")
    print(f"  Average Latency: {result['avg_latency']:.3f}s")
    print(f"  Total Tokens: {result['total_tokens']}")

# Get best variant
best_variant = optimizer.get_best_variant(experiment_id)
if best_variant:
    print(f"\\n🏆 BEST PERFORMING VARIANT:")
    print(f"Variant Index: {best_variant['variant_idx']}")
    print(f"Quality Score: {best_variant['avg_score']:.3f}")
    print(f"Efficiency: {best_variant['total_tokens'] / len(test_inputs):.1f} tokens/query")

print(f"\\n✅ Prompt engineering system implemented successfully!")

print(f"\\n💡 Key Interview Points:")
print(f"• Zero-shot vs Few-shot: Trade-off between simplicity and performance")
print(f"• Chain-of-Thought: Improves reasoning for complex tasks by 10-50%")
print(f"• Role-playing: Leverages model's training on diverse perspectives")
print(f"• A/B testing: Essential for production prompt optimization")
print(f"• Token efficiency: Cost optimization through smart prompt design")
print(f"• Safety considerations: Prompt injection prevention, content filtering")

NameError: name 'dataclass' is not defined

## 6. Production Deployment & Azure Integration

### 🚀 **Deployment Architecture for Transformers:**

```
Training Pipeline:    [Data] → [Fine-tuning] → [Evaluation] → [Model Registry]
                                     ↓
Inference Pipeline:   [Model Serving] → [API Gateway] → [Load Balancer] → [Monitoring]
```

### ⚙️ **Azure Services for ML Production:**

| Service | Purpose | Use Case |
|---------|---------|----------|
| **Azure OpenAI** | Managed LLM inference | GPT-4, ChatGPT integration |
| **Azure ML** | End-to-end ML lifecycle | Training, deployment, monitoring |
| **Azure Cognitive Services** | Pre-built AI APIs | Vision, Speech, Language |
| **Azure Container Instances** | Serverless containers | Lightweight model serving |
| **Azure Kubernetes Service** | Container orchestration | Scalable, production workloads |
| **Azure Functions** | Serverless compute | Event-driven ML inference |

### 🔧 **Model Optimization Techniques:**

1. **Quantization**: 8-bit, 4-bit, INT8 for memory reduction
2. **Pruning**: Remove unnecessary weights and connections  
3. **Knowledge Distillation**: Compress large models to smaller ones
4. **ONNX Conversion**: Cross-platform optimized inference
5. **TensorRT/DirectML**: Hardware-specific acceleration

### 📊 **Production Metrics:**
- **Latency**: P50, P95, P99 response times
- **Throughput**: Requests/second, tokens/second
- **Resource Usage**: GPU/CPU utilization, memory
- **Cost**: Compute cost per request/token
- **Quality**: BLEU, ROUGE, human evaluation scores

In [None]:
import asyncio
import aiohttp
import logging
from datetime import datetime
from concurrent.futures import ThreadPoolExecutor
import threading
import queue
import psutil
import gc
from pathlib import Path
import yaml

# Mock Azure SDK imports for demonstration
class MockAzureMLClient:
    """Mock Azure ML client for demonstration purposes."""
    def __init__(self, **kwargs):
        self.subscription_id = kwargs.get('subscription_id', 'mock-subscription')
        self.resource_group = kwargs.get('resource_group', 'mock-rg')
        
    def create_or_update(self, workspace):
        print(f"✅ Mock: Created workspace {workspace.name}")
        
    def begin_create_or_update(self, deployment):
        print(f"✅ Mock: Deployed model {deployment.name}")
        return MockDeployment()

class MockDeployment:
    def result(self):
        return {"status": "Succeeded", "scoring_uri": "https://mock-endpoint.azure.com/score"}


class ModelOptimizer:
    """
    Model optimization techniques for production deployment.
    
    Key Techniques:
    1. Quantization: Reduce precision (FP32 → INT8)
    2. Pruning: Remove redundant parameters
    3. Knowledge Distillation: Compress large models
    4. ONNX Conversion: Hardware-agnostic format
    
    Interview Points:
    1. Trade-offs: Model size vs accuracy
    2. Hardware considerations: CPU vs GPU optimization
    3. Inference speedup: 2-10x improvement possible
    4. Memory reduction: Up to 75% with quantization
    """
    
    @staticmethod
    def quantize_model(model, quantization_type='dynamic'):
        """
        Quantize model for reduced memory and faster inference.
        
        Types:
        - dynamic: Post-training quantization (easiest)
        - static: Requires calibration dataset (better accuracy)
        - qat: Quantization-aware training (best accuracy)
        """
        print(f"🔧 Quantizing model with {quantization_type} quantization...")
        
        try:
            import torch.quantization as quantization
            
            if quantization_type == 'dynamic':
                # Dynamic quantization - good for LSTM/RNN
                quantized_model = torch.quantization.quantize_dynamic(
                    model, {torch.nn.Linear, torch.nn.LSTM}, dtype=torch.qint8
                )
            elif quantization_type == 'static':
                # Static quantization - requires calibration
                model.eval()
                model.qconfig = torch.quantization.get_default_qconfig('fbgemm')
                quantized_model = torch.quantization.prepare(model)
                # Would need calibration data here
                quantized_model = torch.quantization.convert(quantized_model)
            else:
                print("⚠️ QAT requires specialized training setup")
                return model
                
            # Calculate compression ratio
            original_size = sum(p.numel() * p.element_size() for p in model.parameters())
            quantized_size = sum(p.numel() * p.element_size() for p in quantized_model.parameters())
            compression_ratio = original_size / quantized_size
            
            print(f"✅ Quantization complete. Compression ratio: {compression_ratio:.2f}x")
            return quantized_model
            
        except ImportError:
            print("⚠️ PyTorch quantization not available - using mock quantization")
            return model
    
    @staticmethod
    def prune_model(model, pruning_ratio=0.2):
        """
        Prune model by removing low-magnitude weights.
        
        Techniques:
        1. Magnitude-based: Remove smallest weights
        2. Structured: Remove entire channels/layers
        3. Gradual: Iterative pruning during training
        """
        print(f"✂️ Pruning model with {pruning_ratio:.1%} sparsity...")
        
        try:
            import torch.nn.utils.prune as prune
            
            # Apply magnitude-based pruning to linear layers
            for name, module in model.named_modules():
                if isinstance(module, torch.nn.Linear):
                    prune.l1_unstructured(module, name='weight', amount=pruning_ratio)
                    prune.remove(module, 'weight')  # Make pruning permanent
            
            # Calculate actual sparsity
            total_params = sum(p.numel() for p in model.parameters())
            zero_params = sum((p == 0).sum().item() for p in model.parameters())
            actual_sparsity = zero_params / total_params
            
            print(f"✅ Pruning complete. Actual sparsity: {actual_sparsity:.1%}")
            return model
            
        except ImportError:
            print("⚠️ PyTorch pruning not available - using mock pruning")
            return model
    
    @staticmethod
    def convert_to_onnx(model, input_shape, output_path):
        """
        Convert PyTorch model to ONNX for cross-platform deployment.
        
        Benefits:
        1. Hardware optimization (CPU, GPU, specialized chips)
        2. Runtime optimization (ONNX Runtime, TensorRT)
        3. Cross-framework compatibility
        """
        print(f"🔄 Converting model to ONNX format...")
        
        try:
            import torch.onnx
            
            # Create dummy input
            dummy_input = torch.randn(input_shape)
            model.eval()
            
            # Export to ONNX
            torch.onnx.export(
                model,
                dummy_input,
                output_path,
                export_params=True,
                opset_version=11,
                do_constant_folding=True,
                input_names=['input'],
                output_names=['output'],
                dynamic_axes={'input': {0: 'batch_size'}, 'output': {0: 'batch_size'}}
            )
            
            print(f"✅ ONNX model saved to {output_path}")
            return output_path
            
        except ImportError:
            print("⚠️ ONNX export not available")
            return None


class ModelServer:
    """
    Production-ready model serving with monitoring and optimization.
    
    Features:
    1. Async request handling
    2. Batch processing
    3. Caching
    4. Health monitoring
    5. Auto-scaling triggers
    
    Production Considerations:
    1. Latency: Sub-100ms response times
    2. Throughput: Handle 1000+ RPS
    3. Memory: Efficient batch processing
    4. Monitoring: Real-time metrics
    """
    
    def __init__(self, model, max_batch_size=8, cache_size=1000):
        self.model = model
        self.max_batch_size = max_batch_size
        self.cache = {}
        self.cache_size = cache_size
        self.request_queue = queue.Queue()
        self.metrics = {
            'total_requests': 0,
            'successful_requests': 0,
            'failed_requests': 0,
            'avg_latency': 0,
            'cache_hits': 0
        }
        self.is_running = False
        self.executor = ThreadPoolExecutor(max_workers=4)
        
    async def predict(self, input_text: str, use_cache=True) -> Dict:
        """
        Make prediction with caching and monitoring.
        """
        start_time = time.time()
        self.metrics['total_requests'] += 1
        
        try:
            # Check cache first
            if use_cache and input_text in self.cache:
                self.metrics['cache_hits'] += 1
                self.metrics['successful_requests'] += 1
                return {
                    'prediction': self.cache[input_text],
                    'cached': True,
                    'latency': time.time() - start_time
                }
            
            # Process with model
            prediction = await self._run_inference(input_text)
            
            # Update cache
            if use_cache and len(self.cache) < self.cache_size:
                self.cache[input_text] = prediction
            
            latency = time.time() - start_time
            self._update_latency_metrics(latency)
            self.metrics['successful_requests'] += 1
            
            return {
                'prediction': prediction,
                'cached': False,
                'latency': latency
            }
            
        except Exception as e:
            self.metrics['failed_requests'] += 1
            logging.error(f"Prediction failed: {e}")
            return {'error': str(e), 'latency': time.time() - start_time}
    
    async def _run_inference(self, input_text: str):
        """Run model inference in thread pool."""
        loop = asyncio.get_event_loop()
        
        def _inference():
            # Mock inference (replace with actual model call)
            time.sleep(0.01)  # Simulate processing time
            return f"Mock prediction for: {input_text[:50]}..."
        
        result = await loop.run_in_executor(self.executor, _inference)
        return result
    
    async def batch_predict(self, inputs: List[str]) -> List[Dict]:
        """
        Batch prediction for improved throughput.
        
        Benefits:
        1. Better GPU utilization
        2. Reduced overhead per request
        3. Higher throughput
        """
        # Split into batches
        batches = [inputs[i:i + self.max_batch_size] 
                  for i in range(0, len(inputs), self.max_batch_size)]
        
        all_results = []
        for batch in batches:
            # Process batch concurrently
            tasks = [self.predict(text) for text in batch]
            batch_results = await asyncio.gather(*tasks)
            all_results.extend(batch_results)
        
        return all_results
    
    def _update_latency_metrics(self, latency):
        """Update rolling average latency."""
        alpha = 0.1  # Smoothing factor
        if self.metrics['avg_latency'] == 0:
            self.metrics['avg_latency'] = latency
        else:
            self.metrics['avg_latency'] = (
                alpha * latency + (1 - alpha) * self.metrics['avg_latency']
            )
    
    def get_health_status(self) -> Dict:
        """Get server health and performance metrics."""
        # System metrics
        cpu_percent = psutil.cpu_percent()
        memory_percent = psutil.virtual_memory().percent
        
        # Calculate success rate
        total_reqs = self.metrics['total_requests']
        success_rate = (self.metrics['successful_requests'] / total_reqs * 100 
                       if total_reqs > 0 else 0)
        
        # Cache hit rate
        cache_hit_rate = (self.metrics['cache_hits'] / total_reqs * 100 
                         if total_reqs > 0 else 0)
        
        return {
            'status': 'healthy' if success_rate > 95 and cpu_percent < 80 else 'degraded',
            'metrics': {
                'requests': self.metrics,
                'success_rate': f"{success_rate:.1f}%",
                'cache_hit_rate': f"{cache_hit_rate:.1f}%",
                'avg_latency_ms': f"{self.metrics['avg_latency'] * 1000:.1f}",
                'system': {
                    'cpu_percent': cpu_percent,
                    'memory_percent': memory_percent
                }
            }
        }


class AzureMLDeployment:
    """
    Azure ML deployment manager for transformer models.
    
    Features:
    1. Model registration
    2. Endpoint deployment
    3. Monitoring setup
    4. Auto-scaling configuration
    
    Interview Topics:
    1. Blue-green deployments
    2. A/B testing in production
    3. Cost optimization strategies
    4. Monitoring and alerting
    """
    
    def __init__(self, subscription_id: str, resource_group: str, workspace_name: str):
        # Initialize Azure ML client (mocked for demo)
        self.ml_client = MockAzureMLClient(
            subscription_id=subscription_id,
            resource_group=resource_group
        )
        self.workspace_name = workspace_name
        
    def register_model(self, model_name: str, model_path: str, description: str = None):
        """Register model in Azure ML model registry."""
        print(f"📝 Registering model '{model_name}' in Azure ML...")
        
        # Mock model registration
        model_info = {
            'name': model_name,
            'path': model_path,
            'description': description or f"Transformer model: {model_name}",
            'version': 1,
            'created_time': datetime.now().isoformat()
        }
        
        print(f"✅ Model registered: {model_info}")
        return model_info
    
    def create_endpoint(self, endpoint_name: str, model_name: str, 
                       instance_type: str = "Standard_DS3_v2"):
        """Create managed online endpoint for model serving."""
        print(f"🚀 Creating endpoint '{endpoint_name}' with {instance_type}...")
        
        # Mock endpoint configuration
        endpoint_config = {
            'name': endpoint_name,
            'model': model_name,
            'instance_type': instance_type,
            'instance_count': 1,
            'auth_mode': 'key',
            'compute_type': 'managed'
        }
        
        # Mock deployment
        deployment_result = self.ml_client.begin_create_or_update(
            type('MockDeployment', (), endpoint_config)()
        ).result()
        
        print(f"✅ Endpoint created: {deployment_result}")
        return deployment_result
    
    def setup_monitoring(self, endpoint_name: str):
        """Setup monitoring and alerting for the endpoint."""
        print(f"📊 Setting up monitoring for endpoint '{endpoint_name}'...")
        
        monitoring_config = {
            'metrics': [
                'requests_per_minute',
                'latency_p95',
                'error_rate',
                'cpu_utilization',
                'memory_utilization'
            ],
            'alerts': [
                {'metric': 'error_rate', 'threshold': 5, 'condition': 'greater_than'},
                {'metric': 'latency_p95', 'threshold': 1000, 'condition': 'greater_than'},
                {'metric': 'cpu_utilization', 'threshold': 80, 'condition': 'greater_than'}
            ],
            'dashboard': f"https://portal.azure.com/dashboard/{endpoint_name}"
        }
        
        print(f"✅ Monitoring configured: {monitoring_config}")
        return monitoring_config
    
    def configure_autoscaling(self, endpoint_name: str, min_instances=1, max_instances=10):
        """Configure auto-scaling based on demand."""
        print(f"⚖️ Configuring auto-scaling for endpoint '{endpoint_name}'...")
        
        autoscaling_config = {
            'min_instances': min_instances,
            'max_instances': max_instances,
            'target_utilization_percentage': 70,
            'scale_up_cooldown': '5m',
            'scale_down_cooldown': '15m',
            'metrics': ['cpu_utilization', 'memory_utilization', 'request_rate']
        }
        
        print(f"✅ Auto-scaling configured: {autoscaling_config}")
        return autoscaling_config


# Comprehensive demonstration
print("🚀 PRODUCTION DEPLOYMENT & AZURE INTEGRATION")
print("=" * 60)

# Test model optimization
print("\\n🔧 MODEL OPTIMIZATION:")
sample_model = nn.Sequential(
    nn.Linear(512, 256),
    nn.ReLU(),
    nn.Linear(256, 128),
    nn.ReLU(),
    nn.Linear(128, 2)
)

optimizer = ModelOptimizer()

# Original model size
original_params = sum(p.numel() for p in sample_model.parameters())
print(f"Original model parameters: {original_params:,}")

# Test quantization
quantized_model = optimizer.quantize_model(sample_model, 'dynamic')

# Test pruning  
pruned_model = optimizer.prune_model(sample_model, pruning_ratio=0.3)

# Test ONNX conversion
onnx_path = "model.onnx"
optimizer.convert_to_onnx(sample_model, (1, 512), onnx_path)

# Test model server
print("\\n🖥️ MODEL SERVER DEMONSTRATION:")
server = ModelServer(sample_model, max_batch_size=4)

# Test single prediction
async def test_predictions():
    result = await server.predict("Test input for sentiment analysis")
    print(f"Single prediction: {result}")
    
    # Test batch prediction
    batch_inputs = [
        "Great product, highly recommend!",
        "Terrible experience, very disappointed",
        "Average quality, nothing special",
        "Excellent service and fast delivery"
    ]
    
    batch_results = await server.batch_predict(batch_inputs)
    print(f"\\nBatch prediction results: {len(batch_results)} predictions")
    
    # Test caching
    cached_result = await server.predict("Test input for sentiment analysis")
    print(f"Cached prediction: {cached_result['cached']}")

# Run async test
try:
    asyncio.run(test_predictions())
except RuntimeError:
    print("✅ Async prediction system demonstrated (would run in production)")

# Get health status
health = server.get_health_status()
print(f"\\n📊 Server Health Status:")
print(f"Status: {health['status']}")
print(f"Success Rate: {health['metrics']['success_rate']}")
print(f"Average Latency: {health['metrics']['avg_latency_ms']}ms")

# Test Azure ML deployment
print("\\n☁️ AZURE ML DEPLOYMENT:")
azure_deployment = AzureMLDeployment(
    subscription_id="your-subscription-id",
    resource_group="ml-rg",
    workspace_name="transformer-workspace"
)

# Register model
model_info = azure_deployment.register_model(
    model_name="sentiment-transformer-v1",
    model_path="./models/",
    description="Fine-tuned transformer for sentiment analysis"
)

# Create endpoint
endpoint_result = azure_deployment.create_endpoint(
    endpoint_name="sentiment-endpoint",
    model_name="sentiment-transformer-v1",
    instance_type="Standard_DS3_v2"
)

# Setup monitoring
monitoring = azure_deployment.setup_monitoring("sentiment-endpoint")

# Configure auto-scaling
autoscaling = azure_deployment.configure_autoscaling(
    endpoint_name="sentiment-endpoint",
    min_instances=1,
    max_instances=10
)

print(f"\\n✅ Production deployment system implemented successfully!")

print(f"\\n💡 Production Best Practices:")
print(f"• Model Optimization: Quantization can reduce memory by 75%")
print(f"• Batch Processing: Improves throughput by 3-5x")
print(f"• Caching: Reduces latency for repeated requests")
print(f"• Monitoring: Track latency, throughput, error rates")
print(f"• Auto-scaling: Handle traffic spikes automatically")
print(f"• Blue-green deployment: Zero-downtime model updates")

print(f"\\n🎯 Key Interview Points:")
print(f"• Latency optimization: Model quantization, ONNX, batching")
print(f"• Cost optimization: Right-sizing instances, auto-scaling")
print(f"• Reliability: Health checks, monitoring, alerting")
print(f"• Security: Managed identity, key vault, network isolation")
print(f"• CI/CD: Automated testing, gradual rollouts, rollback strategies")

## 7. Advanced Topics & Interview Readiness Summary

### 🧠 **Advanced Transformer Concepts:**

#### **Mixture of Experts (MoE)**
- **Key Idea**: Activate only subset of parameters per input
- **Benefits**: Scales model capacity without proportional compute increase
- **Challenge**: Routing efficiency, load balancing across experts
- **Examples**: Switch Transformer, GLaM, PaLM-2

#### **Long Context Handling**
- **Problem**: Quadratic attention complexity O(n²)
- **Solutions**: 
  - Sparse attention patterns (Longformer, BigBird)
  - Sliding window attention
  - Memory-augmented transformers
  - Retrieval-based context extension

#### **Multi-modal Transformers**
- **Vision-Language**: CLIP, DALL-E, GPT-4V
- **Audio-Language**: Whisper, SpeechT5
- **Code-Language**: GitHub Copilot, CodeT5
- **Unified Models**: Flamingo, BLIP-2

### ⚡ **Recent Advances (2023-2024):**

#### **Architectural Innovations**
1. **RMSNorm**: Alternative to LayerNorm (LLaMA)
2. **SwiGLU**: Improved activation function
3. **Rotary Position Embedding (RoPE)**: Better positional encoding
4. **Group Query Attention**: Reduces memory bandwidth

#### **Training Techniques**
1. **Constitutional AI**: Self-improving alignment
2. **RLHF**: Reinforcement Learning from Human Feedback
3. **Instruction Tuning**: Following complex instructions
4. **Few-shot ICL**: In-context learning capabilities

### 🎯 **Interview Question Categories:**

#### **1. Mathematical Foundations (25%)**
- Attention mechanism derivation
- Computational complexity analysis
- Gradient flow in deep transformers
- Positional encoding mathematics

#### **2. Implementation Details (25%)**
- Custom layer implementations
- Memory optimization techniques
- Distributed training strategies
- Debugging common issues

#### **3. Fine-tuning & Adaptation (20%)**
- When to use which fine-tuning method
- Catastrophic forgetting prevention
- Domain adaptation strategies
- Evaluation methodology

#### **4. Production Deployment (20%)**
- Latency vs accuracy trade-offs
- Scaling and load balancing
- Cost optimization strategies
- Monitoring and debugging

#### **5. Recent Research & Trends (10%)**
- Latest architectural improvements
- Emerging application areas
- Ethical considerations
- Future research directions

In [None]:
# Final comprehensive test and interview readiness assessment
print("🎓 TRANSFORMER INTERVIEW READINESS ASSESSMENT")
print("=" * 60)

class InterviewReadinessChecker:
    """
    Comprehensive interview readiness assessment for transformer expertise.
    
    Covers all key areas that Applied Scientist roles typically evaluate.
    """
    
    def __init__(self):
        self.assessment_areas = {
            'mathematical_foundations': {
                'weight': 0.25,
                'topics': [
                    'Attention mechanism derivation',
                    'Computational complexity (O(n²d) analysis)',
                    'Gradient computation and backpropagation',
                    'Positional encoding mathematics',
                    'Layer normalization vs batch normalization'
                ]
            },
            'implementation_skills': {
                'weight': 0.25,
                'topics': [
                    'Multi-head attention implementation',
                    'Custom PyTorch/TensorFlow layers',
                    'Memory optimization techniques',
                    'Distributed training setup',
                    'Common debugging approaches'
                ]
            },
            'fine_tuning_expertise': {
                'weight': 0.20,
                'topics': [
                    'LoRA vs full fine-tuning trade-offs',
                    'Catastrophic forgetting prevention',
                    'Domain adaptation strategies',
                    'Instruction tuning and RLHF',
                    'Evaluation methodology (BLEU, ROUGE, human eval)'
                ]
            },
            'production_deployment': {
                'weight': 0.20,
                'topics': [
                    'Model optimization (quantization, pruning)',
                    'Serving architecture and scaling',
                    'Latency vs accuracy optimization',
                    'Monitoring and debugging in production',
                    'Cost optimization strategies'
                ]
            },
            'recent_advances': {
                'weight': 0.10,
                'topics': [
                    'Mixture of Experts (MoE) architecture',
                    'Long context handling techniques',
                    'Multi-modal transformers',
                    'Recent architectural innovations',
                    'Ethical AI and safety considerations'
                ]
            }
        }
    
    def assess_readiness(self) -> Dict[str, any]:
        """
        Assess interview readiness across all areas.
        
        This would typically involve actual testing, but here we provide
        a framework for self-assessment.
        """
        print("📋 INTERVIEW READINESS CHECKLIST:")
        print("Rate your confidence (1-5) in each area:\\n")
        
        total_score = 0
        detailed_results = {}
        
        for area, info in self.assessment_areas.items():
            print(f"📚 {area.replace('_', ' ').title()} (Weight: {info['weight']:.0%}):")
            area_score = 0
            
            for i, topic in enumerate(info['topics'], 1):
                # In a real assessment, this would be interactive
                # For demo, we'll assign mock scores
                mock_score = np.random.uniform(3.5, 5.0)  # Simulating good preparation
                area_score += mock_score
                print(f"  {i}. {topic}: {mock_score:.1f}/5.0")
            
            area_average = area_score / len(info['topics'])
            weighted_score = area_average * info['weight']
            total_score += weighted_score
            
            detailed_results[area] = {
                'average_score': area_average,
                'weighted_contribution': weighted_score,
                'topics': info['topics']
            }
            
            print(f"  📊 Area Average: {area_average:.2f}/5.0\\n")
        
        # Determine readiness level
        if total_score >= 4.5:
            readiness_level = "🌟 Excellent - Ready for senior roles"
        elif total_score >= 4.0:
            readiness_level = "✅ Good - Ready for most applied scientist roles"
        elif total_score >= 3.5:
            readiness_level = "⚠️ Moderate - Need more preparation in weak areas"
        else:
            readiness_level = "❌ Needs significant improvement"
        
        return {
            'total_score': total_score,
            'readiness_level': readiness_level,
            'detailed_results': detailed_results,
            'recommendations': self._get_recommendations(detailed_results)
        }
    
    def _get_recommendations(self, results: Dict) -> List[str]:
        """Generate specific recommendations based on assessment."""
        recommendations = []
        
        for area, details in results.items():
            if details['average_score'] < 4.0:
                area_name = area.replace('_', ' ')
                recommendations.append(
                    f"Focus on {area_name}: Practice {details['topics'][0].lower()}"
                )
        
        # Add general recommendations
        recommendations.extend([
            "Practice implementing transformer components from scratch",
            "Stay updated with latest papers (ArXiv, conferences)",
            "Build end-to-end projects demonstrating production skills",
            "Practice explaining complex concepts in simple terms"
        ])
        
        return recommendations[:5]  # Top 5 recommendations


def generate_sample_interview_questions():
    """Generate sample interview questions across different categories."""
    
    questions = {
        'Mathematical Foundations': [
            "Derive the gradient of the attention mechanism with respect to the query matrix.",
            "Why do we scale attention scores by √d_k? What happens if we don't?",
            "Compare the computational complexity of transformers vs RNNs for sequence length n.",
            "Explain how positional encoding enables the model to understand sequence order."
        ],
        
        'Implementation Details': [
            "Implement multi-head attention from scratch in PyTorch.",
            "How would you debug vanishing gradients in a 24-layer transformer?",
            "Design a memory-efficient implementation for very long sequences.",
            "Explain the trade-offs between different attention patterns (full, sparse, local)."
        ],
        
        'Fine-tuning Strategy': [
            "When would you choose LoRA over full fine-tuning? Provide specific scenarios.",
            "How do you prevent catastrophic forgetting when adapting to new domains?",
            "Design an evaluation framework for a customer service chatbot.",
            "Explain the difference between instruction tuning and traditional fine-tuning."
        ],
        
        'Production Deployment': [
            "How would you optimize a transformer model for real-time inference?",
            "Design a serving architecture for handling 10,000 requests per second.",
            "What metrics would you monitor for a production LLM deployment?",
            "How do you handle model updates without downtime?"
        ],
        
        'Business & Strategy': [
            "A client wants to build a legal document analyzer. What approach would you recommend?",
            "How do you balance model performance with deployment costs?",
            "Explain the ethical considerations when deploying large language models.",
            "How would you convince leadership to invest in transformer research?"
        ]
    }
    
    return questions


def create_study_plan():
    """Create a structured study plan for transformer interview preparation."""
    
    study_plan = {
        'Week 1-2: Foundations': [
            "📚 Study: Attention mechanism mathematics and implementation",
            "💻 Code: Implement transformer from scratch (no libraries)",
            "📖 Read: 'Attention Is All You Need' paper + related papers",
            "🧪 Practice: Derive gradients, understand computational complexity"
        ],
        
        'Week 3-4: Fine-tuning & Adaptation': [
            "📚 Study: LoRA, QLoRA, prefix tuning, instruction tuning",
            "💻 Code: Implement LoRA and compare with full fine-tuning",
            "📖 Read: Recent papers on parameter-efficient fine-tuning",
            "🧪 Practice: Fine-tune models for different tasks and domains"
        ],
        
        'Week 5-6: Production & Scaling': [
            "📚 Study: Model optimization, serving architectures, monitoring",
            "💻 Code: Build production-ready serving system with monitoring",
            "📖 Read: Industry blogs on LLM deployment (OpenAI, Google, etc.)",
            "🧪 Practice: Optimize models for latency and throughput"
        ],
        
        'Week 7-8: Advanced Topics & Interview Prep': [
            "📚 Study: Latest research, multi-modal models, MoE",
            "💻 Code: Implement advanced techniques (sparse attention, etc.)",
            "📖 Read: Recent conference papers (NeurIPS, ICML, ACL)",
            "🧪 Practice: Mock interviews, explain concepts clearly"
        ]
    }
    
    return study_plan


# Run comprehensive assessment
print("\\n🔍 RUNNING COMPREHENSIVE ASSESSMENT...")
checker = InterviewReadinessChecker()
assessment = checker.assess_readiness()

print(f"\\n🎯 OVERALL ASSESSMENT RESULTS:")
print(f"Total Score: {assessment['total_score']:.2f}/5.0")
print(f"Readiness Level: {assessment['readiness_level']}")

print(f"\\n💡 TOP RECOMMENDATIONS:")
for i, rec in enumerate(assessment['recommendations'], 1):
    print(f"{i}. {rec}")

# Display sample interview questions
print(f"\\n❓ SAMPLE INTERVIEW QUESTIONS:")
sample_questions = generate_sample_interview_questions()

for category, questions in sample_questions.items():
    print(f"\\n📋 {category}:")
    for i, question in enumerate(questions[:2], 1):  # Show 2 per category
        print(f"  {i}. {question}")

# Show study plan
print(f"\\n📅 RECOMMENDED STUDY PLAN:")
study_plan = create_study_plan()

for period, activities in study_plan.items():
    print(f"\\n{period}:")
    for activity in activities:
        print(f"  {activity}")

# Final readiness summary
print(f"\\n🏆 TRANSFORMER EXPERTISE SUMMARY:")

expertise_areas = [
    "✅ Core Architecture: Attention, positional encoding, layer normalization",
    "✅ Fine-tuning Methods: LoRA, QLoRA, instruction tuning, RLHF",
    "✅ RAG Systems: Vector search, chunking, retrieval optimization",
    "✅ Prompt Engineering: Templates, optimization, evaluation",
    "✅ Production Deployment: Optimization, serving, monitoring",
    "✅ Advanced Topics: MoE, long context, multi-modal models"
]

for area in expertise_areas:
    print(area)

print(f"\\n🎓 YOU'RE READY FOR TRANSFORMER-FOCUSED INTERVIEWS!")

print(f"\\n💼 KEY TALKING POINTS FOR AMAZON APPLIED SCIENTIST:")
talking_points = [
    "🔬 Research Impact: How transformers revolutionized NLP and beyond",
    "⚡ Production Experience: Scaling models to millions of users",
    "💰 Cost Optimization: Balancing performance with computational efficiency",
    "🛡️ Safety & Ethics: Responsible AI deployment and bias mitigation",
    "🚀 Innovation: Contributing to next-generation AI capabilities",
    "👥 Leadership: Mentoring teams and driving technical vision"
]

for point in talking_points:
    print(point)

print(f"\\n🌟 Remember: Focus on both technical depth AND business impact!")
print(f"💡 Practice explaining complex concepts to non-technical stakeholders!")
print(f"🤝 Demonstrate how your expertise drives customer value and business growth!")

---

# 8. Latest Architectural Innovations (2023-2024)

## 🔬 **State-of-the-Art Transformer Architectures**

### **🚀 Mixture of Experts (MoE) - The Future of Scaling**

**Core Innovation**: Activate only a subset of parameters per input
- **Switch Transformer**: Routing mechanism with expert specialization
- **GLaM**: 1.2T parameters with 8B activated per token
- **PaLM-2**: Improved routing and load balancing
- **Mixtral 8x7B**: Open-source MoE breakthrough

**Key Benefits**:
- 10x model capacity with 2x compute cost
- Expert specialization for different domains
- Better sample efficiency than dense models

**Production Challenges**:
- Load balancing across experts
- Memory bandwidth optimization
- Dynamic routing efficiency

### **⚡ Efficient Attention Mechanisms**

#### **1. Ring Attention (2024)**
```
Memory: O(1) vs O(n²) for standard attention
Sequence Length: Up to 100M+ tokens
Key Innovation: Distributed attention computation
```

#### **2. Mamba/State Space Models**
```
Complexity: O(n) vs O(n²) for transformers
Strengths: Long sequences, efficient inference
Applications: DNA sequencing, audio processing
```

#### **3. RetNet (Retention Networks)**
```
Training: Parallel like transformers
Inference: Recurrent for efficiency
Memory: O(1) for generation
```

### **🎯 Architectural Improvements**

#### **RMSNorm vs LayerNorm**
- **RMSNorm**: 10-40% faster, simpler computation
- **Formula**: x / RMS(x) * g (no mean centering)
- **Used in**: LLaMA, PaLM, Chinchilla

#### **SwiGLU Activation**
- **Formula**: SwiGLU(x) = Swish(Wx + b) ⊙ (Vx + c)
- **Benefits**: Better than ReLU/GELU in large models
- **Used in**: PaLM, LLaMA, Chinchilla

#### **Rotary Position Embedding (RoPE)**
- **Advantage**: Better extrapolation to longer sequences
- **Innovation**: Rotation in complex space
- **Applications**: ChatGPT, GPT-4, LLaMA

---

## 🌐 **Multi-modal Transformers**

### **Vision-Language Models**

#### **CLIP-style Architectures**
- **Contrastive Learning**: Image-text pairs
- **Zero-shot Classification**: No task-specific training
- **Applications**: Image search, content moderation

#### **Generative Vision-Language**
- **DALL-E 3**: Improved prompt following
- **Midjourney**: Artistic image generation
- **GPT-4V**: Multimodal reasoning capabilities

### **Audio-Language Integration**

#### **Whisper Architecture**
- **Multi-task Training**: Speech recognition + translation
- **Robustness**: Works across languages and accents
- **Production Impact**: Real-time transcription systems

#### **MusicLM/AudioLM**
- **Music Generation**: Text-to-music synthesis
- **Audio Continuation**: Semantic audio understanding
- **Applications**: Content creation, accessibility

### **Code-Language Models**

#### **Code Generation Evolution**
- **GitHub Copilot**: IDE integration and productivity
- **CodeT5+**: Code understanding and generation
- **StarCoder**: Open-source code models

#### **Mathematical Reasoning**
- **Minerva**: Mathematical problem solving
- **Tool-using Models**: Calculator, code execution
- **Formal Verification**: Proof assistance

---

## 🧠 **Advanced Training Techniques**

### **Constitutional AI & Self-Improvement**

#### **Constitutional AI (Anthropic)**
- **Self-critique**: Model reviews its own outputs
- **Iterative Refinement**: Constitutional principles
- **Scalable Oversight**: Reduced human annotation

#### **Self-Instruct & Alpaca**
- **Bootstrap Learning**: Generate own training data
- **Instruction Following**: Improved task generalization
- **Cost Efficiency**: Reduced human labeling

### **Reinforcement Learning from Human Feedback (RLHF)**

#### **PPO for Language Models**
- **Reward Modeling**: Human preference learning
- **Policy Optimization**: Proximal policy optimization
- **Production Systems**: ChatGPT, Claude, Bard

#### **Direct Preference Optimization (DPO)**
- **Innovation**: Skip reward model training
- **Efficiency**: Direct policy optimization
- **Results**: Simpler, often better than PPO

### **Advanced Fine-tuning Methods**

#### **QLoRA Improvements**
- **NF4 Quantization**: Normal float 4-bit
- **Double Quantization**: Quantize quantization constants
- **Memory**: Train 65B models on single GPU

#### **AdaLoRA (Adaptive LoRA)**
- **Dynamic Rank**: Adjust rank during training
- **Importance Scoring**: SVD-based importance
- **Efficiency**: Better performance per parameter

In [None]:
# Latest Architectural Innovations - Implementation

import torch.nn.functional as F
from torch.nn import Parameter
from typing import Optional, Tuple
import math

print("🔬 LATEST TRANSFORMER INNOVATIONS")
print("=" * 50)

class RMSNorm(nn.Module):
    """
    Root Mean Square Layer Normalization (RMSNorm)
    
    Key Innovation: Simpler than LayerNorm, removes mean centering
    Formula: x / RMS(x) * g where RMS(x) = sqrt(mean(x²))
    
    Benefits:
    - 10-40% faster than LayerNorm
    - Simpler gradient computation
    - Used in LLaMA, PaLM, Chinchilla
    
    Interview Points:
    - Why remove mean centering? (re-centering not always beneficial)
    - When to use RMSNorm vs LayerNorm?
    - Performance implications in large models
    """
    
    def __init__(self, dim: int, eps: float = 1e-6):
        super().__init__()
        self.eps = eps
        self.weight = Parameter(torch.ones(dim))
    
    def _norm(self, x):
        return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps)
    
    def forward(self, x):
        output = self._norm(x.float()).type_as(x)
        return output * self.weight


class SwiGLU(nn.Module):
    """
    SwiGLU activation function for feed-forward networks.
    
    Formula: SwiGLU(x) = Swish(Wx + b) ⊙ (Vx + c)
    where Swish(x) = x * sigmoid(x)
    
    Key Innovation: Gated activation with Swish
    Used in: PaLM, LLaMA, Chinchilla
    Performance: Better than ReLU/GELU in large models
    
    Interview Focus:
    - Why gated activations work better?
    - Trade-off: 50% more parameters vs better performance
    - When to use in production?
    """
    
    def __init__(self, dim: int, hidden_dim: Optional[int] = None):
        super().__init__()
        hidden_dim = hidden_dim or int(2 * dim / 3)
        
        # Two linear projections for gating
        self.w1 = nn.Linear(dim, hidden_dim, bias=False)  # Gate
        self.w2 = nn.Linear(hidden_dim, dim, bias=False)  # Output
        self.w3 = nn.Linear(dim, hidden_dim, bias=False)  # Value
    
    def forward(self, x):
        # SwiGLU(x) = Swish(W1*x) ⊙ (W3*x) * W2
        swish_gate = F.silu(self.w1(x))  # Swish = SiLU
        value = self.w3(x)
        hidden = swish_gate * value  # Element-wise multiplication
        return self.w2(hidden)


class RotaryPositionalEmbedding(nn.Module):
    """
    Rotary Position Embedding (RoPE) implementation.
    
    Key Innovation: Encodes position through rotation in complex space
    Benefits:
    - Better extrapolation to longer sequences
    - Relative position encoding naturally
    - Used in ChatGPT, GPT-4, LLaMA
    
    Mathematical Foundation:
    - Rotation matrix applied to query and key
    - Preserves inner product structure
    - Extrapolates beyond training length
    
    Interview Topics:
    - Why rotation in complex space?
    - How does this enable length extrapolation?
    - Comparison with sinusoidal positional encoding
    """
    
    def __init__(self, dim: int, max_position_embeddings: int = 2048, base: float = 10000):
        super().__init__()
        
        inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim))
        self.register_buffer("inv_freq", inv_freq)
        
        # Build the cache for positions up to max_position_embeddings
        self._set_cos_sin_cache(max_position_embeddings)
    
    def _set_cos_sin_cache(self, seq_len):
        self.max_seq_len_cached = seq_len
        t = torch.arange(seq_len, device=self.inv_freq.device, dtype=self.inv_freq.dtype)
        freqs = torch.einsum("i,j->ij", t, self.inv_freq)
        
        # Different from paper, but it uses a different permutation to match the complex multiply
        emb = torch.cat((freqs, freqs), dim=-1)
        self.register_buffer("cos_cached", emb.cos())
        self.register_buffer("sin_cached", emb.sin())
    
    def forward(self, x, seq_len=None):
        # x: [batch_size, num_heads, seq_len, head_dim]
        if seq_len > self.max_seq_len_cached:
            self._set_cos_sin_cache(seq_len)
        
        return (
            self.cos_cached[:seq_len, ...].to(dtype=x.dtype),
            self.sin_cached[:seq_len, ...].to(dtype=x.dtype),
        )


def apply_rotary_pos_emb(q, k, cos, sin, position_ids):
    """Apply rotary position embedding to query and key tensors."""
    cos = cos.squeeze(1).squeeze(0)  # [seq_len, dim]
    sin = sin.squeeze(1).squeeze(0)  # [seq_len, dim]
    cos = cos[position_ids].unsqueeze(1)  # [batch_size, 1, seq_len, dim]
    sin = sin[position_ids].unsqueeze(1)  # [batch_size, 1, seq_len, dim]
    
    q_embed = (q * cos) + (rotate_half(q) * sin)
    k_embed = (k * cos) + (rotate_half(k) * sin)
    return q_embed, k_embed


def rotate_half(x):
    """Rotates half the hidden dims of the input."""
    x1 = x[..., : x.shape[-1] // 2]
    x2 = x[..., x.shape[-1] // 2 :]
    return torch.cat((-x2, x1), dim=-1)


class MixtureOfExpertsLayer(nn.Module):
    """
    Mixture of Experts (MoE) layer implementation.
    
    Key Innovation: Sparse activation - only use subset of parameters
    Benefits:
    - Scale model capacity without proportional compute increase
    - Expert specialization for different input types
    - Used in Switch Transformer, GLaM, PaLM-2
    
    Challenges:
    - Load balancing across experts
    - Communication overhead in distributed setting
    - Routing efficiency
    
    Interview Focus:
    - How does routing work?
    - Load balancing strategies
    - When to use MoE vs dense models?
    """
    
    def __init__(self, dim: int, num_experts: int = 8, top_k: int = 2, capacity_factor: float = 1.0):
        super().__init__()
        self.dim = dim
        self.num_experts = num_experts
        self.top_k = top_k
        self.capacity_factor = capacity_factor
        
        # Router network
        self.router = nn.Linear(dim, num_experts, bias=False)
        
        # Expert networks
        self.experts = nn.ModuleList([
            nn.Sequential(
                nn.Linear(dim, dim * 4),
                nn.ReLU(),
                nn.Linear(dim * 4, dim)
            ) for _ in range(num_experts)
        ])
        
    def forward(self, x):
        batch_size, seq_len, dim = x.shape
        x_flat = x.view(-1, dim)  # [batch_size * seq_len, dim]
        
        # Route tokens to experts
        router_logits = self.router(x_flat)  # [batch_size * seq_len, num_experts]
        routing_weights, selected_experts = torch.topk(router_logits, self.top_k, dim=-1)
        routing_weights = F.softmax(routing_weights, dim=-1)
        
        # Compute expert outputs
        final_output = torch.zeros_like(x_flat)
        
        for i in range(self.top_k):
            expert_idx = selected_experts[:, i]
            expert_weights = routing_weights[:, i]
            
            for expert_id in range(self.num_experts):
                expert_mask = (expert_idx == expert_id)
                if expert_mask.any():
                    expert_input = x_flat[expert_mask]
                    expert_output = self.experts[expert_id](expert_input)
                    
                    # Apply routing weights
                    weighted_output = expert_output * expert_weights[expert_mask].unsqueeze(-1)
                    final_output[expert_mask] += weighted_output
        
        return final_output.view(batch_size, seq_len, dim)


class EfficientAttentionLayer(nn.Module):
    """
    Efficient attention mechanisms for long sequences.
    
    Implementations:
    1. Flash Attention - memory efficient attention
    2. Linear Attention - O(n) complexity
    3. Sparse Attention - patterns for long sequences
    
    Key Innovation: Reduce memory/compute while maintaining quality
    Applications: Long documents, genomics, audio processing
    
    Interview Points:
    - Memory bottlenecks in standard attention
    - Trade-offs between efficiency and quality
    - When to use which efficient attention variant
    """
    
    def __init__(self, dim: int, num_heads: int, attention_type: str = "flash"):
        super().__init__()
        self.dim = dim
        self.num_heads = num_heads
        self.head_dim = dim // num_heads
        self.attention_type = attention_type
        
        self.q_proj = nn.Linear(dim, dim, bias=False)
        self.k_proj = nn.Linear(dim, dim, bias=False)
        self.v_proj = nn.Linear(dim, dim, bias=False)
        self.o_proj = nn.Linear(dim, dim, bias=False)
        
    def flash_attention(self, q, k, v, mask=None):
        """
        Flash Attention - memory efficient attention computation.
        
        Key Innovation: Tiling and recomputation to reduce memory
        Memory: O(sqrt(n)) vs O(n²)
        Speed: Often faster due to memory efficiency
        """
        batch_size, num_heads, seq_len, head_dim = q.shape
        
        # Simplified version - in practice uses more sophisticated tiling
        scale = 1.0 / math.sqrt(head_dim)
        
        # Chunk the computation to reduce memory
        chunk_size = min(1024, seq_len)
        output = torch.zeros_like(q)
        
        for i in range(0, seq_len, chunk_size):
            end_i = min(i + chunk_size, seq_len)
            
            for j in range(0, seq_len, chunk_size):
                end_j = min(j + chunk_size, seq_len)
                
                # Compute attention for this chunk
                q_chunk = q[:, :, i:end_i, :]
                k_chunk = k[:, :, j:end_j, :]
                v_chunk = v[:, :, j:end_j, :]
                
                scores = torch.matmul(q_chunk, k_chunk.transpose(-2, -1)) * scale
                if mask is not None:
                    scores = scores.masked_fill(mask[i:end_i, j:end_j] == 0, -1e9)
                
                attn_weights = F.softmax(scores, dim=-1)
                output[:, :, i:end_i, :] += torch.matmul(attn_weights, v_chunk)
        
        return output
    
    def linear_attention(self, q, k, v):
        """
        Linear Attention - O(n) complexity attention.
        
        Key Innovation: Kernel trick to avoid computing full attention matrix
        Formula: softmax(QK^T)V ≈ Q'(K'^TV)
        Complexity: O(nd²) vs O(n²d)
        """
        # Apply ELU + 1 to ensure positivity (kernel trick)
        q = F.elu(q) + 1
        k = F.elu(k) + 1
        
        # Compute K^T V first (memory efficient)
        kv = torch.matmul(k.transpose(-2, -1), v)  # [batch, heads, head_dim, head_dim]
        
        # Then compute Q * (K^T V)
        output = torch.matmul(q, kv)  # [batch, heads, seq_len, head_dim]
        
        # Normalization
        k_sum = k.sum(dim=-2, keepdim=True)  # [batch, heads, 1, head_dim]
        normalizer = torch.matmul(q, k_sum.transpose(-2, -1))  # [batch, heads, seq_len, 1]
        
        return output / (normalizer + 1e-6)
    
    def forward(self, x, mask=None):
        batch_size, seq_len, dim = x.shape
        
        # Project to q, k, v
        q = self.q_proj(x).view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
        k = self.k_proj(x).view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
        v = self.v_proj(x).view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
        
        # Choose attention mechanism
        if self.attention_type == "flash":
            attn_output = self.flash_attention(q, k, v, mask)
        elif self.attention_type == "linear":
            attn_output = self.linear_attention(q, k, v)
        else:
            # Standard attention (for comparison)
            scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.head_dim)
            if mask is not None:
                scores = scores.masked_fill(mask == 0, -1e9)
            attn_weights = F.softmax(scores, dim=-1)
            attn_output = torch.matmul(attn_weights, v)
        
        # Reshape and project
        attn_output = attn_output.transpose(1, 2).contiguous().view(batch_size, seq_len, dim)
        return self.o_proj(attn_output)


# Demonstrate latest innovations
print("\n🧪 TESTING LATEST ARCHITECTURAL INNOVATIONS:")

# Test RMSNorm vs LayerNorm
print("\n1. RMSNorm vs LayerNorm:")
dim = 512
x = torch.randn(2, 10, dim)

layernorm = nn.LayerNorm(dim)
rmsnorm = RMSNorm(dim)

# Time comparison (mock)
ln_output = layernorm(x)
rms_output = rmsnorm(x)

print(f"LayerNorm output shape: {ln_output.shape}")
print(f"RMSNorm output shape: {rms_output.shape}")
print(f"Output difference (should be similar): {torch.mean(torch.abs(ln_output - rms_output)):.6f}")

# Test SwiGLU
print("\n2. SwiGLU Activation:")
swiglu = SwiGLU(dim, hidden_dim=dim * 4 // 3)
swiglu_output = swiglu(x)
print(f"SwiGLU output shape: {swiglu_output.shape}")

# Compare with standard FFN
standard_ffn = nn.Sequential(
    nn.Linear(dim, dim * 4),
    nn.GELU(),
    nn.Linear(dim * 4, dim)
)
standard_output = standard_ffn(x)
print(f"Standard FFN params: {sum(p.numel() for p in standard_ffn.parameters()):,}")
print(f"SwiGLU params: {sum(p.numel() for p in swiglu.parameters()):,}")

# Test Rotary Position Embedding
print("\n3. Rotary Position Embedding:")
rope = RotaryPositionalEmbedding(dim // 8)  # head_dim
seq_len = 10
cos, sin = rope(x, seq_len)
print(f"RoPE cos shape: {cos.shape}")
print(f"RoPE sin shape: {sin.shape}")

# Test Mixture of Experts
print("\n4. Mixture of Experts:")
moe = MixtureOfExpertsLayer(dim, num_experts=4, top_k=2)
moe_output = moe(x)
print(f"MoE output shape: {moe_output.shape}")

# Compare parameter efficiency
dense_ffn_params = dim * (dim * 4) + (dim * 4) * dim  # Two linear layers
moe_params = sum(p.numel() for p in moe.parameters())
print(f"Dense FFN params: {dense_ffn_params:,}")
print(f"MoE params: {moe_params:,}")
print(f"MoE uses {moe_params / dense_ffn_params:.2f}x parameters for {4}x capacity")

# Test Efficient Attention
print("\n5. Efficient Attention Mechanisms:")
efficient_attn = EfficientAttentionLayer(dim, num_heads=8, attention_type="linear")
attn_output = efficient_attn(x)
print(f"Efficient attention output: {attn_output.shape}")

# Memory complexity comparison
seq_length = 1024
standard_memory = seq_length ** 2 * dim  # O(n²d)
linear_memory = seq_length * dim ** 2   # O(nd²)
print(f"Standard attention memory: {standard_memory:,} units")
print(f"Linear attention memory: {linear_memory:,} units")
print(f"Memory reduction: {standard_memory / linear_memory:.2f}x when seq_len > d")

print(f"\n✅ Latest architectural innovations demonstrated!")

print(f"\n💡 Key Innovation Insights:")
innovations = [
    "🔧 RMSNorm: 10-40% faster than LayerNorm with similar performance",
    "⚡ SwiGLU: Better activation for large models, 50% more params but worth it",
    "🔄 RoPE: Enables length extrapolation beyond training sequences",
    "🎯 MoE: Scale capacity without proportional compute increase",
    "💾 Efficient Attention: Handle longer sequences with reduced memory",
    "🏗️ Architectural Evolution: Each innovation builds on previous breakthroughs"
]

for insight in innovations:
    print(f"  {insight}")

print(f"\n🎓 Senior Applied Scientist Readiness:")
readiness_areas = [
    "✅ Understand trade-offs of each innovation",
    "✅ Know when to apply which technique",
    "✅ Can implement from scratch if needed", 
    "✅ Aware of production implications",
    "✅ Stay current with latest research",
    "✅ Can explain business impact clearly"
]

for area in readiness_areas:
    print(f"  {area}")

print(f"\n🚀 You're prepared for the latest transformer innovations!")


---

# 9. Safety, Ethics & Responsible AI

## 🛡️ **AI Safety & Alignment**

### **Core Safety Challenges**

#### **1. Hallucination & Factual Accuracy**
- **Problem**: Models generate confident but false information
- **Detection**: Uncertainty estimation, consistency checking
- **Mitigation**: RAG systems, fact-checking integration
- **Evaluation**: TruthfulQA, fact verification benchmarks

#### **2. Prompt Injection & Security**
- **Attack Types**: Direct injection, indirect injection, jailbreaking
- **Defense**: Input sanitization, output filtering, constitutional AI
- **Production**: Rate limiting, content moderation, monitoring

#### **3. Bias & Fairness**
- **Sources**: Training data bias, annotation bias, societal bias
- **Measurement**: Demographic parity, equalized odds, calibration
- **Mitigation**: Debiasing techniques, diverse training data

### **🎯 Alignment Techniques**

#### **Constitutional AI (Anthropic)**
```
1. Self-critique: Model evaluates its own responses
2. Constitutional principles: Embedded ethical guidelines  
3. Iterative refinement: Improve through self-correction
4. Scalable oversight: Reduce human intervention needs
```

#### **RLHF (Reinforcement Learning from Human Feedback)**
```
1. Supervised fine-tuning: Base instruction following
2. Reward modeling: Learn human preferences
3. RL optimization: PPO to maximize reward
4. Iterative improvement: Continuous feedback integration
```

---

## 📊 **Advanced Evaluation Methods**

### **Beyond BLEU/ROUGE: Modern Evaluation**

#### **1. Human Evaluation Frameworks**
- **Likert Scales**: Quality, relevance, helpfulness ratings
- **Comparative Evaluation**: A/B testing, pairwise comparison
- **Task-specific**: Accuracy, completion rate, user satisfaction

#### **2. Automated Evaluation Metrics**
- **BERTScore**: Semantic similarity using BERT embeddings
- **BLEURT**: Learned evaluation metric
- **UniEval**: Unified evaluation framework
- **GPT-4 as Judge**: LLM-based evaluation

#### **3. Robustness Testing**
- **Adversarial Examples**: Input perturbations
- **Out-of-Distribution**: Performance on unseen domains
- **Stress Testing**: Edge cases, corner cases, failure modes

### **🔬 Evaluation Best Practices**

#### **Multi-dimensional Assessment**
```
Quality Dimensions:
├── Factual Accuracy: Truthfulness, consistency
├── Relevance: On-topic, addresses query
├── Coherence: Logical flow, readability
├── Safety: No harmful content
└── Efficiency: Response time, resource usage
```

#### **Statistical Significance**
- **Sample Size**: Adequate power analysis
- **Confidence Intervals**: Uncertainty quantification  
- **Multiple Comparisons**: Bonferroni correction
- **Effect Size**: Practical significance vs statistical significance

---

## 💼 **Senior Applied Scientist Leadership**

### **🎯 Technical Leadership Responsibilities**

#### **Research Strategy & Vision**
- **Technology Roadmaps**: 6-18 month planning
- **Research Prioritization**: Impact vs effort analysis
- **Cross-functional Alignment**: Product, engineering, business
- **Innovation Pipeline**: Basic research to product integration

#### **Team Development & Mentoring**
- **Hiring**: Technical assessment, culture fit
- **Mentoring**: Junior scientists, research direction
- **Knowledge Sharing**: Tech talks, documentation
- **Career Development**: Growth paths, skill building

### **🏢 Business Impact & Strategy**

#### **Stakeholder Communication**
- **Executive Updates**: Progress, challenges, opportunities
- **Technical Translation**: Complex concepts for business leaders
- **ROI Demonstration**: Research impact on business metrics
- **Risk Assessment**: Technical debt, failure modes

#### **Product Integration**
- **Feasibility Analysis**: Research to product viability
- **MVP Definition**: Minimum viable product scope
- **A/B Testing**: Experiment design and analysis
- **Launch Strategy**: Gradual rollout, monitoring, optimization

### **📈 Success Metrics for Senior Scientists**

#### **Research Excellence**
- **Publications**: Top-tier venues (NeurIPS, ICML, ACL)
- **Patents**: Intellectual property generation
- **Citations**: Research impact and influence
- **Open Source**: Community contributions

#### **Business Impact**
- **Product Features**: Successful launches
- **Cost Reduction**: Efficiency improvements
- **Revenue Growth**: New capabilities enabling growth
- **User Metrics**: Engagement, satisfaction, retention

#### **Leadership Influence**
- **Team Growth**: Hiring and developing talent
- **Technical Direction**: Shaping company AI strategy
- **Industry Recognition**: Speaking, standards committees
- **Thought Leadership**: Blog posts, whitepapers

---

## 🔮 **Future Trends & Research Directions**

### **Emerging Paradigms**

#### **1. Agentic AI Systems**
- **Tool Use**: API integration, code execution
- **Planning**: Multi-step reasoning, goal decomposition
- **Memory**: Long-term knowledge retention
- **Collaboration**: Multi-agent systems

#### **2. Multimodal Foundation Models**
- **Unified Architecture**: Text, vision, audio, code
- **Cross-modal Reasoning**: Understanding relationships
- **Embodied AI**: Robotics integration
- **Real-world Grounding**: Physical world understanding

#### **3. Scientific AI**
- **Discovery**: Drug discovery, materials science
- **Reasoning**: Mathematical proofs, theorem proving
- **Simulation**: Physics-informed neural networks
- **Experimental Design**: Automated hypothesis generation

### **🚀 Production Evolution**

#### **Edge Deployment**
- **Model Compression**: Ultra-efficient models
- **Hardware Co-design**: Custom chips for inference
- **Federated Learning**: Privacy-preserving training
- **Real-time Processing**: Streaming inference

#### **Personalization**
- **User Adaptation**: Continuous learning from feedback
- **Privacy Preservation**: Differential privacy, secure computation
- **Context Awareness**: Environmental and temporal adaptation
- **Multi-task Learning**: Shared representations across users

### **🎯 Interview Preparation for Senior Roles**

#### **Research Depth Questions**
- "Describe your most impactful research contribution"
- "How do you evaluate research vs engineering trade-offs?"
- "What's your opinion on current limitations of transformers?"
- "How would you design the next generation of language models?"

#### **Leadership Scenarios**
- "A junior team member proposes an infeasible research direction. How do you handle it?"
- "You need to convince executives to invest in long-term research. What's your approach?"
- "How do you balance research innovation with product delivery timelines?"
- "Describe a time you had to pivot research direction due to business needs"

#### **Technical Vision**
- "Where do you see AI/ML heading in the next 5 years?"
- "What are the biggest unsolved problems in your domain?"
- "How would you build an AI research team from scratch?"
- "What's your framework for choosing research problems?"

In [None]:
# Advanced Safety & Evaluation Implementation

import torch
import torch.nn.functional as F
import numpy as np
from typing import List, Dict, Tuple, Optional
from dataclasses import dataclass
import json
import re

class SafetyEvaluator:
    """Comprehensive safety evaluation framework for LLMs"""
    
    def __init__(self):
        self.toxicity_keywords = [
            "harmful", "offensive", "inappropriate", "violence", 
            "discrimination", "hate", "threat", "illegal"
        ]
        self.factual_inconsistencies = []
        
    def detect_hallucination(self, response: str, context: str) -> Dict[str, float]:
        """Detect potential hallucinations in model responses"""
        # Simplified hallucination detection
        scores = {}
        
        # 1. Factual consistency check
        scores['factual_consistency'] = self._check_factual_consistency(response, context)
        
        # 2. Confidence vs uncertainty
        scores['uncertainty'] = self._estimate_uncertainty(response)
        
        # 3. Citation verification
        scores['citation_accuracy'] = self._verify_citations(response)
        
        return scores
    
    def _check_factual_consistency(self, response: str, context: str) -> float:
        """Check if response is consistent with provided context"""
        # Simple overlap-based consistency check
        response_tokens = set(response.lower().split())
        context_tokens = set(context.lower().split())
        
        if len(response_tokens) == 0:
            return 0.0
            
        overlap = len(response_tokens.intersection(context_tokens))
        return overlap / len(response_tokens)
    
    def _estimate_uncertainty(self, response: str) -> float:
        """Estimate uncertainty in model response"""
        uncertainty_phrases = [
            "i think", "maybe", "possibly", "might be", "could be",
            "i'm not sure", "uncertain", "unclear", "probably"
        ]
        
        uncertainty_count = sum(1 for phrase in uncertainty_phrases 
                              if phrase in response.lower())
        
        # Normalize by response length
        words = len(response.split())
        return min(uncertainty_count / max(words, 1), 1.0)
    
    def _verify_citations(self, response: str) -> float:
        """Verify accuracy of citations in response"""
        # Extract citations (simplified pattern)
        citations = re.findall(r'\[(\d+)\]', response)
        
        if not citations:
            return 1.0  # No citations to verify
            
        # In real implementation, would verify against knowledge base
        # For demo, assume 80% accuracy
        return 0.8

class ConstitutionalAI:
    """Implementation of Constitutional AI principles"""
    
    def __init__(self):
        self.principles = [
            "Be helpful and harmless",
            "Provide accurate information",
            "Respect human autonomy",
            "Avoid harmful or offensive content",
            "Be transparent about limitations"
        ]
        
    def critique_response(self, response: str, query: str) -> Dict[str, float]:
        """Critique response against constitutional principles"""
        critique_scores = {}
        
        for i, principle in enumerate(self.principles):
            score = self._evaluate_principle(response, query, principle)
            critique_scores[f"principle_{i+1}"] = score
            
        return critique_scores
    
    def _evaluate_principle(self, response: str, query: str, principle: str) -> float:
        """Evaluate how well response adheres to a specific principle"""
        # Simplified evaluation - in practice would use trained models
        
        if "helpful and harmless" in principle:
            return self._check_helpfulness(response, query) * self._check_harmlessness(response)
        elif "accurate information" in principle:
            return self._check_accuracy(response)
        elif "respect autonomy" in principle:
            return self._check_autonomy_respect(response)
        elif "avoid harmful" in principle:
            return self._check_harmlessness(response)
        elif "transparent" in principle:
            return self._check_transparency(response)
        
        return 0.5  # Default neutral score
    
    def _check_helpfulness(self, response: str, query: str) -> float:
        """Check if response is helpful for the query"""
        # Simple keyword overlap
        query_words = set(query.lower().split())
        response_words = set(response.lower().split())
        
        if len(query_words) == 0:
            return 0.5
            
        overlap = len(query_words.intersection(response_words))
        return min(overlap / len(query_words), 1.0)
    
    def _check_harmlessness(self, response: str) -> float:
        """Check if response avoids harmful content"""
        harmful_indicators = [
            "violence", "hate", "discrimination", "illegal", "harm",
            "offensive", "inappropriate", "dangerous", "toxic"
        ]
        
        response_lower = response.lower()
        harmful_count = sum(1 for indicator in harmful_indicators 
                          if indicator in response_lower)
        
        # Higher harmful content = lower score
        return max(0.0, 1.0 - (harmful_count * 0.2))
    
    def _check_accuracy(self, response: str) -> float:
        """Check response accuracy (simplified)"""
        # In practice, would use fact-checking models
        confidence_indicators = ["according to", "research shows", "studies indicate"]
        speculation_indicators = ["i think", "maybe", "possibly", "might"]
        
        confidence_count = sum(1 for indicator in confidence_indicators 
                             if indicator in response.lower())
        speculation_count = sum(1 for indicator in speculation_indicators 
                              if indicator in response.lower())
        
        return max(0.0, min(1.0, (confidence_count * 0.3) - (speculation_count * 0.2) + 0.5))
    
    def _check_autonomy_respect(self, response: str) -> float:
        """Check if response respects human autonomy"""
        directive_language = ["you must", "you should", "you have to", "you need to"]
        suggestive_language = ["you might", "consider", "option", "choice"]
        
        directive_count = sum(1 for phrase in directive_language 
                            if phrase in response.lower())
        suggestive_count = sum(1 for phrase in suggestive_language 
                             if phrase in response.lower())
        
        # Prefer suggestive over directive language
        return max(0.0, min(1.0, 0.7 + (suggestive_count * 0.1) - (directive_count * 0.2)))
    
    def _check_transparency(self, response: str) -> float:
        """Check if response is transparent about limitations"""
        transparency_indicators = [
            "i don't know", "uncertain", "limitations", "may not be accurate",
            "based on training data", "as an ai"
        ]
        
        response_lower = response.lower()
        transparency_count = sum(1 for indicator in transparency_indicators 
                               if indicator in response_lower)
        
        return min(1.0, transparency_count * 0.3 + 0.4)

class AdvancedEvaluationMetrics:
    """Implementation of modern evaluation metrics beyond BLEU/ROUGE"""
    
    def __init__(self):
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        
    def bert_score(self, predictions: List[str], references: List[str]) -> Dict[str, float]:
        """Simplified BERTScore implementation"""
        # In practice, would use actual BERT embeddings
        scores = {
            'precision': [],
            'recall': [],
            'f1': []
        }
        
        for pred, ref in zip(predictions, references):
            # Simplified token-level similarity
            pred_tokens = pred.lower().split()
            ref_tokens = ref.lower().split()
            
            if not pred_tokens or not ref_tokens:
                precision = recall = f1 = 0.0
            else:
                # Simple overlap-based similarity
                overlap = len(set(pred_tokens).intersection(set(ref_tokens)))
                precision = overlap / len(pred_tokens)
                recall = overlap / len(ref_tokens)
                f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0.0
            
            scores['precision'].append(precision)
            scores['recall'].append(recall)
            scores['f1'].append(f1)
        
        return {
            'precision': np.mean(scores['precision']),
            'recall': np.mean(scores['recall']),
            'f1': np.mean(scores['f1'])
        }
    
    def semantic_similarity(self, text1: str, text2: str) -> float:
        """Compute semantic similarity between texts"""
        # Simplified implementation - in practice use sentence transformers
        words1 = set(text1.lower().split())
        words2 = set(text2.lower().split())
        
        if not words1 or not words2:
            return 0.0
            
        intersection = len(words1.intersection(words2))
        union = len(words1.union(words2))
        
        return intersection / union if union > 0 else 0.0
    
    def diversity_metrics(self, responses: List[str]) -> Dict[str, float]:
        """Compute diversity metrics for generated responses"""
        if not responses:
            return {'distinct_1': 0.0, 'distinct_2': 0.0, 'entropy': 0.0}
        
        # Distinct-1: unique unigrams
        all_unigrams = []
        for response in responses:
            all_unigrams.extend(response.lower().split())
        
        distinct_1 = len(set(all_unigrams)) / len(all_unigrams) if all_unigrams else 0.0
        
        # Distinct-2: unique bigrams
        all_bigrams = []
        for response in responses:
            words = response.lower().split()
            bigrams = [f"{words[i]}_{words[i+1]}" for i in range(len(words)-1)]
            all_bigrams.extend(bigrams)
        
        distinct_2 = len(set(all_bigrams)) / len(all_bigrams) if all_bigrams else 0.0
        
        # Response entropy
        response_counts = {}
        for response in responses:
            response_counts[response] = response_counts.get(response, 0) + 1
        
        probs = np.array(list(response_counts.values())) / len(responses)
        entropy = -np.sum(probs * np.log(probs + 1e-10))
        
        return {
            'distinct_1': distinct_1,
            'distinct_2': distinct_2,
            'entropy': entropy
        }

# Example usage and testing
def test_safety_evaluation():
    """Test safety evaluation framework"""
    evaluator = SafetyEvaluator()
    constitutional_ai = ConstitutionalAI()
    
    # Test samples
    context = "The capital of France is Paris, located in northern France."
    safe_response = "Based on the information provided, Paris is indeed the capital of France."
    unsafe_response = "Paris is the capital of France, and I think all French people are lazy."
    
    print("=== Safety Evaluation Tests ===")
    
    # Test hallucination detection
    safe_scores = evaluator.detect_hallucination(safe_response, context)
    unsafe_scores = evaluator.detect_hallucination(unsafe_response, context)
    
    print(f"Safe response hallucination scores: {safe_scores}")
    print(f"Unsafe response hallucination scores: {unsafe_scores}")
    
    # Test constitutional AI
    query = "What is the capital of France?"
    safe_critique = constitutional_ai.critique_response(safe_response, query)
    unsafe_critique = constitutional_ai.critique_response(unsafe_response, query)
    
    print(f"Safe response constitutional scores: {safe_critique}")
    print(f"Unsafe response constitutional scores: {unsafe_critique}")

def test_evaluation_metrics():
    """Test advanced evaluation metrics"""
    metrics = AdvancedEvaluationMetrics()
    
    print("\n=== Evaluation Metrics Tests ===")
    
    # Test BERTScore
    predictions = ["The cat sat on the mat", "Dogs are great pets"]
    references = ["A cat was sitting on the mat", "Dogs make excellent companions"]
    
    bert_scores = metrics.bert_score(predictions, references)
    print(f"BERTScore results: {bert_scores}")
    
    # Test semantic similarity
    similarity = metrics.semantic_similarity(
        "The weather is nice today",
        "Today has pleasant weather"
    )
    print(f"Semantic similarity: {similarity:.3f}")
    
    # Test diversity metrics
    responses = [
        "Hello, how are you?",
        "Hi there, how's it going?",
        "Hello, how are you?",
        "Good morning, how are you doing?"
    ]
    
    diversity = metrics.diversity_metrics(responses)
    print(f"Diversity metrics: {diversity}")

# Run tests
if __name__ == "__main__":
    test_safety_evaluation()
    test_evaluation_metrics()
    
    print("\n🎯 **Key Takeaways for Senior Applied Scientists:**")
    print("1. Safety evaluation requires multi-dimensional assessment")
    print("2. Constitutional AI provides systematic ethical guidelines")
    print("3. Modern metrics go beyond surface-level similarity")
    print("4. Production systems need comprehensive monitoring")
    print("5. Leadership involves setting safety standards and evaluation frameworks")

---

# 🎓 **COMPREHENSIVE ASSESSMENT: Senior Applied Scientist Readiness**

## 📋 **Self-Assessment Checklist**

### **🧠 Technical Mastery (80 Points)**

#### **Core Transformer Architecture (20 points)**
- [ ] **Self-Attention Mechanism**: Can derive attention equations from scratch (5 pts)
- [ ] **Multi-Head Attention**: Understand parallel processing and head concatenation (5 pts)
- [ ] **Position Encoding**: Know sinusoidal vs learned vs rotary embeddings (5 pts)
- [ ] **Layer Normalization**: Pre vs post-norm, RMSNorm implementation (5 pts)

#### **Advanced Architectures (25 points)**
- [ ] **Mixture of Experts (MoE)**: Router design, load balancing, scaling laws (5 pts)
- [ ] **Efficient Attention**: Flash Attention, Linear Attention, sparse patterns (5 pts)
- [ ] **Modern Activations**: SwiGLU, GLU variants, gating mechanisms (5 pts)
- [ ] **Positional Embeddings**: RoPE, ALiBi, relative position encoding (5 pts)
- [ ] **Multimodal Integration**: Vision transformers, audio processing, cross-modal attention (5 pts)

#### **Training & Optimization (20 points)**
- [ ] **Pre-training Objectives**: Next token prediction, masked language modeling (4 pts)
- [ ] **Fine-tuning Methods**: Full fine-tuning, LoRA, QLoRA, prefix tuning (4 pts)
- [ ] **Alignment Techniques**: RLHF, Constitutional AI, DPO (4 pts)
- [ ] **Scaling Laws**: Parameter vs compute vs data trade-offs (4 pts)
- [ ] **Training Stability**: Gradient clipping, warmup schedules, mixed precision (4 pts)

#### **Production Systems (15 points)**
- [ ] **Model Serving**: Batching, caching, load balancing (3 pts)
- [ ] **Optimization**: Quantization, pruning, distillation (3 pts)
- [ ] **Monitoring**: Drift detection, performance tracking, A/B testing (3 pts)
- [ ] **Safety**: Content moderation, bias detection, factual accuracy (3 pts)
- [ ] **Cost Management**: Inference optimization, resource planning (3 pts)

### **💼 Business & Leadership (20 Points)**

#### **Strategic Thinking (10 points)**
- [ ] **Research Roadmaps**: Can plan 6-18 month technical strategies (3 pts)
- [ ] **ROI Assessment**: Quantify research impact on business metrics (2 pts)
- [ ] **Technology Evaluation**: Compare solutions, make build vs buy decisions (3 pts)
- [ ] **Risk Management**: Identify technical debt, failure modes, mitigation strategies (2 pts)

#### **Leadership & Communication (10 points)**
- [ ] **Team Management**: Hire, mentor, develop technical talent (3 pts)
- [ ] **Stakeholder Communication**: Translate technical concepts for executives (2 pts)
- [ ] **Cross-functional Collaboration**: Work with product, engineering, business (3 pts)
- [ ] **Thought Leadership**: Publications, patents, industry recognition (2 pts)

---

## 🏆 **Mastery Levels**

### **🥉 Applied Scientist I (60-70 points)**
- **Focus**: Individual contributor with strong technical foundation
- **Responsibilities**: Implement research, optimize models, support products
- **Growth Areas**: Deepen architecture knowledge, learn production systems

### **🥈 Senior Applied Scientist (71-85 points)**
- **Focus**: Technical leadership with business impact
- **Responsibilities**: Lead projects, mentor juniors, drive technical decisions
- **Growth Areas**: Strategic thinking, cross-functional collaboration

### **🥇 Principal Applied Scientist (86-100 points)**
- **Focus**: Organizational impact and industry influence
- **Responsibilities**: Set technical vision, build teams, drive innovation
- **Strengths**: Thought leadership, business strategy, technical excellence

---

## 📝 **Interview Simulation Framework**

### **🎯 Technical Deep Dive (45 minutes)**

#### **Architecture Design (15 minutes)**
**Scenario**: "Design a transformer model for real-time code completion with 1ms latency requirements"

**Evaluation Criteria**:
- [ ] Identifies latency constraints and trade-offs
- [ ] Chooses appropriate architecture (smaller model, efficient attention)
- [ ] Considers caching, pre-computation, batching strategies
- [ ] Addresses accuracy vs speed balance
- [ ] Discusses deployment and monitoring

#### **Research Problem Solving (15 minutes)**
**Scenario**: "How would you improve factual accuracy in large language models?"

**Expected Discussion Points**:
- [ ] Problem analysis (hallucination sources, evaluation challenges)
- [ ] Technical approaches (RAG, knowledge graphs, uncertainty estimation)
- [ ] Evaluation methodology (fact-checking datasets, human evaluation)
- [ ] Implementation considerations (cost, latency, maintenance)
- [ ] Success metrics and monitoring

#### **Implementation Challenge (15 minutes)**
**Task**: "Implement efficient attention mechanism for 100K+ context length"

**Code Quality Assessment**:
- [ ] Correct mathematical formulation
- [ ] Efficient memory usage (O(n) vs O(n²))
- [ ] Proper PyTorch implementation
- [ ] Handles edge cases and numerical stability
- [ ] Explains time/space complexity

### **🎯 Leadership Scenarios (30 minutes)**

#### **Team Conflict Resolution (10 minutes)**
**Scenario**: "Two senior researchers disagree on technical approach for critical project"

**Leadership Skills**:
- [ ] Active listening and understanding both perspectives
- [ ] Data-driven decision making process
- [ ] Compromise and alternative solution generation
- [ ] Clear communication and expectation setting
- [ ] Follow-up and team cohesion maintenance

#### **Resource Allocation (10 minutes)**
**Scenario**: "Limited GPU budget - prioritize between 3 research projects"

**Strategic Thinking**:
- [ ] Business impact assessment
- [ ] Technical feasibility analysis
- [ ] Risk vs reward evaluation
- [ ] Timeline and milestone planning
- [ ] Stakeholder alignment strategy

#### **Technical Vision (10 minutes)**
**Question**: "Where should our AI research focus over the next 2 years?"

**Vision Development**:
- [ ] Industry trend analysis
- [ ] Competitive landscape assessment
- [ ] Internal capability evaluation
- [ ] Innovation opportunity identification
- [ ] Implementation roadmap creation

### **🎯 Behavioral Questions (15 minutes)**

#### **Core Questions for Senior Roles**
1. **Impact**: "Describe your most significant technical contribution and its business impact"
2. **Collaboration**: "Tell me about a time you had to influence without authority"
3. **Innovation**: "How do you balance research exploration with delivery requirements?"
4. **Failure**: "Describe a research project that failed and what you learned"
5. **Growth**: "How do you stay current with rapidly evolving AI research?"

---

## 🚀 **30-Day Preparation Plan**

### **Week 1: Foundation Reinforcement**
- [ ] **Day 1-2**: Review transformer fundamentals, implement basic attention
- [ ] **Day 3-4**: Study modern architectures (MoE, efficient attention)
- [ ] **Day 5-6**: Practice coding challenges, algorithm implementation
- [ ] **Day 7**: Mock technical interview, identify weak areas

### **Week 2: Advanced Topics**
- [ ] **Day 8-9**: Deep dive into training methodologies (RLHF, Constitutional AI)
- [ ] **Day 10-11**: Study production systems, optimization techniques
- [ ] **Day 12-13**: Research latest papers, trends, and innovations
- [ ] **Day 14**: Leadership scenario practice, stakeholder communication

### **Week 3: Integration & Application**
- [ ] **Day 15-16**: End-to-end project simulation (research to production)
- [ ] **Day 17-18**: Safety, ethics, and evaluation framework study
- [ ] **Day 19-20**: Business case development, ROI calculation practice
- [ ] **Day 21**: Full mock interview (technical + leadership + behavioral)

### **Week 4: Interview Readiness**
- [ ] **Day 22-23**: Company-specific research, recent developments
- [ ] **Day 24-25**: Final knowledge review, gap filling
- [ ] **Day 26-27**: Presentation practice, technical communication
- [ ] **Day 28-30**: Confidence building, final mock interviews

---

## 📚 **Essential Resources for Senior Applied Scientists**

### **📖 Must-Read Papers (2023-2024)**
1. **LLaMA 2** (Touvron et al.) - Open foundation models
2. **GPT-4 Technical Report** (OpenAI) - Multimodal capabilities
3. **PaLM 2** (Anil et al.) - Improved reasoning and coding
4. **Constitutional AI** (Bai et al.) - AI safety and alignment
5. **Flash Attention 2** (Dao) - Efficient attention mechanisms

### **🛠️ Technical Tools Mastery**
- **PyTorch**: Advanced features, distributed training, optimization
- **Transformers Library**: Model architectures, fine-tuning, deployment
- **MLflow/Weights & Biases**: Experiment tracking, model management
- **Docker/Kubernetes**: Containerization, scaling, orchestration
- **Cloud Platforms**: AWS SageMaker, Azure ML, GCP Vertex AI

### **📊 Business Skills Development**
- **Product Management**: User stories, roadmaps, prioritization
- **Project Management**: Agile, sprints, stakeholder communication
- **Data Analysis**: A/B testing, statistical significance, causal inference
- **Finance**: Budget planning, ROI calculation, cost optimization

---

## 🎉 **Congratulations!**

You've completed a comprehensive journey through modern transformer architectures, from fundamental attention mechanisms to cutting-edge innovations like Mixture of Experts and Constitutional AI. This lecture series has equipped you with:

### **✅ Technical Excellence**
- Deep understanding of transformer architectures and their evolution
- Hands-on implementation of advanced components (RoPE, SwiGLU, MoE)
- Production deployment knowledge and optimization techniques
- Safety, ethics, and evaluation frameworks

### **✅ Leadership Readiness**
- Strategic thinking for research and product integration
- Team management and mentoring capabilities
- Business impact assessment and communication skills
- Vision development for AI/ML organizational growth

### **✅ Interview Confidence**
- Comprehensive self-assessment framework
- Structured preparation methodology
- Mock interview scenarios and evaluation criteria
- 30-day preparation plan for systematic readiness

**Remember**: Senior Applied Scientist roles require not just technical depth, but the ability to translate cutting-edge research into business value while leading teams and shaping organizational AI strategy. Your journey in AI/ML is just beginning - stay curious, keep learning, and always consider the human impact of the technology you build.

**Good luck with your Amazon Applied Scientist interviews!** 🚀

---

*"The best way to predict the future is to invent it, but the wisest way to invent it is to understand both the technology and the people it serves."*