# 🤖 Pre-trained Language Models & LLM Era: Hands-on Practice
## Interactive Learning Notebook for NLP and Language Models

**Author**: Based on Lecture 14 by Ho-min Park
**Date**: 2024
**Duration**: 3-4 hours

### 📚 Learning Objectives
- Understand the paradigm shift from traditional NLP to pre-trained models
- Implement tokenization and understand language modeling objectives
- Practice with BERT and GPT-style models
- Master fine-tuning strategies and prompt engineering
- Explore RLHF and instruction tuning concepts

### 🛠️ Technical Requirements
- Python 3.8+
- transformers, torch, numpy, pandas, matplotlib, seaborn, plotly
- GPU recommended but not required

---

## Part 1: Environment Setup and Imports
Let's start by setting up our environment with all necessary libraries.

In [None]:
# Install required packages (run once)
!pip install -q transformers torch numpy pandas matplotlib seaborn plotly scikit-learn datasets accelerate sentencepiece

In [None]:
# Import essential libraries
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings('ignore')

# Transformers and NLP libraries
from transformers import (
    AutoTokenizer, AutoModel, AutoModelForSequenceClassification,
    AutoModelForCausalLM, AutoModelForMaskedLM,
    BertTokenizer, BertModel, BertForMaskedLM,
    GPT2Tokenizer, GPT2Model, GPT2LMHeadModel,
    T5Tokenizer, T5ForConditionalGeneration,
    pipeline, Trainer, TrainingArguments
)
from datasets import load_dataset, Dataset
from sklearn.metrics import accuracy_score, f1_score
from sklearn.model_selection import train_test_split
import time
import json
import random
from typing import List, Dict, Tuple

# Set random seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)
random.seed(42)

# Check if CUDA is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using device: {device}')
print(f'PyTorch version: {torch.__version__}')
print(f'Transformers version: {transformers.__version__}')

---
## Part 2: Understanding the Paradigm Shift

### Exercise 1: From Traditional NLP to Pre-trained Models

#### Concept
The shift from task-specific models to pre-trained foundation models represents a fundamental change in NLP:
- **Old Paradigm**: Task-specific models trained from scratch, requiring labeled data for each task
- **New Paradigm**: General-purpose foundation models pre-trained on massive text, then adapted

Let's visualize this paradigm shift and understand tokenization.

In [None]:
# Exercise 1: Visualizing the Paradigm Shift

def visualize_paradigm_shift():
    """Create an interactive visualization of the NLP paradigm shift"""
    
    # Data for visualization
    paradigms = ['Traditional NLP\n(Pre-2018)', 'Pre-trained Models\n(2018-2020)', 'LLM Era\n(2020+)']
    
    # Metrics for comparison
    data_requirements = [100, 10, 1]  # Relative scale
    model_sizes = [0.1, 1, 100]  # Relative scale (millions of parameters)
    capabilities = [20, 60, 95]  # Performance percentage
    
    # Create subplots
    fig = make_subplots(
        rows=1, cols=3,
        subplot_titles=['Data Requirements', 'Model Size', 'Capabilities'],
        specs=[[{'type': 'bar'}, {'type': 'bar'}, {'type': 'bar'}]]
    )
    
    # Add traces
    fig.add_trace(
        go.Bar(x=paradigms, y=data_requirements, name='Data Needed',
               marker_color=['#FF6B6B', '#4ECDC4', '#45B7D1']),
        row=1, col=1
    )
    
    fig.add_trace(
        go.Bar(x=paradigms, y=model_sizes, name='Model Size',
               marker_color=['#FF6B6B', '#4ECDC4', '#45B7D1']),
        row=1, col=2
    )
    
    fig.add_trace(
        go.Bar(x=paradigms, y=capabilities, name='Performance',
               marker_color=['#FF6B6B', '#4ECDC4', '#45B7D1']),
        row=1, col=3
    )
    
    # Update layout
    fig.update_layout(
        title_text='Evolution of NLP: Paradigm Shift',
        showlegend=False,
        height=400
    )
    
    fig.show()
    
    # Print insights
    print("\n📊 Key Insights:")
    print("1. Data Requirements: Decreased by 100x with pre-training")
    print("2. Model Size: Increased by 1000x for better performance")
    print("3. Capabilities: Near-human performance on many tasks")

visualize_paradigm_shift()

### Exercise 2: Understanding Tokenization

#### Concept
Tokenization is the foundation of language models. Different models use different tokenization strategies:
- **WordPiece** (BERT): Subword tokenization with ## prefix
- **BPE** (GPT): Byte Pair Encoding
- **SentencePiece** (T5): Unigram-based tokenization

In [None]:
# Exercise 2: Comparative Tokenization Analysis

def compare_tokenizers(text: str):
    """Compare different tokenization strategies"""
    
    # Load tokenizers
    bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    gpt2_tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
    t5_tokenizer = T5Tokenizer.from_pretrained('t5-small')
    
    # Tokenize text
    bert_tokens = bert_tokenizer.tokenize(text.lower())
    gpt2_tokens = gpt2_tokenizer.tokenize(text)
    t5_tokens = t5_tokenizer.tokenize(text)
    
    # Create comparison dataframe
    comparison_data = {
        'Tokenizer': ['BERT (WordPiece)', 'GPT-2 (BPE)', 'T5 (SentencePiece)'],
        'Tokens': [bert_tokens, gpt2_tokens, t5_tokens],
        'Token Count': [len(bert_tokens), len(gpt2_tokens), len(t5_tokens)],
        'Vocabulary Size': [
            bert_tokenizer.vocab_size,
            len(gpt2_tokenizer.get_vocab()),
            t5_tokenizer.vocab_size
        ]
    }
    
    df = pd.DataFrame(comparison_data)
    
    # Display results
    print(f"\n📝 Original Text: '{text}'")
    print("\n" + "="*60)
    
    for idx, row in df.iterrows():
        print(f"\n{row['Tokenizer']}:")
        print(f"  Tokens: {row['Tokens']}")
        print(f"  Count: {row['Token Count']}")
        print(f"  Vocab Size: {row['Vocabulary Size']:,}")
    
    # Visualize token counts
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
    
    # Token count comparison
    ax1.bar(df['Tokenizer'], df['Token Count'], color=['#FF6B6B', '#4ECDC4', '#45B7D1'])
    ax1.set_title('Token Count Comparison')
    ax1.set_ylabel('Number of Tokens')
    ax1.set_xticklabels(df['Tokenizer'], rotation=45, ha='right')
    
    # Vocabulary size comparison
    ax2.bar(df['Tokenizer'], df['Vocabulary Size'], color=['#FF6B6B', '#4ECDC4', '#45B7D1'])
    ax2.set_title('Vocabulary Size Comparison')
    ax2.set_ylabel('Vocabulary Size')
    ax2.set_xticklabels(df['Tokenizer'], rotation=45, ha='right')
    
    plt.tight_layout()
    plt.show()
    
    return df

# Test with different texts
test_texts = [
    "The quick brown fox jumps over the lazy dog.",
    "Pre-trained language models revolutionized NLP!",
    "ChatGPT uses reinforcement learning from human feedback (RLHF)."
]

for text in test_texts:
    df = compare_tokenizers(text)
    print("\n" + "="*60)

---
## Part 3: BERT - Bidirectional Encoder Models

### Exercise 3: Masked Language Modeling (MLM)

#### Concept
BERT uses Masked Language Modeling where:
- 15% of tokens are masked
- Model predicts masked tokens using bidirectional context
- Enables deep bidirectional understanding

In [None]:
# Exercise 3: Implement and Visualize Masked Language Modeling

class MaskedLanguageModeling:
    def __init__(self, model_name='bert-base-uncased'):
        self.tokenizer = BertTokenizer.from_pretrained(model_name)
        self.model = BertForMaskedLM.from_pretrained(model_name)
        self.model.eval()
    
    def mask_and_predict(self, text: str, mask_token_index: int = None):
        """Mask a token and predict it using BERT"""
        
        # Tokenize
        tokens = self.tokenizer.tokenize(text)
        
        # Choose random token to mask if not specified
        if mask_token_index is None:
            mask_token_index = random.randint(1, len(tokens) - 1)
        
        # Store original token
        original_token = tokens[mask_token_index]
        
        # Create masked version
        masked_tokens = tokens.copy()
        masked_tokens[mask_token_index] = '[MASK]'
        
        # Convert to IDs
        indexed_tokens = self.tokenizer.convert_tokens_to_ids(['[CLS]'] + masked_tokens + ['[SEP]'])
        tokens_tensor = torch.tensor([indexed_tokens])
        
        # Predict
        with torch.no_grad():
            outputs = self.model(tokens_tensor)
            predictions = outputs.logits
        
        # Get top 5 predictions for masked token
        masked_index = mask_token_index + 1  # +1 for [CLS]
        predicted_scores = predictions[0, masked_index]
        top_5_idx = torch.argsort(predicted_scores, descending=True)[:5]
        
        # Convert predictions to tokens
        top_5_tokens = [self.tokenizer.convert_ids_to_tokens([idx.item()])[0] for idx in top_5_idx]
        top_5_scores = [predicted_scores[idx].item() for idx in top_5_idx]
        
        # Softmax for probabilities
        probs = F.softmax(predicted_scores, dim=-1)
        top_5_probs = [probs[idx].item() for idx in top_5_idx]
        
        return {
            'original': original_token,
            'masked_text': ' '.join(masked_tokens),
            'predictions': list(zip(top_5_tokens, top_5_probs, top_5_scores))
        }
    
    def visualize_predictions(self, results: dict):
        """Visualize MLM predictions"""
        
        tokens = [pred[0] for pred in results['predictions']]
        probs = [pred[1] for pred in results['predictions']]
        
        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
        
        # Bar chart of predictions
        colors = ['#2ECC71' if token == results['original'] else '#3498DB' for token in tokens]
        bars = ax1.barh(tokens, probs, color=colors)
        ax1.set_xlabel('Probability')
        ax1.set_title('Top 5 Predictions for [MASK]')
        ax1.set_xlim(0, max(probs) * 1.1)
        
        # Add value labels
        for bar, prob in zip(bars, probs):
            ax1.text(prob + 0.01, bar.get_y() + bar.get_height()/2, 
                    f'{prob:.2%}', va='center')
        
        # Attention visualization (simplified)
        sentence = results['masked_text'].split()
        attention_matrix = np.random.rand(len(sentence), len(sentence))
        np.fill_diagonal(attention_matrix, 1)
        
        im = ax2.imshow(attention_matrix, cmap='Blues', aspect='auto')
        ax2.set_xticks(range(len(sentence)))
        ax2.set_yticks(range(len(sentence)))
        ax2.set_xticklabels(sentence, rotation=45, ha='right')
        ax2.set_yticklabels(sentence)
        ax2.set_title('Bidirectional Attention Pattern')
        plt.colorbar(im, ax=ax2)
        
        plt.tight_layout()
        plt.show()
        
        print(f"\n✅ Original token: '{results['original']}'")
        print(f"📊 Top prediction: '{tokens[0]}' with {probs[0]:.2%} confidence")

# Initialize and test MLM
mlm = MaskedLanguageModeling()

# Test sentences
test_sentences = [
    "The cat sat on the mat.",
    "Machine learning models require large amounts of data.",
    "BERT uses bidirectional attention to understand context."
]

for sentence in test_sentences:
    print(f"\n🔍 Testing: {sentence}")
    results = mlm.mask_and_predict(sentence)
    mlm.visualize_predictions(results)

### 💡 Your Turn: MLM Practice

Modify the code above to:
1. Mask multiple tokens in a sentence
2. Compare predictions when masking different positions
3. Analyze how context affects predictions

---
## Part 4: GPT - Decoder-based Models

### Exercise 4: Autoregressive Text Generation

#### Concept
GPT models use causal (left-to-right) attention:
- Predict next token based on previous context
- Optimized for text generation
- Can be used for few-shot learning

In [None]:
# Exercise 4: GPT Text Generation and Analysis

class GPTTextGeneration:
    def __init__(self, model_name='gpt2'):
        self.tokenizer = GPT2Tokenizer.from_pretrained(model_name)
        self.model = GPT2LMHeadModel.from_pretrained(model_name)
        self.tokenizer.pad_token = self.tokenizer.eos_token
        self.model.eval()
    
    def generate_text(self, prompt: str, max_length: int = 50, 
                     temperature: float = 1.0, top_k: int = 50):
        """Generate text using GPT-2"""
        
        # Encode prompt
        input_ids = self.tokenizer.encode(prompt, return_tensors='pt')
        
        # Generate
        with torch.no_grad():
            output = self.model.generate(
                input_ids,
                max_length=max_length,
                temperature=temperature,
                top_k=top_k,
                do_sample=True,
                pad_token_id=self.tokenizer.eos_token_id
            )
        
        # Decode
        generated_text = self.tokenizer.decode(output[0], skip_special_tokens=True)
        
        return generated_text
    
    def analyze_generation_strategies(self, prompt: str):
        """Compare different generation strategies"""
        
        strategies = [
            {'name': 'Greedy', 'temp': 1.0, 'top_k': 1},
            {'name': 'Low Temperature', 'temp': 0.5, 'top_k': 50},
            {'name': 'High Temperature', 'temp': 1.5, 'top_k': 50},
            {'name': 'Top-K Sampling', 'temp': 1.0, 'top_k': 10}
        ]
        
        results = []
        
        for strategy in strategies:
            generated = self.generate_text(
                prompt, 
                temperature=strategy['temp'],
                top_k=strategy['top_k']
            )
            results.append({
                'Strategy': strategy['name'],
                'Temperature': strategy['temp'],
                'Top-K': strategy['top_k'],
                'Generated': generated[len(prompt):].strip()[:100]  # First 100 chars after prompt
            })
        
        df = pd.DataFrame(results)
        return df
    
    def visualize_token_probabilities(self, text: str):
        """Visualize next token probabilities"""
        
        input_ids = self.tokenizer.encode(text, return_tensors='pt')
        
        with torch.no_grad():
            outputs = self.model(input_ids)
            predictions = outputs.logits
        
        # Get probabilities for next token
        next_token_logits = predictions[0, -1, :]
        probs = F.softmax(next_token_logits, dim=-1)
        
        # Get top 10 tokens
        top_k = 10
        top_probs, top_indices = torch.topk(probs, top_k)
        top_tokens = [self.tokenizer.decode([idx]) for idx in top_indices]
        
        # Create interactive plot
        fig = go.Figure(data=[
            go.Bar(
                x=top_tokens,
                y=top_probs.numpy(),
                text=[f'{p:.2%}' for p in top_probs],
                textposition='auto',
                marker_color='lightblue'
            )
        ])
        
        fig.update_layout(
            title=f'Next Token Probabilities after: "{text}"',
            xaxis_title='Predicted Tokens',
            yaxis_title='Probability',
            height=400
        )
        
        fig.show()
        
        return list(zip(top_tokens, top_probs.numpy()))

# Initialize GPT model
gpt = GPTTextGeneration()

# Test generation strategies
prompt = "The future of artificial intelligence is"
print(f"\n🎯 Prompt: '{prompt}'\n")

# Compare strategies
strategies_df = gpt.analyze_generation_strategies(prompt)
print("\n📊 Generation Strategies Comparison:")
for idx, row in strategies_df.iterrows():
    print(f"\n{row['Strategy']} (T={row['Temperature']}, K={row['Top-K']}):")
    print(f"  → {row['Generated']}")

# Visualize token probabilities
print("\n\n🔮 Next Token Predictions:")
predictions = gpt.visualize_token_probabilities(prompt)

---
## Part 5: Few-shot Learning and In-Context Learning

### Exercise 5: Implementing Few-shot Learning

#### Concept
Large language models can learn from examples in the prompt:
- **Zero-shot**: Task description only
- **One-shot**: Single example
- **Few-shot**: Multiple examples

In [None]:
# Exercise 5: Few-shot Learning Implementation

class FewShotLearning:
    def __init__(self):
        self.tokenizer = GPT2Tokenizer.from_pretrained('gpt2-medium')
        self.model = GPT2LMHeadModel.from_pretrained('gpt2-medium')
        self.tokenizer.pad_token = self.tokenizer.eos_token
    
    def create_prompt(self, task: str, examples: List[Tuple[str, str]], query: str) -> str:
        """Create few-shot prompt"""
        prompt = f"Task: {task}\n\n"
        
        # Add examples
        for i, (input_text, output_text) in enumerate(examples, 1):
            prompt += f"Example {i}:\n"
            prompt += f"Input: {input_text}\n"
            prompt += f"Output: {output_text}\n\n"
        
        # Add query
        prompt += f"Now:\nInput: {query}\nOutput:"
        
        return prompt
    
    def perform_few_shot(self, task: str, examples: List[Tuple[str, str]], 
                        query: str, max_length: int = 100):
        """Perform few-shot learning"""
        
        # Create prompt
        prompt = self.create_prompt(task, examples, query)
        
        # Generate response
        input_ids = self.tokenizer.encode(prompt, return_tensors='pt')
        
        with torch.no_grad():
            output = self.model.generate(
                input_ids,
                max_length=len(input_ids[0]) + max_length,
                temperature=0.8,
                pad_token_id=self.tokenizer.eos_token_id,
                do_sample=True
            )
        
        # Decode and extract answer
        full_text = self.tokenizer.decode(output[0], skip_special_tokens=True)
        answer = full_text[len(prompt):].strip()
        
        return prompt, answer
    
    def compare_shot_learning(self, task: str, examples: List[Tuple[str, str]], 
                             query: str):
        """Compare zero-shot, one-shot, and few-shot learning"""
        
        results = {}
        
        # Zero-shot
        zero_prompt, zero_answer = self.perform_few_shot(task, [], query)
        results['Zero-shot'] = {'prompt': zero_prompt, 'answer': zero_answer}
        
        # One-shot
        one_prompt, one_answer = self.perform_few_shot(task, examples[:1], query)
        results['One-shot'] = {'prompt': one_prompt, 'answer': one_answer}
        
        # Few-shot
        few_prompt, few_answer = self.perform_few_shot(task, examples, query)
        results['Few-shot'] = {'prompt': few_prompt, 'answer': few_answer}
        
        return results

# Initialize few-shot learner
fsl = FewShotLearning()

# Define sentiment analysis task
task = "Classify the sentiment of the text as positive or negative"

examples = [
    ("I love this movie! It's amazing.", "positive"),
    ("This product is terrible. Waste of money.", "negative"),
    ("Best experience ever! Highly recommend.", "positive")
]

query = "The service was disappointing and slow."

# Compare different shot learning
results = fsl.compare_shot_learning(task, examples, query)

print("\n🎯 Sentiment Analysis Task\n")
print(f"Query: '{query}'\n")
print("="*60)

for method, result in results.items():
    print(f"\n{method}:")
    print(f"Answer: {result['answer'][:50]}")
    print("-"*40)

# Visualize results
methods = list(results.keys())
prompt_lengths = [len(results[m]['prompt']) for m in methods]
answer_lengths = [len(results[m]['answer'].split()[0]) if results[m]['answer'] else 0 for m in methods]

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

ax1.bar(methods, prompt_lengths, color=['#FF6B6B', '#FFA07A', '#98D8C8'])
ax1.set_title('Prompt Length Comparison')
ax1.set_ylabel('Characters')

ax2.bar(methods, [0.3, 0.6, 0.9], color=['#FF6B6B', '#FFA07A', '#98D8C8'])
ax2.set_title('Expected Accuracy (Illustrative)')
ax2.set_ylabel('Accuracy')
ax2.set_ylim(0, 1)

plt.tight_layout()
plt.show()

---
## Part 6: Fine-tuning Strategies

### Exercise 6: Parameter-Efficient Fine-tuning (PEFT)

#### Concept
Modern fine-tuning techniques optimize only a small subset of parameters:
- **LoRA**: Low-Rank Adaptation
- **Prefix Tuning**: Optimize continuous prompts
- **Adapter Layers**: Small bottleneck modules

In [None]:
# Exercise 6: Simulate LoRA (Low-Rank Adaptation)

class LoRASimulation:
    """
    Simplified LoRA implementation for educational purposes
    """
    
    def __init__(self, original_dim: int = 768, rank: int = 8):
        """
        Initialize LoRA matrices
        original_dim: Original weight matrix dimension
        rank: Low-rank decomposition rank
        """
        self.original_dim = original_dim
        self.rank = rank
        
        # Original weight matrix (frozen)
        self.W = nn.Parameter(torch.randn(original_dim, original_dim), requires_grad=False)
        
        # LoRA matrices (trainable)
        self.A = nn.Parameter(torch.randn(original_dim, rank))
        self.B = nn.Parameter(torch.randn(rank, original_dim))
        
        # Initialize A and B
        nn.init.kaiming_uniform_(self.A)
        nn.init.zeros_(self.B)
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Forward pass with LoRA"""
        # Original transformation
        original_output = torch.matmul(x, self.W)
        
        # LoRA adaptation
        lora_output = torch.matmul(torch.matmul(x, self.A), self.B)
        
        # Combine
        return original_output + lora_output
    
    def count_parameters(self) -> dict:
        """Count trainable vs frozen parameters"""
        original_params = self.W.numel()
        lora_params = self.A.numel() + self.B.numel()
        
        return {
            'original': original_params,
            'lora': lora_params,
            'total': original_params + lora_params,
            'reduction': f"{(1 - lora_params/original_params)*100:.1f}%"
        }

def visualize_lora_efficiency():
    """Visualize parameter efficiency of LoRA"""
    
    dimensions = [128, 256, 512, 768, 1024]
    ranks = [4, 8, 16, 32]
    
    # Calculate parameter savings
    data = []
    for dim in dimensions:
        for rank in ranks:
            original = dim * dim
            lora = 2 * dim * rank
            savings = (1 - lora/original) * 100
            data.append({
                'Dimension': dim,
                'Rank': rank,
                'Original': original,
                'LoRA': lora,
                'Savings': savings
            })
    
    df = pd.DataFrame(data)
    
    # Create heatmap
    pivot_df = df.pivot(index='Dimension', columns='Rank', values='Savings')
    
    plt.figure(figsize=(10, 6))
    sns.heatmap(pivot_df, annot=True, fmt='.1f', cmap='YlOrRd', 
                cbar_kws={'label': 'Parameter Reduction (%)'})
    plt.title('LoRA Parameter Efficiency: Savings by Dimension and Rank')
    plt.xlabel('LoRA Rank')
    plt.ylabel('Model Dimension')
    plt.show()
    
    # Interactive 3D plot
    fig = go.Figure(data=[go.Surface(
        z=pivot_df.values,
        x=pivot_df.columns,
        y=pivot_df.index,
        colorscale='Viridis',
        colorbar=dict(title='Savings (%)')
    )])
    
    fig.update_layout(
        title='LoRA Parameter Savings Surface',
        scene=dict(
            xaxis_title='Rank',
            yaxis_title='Dimension',
            zaxis_title='Savings (%)'
        ),
        height=500
    )
    
    fig.show()

# Test LoRA
print("\n🔧 LoRA (Low-Rank Adaptation) Analysis\n")

# Create LoRA model
lora = LoRASimulation(original_dim=768, rank=8)
params = lora.count_parameters()

print("Parameter Count:")
print(f"  Original: {params['original']:,}")
print(f"  LoRA: {params['lora']:,}")
print(f"  Reduction: {params['reduction']}")

# Visualize efficiency
visualize_lora_efficiency()

# Test forward pass
x = torch.randn(32, 768)  # Batch of 32, dimension 768
output = lora.forward(x)
print(f"\n✅ Forward pass successful: Input {x.shape} → Output {output.shape}")

---
## Part 7: Advanced Prompt Engineering

### Exercise 7: Chain-of-Thought (CoT) Prompting

#### Concept
Chain-of-Thought prompting improves reasoning by:
- Breaking down complex problems into steps
- Showing intermediate reasoning
- Improving accuracy on complex tasks

In [None]:
# Exercise 7: Chain-of-Thought Prompting

class ChainOfThoughtPrompting:
    def __init__(self):
        self.examples = {
            'math': [
                {
                    'question': 'Roger has 5 tennis balls. He buys 2 cans of 3 balls each. How many total?',
                    'cot': '''Let's think step by step:
1. Roger starts with 5 tennis balls
2. He buys 2 cans
3. Each can has 3 balls, so 2 × 3 = 6 balls
4. Total: 5 + 6 = 11 balls
Answer: 11'''
                }
            ],
            'logic': [
                {
                    'question': 'All roses are flowers. Some flowers fade quickly. Can we conclude that some roses fade quickly?',
                    'cot': '''Let's analyze step by step:
1. All roses are flowers (roses ⊆ flowers)
2. Some flowers fade quickly (∃ flowers that fade)
3. But we don't know if the fading flowers include roses
4. The fading flowers might be non-rose flowers
Answer: No, we cannot conclude this.'''
                }
            ]
        }
    
    def create_cot_prompt(self, task_type: str, question: str) -> str:
        """Create Chain-of-Thought prompt"""
        
        prompt = f"Solve the following {task_type} problem step by step.\n\n"
        
        # Add examples
        if task_type in self.examples:
            for example in self.examples[task_type]:
                prompt += f"Question: {example['question']}\n"
                prompt += f"{example['cot']}\n\n"
        
        # Add new question
        prompt += f"Question: {question}\n"
        prompt += "Let's think step by step:\n"
        
        return prompt
    
    def compare_prompting_strategies(self, question: str):
        """Compare different prompting strategies"""
        
        strategies = {
            'Direct': f"Question: {question}\nAnswer:",
            'CoT': self.create_cot_prompt('math', question),
            'Zero-shot-CoT': f"Question: {question}\nLet's think step by step:",
            'Role-based': f"You are a math expert. Solve this problem:\n{question}\nSolution:"
        }
        
        return strategies
    
    def visualize_prompt_comparison(self, strategies: dict):
        """Visualize different prompting strategies"""
        
        # Calculate metrics
        metrics = []
        for name, prompt in strategies.items():
            metrics.append({
                'Strategy': name,
                'Length': len(prompt),
                'Words': len(prompt.split()),
                'Lines': len(prompt.split('\n'))
            })
        
        df = pd.DataFrame(metrics)
        
        # Create subplots
        fig = make_subplots(
            rows=1, cols=3,
            subplot_titles=['Prompt Length', 'Word Count', 'Line Count']
        )
        
        # Add traces
        fig.add_trace(
            go.Bar(x=df['Strategy'], y=df['Length'], name='Characters',
                  marker_color='lightblue'),
            row=1, col=1
        )
        
        fig.add_trace(
            go.Bar(x=df['Strategy'], y=df['Words'], name='Words',
                  marker_color='lightgreen'),
            row=1, col=2
        )
        
        fig.add_trace(
            go.Bar(x=df['Strategy'], y=df['Lines'], name='Lines',
                  marker_color='lightcoral'),
            row=1, col=3
        )
        
        fig.update_layout(
            title_text='Prompting Strategy Comparison',
            showlegend=False,
            height=400
        )
        
        fig.show()
        
        return df

# Test Chain-of-Thought
cot = ChainOfThoughtPrompting()

# Test question
question = "A store had 45 apples. They sold 17 in the morning and 12 in the afternoon. How many are left?"

print("\n🧠 Chain-of-Thought Prompting Analysis\n")
print(f"Test Question: {question}\n")
print("="*60)

# Compare strategies
strategies = cot.compare_prompting_strategies(question)

# Display each strategy
for name, prompt in strategies.items():
    print(f"\n{'='*20} {name} Strategy {'='*20}")
    print(prompt[:200] + "..." if len(prompt) > 200 else prompt)

# Visualize comparison
print("\n📊 Strategy Metrics:")
metrics_df = cot.visualize_prompt_comparison(strategies)
print(metrics_df.to_string())

---
## Part 8: RLHF and Instruction Tuning

### Exercise 8: Simulating RLHF Process

#### Concept
Reinforcement Learning from Human Feedback (RLHF) involves:
1. Supervised Fine-tuning (SFT)
2. Reward Model Training
3. PPO Optimization

In [None]:
# Exercise 8: RLHF Process Simulation

class RLHFSimulation:
    def __init__(self):
        self.responses_database = [
            {
                'prompt': 'Explain quantum computing',
                'responses': [
                    {'text': 'Quantum computing uses quantum bits that can be 0 and 1 simultaneously.', 'score': 0.7},
                    {'text': 'It is a type of computation using quantum phenomena like superposition.', 'score': 0.8},
                    {'text': 'Computers that are very fast.', 'score': 0.3}
                ]
            },
            {
                'prompt': 'How to make coffee?',
                'responses': [
                    {'text': '1. Boil water 2. Add coffee 3. Stir 4. Enjoy', 'score': 0.6},
                    {'text': 'Heat water to 195-205°F, add 2 tbsp coffee per 6 oz water, brew 4-5 min', 'score': 0.9},
                    {'text': 'Put coffee in water.', 'score': 0.2}
                ]
            }
        ]
    
    def simulate_reward_model(self, response: str) -> float:
        """Simulate a reward model scoring"""
        # Simple heuristics for demonstration
        score = 0.5  # Base score
        
        # Length bonus
        if 20 < len(response.split()) < 50:
            score += 0.2
        
        # Specificity bonus
        if any(char.isdigit() for char in response):
            score += 0.1
        
        # Structure bonus
        if '.' in response or ',' in response:
            score += 0.1
        
        return min(score, 1.0)
    
    def ppo_update_simulation(self, responses: List[dict], learning_rate: float = 0.1):
        """Simulate PPO policy update"""
        updated_responses = []
        
        for resp in responses:
            # Simulate policy update based on reward
            old_score = resp['score']
            reward_signal = old_score - 0.5  # Centered around 0.5
            
            # Simulate update (simplified)
            new_score = old_score + learning_rate * reward_signal
            new_score = max(0, min(1, new_score))  # Clip to [0, 1]
            
            updated_responses.append({
                'text': resp['text'],
                'old_score': old_score,
                'new_score': new_score,
                'improvement': new_score - old_score
            })
        
        return updated_responses
    
    def visualize_rlhf_process(self):
        """Visualize the RLHF training process"""
        
        # Simulate training iterations
        iterations = 10
        scores_over_time = []
        
        for i in range(iterations):
            iter_scores = []
            for prompt_data in self.responses_database:
                for response in prompt_data['responses']:
                    # Simulate improvement
                    base_score = response['score']
                    improvement = np.random.normal(0.02, 0.01) * i
                    score = min(base_score + improvement, 1.0)
                    iter_scores.append(score)
            
            scores_over_time.append({
                'Iteration': i,
                'Mean Score': np.mean(iter_scores),
                'Min Score': np.min(iter_scores),
                'Max Score': np.max(iter_scores)
            })
        
        df = pd.DataFrame(scores_over_time)
        
        # Create visualization
        fig = go.Figure()
        
        # Add traces
        fig.add_trace(go.Scatter(
            x=df['Iteration'],
            y=df['Mean Score'],
            mode='lines+markers',
            name='Mean Score',
            line=dict(color='blue', width=3)
        ))
        
        # Add confidence band
        fig.add_trace(go.Scatter(
            x=df['Iteration'].tolist() + df['Iteration'].tolist()[::-1],
            y=df['Max Score'].tolist() + df['Min Score'].tolist()[::-1],
            fill='toself',
            fillcolor='rgba(0,100,200,0.2)',
            line=dict(color='rgba(255,255,255,0)'),
            showlegend=False,
            name='Range'
        ))
        
        fig.update_layout(
            title='RLHF Training Progress Simulation',
            xaxis_title='Training Iteration',
            yaxis_title='Reward Score',
            height=400
        )
        
        fig.show()
        
        return df
    
    def demonstrate_improvement(self, prompt: str):
        """Show response improvement through RLHF"""
        
        print(f"\n📝 Prompt: '{prompt}'\n")
        print("="*60)
        
        # Find responses for this prompt
        prompt_data = next((p for p in self.responses_database if p['prompt'] == prompt), None)
        
        if prompt_data:
            # Before RLHF
            print("\n🔴 Before RLHF:")
            for resp in prompt_data['responses']:
                print(f"  Score {resp['score']:.1f}: {resp['text']}")
            
            # After RLHF (simulated)
            updated = self.ppo_update_simulation(prompt_data['responses'])
            
            print("\n🟢 After RLHF (simulated):")
            sorted_updated = sorted(updated, key=lambda x: x['new_score'], reverse=True)
            for resp in sorted_updated:
                print(f"  Score {resp['new_score']:.2f} ({resp['improvement']:+.2f}): {resp['text']}")

# Test RLHF Simulation
rlhf = RLHFSimulation()

print("\n🎯 RLHF (Reinforcement Learning from Human Feedback) Simulation\n")

# Visualize training progress
progress_df = rlhf.visualize_rlhf_process()

# Demonstrate improvement on specific prompts
for prompt_data in rlhf.responses_database:
    rlhf.demonstrate_improvement(prompt_data['prompt'])

print("\n📊 Training Statistics:")
print(f"  Initial Mean Score: {progress_df.iloc[0]['Mean Score']:.3f}")
print(f"  Final Mean Score: {progress_df.iloc[-1]['Mean Score']:.3f}")
print(f"  Improvement: {(progress_df.iloc[-1]['Mean Score'] - progress_df.iloc[0]['Mean Score']):.3f}")

---
## Part 9: Model Architecture Comparison and Scaling Laws

### Exercise 9: Understanding Scaling Laws

#### Concept
Model performance follows predictable scaling laws:
- Performance improves with model size
- Emergent abilities appear at scale thresholds
- Trade-offs between performance and efficiency

In [None]:
# Exercise 9: Scaling Laws and Model Comparison

def analyze_scaling_laws():
    """Analyze and visualize scaling laws in language models"""
    
    # Model data (approximate values for illustration)
    models = [
        {'name': 'BERT-Base', 'params': 110e6, 'year': 2018, 'type': 'Encoder', 'performance': 0.82},
        {'name': 'BERT-Large', 'params': 340e6, 'year': 2018, 'type': 'Encoder', 'performance': 0.86},
        {'name': 'GPT-2', 'params': 1.5e9, 'year': 2019, 'type': 'Decoder', 'performance': 0.88},
        {'name': 'T5-Base', 'params': 220e6, 'year': 2019, 'type': 'Enc-Dec', 'performance': 0.84},
        {'name': 'T5-Large', 'params': 770e6, 'year': 2019, 'type': 'Enc-Dec', 'performance': 0.87},
        {'name': 'GPT-3', 'params': 175e9, 'year': 2020, 'type': 'Decoder', 'performance': 0.93},
        {'name': 'T5-11B', 'params': 11e9, 'year': 2019, 'type': 'Enc-Dec', 'performance': 0.90},
        {'name': 'GPT-4*', 'params': 1e12, 'year': 2023, 'type': 'Decoder', 'performance': 0.97}
    ]
    
    df = pd.DataFrame(models)
    
    # Create comprehensive visualization
    fig = make_subplots(
        rows=2, cols=2,
        subplot_titles=[
            'Scaling Law: Size vs Performance',
            'Model Evolution Timeline',
            'Architecture Distribution',
            'Emergent Abilities Threshold'
        ],
        specs=[
            [{'type': 'scatter'}, {'type': 'scatter'}],
            [{'type': 'bar'}, {'type': 'scatter'}]
        ]
    )
    
    # 1. Scaling Law Plot
    colors = {'Encoder': 'blue', 'Decoder': 'red', 'Enc-Dec': 'green'}
    for model_type in df['type'].unique():
        data = df[df['type'] == model_type]
        fig.add_trace(
            go.Scatter(
                x=np.log10(data['params']),
                y=data['performance'],
                mode='markers+text',
                name=model_type,
                text=data['name'],
                textposition='top center',
                marker=dict(size=10, color=colors[model_type])
            ),
            row=1, col=1
        )
    
    # Add trend line
    x_trend = np.log10(df['params'])
    z = np.polyfit(x_trend, df['performance'], 2)
    p = np.poly1d(z)
    x_line = np.linspace(x_trend.min(), x_trend.max(), 100)
    fig.add_trace(
        go.Scatter(
            x=x_line,
            y=p(x_line),
            mode='lines',
            name='Trend',
            line=dict(dash='dash', color='gray')
        ),
        row=1, col=1
    )
    
    # 2. Timeline Evolution
    fig.add_trace(
        go.Scatter(
            x=df['year'],
            y=np.log10(df['params']),
            mode='markers+lines',
            name='Model Size',
            marker=dict(size=12),
            text=df['name'],
            textposition='top center'
        ),
        row=1, col=2
    )
    
    # 3. Architecture Distribution
    arch_counts = df['type'].value_counts()
    fig.add_trace(
        go.Bar(
            x=arch_counts.index,
            y=arch_counts.values,
            name='Count',
            marker_color=['blue', 'red', 'green']
        ),
        row=2, col=1
    )
    
    # 4. Emergent Abilities
    # Simulate emergent abilities appearing at different scales
    abilities = [
        {'size': 1e9, 'ability': 'Basic QA', 'score': 0.3},
        {'size': 10e9, 'ability': 'Few-shot', 'score': 0.6},
        {'size': 100e9, 'ability': 'CoT', 'score': 0.8},
        {'size': 500e9, 'ability': 'Reasoning', 'score': 0.9},
        {'size': 1e12, 'ability': 'Multi-modal', 'score': 0.95}
    ]
    
    abilities_df = pd.DataFrame(abilities)
    fig.add_trace(
        go.Scatter(
            x=np.log10(abilities_df['size']),
            y=abilities_df['score'],
            mode='markers+lines',
            name='Abilities',
            marker=dict(size=15, color='purple'),
            text=abilities_df['ability'],
            textposition='top center'
        ),
        row=2, col=2
    )
    
    # Update layout
    fig.update_xaxes(title_text="Log10(Parameters)", row=1, col=1)
    fig.update_yaxes(title_text="Performance", row=1, col=1)
    fig.update_xaxes(title_text="Year", row=1, col=2)
    fig.update_yaxes(title_text="Log10(Parameters)", row=1, col=2)
    fig.update_xaxes(title_text="Architecture Type", row=2, col=1)
    fig.update_yaxes(title_text="Count", row=2, col=1)
    fig.update_xaxes(title_text="Log10(Model Size)", row=2, col=2)
    fig.update_yaxes(title_text="Capability Score", row=2, col=2)
    
    fig.update_layout(height=800, showlegend=True, title_text="Language Model Scaling Analysis")
    fig.show()
    
    # Print insights
    print("\n📊 Key Scaling Insights:")
    print(f"1. Performance Range: {df['performance'].min():.2f} to {df['performance'].max():.2f}")
    print(f"2. Parameter Range: {df['params'].min()/1e6:.0f}M to {df['params'].max()/1e9:.0f}B")
    print(f"3. Years Covered: {df['year'].min()} to {df['year'].max()}")
    print(f"4. Most Common Architecture: {df['type'].mode()[0]}")
    
    return df

# Analyze scaling laws
print("\n📈 Language Model Scaling Laws Analysis\n")
scaling_df = analyze_scaling_laws()

print("\n📋 Model Comparison Table:")
print(scaling_df.to_string(index=False))

---
## Part 10: Summary and Practice Exercises

### 🎓 Key Takeaways

1. **Paradigm Shift**: From task-specific to general-purpose foundation models
2. **Pre-training**: Self-supervised learning on massive unlabeled data
3. **Architecture Types**:
   - Encoder (BERT): Bidirectional understanding
   - Decoder (GPT): Autoregressive generation
   - Encoder-Decoder (T5): Flexible sequence-to-sequence
4. **Fine-tuning**: Adapt pre-trained models to specific tasks
5. **Prompting**: In-context learning without parameter updates
6. **RLHF**: Align models with human preferences
7. **Scaling Laws**: Predictable performance improvements with size

### 📝 Practice Exercises

In [None]:
# Exercise 10: Comprehensive Practice

print("\n🎯 Practice Exercises\n")
print("="*60)

# Exercise 1: Custom Tokenization
print("\n📌 Exercise 1: Implement custom tokenization")
print("Task: Create a function that compares tokenization across languages")
print("Hint: Use multiple sentences in different languages")

# Exercise 2: MLM Task
print("\n📌 Exercise 2: Create domain-specific MLM")
print("Task: Fine-tune BERT for medical/legal domain MLM")
print("Hint: Use domain-specific vocabulary")

# Exercise 3: Few-shot Classification
print("\n📌 Exercise 3: Build few-shot classifier")
print("Task: Create a few-shot emotion classifier")
print("Hint: Use 5 emotions with 3 examples each")

# Exercise 4: Prompt Optimization
print("\n📌 Exercise 4: Optimize prompts automatically")
print("Task: Implement prompt search algorithm")
print("Hint: Use genetic algorithm or beam search")

# Exercise 5: PEFT Implementation
print("\n📌 Exercise 5: Implement Prefix Tuning")
print("Task: Create prefix tuning for sentiment analysis")
print("Hint: Optimize continuous prompt vectors")

print("\n" + "="*60)
print("\n🎉 Congratulations! You've completed the LLM hands-on tutorial!")
print("\n📚 Further Reading:")
print("- Attention Is All You Need (Vaswani et al., 2017)")
print("- BERT: Pre-training of Deep Bidirectional Transformers (Devlin et al., 2018)")
print("- Language Models are Few-Shot Learners (Brown et al., 2020)")
print("- LoRA: Low-Rank Adaptation of Large Language Models (Hu et al., 2021)")
print("- Constitutional AI: Harmlessness from AI Feedback (Bai et al., 2022)")

---

## 🎯 Final Project Ideas

1. **Build a Custom ChatBot**: Fine-tune GPT-2 on domain-specific data
2. **Create a Fact-Checker**: Use BERT for claim verification
3. **Develop a Code Generator**: Implement few-shot code generation
4. **Design a Prompt Library**: Create optimized prompts for various tasks
5. **Implement RLHF**: Build a simple preference learning system

## 📧 Contact

For questions or feedback:
- Original Lecture: Ho-min Park (homin.park@ghent.ac.kr)
- Notebook Creation: AI-Generated Interactive Tutorial

---

**Thank you for learning!** 🚀