# Data Preprocessing and Tokenization for Chatbots

**Learning Objectives:**
- Understand text preprocessing fundamentals for NLP
- Implement a custom tokenizer from scratch
- Build vocabulary with frequency analysis
- Create dataset classes for conversational data
- Compare custom vs pre-built tokenization solutions

**Prerequisites:**
- PyTorch fundamentals (Notebook 01)
- Tensor operations for NLP (Notebook 02)
- Basic understanding of text processing

---

## Introduction

Text preprocessing and tokenization are crucial steps in any NLP pipeline. Before we can train a chatbot, we need to convert raw text into numerical representations that neural networks can understand. This notebook will guide you through building these components from scratch, helping you understand the underlying concepts before using pre-built solutions.

## 1. Imports and Setup

Let's start by importing the necessary libraries and setting up our environment.

In [None]:
# Standard library imports
import json
import re
import string
from collections import Counter, defaultdict
from typing import List, Dict, Tuple, Optional, Any
from dataclasses import dataclass
import os

# Data science and visualization
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud

# PyTorch imports
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

# Set up plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)

print("Libraries imported successfully!")
print(f"PyTorch version: {torch.__version__}")
print(f"Device available: {'CUDA' if torch.cuda.is_available() else 'CPU'}")

## 2. Text Preprocessing Pipeline

Before tokenization, we need to clean and normalize our text data. This involves several steps:

1. **Lowercasing**: Convert all text to lowercase for consistency
2. **Punctuation handling**: Decide how to handle punctuation marks
3. **Special character removal**: Remove or replace special characters
4. **Whitespace normalization**: Handle multiple spaces, tabs, newlines
5. **Contraction expansion**: Convert contractions (don't → do not)

Let's implement each step with educational explanations.

In [None]:
class TextPreprocessor:
    """
    Educational text preprocessor that demonstrates common NLP preprocessing steps.
    
    This class implements various text cleaning and normalization techniques
    commonly used in chatbot development. Each method includes detailed
    explanations of why the preprocessing step is important.
    """
    
    def __init__(self, 
                 lowercase: bool = True,
                 remove_punctuation: bool = False,
                 expand_contractions: bool = True,
                 remove_extra_whitespace: bool = True):
        """
        Initialize the preprocessor with configuration options.
        
        Args:
            lowercase: Whether to convert text to lowercase
            remove_punctuation: Whether to remove punctuation marks
            expand_contractions: Whether to expand contractions
            remove_extra_whitespace: Whether to normalize whitespace
        
        Educational Note:
        - Lowercasing reduces vocabulary size but loses capitalization information
        - Punctuation can be important for chatbots (questions vs statements)
        - Contractions should be expanded for consistency
        """
        self.lowercase = lowercase
        self.remove_punctuation = remove_punctuation
        self.expand_contractions = expand_contractions
        self.remove_extra_whitespace = remove_extra_whitespace
        
        # Common contractions mapping for educational purposes
        self.contractions = {
            "don't": "do not",
            "won't": "will not",
            "can't": "cannot",
            "n't": " not",
            "'re": " are",
            "'ve": " have",
            "'ll": " will",
            "'d": " would",
            "'m": " am"
        }
    
    def expand_contractions_text(self, text: str) -> str:
        """
        Expand contractions in the text.
        
        Educational Note:
        Contractions like "don't" should be expanded to "do not" because:
        1. It creates consistency in the vocabulary
        2. It helps the model understand negation better
        3. It reduces the number of unique tokens
        """
        if not self.expand_contractions:
            return text
            
        # Convert to lowercase for matching, but preserve original case
        text_lower = text.lower()
        
        for contraction, expansion in self.contractions.items():
            # Use word boundaries to avoid partial matches
            pattern = r'\b' + re.escape(contraction) + r'\b'
            text = re.sub(pattern, expansion, text, flags=re.IGNORECASE)
            
        return text
    
    def normalize_whitespace(self, text: str) -> str:
        """
        Normalize whitespace in the text.
        
        Educational Note:
        Multiple spaces, tabs, and newlines can create inconsistencies.
        We normalize them to single spaces for cleaner tokenization.
        """
        if not self.remove_extra_whitespace:
            return text
            
        # Replace multiple whitespace characters with single space
        text = re.sub(r'\s+', ' ', text)
        
        # Remove leading and trailing whitespace
        text = text.strip()
        
        return text
    
    def remove_punctuation_text(self, text: str) -> str:
        """
        Remove punctuation from text.
        
        Educational Note:
        Removing punctuation reduces vocabulary size but loses important
        information like questions (?) vs statements (.). For chatbots,
        we might want to keep some punctuation.
        """
        if not self.remove_punctuation:
            return text
            
        # Remove all punctuation except spaces
        translator = str.maketrans('', '', string.punctuation)
        return text.translate(translator)
    
    def preprocess(self, text: str) -> str:
        """
        Apply all preprocessing steps to the input text.
        
        Educational Note:
        The order of preprocessing steps matters:
        1. Expand contractions first (before lowercasing)
        2. Handle punctuation
        3. Normalize case
        4. Clean whitespace last
        """
        # Step 1: Expand contractions (before lowercasing)
        text = self.expand_contractions_text(text)
        
        # Step 2: Handle punctuation
        text = self.remove_punctuation_text(text)
        
        # Step 3: Convert to lowercase
        if self.lowercase:
            text = text.lower()
        
        # Step 4: Normalize whitespace
        text = self.normalize_whitespace(text)
        
        return text

# Test the preprocessor with examples
preprocessor = TextPreprocessor()

test_texts = [
    "Hello! How are you today?",
    "I don't think that's right...",
    "What's   the    weather like?",
    "Can't we do better than this?"
]

print("Text Preprocessing Examples:")
print("=" * 50)
for text in test_texts:
    processed = preprocessor.preprocess(text)
    print(f"Original:  '{text}'")
    print(f"Processed: '{processed}'")
    print()

## 3. Custom Tokenizer Implementation

Now let's build a custom tokenizer from scratch. A tokenizer converts text into tokens (words, subwords, or characters) and maps them to numerical IDs.

**Key Concepts:**
- **Vocabulary**: The set of all unique tokens
- **Token-to-ID mapping**: Each token gets a unique integer ID
- **Special tokens**: `<PAD>`, `<UNK>`, `<SOS>`, `<EOS>` for special purposes
- **Encoding**: Convert text to token IDs
- **Decoding**: Convert token IDs back to text

In [None]:
class EducationalTokenizer:
    """
    A custom tokenizer implementation for educational purposes.
    
    This tokenizer demonstrates the core concepts of tokenization:
    - Vocabulary building from training data
    - Token-to-ID and ID-to-token mappings
    - Handling unknown tokens
    - Special tokens for sequence modeling
    
    Educational Note:
    While production systems use sophisticated tokenizers (BPE, WordPiece),
    this simple word-level tokenizer helps understand the fundamentals.
    """
    
    def __init__(self, 
                 max_vocab_size: int = 10000,
                 min_frequency: int = 1,
                 special_tokens: List[str] = None):
        """
        Initialize the tokenizer.
        
        Args:
            max_vocab_size: Maximum number of tokens in vocabulary
            min_frequency: Minimum frequency for a token to be included
            special_tokens: List of special tokens to include
        
        Educational Note:
        - max_vocab_size controls memory usage and model size
        - min_frequency helps filter out rare/noisy tokens
        - Special tokens handle sequence boundaries and unknown words
        """
        self.max_vocab_size = max_vocab_size
        self.min_frequency = min_frequency
        
        # Define special tokens
        if special_tokens is None:
            self.special_tokens = ['<PAD>', '<UNK>', '<SOS>', '<EOS>']
        else:
            self.special_tokens = special_tokens
        
        # Initialize mappings
        self.token_to_id = {}
        self.id_to_token = {}
        self.token_frequencies = Counter()
        self.vocab_size = 0
        
        # Add special tokens first (they get IDs 0, 1, 2, ...)
        for token in self.special_tokens:
            self._add_token(token)
    
    def _add_token(self, token: str) -> int:
        """
        Add a token to the vocabulary.
        
        Educational Note:
        This method maintains the bidirectional mapping between
        tokens and their integer IDs. The ID assignment is sequential.
        """
        if token not in self.token_to_id:
            token_id = len(self.token_to_id)
            self.token_to_id[token] = token_id
            self.id_to_token[token_id] = token
            self.vocab_size += 1
        return self.token_to_id[token]
    
    def tokenize(self, text: str) -> List[str]:
        """
        Split text into tokens.
        
        Educational Note:
        This is a simple word-level tokenization. More sophisticated
        approaches include:
        - Subword tokenization (BPE, WordPiece)
        - Character-level tokenization
        - Sentence piece tokenization
        """
        # Simple whitespace tokenization
        # In practice, you might use more sophisticated methods
        tokens = text.strip().split()
        return tokens
    
    def build_vocabulary(self, texts: List[str], preprocessor: TextPreprocessor = None):
        """
        Build vocabulary from a list of texts.
        
        Educational Note:
        Vocabulary building involves:
        1. Tokenizing all texts
        2. Counting token frequencies
        3. Filtering by frequency and vocabulary size
        4. Creating token-to-ID mappings
        """
        print("Building vocabulary...")
        
        # Count token frequencies
        all_tokens = []
        for text in texts:
            # Preprocess if preprocessor is provided
            if preprocessor:
                text = preprocessor.preprocess(text)
            
            # Tokenize and collect tokens
            tokens = self.tokenize(text)
            all_tokens.extend(tokens)
            self.token_frequencies.update(tokens)
        
        print(f"Total tokens found: {len(all_tokens)}")
        print(f"Unique tokens found: {len(self.token_frequencies)}")
        
        # Filter tokens by frequency and add to vocabulary
        # Sort by frequency (descending) to keep most common tokens
        sorted_tokens = self.token_frequencies.most_common()
        
        added_tokens = 0
        for token, freq in sorted_tokens:
            # Skip if we've reached max vocabulary size
            if len(self.token_to_id) >= self.max_vocab_size:
                break
            
            # Skip if frequency is too low
            if freq < self.min_frequency:
                break
            
            # Skip if token is already in vocabulary (special tokens)
            if token not in self.token_to_id:
                self._add_token(token)
                added_tokens += 1
        
        print(f"Vocabulary built with {self.vocab_size} tokens")
        print(f"Added {added_tokens} new tokens (excluding special tokens)")
    
    def encode(self, text: str, 
               add_special_tokens: bool = True,
               preprocessor: TextPreprocessor = None) -> List[int]:
        """
        Convert text to token IDs.
        
        Educational Note:
        Encoding involves:
        1. Preprocessing the text
        2. Tokenizing into words/subwords
        3. Converting tokens to IDs
        4. Handling unknown tokens
        5. Adding special tokens if needed
        """
        # Preprocess if preprocessor is provided
        if preprocessor:
            text = preprocessor.preprocess(text)
        
        # Tokenize
        tokens = self.tokenize(text)
        
        # Convert tokens to IDs
        token_ids = []
        
        # Add start-of-sequence token
        if add_special_tokens and '<SOS>' in self.token_to_id:
            token_ids.append(self.token_to_id['<SOS>'])
        
        # Convert each token to ID
        unk_id = self.token_to_id.get('<UNK>', 1)  # Default to ID 1
        for token in tokens:
            token_id = self.token_to_id.get(token, unk_id)
            token_ids.append(token_id)
        
        # Add end-of-sequence token
        if add_special_tokens and '<EOS>' in self.token_to_id:
            token_ids.append(self.token_to_id['<EOS>'])
        
        return token_ids
    
    def decode(self, token_ids: List[int], 
               skip_special_tokens: bool = True) -> str:
        """
        Convert token IDs back to text.
        
        Educational Note:
        Decoding is the reverse of encoding:
        1. Convert IDs back to tokens
        2. Handle special tokens
        3. Join tokens into text
        """
        tokens = []
        
        for token_id in token_ids:
            if token_id in self.id_to_token:
                token = self.id_to_token[token_id]
                
                # Skip special tokens if requested
                if skip_special_tokens and token in self.special_tokens:
                    continue
                
                tokens.append(token)
            else:
                # Handle invalid token IDs
                tokens.append('<UNK>')
        
        # Join tokens with spaces
        return ' '.join(tokens)
    
    def get_vocab_info(self) -> Dict[str, Any]:
        """
        Get information about the vocabulary.
        
        Educational Note:
        Vocabulary statistics help understand the tokenizer's behavior
        and can guide hyperparameter tuning.
        """
        return {
            'vocab_size': self.vocab_size,
            'special_tokens': self.special_tokens,
            'most_common_tokens': self.token_frequencies.most_common(10),
            'total_token_occurrences': sum(self.token_frequencies.values())
        }

# Test the tokenizer
print("Testing Custom Tokenizer:")
print("=" * 40)

# Create sample texts for testing
sample_texts = [
    "Hello, how are you?",
    "I am doing well, thank you!",
    "What is machine learning?",
    "Machine learning is fascinating."
]

# Initialize tokenizer and preprocessor
tokenizer = EducationalTokenizer(max_vocab_size=100)
preprocessor = TextPreprocessor()

# Build vocabulary
tokenizer.build_vocabulary(sample_texts, preprocessor)

# Test encoding and decoding
test_text = "Hello, what is machine learning?"
print(f"\nTest text: '{test_text}'")

# Encode
token_ids = tokenizer.encode(test_text, preprocessor=preprocessor)
print(f"Encoded IDs: {token_ids}")

# Decode
decoded_text = tokenizer.decode(token_ids)
print(f"Decoded text: '{decoded_text}'")

# Show vocabulary info
vocab_info = tokenizer.get_vocab_info()
print(f"\nVocabulary Info:")
for key, value in vocab_info.items():
    print(f"  {key}: {value}")

## 4. Vocabulary Analysis and Visualization

Understanding your vocabulary is crucial for building effective chatbots. Let's analyze token frequencies, distribution, and create visualizations to better understand our data.

In [None]:
def analyze_vocabulary(tokenizer: EducationalTokenizer, 
                      texts: List[str], 
                      preprocessor: TextPreprocessor = None):
    """
    Perform comprehensive vocabulary analysis with visualizations.
    
    Educational Note:
    Vocabulary analysis helps us understand:
    - Token frequency distribution (Zipf's law)
    - Vocabulary coverage
    - Most/least common tokens
    - Text length statistics
    """
    print("Performing Vocabulary Analysis...")
    print("=" * 50)
    
    # Collect all tokens and their frequencies
    all_tokens = []
    text_lengths = []
    
    for text in texts:
        if preprocessor:
            processed_text = preprocessor.preprocess(text)
        else:
            processed_text = text
        
        tokens = tokenizer.tokenize(processed_text)
        all_tokens.extend(tokens)
        text_lengths.append(len(tokens))
    
    # Basic statistics
    unique_tokens = len(set(all_tokens))
    total_tokens = len(all_tokens)
    avg_text_length = np.mean(text_lengths)
    
    print(f"Total tokens: {total_tokens:,}")
    print(f"Unique tokens: {unique_tokens:,}")
    print(f"Vocabulary diversity: {unique_tokens/total_tokens:.3f}")
    print(f"Average text length: {avg_text_length:.1f} tokens")
    print(f"Text length range: {min(text_lengths)} - {max(text_lengths)} tokens")
    
    # Create visualizations
    fig, axes = plt.subplots(2, 2, figsize=(15, 12))
    
    # 1. Token frequency distribution (top 20)
    most_common = tokenizer.token_frequencies.most_common(20)
    tokens, frequencies = zip(*most_common)
    
    axes[0, 0].bar(range(len(tokens)), frequencies)
    axes[0, 0].set_xticks(range(len(tokens)))
    axes[0, 0].set_xticklabels(tokens, rotation=45, ha='right')
    axes[0, 0].set_title('Top 20 Most Frequent Tokens')
    axes[0, 0].set_ylabel('Frequency')
    
    # 2. Zipf's law visualization (log-log plot)
    all_frequencies = [freq for token, freq in tokenizer.token_frequencies.most_common()]
    ranks = range(1, len(all_frequencies) + 1)
    
    axes[0, 1].loglog(ranks, all_frequencies, 'b-', alpha=0.7)
    axes[0, 1].set_title("Zipf's Law: Token Frequency vs Rank")
    axes[0, 1].set_xlabel('Rank (log scale)')
    axes[0, 1].set_ylabel('Frequency (log scale)')
    axes[0, 1].grid(True, alpha=0.3)
    
    # 3. Text length distribution
    axes[1, 0].hist(text_lengths, bins=20, alpha=0.7, edgecolor='black')
    axes[1, 0].axvline(avg_text_length, color='red', linestyle='--', 
                       label=f'Mean: {avg_text_length:.1f}')
    axes[1, 0].set_title('Distribution of Text Lengths')
    axes[1, 0].set_xlabel('Number of Tokens')
    axes[1, 0].set_ylabel('Frequency')
    axes[1, 0].legend()
    
    # 4. Vocabulary growth curve
    vocab_sizes = []
    seen_tokens = set()
    
    for i, token in enumerate(all_tokens):
        seen_tokens.add(token)
        if i % 10 == 0:  # Sample every 10 tokens for efficiency
            vocab_sizes.append(len(seen_tokens))
    
    sample_points = range(0, len(all_tokens), 10)
    axes[1, 1].plot(sample_points[:len(vocab_sizes)], vocab_sizes)
    axes[1, 1].set_title('Vocabulary Growth Curve')
    axes[1, 1].set_xlabel('Number of Tokens Processed')
    axes[1, 1].set_ylabel('Unique Tokens Seen')
    axes[1, 1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    # Educational insights
    print("\nEducational Insights:")
    print("-" * 30)
    print("1. Zipf's Law: Natural language follows a power law distribution")
    print("   - Few tokens are very frequent, many tokens are rare")
    print("   - This affects vocabulary size decisions")
    
    print("\n2. Vocabulary Diversity:")
    diversity = unique_tokens / total_tokens
    if diversity > 0.7:
        print("   - High diversity: Many unique tokens, might need larger vocabulary")
    elif diversity > 0.3:
        print("   - Medium diversity: Balanced token distribution")
    else:
        print("   - Low diversity: Many repeated tokens, smaller vocabulary sufficient")
    
    print("\n3. Text Length Distribution:")
    if max(text_lengths) > 3 * avg_text_length:
        print("   - High variance in text lengths, consider sequence length limits")
    else:
        print("   - Consistent text lengths, good for batch processing")
    
    return {
        'total_tokens': total_tokens,
        'unique_tokens': unique_tokens,
        'diversity': diversity,
        'avg_length': avg_text_length,
        'length_range': (min(text_lengths), max(text_lengths))
    }

## 5. Conversational Dataset Class

Now let's create a PyTorch Dataset class specifically designed for conversational data. This will handle loading, preprocessing, and batching of conversation pairs for training.

In [None]:
@dataclass
class ConversationPair:
    """
    Data class representing a single conversation pair.
    
    Educational Note:
    Using dataclasses makes the code more readable and provides
    automatic __init__, __repr__, and other methods.
    """
    input_text: str
    target_text: str
    context: Optional[str] = None
    metadata: Optional[Dict[str, Any]] = None
    
    def __post_init__(self):
        """Initialize metadata if not provided."""
        if self.metadata is None:
            self.metadata = {}

class ConversationalDataset(Dataset):
    """
    PyTorch Dataset for conversational data.
    
    This dataset handles:
    - Loading conversation data from JSON files
    - Converting conversations to input-output pairs
    - Tokenizing and encoding text
    - Padding sequences to consistent lengths
    
    Educational Note:
    Inheriting from torch.utils.data.Dataset allows us to use
    PyTorch's DataLoader for efficient batching and shuffling.
    """
    
    def __init__(self, 
                 data_path: str,
                 tokenizer: EducationalTokenizer,
                 preprocessor: TextPreprocessor = None,
                 max_length: int = 128,
                 include_context: bool = False):
        """
        Initialize the conversational dataset.
        
        Args:
            data_path: Path to JSON file containing conversations
            tokenizer: Tokenizer for encoding text
            preprocessor: Text preprocessor (optional)
            max_length: Maximum sequence length for padding/truncation
            include_context: Whether to include conversation context
        
        Educational Note:
        - max_length controls memory usage and training efficiency
        - Context can provide additional information for responses
        """
        self.tokenizer = tokenizer
        self.preprocessor = preprocessor
        self.max_length = max_length
        self.include_context = include_context
        
        # Load and process conversation data
        self.conversation_pairs = self._load_conversations(data_path)
        
        print(f"Loaded {len(self.conversation_pairs)} conversation pairs")
    
    def _load_conversations(self, data_path: str) -> List[ConversationPair]:
        """
        Load conversations from JSON file and convert to pairs.
        
        Educational Note:
        We convert multi-turn conversations into input-output pairs.
        Each user message becomes an input, and the following bot
        response becomes the target output.
        """
        pairs = []
        
        try:
            with open(data_path, 'r', encoding='utf-8') as f:
                conversations = json.load(f)
        except FileNotFoundError:
            print(f"Error: Could not find data file at {data_path}")
            return []
        except json.JSONDecodeError as e:
            print(f"Error: Invalid JSON format in {data_path}: {e}")
            return []
        
        for conv in conversations:
            messages = conv.get('messages', [])
            context = conv.get('context', '')
            metadata = conv.get('metadata', {})
            
            # Extract input-output pairs from conversation
            for i in range(len(messages) - 1):
                current_msg = messages[i]
                next_msg = messages[i + 1]
                
                # Create pair if current is user and next is bot
                if (current_msg.get('speaker') == 'user' and 
                    next_msg.get('speaker') == 'bot'):
                    
                    input_text = current_msg.get('text', '')
                    target_text = next_msg.get('text', '')
                    
                    # Include context if requested
                    if self.include_context and context:
                        input_text = f"Context: {context}. User: {input_text}"
                    
                    pair = ConversationPair(
                        input_text=input_text,
                        target_text=target_text,
                        context=context,
                        metadata=metadata
                    )
                    pairs.append(pair)
        
        return pairs
    
    def _pad_sequence(self, token_ids: List[int], max_length: int) -> List[int]:
        """
        Pad or truncate sequence to specified length.
        
        Educational Note:
        Padding ensures all sequences in a batch have the same length,
        which is required for efficient tensor operations.
        """
        pad_id = self.tokenizer.token_to_id.get('<PAD>', 0)
        
        if len(token_ids) > max_length:
            # Truncate if too long
            return token_ids[:max_length]
        else:
            # Pad if too short
            padding_length = max_length - len(token_ids)
            return token_ids + [pad_id] * padding_length
    
    def __len__(self) -> int:
        """
        Return the number of conversation pairs.
        
        Educational Note:
        This method is required by PyTorch's Dataset interface.
        """
        return len(self.conversation_pairs)
    
    def __getitem__(self, idx: int) -> Dict[str, torch.Tensor]:
        """
        Get a single conversation pair as tensors.
        
        Educational Note:
        This method is called by PyTorch's DataLoader to get individual
        samples. We return tensors ready for model training.
        """
        pair = self.conversation_pairs[idx]
        
        # Encode input and target texts
        input_ids = self.tokenizer.encode(
            pair.input_text, 
            add_special_tokens=True,
            preprocessor=self.preprocessor
        )
        
        target_ids = self.tokenizer.encode(
            pair.target_text,
            add_special_tokens=True, 
            preprocessor=self.preprocessor
        )
        
        # Pad sequences
        input_ids = self._pad_sequence(input_ids, self.max_length)
        target_ids = self._pad_sequence(target_ids, self.max_length)
        
        # Create attention masks (1 for real tokens, 0 for padding)
        pad_id = self.tokenizer.token_to_id.get('<PAD>', 0)
        input_mask = [1 if token_id != pad_id else 0 for token_id in input_ids]
        target_mask = [1 if token_id != pad_id else 0 for token_id in target_ids]
        
        return {
            'input_ids': torch.tensor(input_ids, dtype=torch.long),
            'input_mask': torch.tensor(input_mask, dtype=torch.long),
            'target_ids': torch.tensor(target_ids, dtype=torch.long),
            'target_mask': torch.tensor(target_mask, dtype=torch.long),
            'input_text': pair.input_text,
            'target_text': pair.target_text
        }
    
    def get_sample_batch(self, batch_size: int = 4) -> Dict[str, torch.Tensor]:
        """
        Get a sample batch for testing and demonstration.
        
        Educational Note:
        This method helps visualize what the model will receive
        during training.
        """
        dataloader = DataLoader(self, batch_size=batch_size, shuffle=True)
        return next(iter(dataloader))
    
    def analyze_dataset(self):
        """
        Analyze the dataset and provide statistics.
        
        Educational Note:
        Dataset analysis helps understand data characteristics
        and guide preprocessing decisions.
        """
        input_lengths = []
        target_lengths = []
        
        for pair in self.conversation_pairs:
            input_tokens = self.tokenizer.tokenize(pair.input_text)
            target_tokens = self.tokenizer.tokenize(pair.target_text)
            
            input_lengths.append(len(input_tokens))
            target_lengths.append(len(target_tokens))
        
        print("Dataset Analysis:")
        print("=" * 30)
        print(f"Total conversation pairs: {len(self.conversation_pairs)}")
        print(f"Average input length: {np.mean(input_lengths):.1f} tokens")
        print(f"Average target length: {np.mean(target_lengths):.1f} tokens")
        print(f"Max input length: {max(input_lengths)} tokens")
        print(f"Max target length: {max(target_lengths)} tokens")
        print(f"Sequences longer than max_length ({self.max_length}):")
        print(f"  Input: {sum(1 for l in input_lengths if l > self.max_length)}")
        print(f"  Target: {sum(1 for l in target_lengths if l > self.max_length)}")
        
        # Visualize length distributions
        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
        
        ax1.hist(input_lengths, bins=20, alpha=0.7, label='Input')
        ax1.axvline(self.max_length, color='red', linestyle='--', 
                   label=f'Max Length ({self.max_length})')
        ax1.set_title('Input Length Distribution')
        ax1.set_xlabel('Number of Tokens')
        ax1.set_ylabel('Frequency')
        ax1.legend()
        
        ax2.hist(target_lengths, bins=20, alpha=0.7, label='Target', color='orange')
        ax2.axvline(self.max_length, color='red', linestyle='--',
                   label=f'Max Length ({self.max_length})')
        ax2.set_title('Target Length Distribution')
        ax2.set_xlabel('Number of Tokens')
        ax2.set_ylabel('Frequency')
        ax2.legend()
        
        plt.tight_layout()
        plt.show()

## 6. Comparison with Pre-built Tokenizers

Now let's compare our custom tokenizer with professional tokenizers like those from Hugging Face. This will help you understand the trade-offs and when to use each approach.

In [None]:
# Note: This section demonstrates the concepts even if transformers library isn't installed
# In a real environment, you would install: pip install transformers

def compare_tokenizers(text_samples: List[str]):
    """
    Compare our custom tokenizer with pre-built solutions.
    
    Educational Note:
    This comparison helps understand:
    - Vocabulary efficiency (tokens per text)
    - Handling of unknown words
    - Subword tokenization benefits
    - Processing speed differences
    """
    print("Tokenizer Comparison")
    print("=" * 50)
    
    # Initialize our custom tokenizer
    custom_tokenizer = EducationalTokenizer(max_vocab_size=1000)
    preprocessor = TextPreprocessor()
    
    # Build vocabulary from samples
    custom_tokenizer.build_vocabulary(text_samples, preprocessor)
    
    # Try to import and use Hugging Face tokenizer
    try:
        from transformers import AutoTokenizer
        
        # Use a simple pre-trained tokenizer
        hf_tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
        hf_available = True
        print("✓ Hugging Face tokenizer loaded successfully")
        
    except ImportError:
        print("⚠ Hugging Face transformers not available")
        print("  Install with: pip install transformers")
        print("  Showing custom tokenizer analysis only")
        hf_available = False
        hf_tokenizer = None
    
    # Compare tokenization results
    comparison_results = []
    
    for i, text in enumerate(text_samples[:5]):  # Limit to first 5 samples
        print(f"\nExample {i+1}: '{text}'")
        print("-" * 40)
        
        # Custom tokenizer
        custom_tokens = custom_tokenizer.tokenize(preprocessor.preprocess(text))
        custom_ids = custom_tokenizer.encode(text, preprocessor=preprocessor)
        
        print(f"Custom Tokenizer:")
        print(f"  Tokens: {custom_tokens}")
        print(f"  Token count: {len(custom_tokens)}")
        print(f"  Token IDs: {custom_ids[:10]}{'...' if len(custom_ids) > 10 else ''}")
        
        result = {
            'text': text,
            'custom_token_count': len(custom_tokens),
            'custom_tokens': custom_tokens
        }
        
        # Hugging Face tokenizer (if available)
        if hf_available:
            hf_encoding = hf_tokenizer(text, return_tensors='pt')
            hf_tokens = hf_tokenizer.tokenize(text)
            hf_ids = hf_encoding['input_ids'][0].tolist()
            
            print(f"\nHugging Face Tokenizer (BERT):")
            print(f"  Tokens: {hf_tokens}")
            print(f"  Token count: {len(hf_tokens)}")
            print(f"  Token IDs: {hf_ids[:10]}{'...' if len(hf_ids) > 10 else ''}")
            
            result.update({
                'hf_token_count': len(hf_tokens),
                'hf_tokens': hf_tokens
            })
        
        comparison_results.append(result)
    
    # Analysis and insights
    print("\n" + "=" * 50)
    print("ANALYSIS AND INSIGHTS")
    print("=" * 50)
    
    # Calculate average token counts
    custom_avg = np.mean([r['custom_token_count'] for r in comparison_results])
    
    print(f"\n1. Token Efficiency:")
    print(f"   Custom tokenizer average: {custom_avg:.1f} tokens per text")
    
    if hf_available:
        hf_avg = np.mean([r['hf_token_count'] for r in comparison_results])
        print(f"   Hugging Face average: {hf_avg:.1f} tokens per text")
        
        efficiency_ratio = custom_avg / hf_avg
        if efficiency_ratio > 1.2:
            print(f"   → Custom tokenizer uses {efficiency_ratio:.1f}x more tokens")
            print(f"     This is expected for word-level vs subword tokenization")
        elif efficiency_ratio < 0.8:
            print(f"   → Custom tokenizer is more efficient ({1/efficiency_ratio:.1f}x fewer tokens)")
        else:
            print(f"   → Similar efficiency between tokenizers")
    
    print(f"\n2. Vocabulary Characteristics:")
    print(f"   Custom vocabulary size: {custom_tokenizer.vocab_size}")
    
    if hf_available:
        print(f"   BERT vocabulary size: {hf_tokenizer.vocab_size:,}")
        print(f"   → Pre-trained tokenizers have much larger vocabularies")
        print(f"     This allows better handling of rare/unknown words")
    
    print(f"\n3. Key Differences:")
    print(f"   Custom Tokenizer (Word-level):")
    print(f"   ✓ Simple and interpretable")
    print(f"   ✓ Fast training and inference")
    print(f"   ✓ Good for small, domain-specific datasets")
    print(f"   ✗ Large vocabulary for diverse text")
    print(f"   ✗ Poor handling of unknown words")
    print(f"   ✗ No subword information")
    
    if hf_available:
        print(f"\n   Pre-trained Tokenizer (Subword):")
        print(f"   ✓ Handles unknown words well")
        print(f"   ✓ Efficient vocabulary usage")
        print(f"   ✓ Captures morphological patterns")
        print(f"   ✓ Works across different domains")
        print(f"   ✗ More complex to understand")
        print(f"   ✗ Requires pre-training or large datasets")
        print(f"   ✗ Less interpretable tokens")
    
    print(f"\n4. When to Use Each:")
    print(f"   Use Custom Tokenizer when:")
    print(f"   - Learning tokenization concepts")
    print(f"   - Working with small, controlled datasets")
    print(f"   - Need full control over vocabulary")
    print(f"   - Interpretability is crucial")
    
    print(f"\n   Use Pre-trained Tokenizer when:")
    print(f"   - Building production systems")
    print(f"   - Working with diverse text data")
    print(f"   - Need robust unknown word handling")
    print(f"   - Want to leverage pre-trained models")
    
    return comparison_results

# Test the comparison
sample_texts = [
    "Hello, how are you doing today?",
    "What's the weather like?",
    "I don't understand this complicated explanation.",
    "Can you help me with machine learning?",
    "That's absolutely fantastic!"
]

comparison_results = compare_tokenizers(sample_texts)

## 7. Practical Demonstration with Real Data

Let's put everything together and work with our actual conversation data to see the complete preprocessing pipeline in action.

In [None]:
# Load and process the actual conversation data
print("Loading Conversation Data...")
print("=" * 40)

# Path to our conversation data
data_path = '../data/conversations/simple_qa_pairs.json'

# Check if file exists
if not os.path.exists(data_path):
    print(f"Warning: Data file not found at {data_path}")
    print("Creating sample data for demonstration...")
    
    # Create sample data if file doesn't exist
    sample_conversations = [
        {
            "id": "demo_001",
            "messages": [
                {"speaker": "user", "text": "Hello there!"},
                {"speaker": "bot", "text": "Hi! How can I help you today?"}
            ],
            "context": "greeting",
            "metadata": {"category": "greeting"}
        },
        {
            "id": "demo_002",
            "messages": [
                {"speaker": "user", "text": "What is machine learning?"},
                {"speaker": "bot", "text": "Machine learning is a method of data analysis that automates analytical model building."}
            ],
            "context": "educational",
            "metadata": {"category": "education"}
        }
    ]
    
    # Save sample data
    os.makedirs(os.path.dirname(data_path), exist_ok=True)
    with open(data_path, 'w') as f:
        json.dump(sample_conversations, f, indent=2)
    
    print(f"Sample data created at {data_path}")

# Initialize components
preprocessor = TextPreprocessor(
    lowercase=True,
    remove_punctuation=False,  # Keep punctuation for chatbots
    expand_contractions=True,
    remove_extra_whitespace=True
)

tokenizer = EducationalTokenizer(
    max_vocab_size=1000,
    min_frequency=1
)

# Load conversations and extract text for vocabulary building
with open(data_path, 'r') as f:
    conversations = json.load(f)

all_texts = []
for conv in conversations:
    for message in conv.get('messages', []):
        text = message.get('text', '')
        if text:
            all_texts.append(text)

print(f"Extracted {len(all_texts)} messages from conversations")

# Build vocabulary
print("\nBuilding vocabulary from conversation data...")
tokenizer.build_vocabulary(all_texts, preprocessor)

# Analyze vocabulary
print("\nVocabulary Analysis:")
vocab_stats = analyze_vocabulary(tokenizer, all_texts, preprocessor)

# Create dataset
print("\nCreating conversational dataset...")
dataset = ConversationalDataset(
    data_path=data_path,
    tokenizer=tokenizer,
    preprocessor=preprocessor,
    max_length=64,  # Shorter for demo data
    include_context=False
)

# Analyze dataset
dataset.analyze_dataset()

# Show sample batch
print("\nSample Batch from Dataset:")
print("=" * 30)
sample_batch = dataset.get_sample_batch(batch_size=2)

for i in range(len(sample_batch['input_text'])):
    print(f"\nExample {i+1}:")
    print(f"  Input text: '{sample_batch['input_text'][i]}'")
    print(f"  Target text: '{sample_batch['target_text'][i]}'")
    print(f"  Input IDs: {sample_batch['input_ids'][i][:10].tolist()}...")
    print(f"  Target IDs: {sample_batch['target_ids'][i][:10].tolist()}...")
    
    # Decode to verify
    decoded_input = tokenizer.decode(sample_batch['input_ids'][i].tolist())
    decoded_target = tokenizer.decode(sample_batch['target_ids'][i].tolist())
    print(f"  Decoded input: '{decoded_input}'")
    print(f"  Decoded target: '{decoded_target}'")

## 8. Summary and Key Takeaways

In this notebook, we've covered the essential concepts of text preprocessing and tokenization for chatbot development:

### What We Learned:

1. **Text Preprocessing Pipeline**
   - Contraction expansion
   - Case normalization
   - Punctuation handling
   - Whitespace normalization

2. **Custom Tokenizer Implementation**
   - Vocabulary building from training data
   - Token-to-ID mappings
   - Special tokens for sequence modeling
   - Encoding and decoding processes

3. **Vocabulary Analysis**
   - Frequency distributions and Zipf's law
   - Vocabulary diversity metrics
   - Text length statistics
   - Visualization techniques

4. **PyTorch Dataset Integration**
   - Conversational data handling
   - Sequence padding and truncation
   - Attention masks
   - Batch processing

5. **Comparison with Production Tokenizers**
   - Word-level vs subword tokenization
   - Trade-offs in vocabulary size and efficiency
   - When to use custom vs pre-built solutions

### Key Insights:

- **Preprocessing matters**: Clean, consistent text leads to better model performance
- **Vocabulary size is a trade-off**: Larger vocabularies capture more nuance but require more memory
- **Special tokens are crucial**: They help models understand sequence boundaries and handle unknown words
- **Subword tokenization is powerful**: It handles unknown words better than word-level approaches
- **Dataset design affects training**: Proper padding, masking, and batching are essential for efficient training

### Next Steps:

In the next notebook, we'll use these preprocessing tools to build neural network architectures for our chatbot, starting with simple feedforward networks and progressing to more sophisticated architectures.

## 9. Exercises and Challenges

Try these exercises to deepen your understanding:

In [None]:
# Exercise 1: Experiment with different preprocessing options
print("Exercise 1: Preprocessing Experiments")
print("=" * 40)

# TODO: Try different preprocessing configurations and compare results
# 1. Create preprocessors with different settings
# 2. Compare vocabulary sizes and token distributions
# 3. Analyze the impact on a sample text

sample_text = "I don't think that's right! What's your opinion?"

configs = [
    {'name': 'Minimal', 'remove_punctuation': False, 'lowercase': False},
    {'name': 'Standard', 'remove_punctuation': False, 'lowercase': True},
    {'name': 'Aggressive', 'remove_punctuation': True, 'lowercase': True}
]

print(f"Original text: '{sample_text}'")
print()

for config in configs:
    name = config.pop('name')
    preprocessor = TextPreprocessor(**config)
    processed = preprocessor.preprocess(sample_text)
    print(f"{name:12}: '{processed}'")

print("\n" + "="*50)
print("Exercise 2: Custom Tokenization Strategy")
print("=" * 50)

# TODO: Implement a character-level tokenizer
# 1. Create a tokenizer that works at character level
# 2. Compare vocabulary size with word-level tokenizer
# 3. Analyze pros and cons for chatbot applications

class CharacterTokenizer:
    """Simple character-level tokenizer for comparison."""
    
    def __init__(self):
        self.char_to_id = {}
        self.id_to_char = {}
        self.vocab_size = 0
        
        # Add special tokens
        special_chars = ['<PAD>', '<UNK>', '<SOS>', '<EOS>']
        for char in special_chars:
            self._add_char(char)
    
    def _add_char(self, char):
        if char not in self.char_to_id:
            char_id = len(self.char_to_id)
            self.char_to_id[char] = char_id
            self.id_to_char[char_id] = char
            self.vocab_size += 1
    
    def build_vocabulary(self, texts):
        for text in texts:
            for char in text:
                self._add_char(char)
    
    def encode(self, text):
        return [self.char_to_id.get(char, self.char_to_id['<UNK>']) for char in text]
    
    def decode(self, char_ids):
        return ''.join([self.id_to_char.get(cid, '<UNK>') for cid in char_ids])

# Compare character vs word tokenization
char_tokenizer = CharacterTokenizer()
char_tokenizer.build_vocabulary(all_texts)

test_text = "Hello, world!"
word_tokens = tokenizer.encode(test_text, preprocessor=preprocessor)
char_tokens = char_tokenizer.encode(test_text)

print(f"Test text: '{test_text}'")
print(f"Word tokenizer: {len(word_tokens)} tokens, vocab size: {tokenizer.vocab_size}")
print(f"Char tokenizer: {len(char_tokens)} tokens, vocab size: {char_tokenizer.vocab_size}")
print(f"Word tokens: {word_tokens}")
print(f"Char tokens: {char_tokens}")

print("\n" + "="*50)
print("Exercise 3: Dataset Optimization")
print("=" * 50)

# TODO: Experiment with different max_length values
# 1. Create datasets with different max_length settings
# 2. Analyze the trade-offs in truncation vs padding
# 3. Find the optimal max_length for your data

max_lengths = [16, 32, 64, 128]
truncation_stats = []

for max_len in max_lengths:
    temp_dataset = ConversationalDataset(
        data_path=data_path,
        tokenizer=tokenizer,
        preprocessor=preprocessor,
        max_length=max_len
    )
    
    # Calculate truncation statistics
    truncated_inputs = 0
    truncated_targets = 0
    total_padding = 0
    
    for i in range(len(temp_dataset)):
        sample = temp_dataset[i]
        
        # Count non-padding tokens
        input_length = sample['input_mask'].sum().item()
        target_length = sample['target_mask'].sum().item()
        
        if input_length == max_len:
            truncated_inputs += 1
        if target_length == max_len:
            truncated_targets += 1
        
        total_padding += (max_len - input_length) + (max_len - target_length)
    
    truncation_stats.append({
        'max_length': max_len,
        'truncated_inputs': truncated_inputs,
        'truncated_targets': truncated_targets,
        'avg_padding': total_padding / (len(temp_dataset) * 2)
    })

print("Max Length Analysis:")
print(f"{'Max Len':<8} {'Trunc Input':<12} {'Trunc Target':<13} {'Avg Padding':<12}")
print("-" * 50)
for stats in truncation_stats:
    print(f"{stats['max_length']:<8} {stats['truncated_inputs']:<12} "
          f"{stats['truncated_targets']:<13} {stats['avg_padding']:<12.1f}")

print("\nRecommendation: Choose max_length that minimizes truncation while keeping padding reasonable")