# LSTM vs Transformer: Sentiment Analysis Comparison

## Week 3 - Understanding How Different Models Read Text

### What You'll Learn:
1. **LSTM Model** - Sequential text processing
2. **Transformer Model** - Parallel attention-based processing
3. **Direct Comparison** - Performance, speed, and understanding
4. **Why Transformers Win** - Understanding the advantages
5. **üéØ Practical**: Build and compare both models

---

## 1. The Challenge: Sentiment Analysis

### Task:
Given a movie review, classify it as **Positive** or **Negative**.

**Examples:**
```
"This movie was absolutely fantastic!" ‚Üí Positive ‚úÖ
"Terrible waste of time and money." ‚Üí Negative ‚ùå
```

### Two Approaches:

**LSTM (Sequential):**
```
Word 1 ‚Üí Word 2 ‚Üí Word 3 ‚Üí Word 4 ‚Üí Prediction
  ‚Üì        ‚Üì        ‚Üì        ‚Üì
Processes one word at a time, passing information forward
```

**Transformer (Parallel):**
```
Word 1 ‚îÄ‚îÄ‚îê
Word 2 ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚Üí All words processed together!
Word 3 ‚îÄ‚îÄ‚î§    Each word "attends" to all others
Word 4 ‚îÄ‚îÄ‚îò
```

In [None]:
# Install required packages (uncomment if needed)
# !pip install torch transformers datasets scikit-learn matplotlib numpy tqdm

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer
from datasets import load_dataset
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import time
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)

# Check for GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
print(f"PyTorch version: {torch.__version__}")

## 2. Load and Prepare Data

We'll use the IMDB movie review dataset:
- **50,000 reviews** total
- **25,000 for training**, 25,000 for testing
- **Balanced**: 50% positive, 50% negative

For this tutorial, we'll use a smaller subset for faster training.

In [None]:
# Load IMDB dataset
print("Loading IMDB dataset...")
train_dataset = load_dataset("imdb", split="train[:5000]")  # 5000 samples for training
test_dataset = load_dataset("imdb", split="test[:1000]")    # 1000 samples for testing

print(f"\nTraining samples: {len(train_dataset)}")
print(f"Test samples: {len(test_dataset)}")

# Show example
print(f"\n{'='*60}")
print("Example Review:")
print(f"{'='*60}")
example = train_dataset[0]
print(f"Text: {example['text'][:200]}...")
print(f"\nLabel: {example['label']} ({'Positive' if example['label'] == 1 else 'Negative'})")

## 3. Text Preprocessing

### Understanding Tokenization:

**What is tokenization?**
Converting text into numbers that models can understand.

```
Text:   "This movie is great!"
         ‚Üì
Tokens: ["This", "movie", "is", "great", "!"]
         ‚Üì
IDs:    [2023, 3185, 2003, 2307, 999]
```

We'll create a simple vocabulary-based tokenizer for LSTM and use BERT tokenizer for Transformer.

In [None]:
# Simple tokenizer for LSTM
class SimpleTokenizer:
    def __init__(self, max_vocab_size=10000):
        self.max_vocab_size = max_vocab_size
        self.word_to_idx = {}
        self.idx_to_word = {}
        
    def build_vocab(self, texts):
        """Build vocabulary from texts."""
        word_freq = {}
        
        # Count word frequencies
        for text in texts:
            for word in text.lower().split():
                word_freq[word] = word_freq.get(word, 0) + 1
        
        # Sort by frequency and take top words
        sorted_words = sorted(word_freq.items(), key=lambda x: x[1], reverse=True)
        
        # Reserve 0 for padding, 1 for unknown
        self.word_to_idx = {'<PAD>': 0, '<UNK>': 1}
        self.idx_to_word = {0: '<PAD>', 1: '<UNK>'}
        
        # Add top words
        for idx, (word, _) in enumerate(sorted_words[:self.max_vocab_size - 2], start=2):
            self.word_to_idx[word] = idx
            self.idx_to_word[idx] = word
        
        print(f"Vocabulary size: {len(self.word_to_idx)}")
    
    def encode(self, text, max_length=256):
        """Convert text to sequence of indices."""
        words = text.lower().split()
        indices = [self.word_to_idx.get(word, 1) for word in words]  # 1 is <UNK>
        
        # Pad or truncate
        if len(indices) < max_length:
            indices = indices + [0] * (max_length - len(indices))  # 0 is <PAD>
        else:
            indices = indices[:max_length]
        
        return indices

# Build vocabulary from training data
simple_tokenizer = SimpleTokenizer(max_vocab_size=10000)
simple_tokenizer.build_vocab([item['text'] for item in train_dataset])

# Test tokenization
test_text = "This movie is absolutely fantastic!"
encoded = simple_tokenizer.encode(test_text, max_length=20)
print(f"\nExample tokenization:")
print(f"Text: {test_text}")
print(f"Encoded: {encoded[:10]}...")  # Show first 10 tokens

In [None]:
# Prepare datasets for both models
class SentimentDataset(Dataset):
    def __init__(self, data, tokenizer, max_length=256, use_bert=False):
        self.data = data
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.use_bert = use_bert
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        item = self.data[idx]
        text = item['text']
        label = item['label']
        
        if self.use_bert:
            # BERT tokenization
            encoding = self.tokenizer(
                text,
                max_length=self.max_length,
                padding='max_length',
                truncation=True,
                return_tensors='pt'
            )
            return {
                'input_ids': encoding['input_ids'].squeeze(),
                'attention_mask': encoding['attention_mask'].squeeze(),
                'label': torch.tensor(label, dtype=torch.long)
            }
        else:
            # Simple tokenization for LSTM
            indices = self.tokenizer.encode(text, max_length=self.max_length)
            return {
                'input_ids': torch.tensor(indices, dtype=torch.long),
                'label': torch.tensor(label, dtype=torch.long)
            }

# Create datasets for LSTM
lstm_train_dataset = SentimentDataset(train_dataset, simple_tokenizer, max_length=256, use_bert=False)
lstm_test_dataset = SentimentDataset(test_dataset, simple_tokenizer, max_length=256, use_bert=False)

print(f"‚úÖ LSTM datasets ready!")
print(f"   Training samples: {len(lstm_train_dataset)}")
print(f"   Test samples: {len(lstm_test_dataset)}")

## 4. Building the LSTM Model

### LSTM Architecture:

```
Input Text
    ‚Üì
Embedding Layer (converts word IDs to vectors)
    ‚Üì
LSTM Layer 1 (processes sequence)
    ‚Üì
LSTM Layer 2 (learns higher-level patterns)
    ‚Üì
Fully Connected Layer
    ‚Üì
Output (Positive or Negative)
```

### How LSTM Reads:
LSTM reads **one word at a time**, maintaining a "memory" of what it has seen so far.

In [None]:
class LSTMSentimentClassifier(nn.Module):
    def __init__(self, vocab_size, embedding_dim=128, hidden_dim=256, num_layers=2, dropout=0.3):
        super(LSTMSentimentClassifier, self).__init__()
        
        # Embedding layer: converts word indices to dense vectors
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        
        # LSTM layers
        self.lstm = nn.LSTM(
            embedding_dim,
            hidden_dim,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout if num_layers > 1 else 0,
            bidirectional=True  # Read both forward and backward
        )
        
        # Fully connected layers
        self.fc = nn.Sequential(
            nn.Linear(hidden_dim * 2, 128),  # *2 because bidirectional
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(128, 2)  # 2 classes: positive/negative
        )
    
    def forward(self, input_ids):
        # Embed the input
        embedded = self.embedding(input_ids)  # (batch, seq_len, embedding_dim)
        
        # Pass through LSTM
        lstm_out, (hidden, cell) = self.lstm(embedded)
        
        # Use the last hidden state from both directions
        # hidden shape: (num_layers * 2, batch, hidden_dim)
        hidden_fwd = hidden[-2]  # Last layer, forward direction
        hidden_bwd = hidden[-1]  # Last layer, backward direction
        hidden_concat = torch.cat([hidden_fwd, hidden_bwd], dim=1)
        
        # Pass through fully connected layers
        output = self.fc(hidden_concat)
        
        return output

# Create LSTM model
vocab_size = len(simple_tokenizer.word_to_idx)
lstm_model = LSTMSentimentClassifier(vocab_size).to(device)

# Count parameters
lstm_params = sum(p.numel() for p in lstm_model.parameters())
print(f"\nüìä LSTM Model:")
print(f"   Parameters: {lstm_params:,}")
print(f"   Vocabulary size: {vocab_size:,}")
print(f"\n{lstm_model}")

## 5. Building the Transformer Model

### Transformer Architecture:

```
Input Text
    ‚Üì
Embedding + Positional Encoding
    ‚Üì
Multi-Head Self-Attention (all words look at each other)
    ‚Üì
Feed-Forward Network
    ‚Üì
Classification Head
    ‚Üì
Output (Positive or Negative)
```

### How Transformer Reads:
Transformer reads **all words at once**, with each word "attending" to all other words to understand context.

In [None]:
class TransformerSentimentClassifier(nn.Module):
    def __init__(self, vocab_size, embedding_dim=128, num_heads=8, num_layers=2, dropout=0.3, max_length=256):
        super(TransformerSentimentClassifier, self).__init__()
        
        # Embedding layer
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        
        # Positional encoding (helps model understand word order)
        self.positional_encoding = nn.Parameter(
            self._create_positional_encoding(max_length, embedding_dim),
            requires_grad=False
        )
        
        # Transformer encoder layers
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=embedding_dim,
            nhead=num_heads,
            dim_feedforward=embedding_dim * 4,
            dropout=dropout,
            batch_first=True
        )
        self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
        
        # Classification head
        self.fc = nn.Sequential(
            nn.Linear(embedding_dim, 128),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(128, 2)
        )
    
    def _create_positional_encoding(self, max_length, embedding_dim):
        """Create sinusoidal positional encoding."""
        position = torch.arange(max_length).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, embedding_dim, 2) * (-np.log(10000.0) / embedding_dim))
        
        pe = torch.zeros(max_length, embedding_dim)
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        
        return pe.unsqueeze(0)  # Add batch dimension
    
    def forward(self, input_ids, attention_mask=None):
        # Embed the input
        embedded = self.embedding(input_ids)  # (batch, seq_len, embedding_dim)
        
        # Add positional encoding
        seq_len = embedded.size(1)
        embedded = embedded + self.positional_encoding[:, :seq_len, :]
        
        # Create padding mask for transformer
        if attention_mask is None:
            attention_mask = (input_ids != 0)  # Mask padding tokens
        
        # Invert mask (Transformer expects True for positions to mask)
        padding_mask = ~attention_mask
        
        # Pass through transformer
        transformer_out = self.transformer(embedded, src_key_padding_mask=padding_mask)
        
        # Use mean pooling over sequence
        # Mask out padding tokens before averaging
        mask_expanded = attention_mask.unsqueeze(-1).expand(transformer_out.size()).float()
        sum_embeddings = torch.sum(transformer_out * mask_expanded, dim=1)
        sum_mask = torch.clamp(mask_expanded.sum(dim=1), min=1e-9)
        pooled = sum_embeddings / sum_mask
        
        # Classification
        output = self.fc(pooled)
        
        return output

# Create Transformer model
transformer_model = TransformerSentimentClassifier(vocab_size).to(device)

# Count parameters
transformer_params = sum(p.numel() for p in transformer_model.parameters())
print(f"\nüìä Transformer Model:")
print(f"   Parameters: {transformer_params:,}")
print(f"   Vocabulary size: {vocab_size:,}")
print(f"\n{transformer_model}")

## 6. Training Function

We'll create a unified training function that works for both models.

In [None]:
def train_model(model, train_dataset, test_dataset, model_name, epochs=3, batch_size=32, lr=0.001):
    """
    Train and evaluate a model.
    
    Args:
        model: The model to train
        train_dataset: Training dataset
        test_dataset: Test dataset
        model_name: Name for logging
        epochs: Number of training epochs
        batch_size: Batch size
        lr: Learning rate
    
    Returns:
        Dictionary with training history and metrics
    """
    # Create data loaders
    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    test_loader = DataLoader(test_dataset, batch_size=batch_size)
    
    # Loss and optimizer
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    
    # Training history
    history = {
        'train_loss': [],
        'train_acc': [],
        'test_loss': [],
        'test_acc': [],
        'epoch_times': []
    }
    
    print(f"\n{'='*60}")
    print(f"Training {model_name}")
    print(f"{'='*60}")
    
    for epoch in range(epochs):
        epoch_start = time.time()
        
        # Training phase
        model.train()
        train_loss = 0
        train_correct = 0
        train_total = 0
        
        train_pbar = tqdm(train_loader, desc=f"Epoch {epoch+1}/{epochs} [Train]")
        for batch in train_pbar:
            input_ids = batch['input_ids'].to(device)
            labels = batch['label'].to(device)
            
            # Forward pass
            optimizer.zero_grad()
            
            # Handle different model inputs
            if 'attention_mask' in batch:
                attention_mask = batch['attention_mask'].to(device)
                outputs = model(input_ids, attention_mask)
            else:
                outputs = model(input_ids)
            
            loss = criterion(outputs, labels)
            
            # Backward pass
            loss.backward()
            optimizer.step()
            
            # Statistics
            train_loss += loss.item()
            _, predicted = torch.max(outputs, 1)
            train_total += labels.size(0)
            train_correct += (predicted == labels).sum().item()
            
            # Update progress bar
            train_pbar.set_postfix({
                'loss': f"{loss.item():.4f}",
                'acc': f"{100 * train_correct / train_total:.2f}%"
            })
        
        # Evaluation phase
        model.eval()
        test_loss = 0
        test_correct = 0
        test_total = 0
        
        with torch.no_grad():
            test_pbar = tqdm(test_loader, desc=f"Epoch {epoch+1}/{epochs} [Test]")
            for batch in test_pbar:
                input_ids = batch['input_ids'].to(device)
                labels = batch['label'].to(device)
                
                # Forward pass
                if 'attention_mask' in batch:
                    attention_mask = batch['attention_mask'].to(device)
                    outputs = model(input_ids, attention_mask)
                else:
                    outputs = model(input_ids)
                
                loss = criterion(outputs, labels)
                
                # Statistics
                test_loss += loss.item()
                _, predicted = torch.max(outputs, 1)
                test_total += labels.size(0)
                test_correct += (predicted == labels).sum().item()
        
        # Calculate metrics
        epoch_time = time.time() - epoch_start
        train_loss = train_loss / len(train_loader)
        train_acc = 100 * train_correct / train_total
        test_loss = test_loss / len(test_loader)
        test_acc = 100 * test_correct / test_total
        
        # Save history
        history['train_loss'].append(train_loss)
        history['train_acc'].append(train_acc)
        history['test_loss'].append(test_loss)
        history['test_acc'].append(test_acc)
        history['epoch_times'].append(epoch_time)
        
        # Print epoch summary
        print(f"\nEpoch {epoch+1} Summary:")
        print(f"  Train Loss: {train_loss:.4f} | Train Acc: {train_acc:.2f}%")
        print(f"  Test Loss:  {test_loss:.4f} | Test Acc:  {test_acc:.2f}%")
        print(f"  Time: {epoch_time:.2f}s")
    
    return history

print("‚úÖ Training function ready!")

## 7. Train LSTM Model

Let's train the LSTM model and see how it performs!

In [None]:
# Train LSTM
lstm_history = train_model(
    model=lstm_model,
    train_dataset=lstm_train_dataset,
    test_dataset=lstm_test_dataset,
    model_name="LSTM",
    epochs=3,
    batch_size=32,
    lr=0.001
)

print("\n‚úÖ LSTM training complete!")

## 8. Train Transformer Model

Now let's train the Transformer model with the same data!

In [None]:
# Train Transformer
transformer_history = train_model(
    model=transformer_model,
    train_dataset=lstm_train_dataset,  # Same dataset
    test_dataset=lstm_test_dataset,
    model_name="Transformer",
    epochs=3,
    batch_size=32,
    lr=0.001
)

print("\n‚úÖ Transformer training complete!")

## 9. Compare Results

Let's visualize and compare the performance of both models!

In [None]:
# Plot comparison
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

epochs_range = range(1, len(lstm_history['train_loss']) + 1)

# Plot 1: Training Loss
axes[0, 0].plot(epochs_range, lstm_history['train_loss'], 'b-o', label='LSTM', linewidth=2)
axes[0, 0].plot(epochs_range, transformer_history['train_loss'], 'r-s', label='Transformer', linewidth=2)
axes[0, 0].set_xlabel('Epoch', fontsize=12)
axes[0, 0].set_ylabel('Loss', fontsize=12)
axes[0, 0].set_title('Training Loss Comparison', fontsize=14, fontweight='bold')
axes[0, 0].legend(fontsize=11)
axes[0, 0].grid(True, alpha=0.3)

# Plot 2: Test Loss
axes[0, 1].plot(epochs_range, lstm_history['test_loss'], 'b-o', label='LSTM', linewidth=2)
axes[0, 1].plot(epochs_range, transformer_history['test_loss'], 'r-s', label='Transformer', linewidth=2)
axes[0, 1].set_xlabel('Epoch', fontsize=12)
axes[0, 1].set_ylabel('Loss', fontsize=12)
axes[0, 1].set_title('Test Loss Comparison', fontsize=14, fontweight='bold')
axes[0, 1].legend(fontsize=11)
axes[0, 1].grid(True, alpha=0.3)

# Plot 3: Training Accuracy
axes[1, 0].plot(epochs_range, lstm_history['train_acc'], 'b-o', label='LSTM', linewidth=2)
axes[1, 0].plot(epochs_range, transformer_history['train_acc'], 'r-s', label='Transformer', linewidth=2)
axes[1, 0].set_xlabel('Epoch', fontsize=12)
axes[1, 0].set_ylabel('Accuracy (%)', fontsize=12)
axes[1, 0].set_title('Training Accuracy Comparison', fontsize=14, fontweight='bold')
axes[1, 0].legend(fontsize=11)
axes[1, 0].grid(True, alpha=0.3)

# Plot 4: Test Accuracy
axes[1, 1].plot(epochs_range, lstm_history['test_acc'], 'b-o', label='LSTM', linewidth=2)
axes[1, 1].plot(epochs_range, transformer_history['test_acc'], 'r-s', label='Transformer', linewidth=2)
axes[1, 1].set_xlabel('Epoch', fontsize=12)
axes[1, 1].set_ylabel('Accuracy (%)', fontsize=12)
axes[1, 1].set_title('Test Accuracy Comparison', fontsize=14, fontweight='bold')
axes[1, 1].legend(fontsize=11)
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Print final comparison
print(f"\n{'='*60}")
print("FINAL RESULTS COMPARISON")
print(f"{'='*60}")
print(f"\n{'Model':<15} {'Test Accuracy':<15} {'Avg Time/Epoch'}")
print(f"{'-'*50}")
print(f"{'LSTM':<15} {lstm_history['test_acc'][-1]:<14.2f}% {np.mean(lstm_history['epoch_times']):.2f}s")
print(f"{'Transformer':<15} {transformer_history['test_acc'][-1]:<14.2f}% {np.mean(transformer_history['epoch_times']):.2f}s")
print(f"\n{'='*60}")

## 10. Test on Custom Examples

Let's test both models on some custom movie reviews!

In [None]:
def predict_sentiment(text, model, tokenizer, model_name):
    """
    Predict sentiment for a given text.
    """
    model.eval()
    
    # Tokenize
    indices = tokenizer.encode(text, max_length=256)
    input_ids = torch.tensor([indices], dtype=torch.long).to(device)
    
    # Predict
    with torch.no_grad():
        outputs = model(input_ids)
        probabilities = F.softmax(outputs, dim=1)
        predicted_class = torch.argmax(probabilities, dim=1).item()
        confidence = probabilities[0][predicted_class].item()
    
    sentiment = "Positive üòä" if predicted_class == 1 else "Negative üòû"
    
    return sentiment, confidence

# Test examples
test_reviews = [
    "This movie was absolutely fantastic! Best film I've seen all year.",
    "Terrible waste of time. I want my money back.",
    "The acting was great but the plot was confusing.",
    "An emotional masterpiece that will stay with you forever.",
    "Boring and predictable. Fell asleep halfway through."
]

print(f"\n{'='*80}")
print("TESTING ON CUSTOM REVIEWS")
print(f"{'='*80}")

for i, review in enumerate(test_reviews, 1):
    print(f"\n{'-'*80}")
    print(f"Review {i}: {review}")
    print(f"{'-'*80}")
    
    # LSTM prediction
    lstm_sentiment, lstm_conf = predict_sentiment(review, lstm_model, simple_tokenizer, "LSTM")
    print(f"LSTM:        {lstm_sentiment:<15} (Confidence: {lstm_conf:.2%})")
    
    # Transformer prediction
    trans_sentiment, trans_conf = predict_sentiment(review, transformer_model, simple_tokenizer, "Transformer")
    print(f"Transformer: {trans_sentiment:<15} (Confidence: {trans_conf:.2%})")

print(f"\n{'='*80}")

## 11. Understanding the Differences

### How Each Model Reads Text:

#### LSTM (Sequential Processing):
```
Review: "This movie was absolutely fantastic!"

Step 1: Read "This"        ‚Üí Hidden state h‚ÇÅ
Step 2: Read "movie"       ‚Üí Update to h‚ÇÇ (remembers "This")
Step 3: Read "was"         ‚Üí Update to h‚ÇÉ (remembers "This movie")
Step 4: Read "absolutely"  ‚Üí Update to h‚ÇÑ
Step 5: Read "fantastic"   ‚Üí Update to h‚ÇÖ
Step 6: Final prediction based on h‚ÇÖ
```

**Problem**: By the time LSTM reaches "fantastic", it might have partially "forgotten" "This" due to the long sequence.

#### Transformer (Parallel Processing with Attention):
```
Review: "This movie was absolutely fantastic!"

All words processed simultaneously!

"fantastic" directly attends to:
  - "This"      (0.05) - low attention
  - "movie"     (0.60) - high attention! (what is fantastic?)
  - "was"       (0.10)
  - "absolutely" (0.20) - moderate attention (intensifier)
  - "fantastic" (0.05) - itself

Direct connections between ALL words!
```

**Advantage**: Every word can directly "look at" every other word, capturing relationships better.

## 12. Visualizing Attention (Transformer)

Let's visualize what the Transformer is "paying attention to"!

In [None]:
def visualize_attention_simple(text, model, tokenizer):
    """
    Visualize which words the model focuses on.
    This is a simplified visualization showing word importance.
    """
    model.eval()
    
    # Tokenize
    words = text.split()
    indices = tokenizer.encode(text, max_length=256)
    input_ids = torch.tensor([indices], dtype=torch.long).to(device)
    
    # Get embeddings and compute importance
    with torch.no_grad():
        # Get embedding
        embedded = model.embedding(input_ids)
        
        # Simple importance: L2 norm of embeddings after transformer
        # (This is a simplification - real attention is more complex)
        importance = torch.norm(embedded, dim=2).squeeze().cpu().numpy()
        
        # Normalize to 0-1
        importance = importance[:len(words)]
        importance = (importance - importance.min()) / (importance.max() - importance.min() + 1e-10)
    
    # Visualize
    plt.figure(figsize=(12, 4))
    colors = plt.cm.Reds(importance)
    
    plt.bar(range(len(words)), importance, color=colors)
    plt.xticks(range(len(words)), words, rotation=45, ha='right')
    plt.ylabel('Importance', fontsize=12)
    plt.title('Word Importance in Transformer Model', fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()
    
    # Print top words
    word_importance = list(zip(words, importance))
    word_importance.sort(key=lambda x: x[1], reverse=True)
    
    print("\nTop 5 Most Important Words:")
    for word, imp in word_importance[:5]:
        print(f"  {word:<15} {imp:.3f}")

# Visualize attention for a sample review
sample_review = "This movie was absolutely fantastic and entertaining"
print(f"Analyzing: '{sample_review}'\n")
visualize_attention_simple(sample_review, transformer_model, simple_tokenizer)

## 13. Key Takeaways

### Performance Comparison:

| Aspect | LSTM | Transformer |
|--------|------|-------------|
| **Processing** | Sequential (one word at a time) | Parallel (all words together) |
| **Speed** | Slower (can't parallelize) | Faster (GPU-friendly) |
| **Long-range dependencies** | Struggles with very long sequences | Excellent - direct connections |
| **Memory** | Hidden state can "forget" | Attention to all words |
| **Training** | Slower | Faster with GPU |
| **Accuracy** | Good | Usually better |

### Why Transformers Win:

1. **Parallel Processing**: Can process all words simultaneously ‚Üí faster training
2. **Direct Connections**: Every word can attend to every other word ‚Üí better understanding
3. **No Forgetting**: No sequential bottleneck ‚Üí better long-range dependencies
4. **Scalability**: Scales better with more data and compute

### When to Use Each:

**Use LSTM when:**
- You have limited computational resources
- Working with very long sequences (lower memory)
- Sequential nature is important (e.g., time series)

**Use Transformer when:**
- You need best performance
- Have GPU resources
- Working with natural language
- Need to understand complex relationships

## 14. Exercises for Students

### Exercise 1: Experiment with Hyperparameters
Try changing:
- Number of LSTM/Transformer layers
- Embedding dimensions
- Learning rate
- Batch size

Which changes improve performance the most?

### Exercise 2: Test on Different Texts
Create your own movie reviews and test both models. Try:
- Very short reviews (5 words)
- Very long reviews (100+ words)
- Mixed sentiment reviews

Which model handles each case better?

### Exercise 3: Analyze Errors
Find examples where:
- Both models are wrong
- LSTM is right but Transformer is wrong
- Transformer is right but LSTM is wrong

What patterns do you notice?

### Exercise 4: Add More Data
Increase the training data size from 5000 to 10000 or 20000 samples.
Does this help one model more than the other?

In [None]:
# Space for your experiments!

# Exercise 1: Try different hyperparameters
# TODO: Modify model parameters and retrain

# Exercise 2: Test your own reviews
my_reviews = [
    # Add your own reviews here!
]

# Exercise 3: Analyze predictions
# TODO: Compare predictions and find patterns

# Exercise 4: Train with more data
# TODO: Load more training samples and compare results

## 15. Summary

### What We Learned:

‚úÖ **LSTM Models**:
- Process text sequentially (one word at a time)
- Use hidden states to remember previous words
- Can struggle with long-range dependencies
- Good for sequential data with limited resources

‚úÖ **Transformer Models**:
- Process all words in parallel
- Use attention to connect every word to every other word
- Better at understanding context and relationships
- State-of-the-art for most NLP tasks

‚úÖ **Key Insight**:
Transformers generally perform better because they can:
1. See all words at once (parallel processing)
2. Directly connect any two words (attention mechanism)
3. Train faster on GPUs
4. Scale better with more data

### Next Steps:

1. **Try pre-trained models**: Use BERT, RoBERTa, or GPT for even better results
2. **Explore attention visualization**: Understand what transformers really "see"
3. **Apply to other tasks**: Question answering, translation, summarization
4. **Learn about modern architectures**: GPT-4, BERT, T5, etc.

---

**üéâ Congratulations!** You now understand the fundamental difference between sequential (LSTM) and attention-based (Transformer) models!

*Remember: The future of NLP is built on Transformers, but understanding LSTMs helps you appreciate why Transformers are so powerful!*