<a href="https://colab.research.google.com/github/nncliff/qwen-32B/blob/main/chapter-1/clipEmbedding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Gradient Clipping in RNNs

This notebook demonstrates how to implement **Gradient Clipping** in a Recurrent Neural Network (RNN) using PyTorch.

**Why Gradient Clipping?**
RNNs are prone to the **exploding gradient problem**, where gradients accumulate and become extremely large during backpropagation through time (BPTT). This can cause numerical instability (NaNs) and prevent the model from converging.

**Key Concepts Covered:**
1.  **RNN Model**: A simple RNN with an embedding layer.
2.  **Gradient Clipping**: Using `torch.nn.utils.clip_grad_norm_` to rescale gradients before the optimizer step.
3.  **Training Loop**: Integrating clipping into the standard training process.

<a href="https://colab.research.google.com/github/nncliff/qwen-32B/blob/main/chapter-1/clipEmbedding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim

embedding_dim = 128
hidden_dim = 256
vocab_size = 10000
sequence_length = 30
batch_size = 64
num_epochs = 30
clip_grad_norm = 1.0

### Hyperparameters & Gradient Clipping

Here we define the model configuration. A key parameter here is `clip_grad_norm = 1.0`.

**Why Gradient Clipping?**
RNNs can suffer from the **exploding gradient problem**, where gradients grow exponentially during backpropagation through time. This can cause the model weights to become unstable (NaNs or Infinity).

**How it works:**
Gradient clipping rescales the gradient vector so that its norm (magnitude) does not exceed a threshold (here, `1.0`). This ensures stability during training without changing the direction of the gradient.

### Understanding Output Shapes

In the `RNNModel` below, pay attention to the tensor shapes in the `forward` method:

*   **`output`**: Shape `(batch_size, sequence_length, hidden_dim)`
    *   This contains the hidden states for **every time step** in the sequence.
*   **`logits`**: Shape `(batch_size, vocab_size)`
    *   We extract the hidden state from the **last time step** (`output[:, -1, :]`), which has shape `(batch_size, hidden_dim)`.
    *   This vector is passed through the fully connected layer (`self.fc`) to project it to the vocabulary size, producing the prediction logits.

In [2]:
class RNNModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim):
        super(RNNModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.rnn = nn.RNN(embedding_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, vocab_size)

    def forward(self, x):
        # x: (batch_size, sequence_length)
        embedded = self.embedding(x)        # (batch_size, sequence_length, embedding_dim)

        # output: (batch_size, sequence_length, hidden_dim)
        # Contains hidden states for all time steps
        output, _ = self.rnn(embedded)

        # logits: (batch_size, vocab_size)
        # Take the last time step's output and project to vocab size
        logits = self.fc(output[:, -1, :])
        return logits

In [3]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print("Using device:", device)
model = RNNModel(vocab_size, embedding_dim, hidden_dim).to(device)
optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()

Using device: cuda


In [4]:
def generate_dummy_data(batch_size, sequence_length, vocab_size):
    input = torch.randint(0, vocab_size, (batch_size, sequence_length), dtype=torch.long)
    target = torch.randint(0, vocab_size, (batch_size,), dtype=torch.long)
    return input, target

In [5]:
for epoch in range(num_epochs):
    model.train()
    total_loss = 0
    for _ in range(100):  # Assume 100 batches per epoch
        inputs, targets = generate_dummy_data(batch_size, sequence_length, vocab_size)
        inputs = inputs.to(device)
        targets = targets.to(device)

        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, targets)
        loss.backward()

        # apply gradient clipping
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip_grad_norm)
        optimizer.step()

        total_loss += loss.item()

    avg_loss = total_loss / 100
    print(f"Epoch {epoch+1}/{num_epochs}, Loss: {avg_loss:.4f}")

Epoch 1/30, Loss: 9.2424
Epoch 2/30, Loss: 9.2454
Epoch 3/30, Loss: 9.2517
Epoch 4/30, Loss: 9.2613
Epoch 5/30, Loss: 9.2703
Epoch 6/30, Loss: 9.2755
Epoch 7/30, Loss: 9.2787
Epoch 8/30, Loss: 9.2864
Epoch 9/30, Loss: 9.2891
Epoch 10/30, Loss: 9.2978
Epoch 11/30, Loss: 9.2959
Epoch 12/30, Loss: 9.3030
Epoch 13/30, Loss: 9.3038
Epoch 14/30, Loss: 9.3099
Epoch 15/30, Loss: 9.3045
Epoch 16/30, Loss: 9.3270
Epoch 17/30, Loss: 9.3247
Epoch 18/30, Loss: 9.3315
Epoch 19/30, Loss: 9.3311
Epoch 20/30, Loss: 9.3408
Epoch 21/30, Loss: 9.3284
Epoch 22/30, Loss: 9.3377
Epoch 23/30, Loss: 9.3477
Epoch 24/30, Loss: 9.3407
Epoch 25/30, Loss: 9.3577
Epoch 26/30, Loss: 9.3477
Epoch 27/30, Loss: 9.3524
Epoch 28/30, Loss: 9.3597
Epoch 29/30, Loss: 9.3520
Epoch 30/30, Loss: 9.3671


In [6]:
model.eval() # Evaluation
sample_input, _ = generate_dummy_data(1, sequence_length, vocab_size)
sample_input = sample_input.to(device)
with torch.no_grad():
    sample_output = model(sample_input)
    predicted_token = torch.argmax(sample_output, dim=1).item()
    print("Sample input:", sample_input.cpu().numpy()) # Print input sequence
    print("Predicted token:", predicted_token)

Sample input: [[3416 3946 7592  251 1614 7779   99 3712 9242 6496 7929 9605 1915 9095
  1397 1891 6593 4440 8305 4034 6037 2651 1283 7930 1534 3992 7234 9743
   965 1475]]
Predicted token: 2857
