# RNN Implementation Methods

Different ways to implement RNNs in PyTorch with code examples.

---

## Contents
1. [Method 1: Manual RNN](#manual)
2. [Method 2: Built-in nn.RNN](#builtin)
3. [Method 3: LSTM & GRU](#lstm)
4. [Method 4: Bidirectional RNN](#bidirectional)
5. [Quick Comparison](#comparison)

In [1]:
import torch
import torch.nn as nn

# Sample data for all examples
batch_size = 32
seq_len = 10
input_size = 50
hidden_size = 128
output_size = 10

# Create sample batch
x = torch.randn(batch_size, seq_len, input_size)
print(f"Input shape: {x.shape}")

Input shape: torch.Size([32, 10, 50])


<a id='manual'></a>
# Method 1: Manual RNN

Build RNN from scratch by processing one time step at a time.

**Key characteristics:**
- Manually loop through each time step
- Full control over the process
- Slower due to Python loops
- Good for learning RNN internals

In [2]:
class ManualRNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        self.hidden_size = hidden_size
        self.i2h = nn.Linear(input_size + hidden_size, hidden_size)
        self.i2o = nn.Linear(input_size + hidden_size, output_size)
    
    def forward(self, input, hidden):
        combined = torch.cat((input, hidden), 1)
        hidden = torch.tanh(self.i2h(combined))
        output = self.i2o(combined)
        return output, hidden
    
    def init_hidden(self, batch_size=1):
        return torch.zeros(batch_size, self.hidden_size)

model = ManualRNN(input_size, hidden_size, output_size)
hidden = model.init_hidden(batch_size)

for t in range(seq_len):
    output, hidden = model(x[:, t, :], hidden)

print(f"Output shape: {output.shape}")
print(f"Hidden shape: {hidden.shape}")

Output shape: torch.Size([32, 10])
Hidden shape: torch.Size([32, 128])


**Code Walkthrough:**

1. **Two Linear Layers:**
   - `i2h`: Transforms concatenated input to hidden state
   - `i2o`: Transforms concatenated input to output

2. **Forward Process:**
   - Concatenate current input with previous hidden state
   - Update hidden state using tanh activation
   - Compute output from the combined vector

3. **Manual Loop Required:**
   - Must call `forward()` once per time step
   - Hidden state carries information across time steps
   - Final output after processing all time steps

<a id='builtin'></a>
# Method 2: Built-in nn.RNN

Use PyTorch's optimized RNN layer.

**Key characteristics:**
- Process entire sequence in one call
- 10-50x faster than manual RNN
- Optimized C++/CUDA implementation
- No manual time step looping needed

In [3]:
class BuiltInRNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size, num_layers=1):
        super().__init__()
        self.rnn = nn.RNN(input_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)
    
    def forward(self, x):
        rnn_out, hidden = self.rnn(x)
        output = self.fc(rnn_out[:, -1, :])
        return output

model = BuiltInRNN(input_size, hidden_size, output_size)
output = model(x)

print(f"Output shape: {output.shape}")

Output shape: torch.Size([32, 10])


**Code Walkthrough:**

1. **Single RNN Layer:**
   - `nn.RNN()`: PyTorch's optimized RNN implementation
   - `batch_first=True`: Input shape is (batch, seq, features)

2. **Forward Process:**
   - `rnn_out`: Contains output for ALL time steps (batch, seq, hidden)
   - We extract `[:, -1, :]` to get only the last time step
   - Pass through linear layer for final prediction

3. **No Manual Loop:**
   - Entire sequence processed in one call
   - Much faster than manual looping

<a id='lstm'></a>
# Method 3: LSTM & GRU

Advanced RNN variants that handle long-term dependencies better.

**LSTM (Long Short-Term Memory):**
- Solves vanishing gradient problem
- Has cell state + hidden state
- Best for long sequences (100+ time steps)
- More parameters than vanilla RNN

**GRU (Gated Recurrent Unit):**
- Simpler than LSTM (no cell state)
- Faster training than LSTM
- Similar performance to LSTM
- Fewer parameters

In [4]:
class LSTMModel(nn.Module):
    def __init__(self, input_size, hidden_size, output_size, num_layers=1):
        super().__init__()
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)
    
    def forward(self, x):
        lstm_out, (hidden, cell) = self.lstm(x)
        output = self.fc(lstm_out[:, -1, :])
        return output

class GRUModel(nn.Module):
    def __init__(self, input_size, hidden_size, output_size, num_layers=1):
        super().__init__()
        self.gru = nn.GRU(input_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)
    
    def forward(self, x):
        gru_out, hidden = self.gru(x)
        output = self.fc(gru_out[:, -1, :])
        return output

lstm_model = LSTMModel(input_size, hidden_size, output_size)
gru_model = GRUModel(input_size, hidden_size, output_size)

lstm_output = lstm_model(x)
gru_output = gru_model(x)

print(f"LSTM output: {lstm_output.shape}")
print(f"GRU output: {gru_output.shape}")

LSTM output: torch.Size([32, 10])
GRU output: torch.Size([32, 10])


**Code Walkthrough:**

1. **LSTM Returns Two States:**
   - `hidden`: Short-term memory
   - `cell`: Long-term memory (unique to LSTM)
   - Unpack both from LSTM output: `(hidden, cell)`

2. **GRU Returns One State:**
   - Only `hidden` state (no cell state)
   - Simpler architecture than LSTM

3. **Both Models:**
   - Use `[:, -1, :]` to get last time step
   - Apply linear layer for classification
   - LSTM has ~3x more parameters than GRU

<a id='bidirectional'></a>
# Method 4: Bidirectional RNN

Process sequences in both forward and backward directions.

**Key characteristics:**
- Reads sequence left-to-right AND right-to-left
- Combines information from both directions
- 2x slower than unidirectional
- Better accuracy when full sequence is available
- Cannot use for real-time/streaming tasks

In [None]:
class BiRNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size, num_layers=1):
        super().__init__()
        self.rnn = nn.RNN(input_size, hidden_size, num_layers, 
                          batch_first=True, bidirectional=True)
        self.fc = nn.Linear(hidden_size * 2, output_size)
    
    def forward(self, x):
        rnn_out, hidden = self.rnn(x)
        output = self.fc(rnn_out[:, -1, :])
        return output

model = BiRNN(input_size, hidden_size, output_size)
output = model(x)

print(f"Output shape: {output.shape}")

Output shape: torch.Size([32, 10])


**Code Walkthrough:**

1. **Bidirectional Flag:**
   - `bidirectional=True`: Processes sequence both ways
   - Forward RNN: Left → Right
   - Backward RNN: Right → Left

2. **Output Size Doubles:**
   - `hidden_size * 2` in the linear layer
   - Concatenates forward and backward hidden states
   - Example: hidden_size=128 → BiRNN outputs 256 features

3. **Usage:**
   - Same input/output interface as regular RNN
   - Automatically handles both directions internally

<a id='stacked'></a>
# Method 5: Stacked/Deep RNN

Stack multiple RNN layers for deeper learning.

**Key characteristics:**
- Multiple RNN layers stacked vertically
- Each layer learns different levels of abstraction
- Better for complex patterns
- More parameters = more training time
- Typically use 2-4 layers (diminishing returns after that)

In [None]:
class StackedRNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size, num_layers=3):
        super().__init__()
        self.rnn = nn.RNN(input_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)
    
    def forward(self, x):
        rnn_out, hidden = self.rnn(x)
        output = self.fc(rnn_out[:, -1, :])
        return output

model = StackedRNN(input_size, hidden_size, output_size, num_layers=3)
output = model(x)

print(f"Output shape: {output.shape}")

Output shape: torch.Size([32, 10, 10])


**Code Walkthrough:**

1. **Number of Layers:**
   - `num_layers=3`: Creates 3 RNN layers stacked vertically
   - Layer 1 output → Layer 2 input → Layer 3 input
   - Each layer learns different level of abstraction

2. **Same Interface:**
   - Only difference from single layer: `num_layers` parameter
   - Input/output shapes remain the same
   - Hidden state has shape (num_layers, batch, hidden_size)

3. **Trade-off:**
   - More layers = better learning capacity
   - More layers = more training time
   - Typically use 2-4 layers in practice

<a id='seq2seq'></a>
# Method 6: Seq2Seq (Encoder-Decoder)

For tasks where input and output have different lengths.

**Architecture:**
- **Encoder**: Compresses input sequence into context vector
- **Decoder**: Generates output sequence from context vector

**Use cases:**
- Machine translation (English → French)
- Text summarization (long text → short summary)
- Chatbots (question → answer)

**Key point:** Input length ≠ Output length

In [None]:
class Encoder(nn.Module):
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.rnn = nn.RNN(input_size, hidden_size, batch_first=True)
    
    def forward(self, x):
        _, hidden = self.rnn(x)
        return hidden

class Decoder(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        self.rnn = nn.RNN(input_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)
    
    def forward(self, x, hidden):
        rnn_out, hidden = self.rnn(x, hidden)
        output = self.fc(rnn_out)
        return output, hidden

class Seq2Seq(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        self.encoder = Encoder(input_size, hidden_size)
        self.decoder = Decoder(output_size, hidden_size, output_size)
    
    def forward(self, src, tgt):
        context = self.encoder(src)
        output, _ = self.decoder(tgt, context)
        return output

model = Seq2Seq(input_size, hidden_size, output_size)
src = torch.randn(batch_size, seq_len, input_size)
tgt = torch.randn(batch_size, seq_len, output_size)
output = model(src, tgt)

print(f"Output shape: {output.shape}")

Output shape: torch.Size([32, 10, 10])



**Code Walkthrough:**

1. **Encoder:**
   - Takes input sequence of any length
   - Returns only the final hidden state (context vector)
   - This context vector "summarizes" the entire input

2. **Decoder:**
   - Starts with encoder's context as initial hidden state
   - Takes target sequence as input
   - Generates output sequence step by step

3. **Seq2Seq Model:**
   - Combines encoder + decoder
   - `src`: Source sequence (e.g., English sentence)
   - `tgt`: Target sequence (e.g., French sentence)
   - Input and output can have different lengths!

4. **Flow:**
   ```
   src → Encoder → context → Decoder → output
         (compress)           (generate)
   ```

<a id='comparison'></a>
# Quick Comparison

## Architecture Comparison

| Method | Use Case | Pros | Cons |
|--------|----------|------|------|
| **Manual RNN** | Learning | Full control | Slow, complex |
| **nn.RNN** | Simple tasks | Fast, easy | Vanishing gradients |
| **LSTM** | Long sequences | Good memory | More parameters |
| **GRU** | Balance | Faster than LSTM | Less capacity |
| **Bidirectional** | Context matters | Both directions | 2x slower |
| **Seq2Seq** | Translation | Variable I/O | Complex training |

## When to Use What?

**RNN**: Short sequences, simple patterns
- Sentiment analysis (short reviews)
- Simple time series

**LSTM**: Long sequences, long-term dependencies
- Language modeling
- Speech recognition
- Long text classification

**GRU**: Similar to LSTM but faster
- When LSTM works but speed matters
- Less data available

**Bidirectional**: Context from both sides needed
- Named Entity Recognition
- Fill-in-the-blank tasks
- Not for real-time/streaming

**Seq2Seq**: Variable length input/output
- Machine translation
- Text summarization
- Chatbots

## Parameter Count Comparison

In [8]:
# Compare parameter counts
models = {
    'Manual RNN': ManualRNN(input_size, hidden_size, output_size),
    'nn.RNN': BuiltInRNN(input_size, hidden_size, output_size),
    'LSTM': LSTMModel(input_size, hidden_size, output_size),
    'GRU': GRUModel(input_size, hidden_size, output_size),
    'BiRNN': BiRNN(input_size, hidden_size, output_size),
}

for name, model in models.items():
    params = sum(p.numel() for p in model.parameters())
    print(f"{name:15} {params:,} parameters")

Manual RNN      24,702 parameters
nn.RNN          24,330 parameters
LSTM            93,450 parameters
GRU             70,410 parameters
BiRNN           48,650 parameters


## Training Tips

### 1. Gradient Clipping
```python
nn.utils.clip_grad_norm_(model.parameters(), max_norm=5.0)
```

### 2. Learning Rate
- Start with 0.001 for Adam
- Use learning rate scheduling

### 3. Hidden Size
- Start with 128 or 256
- Increase if underfitting
- Decrease if overfitting

### 4. Num Layers
- 1-2 layers usually sufficient
- 3-4 for complex tasks
- More layers = harder to train

### 5. Dropout
```python
nn.RNN(input_size, hidden_size, num_layers, dropout=0.5)
```