<a href="https://colab.research.google.com/github/gnoejh/ict1022/blob/main/Architectures/lstm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Long Short-Term Memory (LSTM) Networks

Long Short-Term Memory (LSTM) networks are a specialized form of Recurrent Neural Networks (RNNs) designed to overcome the vanishing gradient problem that affects standard RNNs. Developed by Hochreiter & Schmidhuber in 1997, LSTMs have become fundamental building blocks for many sequence modeling tasks.

## 1. The Vanishing Gradient Problem

Standard RNNs suffer from the vanishing gradient problem when dealing with long sequences:

- During backpropagation through time, gradients are multiplied many times by the recurrent weight matrix
- If eigenvalues are < 1, gradients shrink exponentially with sequence length
- If eigenvalues are > 1, gradients explode
- This makes learning long-range dependencies extremely difficult

LSTMs were specifically designed to address this limitation by introducing a memory cell with gating mechanisms.

## 2. LSTM Architecture

The core innovation of LSTMs is the cell state (memory cell) that runs through the entire sequence, with gates controlling information flow.

### Key Components

An LSTM cell contains:

1. **Cell State (C)**: Long-term memory that passes through the entire sequence with minimal modification
2. **Hidden State (h)**: Short-term memory/output at each time step
3. **Forget Gate (f)**: Controls what information to discard from the cell state
4. **Input Gate (i)**: Controls what new information to store in the cell state
5. **Output Gate (o)**: Controls what parts of the cell state to output

![LSTM Cell Architecture](https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-chain.png)
*Image source: Christopher Olah's blog*

## 3. LSTM Operations

The LSTM performs the following operations at each time step $t$:

### Forget Gate
Decides what information to throw away from the cell state:

$$f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)$$

### Input Gate
Decides what new information to store in the cell state:

$$i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)$$
$$\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)$$

### Cell State Update
Updates the old cell state into the new cell state:

$$C_t = f_t * C_{t-1} + i_t * \tilde{C}_t$$

### Output Gate
Decides what parts of the cell state to output:

$$o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)$$
$$h_t = o_t * \tanh(C_t)$$

Where:
- $\sigma$ is the sigmoid function
- $\tanh$ is the hyperbolic tangent function
- $*$ represents element-wise multiplication

## 4. LSTM Variants

Several LSTM variants have been proposed to simplify or enhance the original architecture:

### Peephole Connections
- Allows gates to "peek" at the cell state by adding connections from cell state to gates
- Can help with precise timing tasks

### Coupled Forget and Input Gates
- Instead of separate decisions about what to forget and what to add, these decisions are coupled
- If we forget something, we add something in its place

### Gated Recurrent Unit (GRU)
- Simplified version of LSTM that combines forget and input gates into a single "update gate"
- Merges cell state and hidden state
- Often performs similarly to LSTM but with fewer parameters

## 5. Implementation Example

### PyTorch Implementation

In [1]:
import torch
import torch.nn as nn

# LSTM parameters
input_size = 10  # Size of input features
hidden_size = 20  # Size of hidden state
num_layers = 2  # Number of LSTM layers
batch_size = 3  # Batch size
seq_length = 5  # Sequence length

# Create an LSTM layer
lstm = nn.LSTM(input_size=input_size, 
               hidden_size=hidden_size,
               num_layers=num_layers,
               batch_first=True)  # batch_first=True means input shape is (batch, seq, feature)

# Example input: (batch_size, seq_length, input_size)
x = torch.randn(batch_size, seq_length, input_size)

# Initial hidden state and cell state
h0 = torch.zeros(num_layers, batch_size, hidden_size)
c0 = torch.zeros(num_layers, batch_size, hidden_size)

# Forward pass
output, (hn, cn) = lstm(x, (h0, c0))

print(f"Output shape: {output.shape}")  # (batch_size, seq_length, hidden_size)
print(f"Hidden state shape: {hn.shape}")  # (num_layers, batch_size, hidden_size)
print(f"Cell state shape: {cn.shape}")  # (num_layers, batch_size, hidden_size)

Output shape: torch.Size([3, 5, 20])
Hidden state shape: torch.Size([2, 3, 20])
Cell state shape: torch.Size([2, 3, 20])


### TensorFlow/Keras Implementation

## 6. Building an LSTM-Based Sequence Model

Here's how to build a complete sequence model using LSTMs:

In [2]:
import torch
import torch.nn as nn

class LSTMSequenceModel(nn.Module):
    def __init__(self, input_size, hidden_size, output_size, num_layers=1, dropout=0):
        super(LSTMSequenceModel, self).__init__()
        
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        
        # LSTM layer
        self.lstm = nn.LSTM(
            input_size=input_size,
            hidden_size=hidden_size,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout if num_layers > 1 else 0
        )
        
        # Output layer
        self.fc = nn.Linear(hidden_size, output_size)
    
    def forward(self, x):
        # x shape: (batch_size, seq_length, input_size)
        
        # Initialize hidden state with zeros
        batch_size = x.size(0)
        h0 = torch.zeros(self.num_layers, batch_size, self.hidden_size).to(x.device)
        c0 = torch.zeros(self.num_layers, batch_size, self.hidden_size).to(x.device)
            
        # Forward propagate LSTM
        # out shape: (batch_size, seq_length, hidden_size)
        out, _ = self.lstm(x, (h0, c0))  
        
        # Decode the hidden state of the last time step
        # For sequence-to-sequence, use all outputs
        out = self.fc(out)
        return out

# Example usage
model = LSTMSequenceModel(input_size=10, hidden_size=32, output_size=5, num_layers=2)
sample_input = torch.randn(8, 15, 10)  # batch_size=8, seq_length=15, input_size=10
output = model(sample_input)
print(f"Model output shape: {output.shape}")  # Should be [8, 15, 5]

Model output shape: torch.Size([8, 15, 5])


## 7. Common Applications

LSTMs excel in various sequence modeling tasks:

### Natural Language Processing
- Language modeling
- Machine translation
- Text generation
- Sentiment analysis
- Named entity recognition

### Time Series Analysis
- Stock price prediction
- Weather forecasting
- Anomaly detection in sensor data
- Energy load forecasting

### Speech and Audio
- Speech recognition
- Speech synthesis
- Music generation
- Audio classification

### Other Applications
- Gesture recognition
- Video analysis
- Handwriting recognition
- DNA sequence analysis

## 8. Advantages and Limitations

### Advantages
- Effectively captures long-range dependencies in sequences
- Mitigates the vanishing gradient problem
- Maintains relevant information over many time steps
- Works well with variable-length sequences
- Robust against noise in temporal data

### Limitations
- Computationally more expensive than standard RNNs
- Sequential nature prevents parallelization during training
- May struggle with very long sequences (1000+ steps)
- Requires careful initialization and regularization
- Being supplanted by Transformer models in many NLP applications
- Limited receptive field compared to attention-based models

## 9. Best Practices for LSTM Models

### Architecture Design
- Start with 1-2 LSTM layers before adding complexity
- Use bidirectional LSTMs for tasks where future context is important
- Consider adding residual connections for very deep LSTM networks

### Training Tips
- Use gradient clipping to prevent exploding gradients
- Apply dropout between LSTM layers, not within recurrent connections
- Normalize inputs for faster convergence
- Consider using different learning rates for LSTM and output layers

### Sequence Handling
- Use padding and masking for variable-length sequences
- Consider applying attention mechanisms for very long sequences
- Truncated backpropagation through time can help with memory constraints

## 10. References and Further Reading

1. Hochreiter, S., & Schmidhuber, J. (1997). [Long Short-Term Memory](https://www.bioinf.jku.at/publications/older/2604.pdf). Neural Computation, 9(8), 1735-1780.

2. Graves, A., & Schmidhuber, J. (2005). [Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures](https://www.cs.toronto.edu/~graves/nn_2005.pdf). Neural Networks, 18(5-6), 602-610.

3. Greff, K., Srivastava, R. K., Koutník, J., Steunebrink, B. R., & Schmidhuber, J. (2016). [LSTM: A Search Space Odyssey](https://arxiv.org/pdf/1503.04069.pdf). IEEE Transactions on Neural Networks and Learning Systems, 28(10), 2222-2232.

4. Colah's Blog: [Understanding LSTM Networks](https://colah.github.io/posts/2015-08-Understanding-LSTMs/)

5. Olah, C., & Carter, S. (2016). [Attention and Augmented Recurrent Neural Networks](https://distill.pub/2016/augmented-rnns/). Distill.

6. Goodfellow, I., Bengio, Y., & Courville, A. (2016). [Deep Learning](https://www.deeplearningbook.org/). MIT Press. (Chapter 10: Sequence Modeling).