# 📘 Lesson 9 — Long Short-Term Memory (LSTM): Advanced Sequence Modeling

---

### 🎯 Why this lesson matters
RNNs are great for sequences but suffer from **vanishing gradients** in long data (forget early info).  
LSTMs fix this with **gates** to control memory.  

👉 LSTMs are used in translation (Google Translate), speech (Siri), and time series.  
They’re a step toward Transformers (which also handle long dependencies).  

We’ll build an LSTM and compare to simple RNN.


In [1]:
# Setup
import torch
import torch.nn as nn
import torch.optim as optim
torch.manual_seed(42)


## 1) What is an LSTM?

- LSTM = RNN + **Cell state** (long-term memory) + Gates.
- Gates decide what to remember/forget.

👉 WHY better than RNN? Handles long sequences without gradient issues.


## 2) Gates Mechanism — Forget, Input, Output

- **Forget gate**: Decides what to discard from cell state.
- **Input gate**: Adds new info to cell state.
- **Output gate**: Decides what to output from cell state.

👉 Equations simplified: Use sigmoid for gates (0-1 decisions).


In [2]:
# Simple LSTM demo
lstm = nn.LSTM(input_size=1, hidden_size=1, num_layers=1)
input_seq = torch.tensor([[[1.0], [2.0], [3.0], [4.0]]])  # Batch=1, seq_len=4, features=1
h0 = torch.zeros(1, 1, 1)  # Initial hidden
c0 = torch.zeros(1, 1, 1)  # Initial cell

output, (hn, cn) = lstm(input_seq, (h0, c0))
print("Hidden states:", output)
print("Cell states:", cn)  # Long-term memory


Hidden states: tensor([[[-0.1688],
         [-0.2836],
         [-0.0360],
         [ 0.2468]]], grad_fn=<StackBackward0>)
Cell states: tensor([[[-0.3353],
         [-0.9285],
         [-0.1833],
         [ 0.6696]]], grad_fn=<StackBackward0>)


## 3) Cell State vs Hidden State

- **Cell state**: Long-term info highway (minimal changes).
- **Hidden state**: Short-term, used for output.

👉 WHY separate? Cell state preserves info over long distances.


## 4) Building an LSTM Model

- Similar to RNN, but with cell state.
- For sentiment or prediction.


In [3]:
class SimpleLSTM(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        self.lstm = nn.LSTM(input_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        out, _ = self.lstm(x)
        out = self.fc(out[:, -1, :])  # Last time step
        return out


## 5) Training on Text Data

- Example: Simple sentiment (positive/negative).
- Use embedding for words.


In [4]:
# Dummy text data (sequences of numbers as "words")
X = torch.rand(10, 5, 1)  # 10 samples, seq_len=5, features=1
y = torch.tensor([[1.], [0.], [1.], [0.], [1.], [0.], [1.], [0.], [1.], [0.]])

model = SimpleLSTM(1, 20, 1)
optimizer = optim.Adam(model.parameters(), lr=0.01)
criterion = nn.BCELoss()

for epoch in range(50):
    optimizer.zero_grad()
    output = torch.sigmoid(model(X))
    loss = criterion(output, y)
    loss.backward()
    optimizer.step()
    if (epoch+1) % 25 == 0 or epoch == 0:
        print(f"Epoch {epoch+1}/50, Loss: {loss.item():.2f}")


Epoch 1/50, Loss: 0.69
Epoch 25/50, Loss: 0.50
Epoch 50/50, Loss: 0.30


## 6) Practice Exercises

- Use LSTM for time series forecasting.
- Stack multiple LSTM layers.


In [5]:
# Practice: Stacked LSTM
class StackedLSTM(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers=2, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        out, _ = self.lstm(x)
        out = self.fc(out[:, -1, :])
        return out


## 📚 Summary

✅ What we learned:
- LSTM gates for long-term memory.
- Cell vs hidden states.
- Training on sequences.

🚀 Next Lesson: **Data Loading & Preprocessing** — preparing real datasets.
