# RNNs and their Shortcomings

- 📺 **Video:** [https://youtu.be/xvnnA04JVQo](https://youtu.be/xvnnA04JVQo)

## Overview
- Review recurrent neural networks for sequential modeling and where they struggle.
- Connect vanishing/exploding gradients to difficulties capturing long-range dependencies.

## Key ideas
- **Recurrence:** hidden state summarizes previous tokens.
- **Gradient issues:** repeated multiplication by Jacobians can shrink or explode norms.
- **Long-range dependencies:** simple RNNs forget distant context without gating.
- **Alternatives:** LSTMs, GRUs, and self-attention mitigate these issues.

## Demo
Simulate gradient propagation through a simple recurrent matrix to show vanishing/exploding effects discussed in the lecture (https://youtu.be/I-8XM1SkJOI).

In [1]:
import numpy as np

rng = np.random.default_rng(0)
W = np.array([[0.7, 0.2], [0.1, 0.6]])
T = 20

state = np.array([1.0, 0.0])
for t in range(T):
    state = np.tanh(W @ state)
print('Hidden state after 20 steps:', state)

# Gradient propagation with eigenvalues < 1
W_grad = np.array([[0.5, 0.0], [0.0, 0.4]])
vec = np.ones(2)
for t in range(1, 31):
    vec = W_grad.T @ vec
    if t % 5 == 0:
        print(f'Norm after {t} steps (vanishing case): {np.linalg.norm(vec):.6f}')

# Exploding gradients example
W_explode = np.array([[1.3, 0.0], [0.0, 1.1]])
vec = np.ones(2)
for t in range(1, 11):
    vec = W_explode.T @ vec
    if t % 2 == 0:
        print(f'Norm after {t} steps (exploding case): {np.linalg.norm(vec):.2f}')


Hidden state after 20 steps: [0.00605599 0.00302811]
Norm after 5 steps (vanishing case): 0.032885
Norm after 10 steps (vanishing case): 0.000982
Norm after 15 steps (vanishing case): 0.000031
Norm after 20 steps (vanishing case): 0.000001
Norm after 25 steps (vanishing case): 0.000000
Norm after 30 steps (vanishing case): 0.000000
Norm after 2 steps (exploding case): 2.08
Norm after 4 steps (exploding case): 3.21
Norm after 6 steps (exploding case): 5.14
Norm after 8 steps (exploding case): 8.43
Norm after 10 steps (exploding case): 14.03


## Try it
- Modify the demo
- Add a tiny dataset or counter-example


## References
- [Eisenstein 6.1](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 6.2](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 6.4](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 6.3](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [[Blog] Understanding LSTMs](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)
- [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/abs/1409.0473)
- [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf)
- [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf)
- [[Blog] The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/)
- [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf)
- [Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation](https://arxiv.org/abs/2108.12409)
- [The Impact of Positional Encoding on Length Generalization in Transformers](https://arxiv.org/abs/2305.19466)


*Links only; we do not redistribute slides or papers.*