# RNNs and their Shortcomings

- 📺 **Video:** [https://youtu.be/xvnnA04JVQo](https://youtu.be/xvnnA04JVQo)

## Overview
Examines the capabilities and limitations of Recurrent Neural Networks in sequence tasks. The video likely reviews how RNNs work (maybe a simple diagram of an RNN unrolled over time) and then delves into the problem of long-term dependencies.

In [None]:
import os, random
random.seed(0)
CI = os.environ.get('CI') == 'true'

## Key ideas
- It presents scenarios where an RNN might fail: for instance, trying to capture something far back in a sentence or document (like remembering the subject's number to ensure verb agreement in a long sentence).
- The explanation might draw from the classic analysis by Hochreiter and Bengio that vanilla RNNs suffer from vanishing gradients - as we backpropagate through many time steps, gradients either shrink exponentially (vanish) or, less often, blow up, making it hard to learn correlations over long distances The video might use a concrete example: “I grew up in France … I speak fluent ___.” - to correctly predict “French” at the blank, the model must remember “France” from far earlier A simple RNN may struggle with this if the gap is large, as it tends to “forget” earlier content by the time it reaches later words.
- This motivates specialized RNN architectures like LSTMs (Long Short-Term Memory) and GRUs (Gated Recurrent Units), which introduce gating mechanisms to help preserve information over long time spans.
- The video likely gives an intuition for LSTMs: they have gates that control the flow of information, so they can keep important info in memory (the cell state) for dozens of time steps, essentially addressing the vanishing gradient by design It might show or describe the components of an LSTM (forget gate, input gate, output gate) at a high level, conveying that these gates decide what to keep, write, or output from the memory cell.

## Demo

In [None]:
print('Try the exercises below and follow the linked materials.')

## Try it
- Modify the demo
- Add a tiny dataset or counter-example


## References
- [Eisenstein 6.1](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 6.2](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 6.4](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 6.3](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [[Blog] Understanding LSTMs](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)
- [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/abs/1409.0473)
- [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf)
- [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf)
- [[Blog] The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/)
- [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf)
- [Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation](https://arxiv.org/abs/2108.12409)
- [The Impact of Positional Encoding on Length Generalization in Transformers](https://arxiv.org/abs/2305.19466)


*Links only; we do not redistribute slides or papers.*