# Sequence Modeling and RNNs

In this chapter, we will deal with variable-length sequence data. 
This is fundamentally different from fixed shape data that we have previously encountered. But variable-length data is abundant in the real-world. Tasks such as translating passages of text from one language to another, 
engaging in dialogue, or controlling a robot, demand that models both ingest and output sequentially structured data. Here we focus on text data which is our primary interest. In particular, we sample character sequences as training data from a dataset of Spanish names. To increase complexity, we proceed with *The Time Machine* (1895) by [H. G. Wells](https://en.wikipedia.org/wiki/H._G._Wells).

For sequence modeling, we explore two approaches. The first approach uses fixed-length windows, or **contexts**, to predict the probability distribution of the next token. This allows us to use familiar models such as CNNs and MLPs. The second approach introduces **Recurrent Neural Networks** (RNNs) which can process sequences of arbitrary length. RNNs are neural networks that capture the dynamics of sequences via *recurrent connections* ({numref}`04-rnn`), which can be thought of as cycles that iteratively update a hidden state vector in the network. 
The resulting hidden representation depend on the specific input order. Hence, RNNs inherit causality from the structure of the text.

To understand the challenges of training RNNs, we derive the **BPTT equations** (**B**ack**p**ropagation **T**hrough **T**ime). We will see that RNNs accumulate gradients with depth corresponding to time steps, instead of number of layers for MLPs[^1]. In particular, we will see that RNNs struggle to model long-term dependencies (i.e., tokens that are spaced far apart but share a significant relationship) which manifest as vanishing gradient. This had motivated the development of more advanced RNN architectures (e.g., LSTM {cite}`lstm` and GRU {cite}`gru`) that aim to minimize or address vanishing gradients.

[^1]: Both can be formulated in terms of the path length in the computation graph between two nodes that share a dependency.

<br>

```{figure} ../../../img/nn/04-rnn.svg
---
width: 600px
name: 04-rnn
align: center
---
RNN unit (a) cyclic, and (b) unrolled RNN (essentially a deep MLP with shared weights).
```

## Sequential data

We consider inputs of the form
$\boldsymbol{\mathsf{x}}_1, \ldots, \boldsymbol{\mathsf{x}}_T$ where $\boldsymbol{\mathsf{x}}_t \in \mathbb{R}^d$ for $t = 1, \ldots, T.$ 
For example collection of words in a document, or sequence of events that occur for an RL agent.
In each of these, the entities are represented using a state vector in $\mathbb{R}^d.$ Note that $T$ is usually a maximum length, and the model may process variable-length inputs of length $\tau$ where  $1 \leq \tau \leq T.$ In terms of targets, we can have:


|Task|Mapping|Example|
|------|------|-----|
| Fixed target |$(\boldsymbol{\mathsf{x}}_1, \ldots, \boldsymbol{\mathsf{x}}_T) \mapsto \boldsymbol{\mathsf{y}}$ | Sentiment Analysis [[1]](https://www.nvidia.com/en-us/glossary/sentiment-analysis/) |
| Fixed input | $\boldsymbol{\mathsf{x}} \mapsto (\boldsymbol{\mathsf{y}}_1, \ldots, \boldsymbol{\mathsf{y}}_T)$ | Image Captioning [[3]](https://cs.stanford.edu/people/karpathy/deepimagesent/) |
| Sequence-to-Sequence | $(\boldsymbol{\mathsf{y}}_1, \ldots, \boldsymbol{\mathsf{y}}_T) \mapsto (\boldsymbol{\mathsf{x}}_1, \ldots, \boldsymbol{\mathsf{x}}_T)$  | Video Captioning, Machine Translation |

Sequence-to-sequence tasks take two forms: 

|Type|Constraint|Example|
|------|------|-----|
| Aligned | Corresponding target aligns with input at each time step | Speech recognition ([STT](https://www.nvidia.com/en-us/glossary/speech-to-text/)) |
| Unaligned | No step-for-step correspondence required | Machine Translation [[2](https://research.google/research-areas/machine-translation/)] | 


To construct training data from historical data, we typically create examples by sampling windows randomly. We will often assume that the underlying data generation process does not change, i.e. is *stationary*. In practice, this means that the weights are independent of time steps.

<br>

## Autoregressive modeling

The goal of autoregressive modeling is to characterize 
$p(\boldsymbol{\mathsf{x}}_t \mid \boldsymbol{\mathsf{x}}_1, \ldots, \boldsymbol{\mathsf{x}}_{t-1}).$
This is autoregressive in the sense that previous elements of the same sequence are used to estimate the next element. Note that the 
entire distribution is generally hard to compute, and we may be content with $\mathbb{E}\left[\boldsymbol{\mathsf{x}}_t \mid \boldsymbol{\mathsf{x}}_1, \ldots, \boldsymbol{\mathsf{x}}_{t-1}\right]$, i.e. estimating the average value of the next element.
One issue is that the length of sequences increase with the amount of data that we encounter.
Much of sequence modeling literature revolve around techniques for dealing with 
increasing context size to predict the next token or certain statistics of the distribution $p(\boldsymbol{\mathsf{x}}_t \mid \boldsymbol{\mathsf{x}}_1, \ldots, \boldsymbol{\mathsf{x}}_{t-1}).$

A natural strategy is to just ignore more data, i.e. only use past $\tau$ observations, so that we estimate $p(\boldsymbol{\mathsf{x}}_t \mid \boldsymbol{\mathsf{x}}_{t-\tau}, \ldots, \boldsymbol{\mathsf{x}}_{t-1}).$ This is a [Markovian assumption](https://en.wikipedia.org/wiki/Markov_model) where we assume that the past $\tau$ elements are sufficient to approximate the next element. This makes sense especially for phenomenon where long-range dependency is rare, or that the importance of long-range dependency decays quickly with time.
In this case, all inputs are of length $\tau$
which allows us to train any linear model or deep network that requires fixed-length vectors as inputs.

Next, we develop models that maintain some summary $\boldsymbol{\mathsf{h}}_t$ of the past observations used to predict the next output, and also updates with each observation, i.e. $\boldsymbol{\mathsf{y}}_t = g(\boldsymbol{\mathsf{h}}_{t})$ and $\boldsymbol{\mathsf{h}}_t = f(\boldsymbol{\mathsf{x}}_{t-1}, \boldsymbol{\mathsf{h}}_{t-1}).$
Since the state $\boldsymbol{\mathsf{h}}_t$ is never observed, these models are called **latent autoregressive models**. RNNs are example of such models[^2].

[^2]: A [future chapter](dl/07-attention) covers stateless models that simply look at interaction between pairs of tokens in a context. Since iterative processing of inputs are not necessary, such models are highly-parallelizable.

In [1]:
!rm chapter.py; touch chapter.py

## References and readings

- [Recurrent Neural Networks. Dive into Deep Learning](https://www.d2l.ai/chapter_recurrent-neural-networks/index.html)
- [Sentiment Analysis. NVIDIA](https://www.nvidia.com/en-us/glossary/sentiment-analysis/)
- [Machine Translation. Google Research](https://research.google/research-areas/machine-translation/)
- [Automatic Speech Recognition (ASR), or Speech-to-Text](https://www.nvidia.com/en-us/glossary/speech-to-text/)
- [Deep Visual-Semantic Alignments for Generating Image Descriptions](https://cs.stanford.edu/people/karpathy/deepimagesent/)
- [The Unreasonable Effectiveness of Recurrent Neural Networks](https://karpathy.github.io/2015/05/21/rnn-effectiveness/)
