# Recurrent Neural Networks

In this chapter, we will deal with variable-length sequence data. 
This is fundamentally different from previous types of data that we have encountered (i.e. fixed shape). But variable-length data is abundant in the real-world. Tasks such as translating passages of text from one natural language to another, 
engaging in dialogue, or controlling a robot, demand that models both ingest and output sequentially structured data.
Here we focus on text data which is our primary interest.

To handle such datasets, we introduce **recurrent neural networks** (RNNs).
RNNs are neural network models that capture the dynamics of sequences via recurrent connections, 
which can be thought of as cycles in the network of nodes that iteratively update a hidden state vector ({numref}`04-rnn`). 
The updates depend on the specific order in which inputs are fed into the network. Hence, RNNs have a built-in causal structure. The **BPTT algorithm** (Backpropagation Through Time) is introduced later, which involves backpropagating through RNNs &mdash; essentially computing gradients through a deep network with depth corresponding to gradients flowing through time steps instead of across layers. We will see that calculating gradients is challenging when training RNNs, motivating the modern RNN architectures that will be discussed in the next chapter.

<br>

```{figure} ../../../img/nn/04-rnn.svg
---
width: 600px
name: 04-rnn
align: center
---
RNN unit (a) cyclic, and (b) unrolled.
```

## Sequential data

We consider inputs of the form
$\boldsymbol{\mathsf{x}}_1, \ldots, \boldsymbol{\mathsf{x}}_T$ where $\boldsymbol{\mathsf{x}}_t \in \mathbb{R}^d$ for $t = 1, \ldots, T.$ 
For example collection of words in a document, or sequence of events that occur for an RL agent.
In each of these, the entities are represented using a state vector in $\mathbb{R}^d.$ Note that $T$ is usually a maximum length, and the model may process variable-length inputs of length $\tau$ where  $1 \leq \tau \leq T.$ In terms of targets, we can have:


|Task|Mapping|Example|
|------|------|-----|
| Fixed target |$(\boldsymbol{\mathsf{x}}_1, \ldots, \boldsymbol{\mathsf{x}}_T) \mapsto \boldsymbol{\mathsf{y}}$ | Sentiment analysis |
| Fixed input | $\boldsymbol{\mathsf{x}} \mapsto (\boldsymbol{\mathsf{y}}_1, \ldots, \boldsymbol{\mathsf{y}}_T)$ | Image captioning |
| Sequence-to-Sequence | $(\boldsymbol{\mathsf{y}}_1, \ldots, \boldsymbol{\mathsf{y}}_T) \mapsto (\boldsymbol{\mathsf{x}}_1, \ldots, \boldsymbol{\mathsf{x}}_T)$  | Video captioning, machine translation |

Sequence-to-sequence tasks take two forms: 

|Type|Constraint|Example|
|------|------|-----|
| Aligned | Input and corresponding target aligns at each time step | Speech Recognition (ASR) |
| Unaligned | No step-for-step correspondence required | Machine translation | 


## Autoregressive modeling

- $p(\boldsymbol{\mathsf{x}}_t \mid \boldsymbol{\mathsf{x}}_1, \ldots, \boldsymbol{\mathsf{x}}_{t-1})$
- entire distribution hard to compute, and a user can be content with $\mathbb{E}\left[\boldsymbol{\mathsf{x}}_t \mid \boldsymbol{\mathsf{x}}_1, \ldots, \boldsymbol{\mathsf{x}}_{t-1}\right]$ e.g. with a linear regression model
- such models that regress the value of a signal on the previous values of that same signal are naturally called autoregressive models.
- one issue is that the inputs vary with $t$. 
- In other words, the number of inputs increases with the amount of data that we encounter.
- Much of what follows in this chapter revolve around techniques for dealing with this when estimating $p(\boldsymbol{\mathsf{x}}_t \mid \boldsymbol{\mathsf{x}}_1, \ldots, \boldsymbol{\mathsf{x}}_{t-1})$ or some statistic(s) of this distribution.



- first strategy is to only use past $\tau$ observations, so that we estimate $p(\boldsymbol{\mathsf{x}}_t \mid \boldsymbol{\mathsf{x}}_{t-\tau}, \ldots, \boldsymbol{\mathsf{x}}_{t-1})$
- then all inputs are of length $\tau$
- this allows us to train any linear model or deep network that requires fixed-length vectors as inputs.
-  Second, we might develop models that maintain some summary $\boldsymbol{\mathsf{h}}_t$ of the past observations used to predict the next output and also updates with each observation, i.e. $\boldsymbol{\mathsf{y}}_t = f(\boldsymbol{\mathsf{h}}_{t})$ and $\boldsymbol{\mathsf{h}}_t = g(\boldsymbol{\mathsf{x}}_{t-1}, \boldsymbol{\mathsf{h}}_{t-1}).$
- Since $\boldsymbol{\mathsf{h}}_t$ is never observed, these models are also called **latent autoregressive models**.


<br>

- To construct training data from historical data, one typically creates examples by sampling windows randomly
- we often assume that the underlying data generation process does not change, i.e. is stationary. In practice, this
means that the weights are independent of the current time step.

In [None]:
!rm chapter.py; touch chapter.py

## References

- [Recurrent Neural Networks. Dive into Deep Learning](https://www.d2l.ai/chapter_recurrent-neural-networks/index.html)