# Sequence Modeling: Recurrent and Recursive Nets
**Goodfellow Chapter 10**

**Recurrent neural networks**, or RNNs, are a family of neural networks for processing sequential data. Similar to how a convolutional network is designed to process a grid of values $\textbf X,$ an RNN is designed to process a sequence of values $\textbf{x}^{1}, . . ., \textbf{x}^\tau$. RNNs can process longer sequences than would be practical with other architectures, and can typically process sequences of variable length.

The key idea needed to move from feedforward to recurrent networks is parameter sharing. RNNs need to share parameters across different parts of the model in order to generalize to different sequence lengths. This is also important because the same piece of information can often appear in different positions within a sequence. 

Suppose we want to train a model to extract years from text. The sentence "I went to Nepal in 2009" contains the same information if it is rewritten as "In 2009, I went to Nepal." A traditional feedforward network, trained on sentences of a fixed length with a different parameter for each input feature, would need to learn what a year looked like in each position of the sentence. This requires a lot of redundant learning. An RNN, alternatively, shares the same weights across several time steps, eliminating the need to re-learn the rules of language for each position in a sequence.

An RNN shares parameters by passing a function of each step's output to the next position in the sequence. This way, the model always has a sense of context learned from previous steps' parameters. 

## Unfolding Computational Graphs

Recall that a computational graph is a way to formalize the structure of a set of computations, mapping inputs and parameters to outputs and loss values. The feature of the computational graph that makes a network _recurrent_ is a connection between the nodes of one sequence's graph and another's. This requires **unfolding** the graph to represent  its multiple time steps to represent a chain of events. 

For example, consider the classical form of a dynamical system:

$$ \textbf{s}^{(t)} = f(\textbf{s}^{(t-1)};\mathbf{\theta}), $$

where $\textbf{s}^{(t)}$ is called the state of the system. This equation is recurrent because at each step, it calls upon the same function at a previous state.

For a finite number of steps $\tau,$ we can unfold this graph by applying the function $\tau - 1$ times. For example, with $\tau = 3,$ the unfolded version of the above graph becomes:


$$
\textbf{s}^{(3)} = f(\textbf{s}^{(2)};\mathbf{\theta}) \\
= f(f(\textbf{s}^{(1)};\mathbf{\theta});\mathbf{\theta})
$$

The unfolded function is no longer recurrent, and can now be represented as an acyclic graph.

A typical RNN will use the following equation, differing only in its use of the input data $\mathbf{x}^{(t)},$ with $\textbf{h}$ representing the model state:

$$ \textbf{h}^{(t)} = f(\textbf{h}^{t - 1}, \textbf{x}^{t};\mathbf{\theta}) $$

When an RNN is tasked with predicting the future given the past items in a sequence, it learns to use $\textbf{h}^{(t)}$ as a lossy summary of the task-relevant aspects of the past sequence of inputs up to $t.$ This summary is in general lossy, since it maps an arbitrary length sequence to a fixed length vector $\textbf{h}^{(t)}.$ This will typically involve the state "forgetting" pieces of the past sequence that it has deemed irrelevant. 

The unfolded recurrence after $t$ steps can be represented as:

$$ 
\textbf{h}^{(t)} = g^{(t)}(x^{(t)}, x^{(t-1)}, . . ., x^{(2)}, x^{(1)})) \\
= f(\textbf{h}^{(t-1)}, \textbf{x}^{(t)};\mathbf{\theta}).$$

The advantages of this unfolded representation are:

* The learned model always has the same input size regardless of sequence length
* It is possible to use the same transition function $f$ with the same parameters at every step

This means we get to learn a single model $f$ that operates on all times steps and all sequence lengths, rather than needing a separate model $g^{(t)}$ for each possible time step. 

Learning a single shared model allows generalization to sequence lengths that did not appear in the training set, and enables the model to be estimated with relatively few training examples. 

## Recurrent Neural Networks

### Teacher Forcing and Networks with Output Recurrence

### Computing the Gradient in an RNN

### RNNs as Directed Graphical Models

### Modeling Sequences Conditioned on Context with RNNs

## Bidirectional RNNs

## Encoder-Decoder Sequence-to-Sequence Architecture

## Deep Recurrent Networks

## Recursive Neural Networks

## The Challenge of Long-Term Dependencies

## Echo State Networks 

## Leaky Units and Other Strategies for Multiple Time Scales

### Adding Skip Connections through Time

### Leaky Units and a Spectrum of Different Time Scales

### Removing Connections

## The Long Short-Term Memory (LSTM) and Other Gated RNNs

### LSTM

### Other Gated RNNs

## Optimization for Long-Term Dependencies

### Clipping Gradients

### Regularizing to Encourage Information Flow

## Explicit Memory