# The Recurrent Neural Network

## Learning objectives

1. Understand the principles behind the creation of the recurrent neural network
2. Obtain intuition about difficulties training RNNs, namely: vanishing/exploding gradients and long-term dependencies
3. CODE/MODELS
4. MORE STUFF

## Historical and theoretical background

The poet Derlmore Schartz once wrote: **"...time is the fire in which we burn"**. We can't scape time. Time is embedded in every human thought and action. Yet, so far, we have been oblivious to the role of time in neural network modeling. Indeed, in all models we have examined so far we have implicitly assumed that **data is "perceived" all at once**, although there are countless examples where time is a critical consideration: movement, speech production, planning, decision-making, etc. We also have implicitly assumed that **past-states have no influence in future-states**. This is, the input pattern at time-step $t-1$ has no influence in the output of time-step $t-0$, or $t+1$, or any subsequent outcome for that matter. In probabilistic jargon, this equals to assume that each sample is drawn independently from each other. We know in many scenarios this is simply not true: when giving a talk, my next utterance will depend upon my past utterances; when running, my last stride will condition my next stride, and so on. You can imagine endless examples.

Multilayer Perceptrons and Convolutional Networks, in principle, can be used to approach problems where time and sequences are a consideration. Nevertheless, such architectures where not designed with time in mind, and better architectures have been envisioned. In particular, **Recurrent Neural Networks (RNNs)** are the modern standard to deal with **time-dependent** and/or **sequence-dependent** problems. This type of networks are "recurrent" in the sense that they can **revisit or reuse past states as inputs to predict the next or future states**. To put it plainly, they have **memory**. Indeed, memory is what allow us to incorporate our past thoughts and behaviors into our future thoughts and behaviors.

### Hopfield Network

One of earliest examples of networks incorporating "recurrences" was the so-called **Hopfield Network**, introduced in 1982 by [John Hopfield](https://en.wikipedia.org/wiki/John_Hopfield), at the time, a physicist at Caltech. Hopfield networks were important as they helped to reignite the interest in neural networks in the early '80s (along with backpropagation). Hopfield wanted to address the fundamental question of **emergence** in cognitive systems: Can relatively stable cognitive phenomena, like memories, emerge from the collective action of large numbers of simple neurons? After all, such behavior was observed in other physicial systems like vortex patterns in fluid flow. Brains seemed like another promising candidate.

Hopfield networks are known as a type of **energy-based** (instead of error-based) network because their properties derive from a global energy-function. In resemblence to the McCulloch-Pitts neuron, Hopfield neurons are binary threshold units but with recurrent instead of feed-forward connections, where each unit is **bi-directionally connected** to each other, as shown in **Figure X**. This means that each unit *receives* inputs and *sends* inputs to every other connected unit. A consequence of this architecture is that **weights values are symmetric**, such that weights *coming into* a unit are the same as the ones *coming out* of a unit. The value of each unit is determined by a linear function wrapped into a threshold function $T$, as $y_i = T(\sum w_{ji}y_j + b_i)$.

<center> Figure X </center>

<img src="./images/rec-net/hopfield-net.svg">

The basic idea of Hopfield networks is that each configuration of binary-values $C$ in the network is associated with a **global energy value $-E$**. Here is a simplified picture of the training process: imagine you have a network with five neurons with a configuration of $C_1=(0, 1, 0, 1, 0)$. Now, imagine $C_1$ yields a global energy-value $E_1= 2$ (following the energy function formula). Your goal is to *minimize* $E$ by changing one element of the network $c_i$ at a time. By using the weight updating rule $\Delta w$, you can subsequently get a new configuration like $C_2=(1, 1, 0, 1, 0)$, as new weights will cause a change in the activation values $(0,1)$. If $C_2$ yields a *lower value of $E$*, let's say, $1.5$, you are moving in the right direction. If you keep iterating with new configurations the network will eventually "settle" into a **global energy minimun** (conditioned to the initial state of the network).

A fascinanting aspect of Hopfield networks, besides the introduction of recurrence, is that is closely based in neuroscience research about learning and memory, particularly Hebbian learning (XXXX). In fact, Hopfield proposed this model as a way to capture **memory formation and retrieval**. Basically, the idea is that the energy-minima of the network could represent the **formation of a memory**, which further give rise to a property known as **content-addressable memory (CAM)**. Here is the idea with a computer analogy: when you access information storaged in the random access memory of your computer (RAM), you give the "address" where the "memory" is located to retrieve it. CAM works the other way around: you give information about the **content** you are searching for, and the computer should retrieve the "memory". This is great because this works even when you have **partial or corrupted** information about the content, which is a much more **realistic depiction of how human memory works**. It is similar to doing a google search. Just think in how many times you have searched for lyrics with partial information, like "song with the beeeee bop ba bodda bope!".

Is important to highlight that the sequential adjustment of Hopfield networks is **not driven by error correction**. Actually, there isn't a "target" as in supervised-based neural networks. Hopfield networks are systems that "evolve" until they find an stable low-energy state. If you "perturb" such a system, the system will "re-evolve" towards its previous stable-state, similar to how those inflatable "Bop Bags" toys get back to their initial position no matter how hard you punch them. It is almost like the system "remembers" its previous stable-state (in't?). This ability to "return" to a previous stable-state after perturbation is why they serve as models of memory.

### Elman Network

Although Hopfield networks where innovative and fascinating models, the first successful example of a recurrent network trained with backpropagation was introduced by [Jeffrey Elman](https://en.wikipedia.org/wiki/Jeffrey_Elman), the so-called **Elman Network**. Elman was a cognitive scientist at UC San Diego at the time, part of the group of researchers that published the famous PDP book.

In 1990, Elman published "Finding Structure in Time", an highly influential work for both cognitive science and machine learning (particularly natural language processing). Elman was concerned with the problem of representing "time" or "sequences" in neural networks. In his view, you could take either a "explicit" approach or an "implicit" approach. The **explicit** approach represents time **spacially**. Consider a vector $x = [x_1,x_2 \cdots, x_n]$, where element $x_1$ represents the first value of a sequence, $x_2$ the second element, and $x_n$ the last element. Hence, the spacial location in $\bf{x}$ is indicating the temporal location of a value. You can think about elements of $\bf{x}$ as sequences of words or actions, one after the other, for instance: $x^1=[Sound, of, the, funky, drummer]$ is a sequence of length five. Elman saw several drawbacks to this approach. First, although $\bf{x}$ is a sequence, the network still needs to represent the sequence all at once as an input, this is, a network would need five input neurons to process $x^1$. Second, imposes a rigid limit on the durations of pattern, in other words, the newtwork needs fixed number of elements for every input vector $\bf{x}$: a network with five input units, can't accomodate a sequence of lenght six. True, you could start with a six input network, but then shorter sequences would be misrepresented, since mistmatched units would receive zero input. This is a problem for most domains where sequences have variable duation. Finally, it can't easily distinguish **relative** temporal position from **absolute** temporal position. Consider the sequence $s = [1, 1]$ and a vector input lenght of four bits. Such a sequence can be presented in at least three variations:

$$
x_1 = [0, 1, 1, 0]\\
x_2 = [0, 0, 1, 1]\\
x_3 = [1, 1, 0, 0]
$$

Here, $\bf{x_1}$, $\bf{x_2}$, and $\bf{x_3}$ are instances of $\bf{s}$ but spacially displaced in the input vector. Geometrically, those three vectors are very different from each other (you can compute similarity measures to put a number on that), although represent the same instance. Even though you can train a neural net to learn those three patterns are associated with the same target, their inherent disimilarity will hinder the network ability to generalize the learned association. It is like training network to learn that blue, yellow, and red are associated with the same target "1": the network will learn the association, but it won't make sense outside the training set.

The **implicit** approach represents time by **its effect in intermediate computations**. To do this, Elman added a **contex unit** to save past computations and incorporate those in future computations. In short, **memory**. Elman based his approach in the work of [Michael I. Jordan](https://people.eecs.berkeley.edu/~jordan/) on serial processing (1986). Jordan's network implement recurrent connections from the network output $\hat{y}$ to its hidden units $h$, via a "memory unit" $\mu$ (equivalent to Elman's "context unit") as depicted in **Figure X**. In short, the the memory unit keeps a running average of **all past outputs**: this is how the entire past history is implicitly accounted for on each new computation. There is no learning in the memory unit, which means the weights are fixed to $1$.

<center> Figure X: Jordan Network </center>

<img src="./images/rec-net/jordan-net.svg">

**Note**: Jordan's network diagrams exemplifies the two ways in which recurrent nets are usually represented: the **compact format** depicts the network structure as a circuit, whereas the **unfolded representation** incorporates the notion of time-steps calculations. The unfolded representation also illustrate how a recurrent network can be seen in a pure feed-forward fashion, with as many layers as time-steps in your sequence. One key consideration is that the weights will be identical on each time-step (or layer). Keep this unrolled representation in mind as will become important later.

Elman's innovation was twofold: **recurrent conections between hidden units and memory** (contex) units, and **trainable parameters from the memory units to the hidden units**. Memory units now have to "remember" the past state of hidden units, which means that instead of keeping a running average, "clone" the vaue at the previous time-step $t-1$. Memory untis also have to learn useful representations (weights) for encoding temporal properties of the sequential input.  **Figure X** summarizes Elman's network in compact and unfolded fashion.

<center> Figure X: Elman Network </center>

<img src="./images/rec-net/elman-net.svg">

**Note**: there is something curious about Elman's architecture. What it is the point of "cloning" $h$ into $c$ at each time-step? You could bypass $c$ altogether by sending the value of $h_t$ straight into $h_{t+1}$, wich yield mathematically identical results. The most likely explanation for this was that Elman starting point was Jordan's network, which had a separated memory unit. Regardless, keep in mind we don't need $c$ units to design a mathematically identical network. 

Elman performed multiple experiments with this architecture demostrating it was capable to solve multiple problems with a sequential structure: a temporal version of the XOR problem; learning the structure (i.e., vowels and consonants sequential order) in sequences of letters; discovering the notion of "word"; and even learning complex lexical classes like word order in short sentences. Let's briefly explore the temporal XOR solution as an exemplar. **Table 1** shows the XOR problem:

**Table 1**: Truth Table For XOR Function

| $x_1$ | $x_2$ | $y$ |
|---|---|--------|
| 0 | 0 | 0      |
| 0 | 1 | 1      |
| 1 | 0 | 1      |
| 1 | 1 | 0      |

Here is a way to transform the XOR problem into a sequence. Consider the following vector: 

$$
s= [1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1,...]
$$

In $\bf{s}$, the first and second elements, $s_1$ and $s_2$, represent $x_1$ and $x_2$ inputs of **Table 1**, whereas the third element, $s_3$, represents the corresponding output $y$. This pattern repeats until the end of the sequence $s$ as shown in **Figure X**.

<center> Figure X: Temporal XOR </center>

<img src="./images/rec-net/temporal-xor.svg">

Elman trained his network with a 3,000 elements sequence for 600 iterations over the entire dataset, on the task of predicting the next item $s_{t+1}$ of the sequence $s$, meaning that he fed inputs to the network **one by one**. He showed that **error pattern** followed a predictable trend: the mean squared error was **lower every 3 elements**, and higher in between, meaning the network learned to predict the third element in the sequence, as shown in **Figure X** (the numbers are made up, but the pattern is the same found by Elman (1990)).

In [1]:
import numpy as np
import pandas as pd
import altair as alt

In [2]:
s = pd.DataFrame({"MSE": [0.35, 0.15, 0.30, 0.27, 0.14, 0.40, 0.35, 0.12, 0.36, 0.31, 0.15, 0.32],
                  "cycle": np.arange(1, 13)})
alt.Chart(s).mark_line().encode(x="cycle", y="MSE")

An inmediate advantage of this approach is the network can take **inputs of any lenght**, withouth having to alter the networ architecture at all.

In subsequent experiments, Elman showed that the **internal (hidden) representations** learned by the network grouped into meaningful categories, this is, **semantically similar words group together** when analyzed with [hierarchical clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering). This was remarkable as demostrated the utility of RNNs as model of cognition in sequence-based problems. 

## Interlude: vanishing and exploding gradients in RNNs

Turns out, training recurrent neural networks is hard. Considerably harder than multilayer-perceptrons. When faced with the task of training **very deep networks**, like RNNs, the gradients have the impolite tendency of either (1) **vanishing**, or (2) **exploding**. Recall that RNNs can be *unrolled* so that recurrent connections follow pure feed-forward computations. This unrolled RNN will have as many layers as elements in the sequence. Thus, a sequence of 50 words will be unrolled as a RNN of 50 layers. 

Concretely, the **vanishing gradient problem** will make really hard to learn **long-term dependencies** in sequences. Let's say you have a collection of poems, where the last sentence makes reference to the first one. Such a dependency will be hard to learn for a deep RNN where gradients vanish as we move backward in the network. It is like the network is missing long-term memory capacity. The **exploding gradient problem** will completely derail the learning process. In very deep networks this is often a problem because more layers amplify the effect of large gradients, compounding into very large updates to the network weights, to the point values completely blow up.        

Here is the intuition for the **mechanics of gradient vanishing**: when gradients begin small, as you move backwards through the network computing gradients, they will get even smaller as you get closer to the input layer. Consequently, when doing the weight update based on such gradients, the weights closer to the output layer will obtain larger updates than weights closer to the input layer. This means that the weights closer to the input layer will hardly change at all, whereas the weights closer to the ouput layer will change a lot. This is a serious problem when **earlier layers matter for prediction**: they will keep propagating more or less the same signal forward because no learning (i.e., weight updates) will happen, which may significantly hinder the network performance.  

Here is the intuition for the **mechanics of gradient explotion**: when gradients begin large, as you move backwards through the network computing gradients, they will get even larger as you get closer to the input layer. Consequently, when doing the weight update based on such gradients, the weights closer to the input layer will obtain larger updates than weights closer to the output layer. Learning can go wrong really fast. Remember that the signal propagated by each layer is the outcome of taking the dot product between a weight matrix and the output of the previous layer. If the weights in earlier layers get really large, they will forward-propagate larger and larger signals on each iteration, and the predicted output values will spiral-up out of control, making the error $y-\hat{y}$ so large that the network will be unable to learn at all. In fact, your computer will "overflow" quickly as it would unable to represent numbers that big. Very dramatic.  

The mathematics of gradient vanishing and explotion gets complicated quickly. If you want to delve into the mathematics see [Bengio et all (1994)](http://ai.dinfo.unifi.it/paolo/ps/tnn-94-gradient.pdf), [Pascanu et all (2012)](https://arxiv.org/abs/1211.5063), and [Philipp et all (2017)](https://arxiv.org/abs/1712.05577). 

For our purposes, I'll give you a simplified numerical example for intuition. Consider the task of predicting a vector $y = \begin{bmatrix} 1 & 1 \end{bmatrix}$, from inputs $x = \begin{bmatrix} 1 & 1 \end{bmatrix}$, with a multilayer-perceptron with 5 hidden layers and tanh activation functions. We have two cases:

- the weight matrix $W_l$ is initialized to large values $w_{ij} = 2$
- the weight matrix $W_s$ is initialized to small values $w_{ij} = 0.02$

Now, let's compute a single forward-propagation pass:

In [3]:
import numpy as np

In [4]:
x = np.array([[1],[1]])
W_l = np.array([[2, 2],
                [2, 2]])

h1 = np.tanh(W_l @ x)
h2 = np.tanh(W_l @ h1)
h3 = np.tanh(W_l @ h2)
h4 = np.tanh(W_l @ h3)
h5 = np.tanh(W_l @ h4)
y_hat = (W_l @ h5)
y_hat

array([[3.99730269],
       [3.99730269]])

In [5]:
x = np.array([[1],[1]])
W_s = np.array([[0.02, 0.02],
                [0.02, 0.02]])

h1 = np.tanh(W_s @ x)
h2 = np.tanh(W_s @ h1)
h3 = np.tanh(W_s @ h2)
h4 = np.tanh(W_s @ h3)
h5 = np.tanh(W_s @ h4)
y_hat = (W_s @ h5)
y_hat

array([[4.09381337e-09],
       [4.09381337e-09]])

We see that for $W_l$ with large weights, the output $\hat{y}\approx4$, whereas for $W_s$ with small weights, the output $\hat{y} \approx 0$. Why does this matter? We haven't done the gradient computation but you can probably anticipate what it's goin to happen: for the $W_l$ case, the gradient update is going to be very large, and for the $W_s$ very small. If you keep cycling through forward and backward passes these problems will become worse, leading to gradient explotion and vanishing respectively.

### Long Short-Term Memory Network

The challenges on training RNNs difficulted their applicability during the early '90s. In addition to vanishing and exploding gradients, we have the fact that the **forward computation is slow**, as RNNs can't compute in parallel: to preserve the time-dependencies through the layers, each layer has to be computed sequentially, which naturally takes more time. Elman networks proved to be effective at solving relatively simple problems, but as the sequences scaled in size and complexity, this type of network struggle. 

Several approaches where proposed in the '90s to address the aforementioned issues (time-delay neural networks, simulated annealing, Kalman Filters, and many others). The architecture that really moved the field forward was the so-called **Long Short-Term Memory (LSTM) Network**, introduced by [Sepp Hochreiter](https://en.wikipedia.org/wiki/Sepp_Hochreiter) and [Jurgen Schmidhuber](https://en.wikipedia.org/wiki/J%C3%BCrgen_Schmidhuber) in 1998. As the name sugguest, the defining characteristic of LSTMs is the addition of units combining both short-memory and long-memory capabilities. 

In LSTMs, instead of having a simple memory unit "cloning" values from the hidden unit as in Elman networks, we have a (1) **cell unit** which effectively acts as a long-term memory storage, and (2) a **hidden state** which acts as a memory updating mechanism. These two elements are integrated as a circuit of logic gates controlling the flow of information at each time-step. Understanding the notation is crucial here, which is depicted in **Figure X**.

<center> Figure X: LSTM architecture </center>

<img src="./images/rec-net/lstm-unit.svg">

In LSTMs $x_t$, $h_t$, and $c_t$ represent vectors of values. Lightish-pink circles represent element-wise operations, and darkish-pink boxes are fully-connected layers with trainable weights. The top part of the diagram acts as a **memory storage**, whereas the bottom part has a double role: (1) passing the hidden-state information from the previos time-step $t-1$ to the next time step $t$, and (2) to regulate the **influx** of information from $x_t$ and $h_{t-1}$ **into** the memory storage, and the **outflux** of information **from** the memory storage into the next hidden state $h-t$. The second role is the core idea behind LSTM. Think about it as making **three decisions** at each time-step:

1. Is the *old information* $c_{t-1}$ worth to keep in my memory storage $c_t$? If so, let the information pass, otherwise, "forget" such information. This is controlled by the *forget gate*.   
2. Is this *new information* (inputs) worth to be saved into my memory storage $c_t$? If so, let information flow into $c_t$. This is controlled by the *input gate* and the *candidate memory cell*. 
3. What elements of the information saved in my memory storage $c_t$ are relevant for the computation of the next hidden-state $h_t$? Select them from $c_t$, combine them new hidden-state output, and let the pass into the next hidden-state $h_t$. This is controlled by the *output gate* and the *tanh* function. 

Decisions 1 and 2 will determine the information that keeps flowing through the memory storage at the top. Decision 3 will determine the information that flows to the next hidde state at the bottom. 

<center> Figure X: LSTM as a sequence of decisions </center>

<img src="./images/rec-net/lstm-choices.svg">

To put LSTMs in context, imagine the following scenerio: we are trying to **predict the next word in a sequence**. Let's say these squences are about sports. From past sequences, we saved in memory the type of sport: "soccer". For the current sequence, we receive a phrase like "A basketball player...". In such a case, we first want to "forget" the previous type of sport "soccer" (*decision 1*) by multplying $c_{t-1} \odot f_t$. Next, we want to "update" memory with the new type of sport, "basketball" (*decision 2*), by adding $c_t = (c_{t-1} \odot f_t) + (i_t \odot \tilde{c_t})$. Finally, we want to output (*decision 3*) a verb relevant for "A basketball player...", like "shoot" or "dunk" by $y_t = h_t = o_t \odot tanh(c_t)$.

LSTMs are very good at capturing long-term dependencies, which make them really succesful in practical applications in sequence-modeling. For instance, every time you type in words for a google search or to send a text message, there is some type of LSTM trying to predict the next word you may want to type.

### RNNs and cognition

As with Convolutional Neural Networks, researchers utilizing RNN for approaching sequential problems like natural languague processing (NLP) or time-series prediction, do not necessarily care about how good of a model of cognition and brain-activity are RNNs. What they really care is about solving problems like translation, speech recognition, and stock market prediction, and many advances in the field come from pursuing such goals. Still, RNN have many **desirable traits as model of neuro-cognitive activity**, and have been used to model several aspects of human cognition and behavior: child behavior in a object permanence tasks (Munakata et all, 1997); knowledge-intensive text-comprehension (St. John, 1992); processing in quasi-regular domains, like English word reading (Plaut et al., 1996); human performance in processing recursive language structures (Christiansen & Chater, 1999); human sequential action (Botvinick & Plaut, 2004); movement patterns in typical and atypical developing children (Muñoz-Organero et al., 2019). And many others. Neuroscientist have used RNNs to model a wide variety of aspects as well (for reviews see Barak, 2017, Güçlü & van Gerven, 2017, Jarne & Laje, 2019). Overall, RNN have demostrated to be a prolific tool for modeling cognitive and brain function.

## Mathematical formalization

## Code implementation

## Application

## Limitations

## Conclusions

## References

 Elman (1990)
 LSTM paper (1997)
 3 gradients problem papers.
 RNN deep learning chapters
 



Botvinick, M., & Plaut, D. C. (2004). Doing without schema hierarchies: A recurrent connectionist approach to normal and impaired routine sequential action. Psychological Review, 111(2), 395.

Barak, O. (2017). Recurrent neural networks as versatile tools of neuroscience research. Current Opinion in Neurobiology, 46, 1–6. https://doi.org/10.1016/j.conb.2017.06.003

Christiansen, M. H., & Chater, N. (1999). Toward a connectionist model of recursion in human linguistic performance. Cognitive Science, 23(2), 157–205.

Güçlü, U., & van Gerven, M. A. (2017). Modeling the dynamics of human brain activity with recurrent neural networks. Frontiers in Computational Neuroscience, 11, 7.

Jarne, C., & Laje, R. (2019). A detailed study of recurrent neural networks used to model tasks in the cerebral cortex. ArXiv Preprint ArXiv:1906.01094.

John, M. F. (1992). The story gestalt: A model of knowledge-intensive processes in text comprehension. Cognitive Science, 16(2), 271–306.

Munakata, Y., McClelland, J. L., Johnson, M. H., & Siegler, R. S. (1997). Rethinking infant knowledge: Toward an adaptive process account of successes and failures in object permanence tasks. Psychological Review, 104(4), 686.

Muñoz-Organero, M., Powell, L., Heller, B., Harpin, V., & Parker, J. (2019). Using Recurrent Neural Networks to Compare Movement Patterns in ADHD and Normally Developing Children Based on Acceleration Signals from the Wrist and Ankle. Sensors (Basel, Switzerland), 19(13). https://doi.org/10.3390/s19132935

Plaut, D. C., McClelland, J. L., Seidenberg, M. S., & Patterson, K. (1996). Understanding normal and impaired word reading: Computational principles in quasi-regular domains. Psychological Review, 103(1), 56.