# The Recurrent Neural Network

## Learning objectives

1. Understand the principles behind the creation of the recurrent neural network
2. MATH
3. CODE/MODELS
4. MORE STUFF

## Historical and theoretical background

The poet Derlmore Schartz once wrote: **"...time is the fire in which we burn"**. We can't scape time. Time is embedded in every human thought and action. Yet, so far, we have been oblivious to the role of time in neural network modeling. Indeed, in all models we have examined so far we have implicitly assumed that **data is "perceived" all at once**, although there are countless examples where time is a critical consideration: movement, speech production, planning, decision-making, etc. We also have implicitly assumed that **past-states have no influence in future-states**. This is, the input pattern at time-step $t-1$ has no influence in the output of time-step $t-0$, or $t+1$, or any subsequent outcome for that matter. In probabilistic jargon, this equals to assume that each sample is drawn independently from each other. We know in many scenarios this is simply not true: when giving a talk, my next utterance will depend upon my past utterances; when running, my last stride will condition my next stride, and so on. You can imagine endless examples.

Multilayer perceptrons and Convolutional Networks, in principle, can be used to approach problems where time and sequences are a consideration. Nevertheless, such architectures where not designed with time in mind, and better architectures were envisioned. In particular, **Recurrent Neural Networks (RNNs)** are the modern standard to deal with **time-dependent** and/or **sequence-dependent** problems. This type of networks are "recurrent" in the sense that they can **revisit or reuse past states** as inputs to predict the next or future states. To put it plainly, they have **memory**. Indeed, memory is what allow us to incorporate our past thoughts and behaviors into our future thoughts and behaviors.

### Hopfield Network

One of earliest examples of networks incorporating "recurrences" was the so-called **Hopfield Network**, introduced in 1982 by John Hopfield, at the time a physicist at Caltech. Hopfield networks were important as they helped to reignite the interest in neural networks in the early '80s (along with backpropagation). 

Hopfield wanted to address the fundamental question of **emergence** in cognitive systems: Can relatively stable cognitive phenomena, like memories, emerge from the collective action of large numbers of simple neurons? After all, such behavior was observed in other physicial systems like vortex patterns in fluid flow, Why not in cognition? 

Hopfield networks are known as a type of energy-based (instead of error-based) network because their properties derive from a global energy-function. In resemblence to the McCulloch-Pitts neuron, Hopfield neurons are binary threshold units but with recurrent instead of feed-forward connections, where each unit is **bi-directionally connected** in the network. This means that each unit receives inputs and sends outputs to every other connected unit. A consequence of this architecture is that **weights are symmetric**, such that weights *coming into* a unit are the same as the ones *coming out* of a unit. The value of each unit is determined by a linear function wrapped into a threshold function $T$, as $y_i = T(\sum w_{ji}y_j + b_i)$. 



<img src="./images/rec-net/hopfield-net.svg">

The basic idea of Hopfield networks is that each configuration of binary-values $C$ in the network is associated with a **global energy value $-E$**. Here is a simplified picture of the training process: imagine you have a network with five neurons with a configuration of $C_1=(0, 1, 0, 1, 0)$. Now, imagine $C_1$ yields a global energy-value $-E_1= 2$. Your goal is to *minimize* $-E$ by changing one element of the network $c_i$ at a time. By using the weight updating rule $\Delta w$, you can subsequently get a new configuration like $C_2=(1, 1, 0, 1, 0)$, since new weights will cause a change the activation values ($0,-1$). If $C_2$ yields a *lower value of $-E$*, let's say, $1.5$, you are moving in the right direction. If you keep iterating with new configurations the network will eventually "settle" into a **global energy minimun** (conditioned to the initial state of the network).

A fascinanting aspect of Hopfield networks, besides the introduction of recurrence, is that is closely based in neuroscience research about learning and memory. In fact, Hopfield proposed this model as a way to capture **memory formation and retrieval**. Basically, the idea is that the energy-minima of the network could represent the **formation of a memory**, which further give rise to a property known as **content-addressable memory (CAM)**. Here is the idea: when you access information storaged in the random access memory of your computer (RAM), you give the "address" where the "memory" is located to retrieve it. CAM works the other way around: you give information about the **content** you are searching for, and the computer should be able to retrieve the "memory". This is great because this works even when you have **partial or corrupted** information about the content, which is a much more **realistic depiction of how human memory works**. It is similar to doing a google search. Just think in how many times you have searched for lyrics with partial information, like "song with the beeeee bop ba bodda bope!".

Is important to highlight that the sequential adjustment of Hopfield networks is **not driven by error correction**. Actually, there isn't a "target" as in supervised-based neural networks. Hopfield networks are systems that "evolve" until they find an stable low-energy state. If you "perturbe" such a system, the system will just "evolve" again until the same stable-state, similar to how those inflatable "Bop Bags" toys get back to their initial position no matter how hard you punch them. It is almost like the system "remembers" its previous stable-state (in't?). This ability to "return" to a previous stable-state after perturbation is why they serve as models of memory.

### Elman Network

Although Hopfield networks where fascinating models, the first really successful example of a recurrent network trained with backpropagation was introduced by [Jeffrey Elman](https://en.wikipedia.org/wiki/Jeffrey_Elman), the so-called "Elman Network". Elman was a cognitive scientist at UC San Diego at the time, part of the group of researchers that published the PDP book.

In 1990, Elman published "Finding Structure in Time", an incredible influential work in both cognitive science and machine learning (natural language processing). Elman was concerned with the problem of representing "time" or "sequences" in neural networks. In his view, you could take either a "explicit" approach or an "implicit" approach. 


TO CONTINUE...

### Long Short-Term Memory Network

TODO

Notes Goodfellow et all:
- for processsing sequential data
- sequences of variable lenght
- parameter sharing allows to apply the model to sequences of different lenght (form) and generalize
- we want to recognize information regardless of the position in the sequence
- mlp need fix-size inputs (problematic for language), and lear weights for each feature separately
- each member of the output depends on (is a function) the previous members of the output

Notes Elman 1990:
- First proposed by Jordan (1986) 
- hidden unit patterns feedback to themselves
- internal representations reflect prior internal states
- context-dependent and generalization across classes of items
- language expresses as temporal sequences
- two approaches to represent time in neural nets: 
    - explicitly, as an additional dimension (a matrix instead of a vector)
    - implicitly, by the effect it has on processing (next outcome)
- Explicit approach: 
    - in a vector, the first temporal event is the x_1, the second x_2, and son on. A time-series. Hence space (i.e., position in a vector) is encoding time.
    - Problem 1: it needs to be represented for the networ all at once
    - Problem 2: input pattern has to be large enough to represent the longest sequence, and fix on that. Hence, all inputs has be of same lenght.
    - Problem 3: it's hard to distinguish relative temporal position from absolute temporal position. The same pattern, ocurring at slightly different times, would be processed as different in vetor space, when they are the same. 
- Implicit approach:
    - network requiers memory
    - Many ways to do it: Jordan 1986 introduced a good one. The recurrent connections allow the hiden units to see its previous output and use as an input for the current computation. In other words, memory.
    - Modification. (Temporal) Contex units: copy the state of the hidden units, and then serve as input to the hidden unit in the next time-step. It is like a short-term memory storage. They must learn representations that learn both classification and to encode temporal propierties of sequential input. 
- Temporal XOR:
    - feeding input one datapoint at a time. (x_1, x_2) -> (y_3), (x_1, x_2) -> (y_6), and so on. 
    - since prediction is just possible every 2 time steps, the error is going to follow a cyclic behavior. 
    - the network will try to use XOR att all time steps after training, although is going to work only every 2 time steps. 
- Structure in letter sequences:
    - architecture: there were 6 input units, 20 hidden units, 6 output units, and 20 context units
    - learns statistical regularities: the consonants were ordered randomly (high error), but the vowels were not (low error)
    - After each consonant can predict next vowel. At the end of the vowel sequence can't predict next consonant. 
    - Net learns which vowels follow which consinants
    - Net learns how many vowels follow each consonant
    - Net knows a consonant follows, but not which one
- Discovering the notion "word":
    - many linguist at the time assumed that basic language constructs must be innate, otherwise, their language theories do not work. Assumes clear-cut and uncontrovertial definitions of constructs like phoneme, morphene, word, etc. This is not true in practice, May counterexamples.
    - Fundamental concepts in lingistics are fliud. Learning is crucial.
    - Neural nets learn graded representations.
    - Can words emerge from learning the sequential structure of letter (or sound, gesture, etc)sequences?
    - Elman's model learns to parse words, but such criteria relative. This is similar to what happens in child languague acquisition, that sometime treat groups of words as single units. 
    - This is not a full-blown model of language acquisition. Intsead, it shows how a neural net can learn to parse words via extracting patterns from data.
- Discovering lexical classes from word oder:
    - Lexical classes (syntactic structure, etc) can be learned from data. Such classes are implicit in the data.
    - He was demonstrated that a network was able to learn the temporal structure of letter sequences
    - In simulation, input patterns do not contain any information about lexical classes, yet, the somehow learn such structure from the co-ocurrence statistics in the data. 
    - Evaluate similarity structure of learned representations with hierarchical clustering. 
    - The network learn representations that resemble the natural organization of classes, like nouns and verbs, or clusters of animals and objects. 
    - Network learns although it has less info than human in real world context. 
    - Replacing man with zog example 
- Remarks about distributed representations:
    - No limit to the number of concepts to be represented by a finite set of units

## Mathematical formalization

## Code implementation

## Application

## Limitations

## Conclusions

## References