# Recurrent Neural Networks
![image.png](attachment:image.png)

**Many to one**: input a sequence eg with word embedding, put it through and output eg the sentiment

**Many to many mappings**: First you read in all the input sequence and output another - this would be used in speech recognition 

## Background

RNNs recursively compute new states by applying transfer functions to previous states and inputs. They map input sequences to output sequences (or a vector on either side).

RNNs have universal approximation property, that is, they are capable of approximating arbitrary nonlinear dynamical systems with arbitrary precision.

In the context of prediction, an RNN is trained on input temporal data x(t) in order to reproduce a desired temporal output y(t). y(t) can be a time series related to the input or a temporal shift of x(t) itself.

![image.png](attachment:image.png)

## Architecture of RNNs

![image.png](attachment:image.png)

We're building a feed-back loop! It recycles the output and puts it back through with the new image or text or whatever.

The architecture of a general RNN can be seen as a weighted, directed, and cyclic graph that contains three different kinds of nodes, namely the input, hidden, and output nodes.

The circles represent input ${x}$, hidden, ${h}$ and output nodes, ${y}$, respectively. The solid squares $W{_i^h}$, $W{_h^h}$, and $W{_h^o}$ are matrices representing input, hidden, and output weights. The polygon represents the nonlinear transformation (activation function) performed by neurons and $z{^-}{^1}$ is the unit delay operator!

## Backpropagation Through Time

The key difference of an unfolded RNN with respect to a standard FFNN is that the weight matrices are constrained to assume the same values in all replicas of the layers, since they represent the recursive application of the same operation.

Through this transformation the network can be trained with standard learning algorithms, originally conceived for feedforward architectures. This learning procedure is called **Backpropagation Through Time (BPTT)** (Rumelhart et al. 1985) and is one of the most successful techniques adopted for training RNNs.

![image.png](attachment:image.png)

**The problem is they don't work for anything other than really simple problems!**

The memory is very bad, and whatever it remembers from the first output becomes increasingly blurry over time.

## LSTMs

LSTMs solved this!

Regular RNN cells suffer from a significant disadvantage: they are bad at representing long-term correlations in the data. The influence of early time steps becomes more and more diluted as the RNN propagates in time. Moreover, the gradient is hard to calculate over long distances, making it difficult to train very deep RNNs at all.

![image.png](attachment:image.png)

LSTM (Long Short-Term-Memory) cells come to the rescue. They encode a long-term memory using 5 separate activations. These activations are responsible for:

- forwarding the input to the output
- storing the input in the long-term memory
- modifying the output by the long-term memory
- erasing the long-term memory

**4 Activation functions in an LSTM - sigmoids and tanh!**

**Need less LSTM neurons then other Neural Networks!**

- 1 activation function (sigmoid far right) - It manages to filter out anomalies by doing a cross product of what it remembers from previous inputs and the new input - e.g. remembers orange colour and then you introduce iceberg it finds no connection between the two so iceberg makes no contribution 

- Far left activation function (sig) - responsible for storing stuff in the network!

- Piece in the middle is for deleting the long term memory! if you have an iceberg and then a polar bear the neural net will start to think maybe the orange images were the anomalies and actually we went to train it for arctic images!

## GRU Cells
GRU (Gated Recurrent Units) Cells use a simpler architecture compared to LSTMs. Although they have only one output, they can be trained to represent long-term relationships. Put simply (maybe a bit too simply), a GRU cell either forwards the old memory or the new input in each step.

![image.png](attachment:image.png)

### Applications of RNNs

| input | output | applications |
|-------|--------|--------------|
| vector | vector | use a normal ANN instead! |
| sequence | vector | sentiment analysis, categorize text, recommender based on history (collab. filtering) | 
| vector | sequence | create image caption, generate music | 
| sequence | sequence | forecasting stock prices, weather forecast, per-character sentiment, video captioning, alarm system (e.g. patient surveillance) | 
| sequence | sequence | machine translation (encoder-decoder) |