# Recurrent Neural Networks
### Machine Learning Reading Group, 
### The University of Melbourne

&nbsp;

Dr Peter Cudmore  
Systems Biology Lab  
The University of Melbourne

## Recurrent Neural Networks

Recurrent Neural Networks (RNN's) are neural networks with'memory'.  

Useful for:
- Natural Language Processing
- Signal Reconstructing
- Motion Tracking/Control

Recurrent Neural Networks for *sequential data*.

## Feedback in RNN's

Memory is usually implemented by introduction *feedback loops* into the neural network.  

Feedback can be introduced at different places in the network; at the output stage, between layers, or at the individual artifical neuron level.


Introducing feedback has some consequences:

- Topolgy is very important.
- 'Infinite Impulse Response' implies truncated training
- Gradient magnitude issues are very common.
- Stability, strange attractors, chaos, etc...

- Draw an example of a neural network and contrast agianst one with a feedback loop (ie, RNN).
- Draw output feedback topology, layer feedback, node feedback 


## RNNs and Signal Processing


Suppose:
- We have partial and/or noisy observations $x(t)$ of some sequential process 
- which is asumed to evolve on a space $z$ according to some evolution rule $f$,
- and is measured to make a prediciton or desicion $y(t)$.

If we were doing control systems, might write this as  

$$
\dot{z} = f(z,x)\qquad y = g(z,x)
$$  
to predict $y$ from some sequence of values of $x$.

## RNNs and Digital Signal Processing
Consider
$$\dot{z} = f(z,x)\qquad y = g(z,x).$$

A RNN can be intepreted as:
- Applying some discritization scheme to the evolution on some sufficiently large state space such that $z_n = F(z_{n-1},x_n)$.
- Applying the *Universal Function Theorem*  to approximate the resulting function $F$ with a feed-forward neural network.
- Conceptually discritizing and splitting $g$ into post-processing (for example, soft-max to transform the output layer data into a pdf) and 'everything else' (which then assimmilated into the network)

RNN in general form::
$$
z_n = F(z_{n-1}, x_n; \mathbb{\theta}),\qquad y_n = M(z_n, x_n)
$$

Observations:
- RNN's are _compositional_ (more on this soon)
- RNN's are Iterated Funciton Systems: class of functions know to generate fractals etc.
- The diff-eq representation makes it clear that we need some $z(0)$ (we usually assume to be zero for training).

## Composition and Unfolding 

Instead of thinking of a RNN acting in an iterative sense, it is often useful to 'unfold' the neural network, that is think of it as a map from $G: X^N \rightarrow Y^N$, where $x_1,x_2\in X$ (for example, vectors EEG data at time $t$) and $y_1\in Y$ (the corresponding decision/ouptut). 

Recall:
$$z_n = F(z_{n-1}, x_n; \mathbb{\theta}),\qquad y_n = M(z_n, x_n)$$
then it follows that
$$ y_1 = M(F(z_0,x_1), x_1),y_2 = M(F(F(z_0,x_1),x_2),x_2), \ldots$$   
Lets define $M_n(z) := M(z,x_n)$ and $F_n:= F(z,x_n)$ then it follows that
$$ y_1 = M_1\circ F_1(z_0) = M_1(F_1(z_0)) = G_1(z_0, x_1)$$
$$ y_2 = M_2 \circ F_2\circ F_1(z_0) = G_2(z_0, x_1, x_2)$$
$$\vdots$$
$$ y_N  = M_N\circ F_N\circ F_{N-1}\circ\cdots\circ F_1 (z_0) = G_N(z_0,x_1,x_2,\ldots,x_N;
\mathbb{\theta})$$

![Unrolled RNN](images/UnrolledRNN.svg)

## Advantages of Unfolding

1. Unfolding makes the input-ouput conditioning explicit
2. Writing out the composition sequences makes it clear how to do gradient descent. 

Applying backprop to the unfolded graph is known as _Back-propogation through time._

## Training A RNN with Backpropogation through time

Requires:
- A sequence size $N$, which determines how far to unroll the graph.
- Test input sequence $\{x_n\}_{n=1}^K$ and corresponding true outputs $\{\hat{y_n}\}_{n=1}^K$ for some $K = N + k$ where $k$ is the batch size.
- A loss function $L$.
- An initial guess for the state: $z_0 =0$.
- The RNN $F(z,x;\theta)$.
- The graph unfolding $G_N(x_1,x_2,\ldots;\theta)$


Algorithm::

    while training:
        z = 0
        for step from 0 to k:
            test_sequence = x[step:step+N]
            y = G_N(test_sequence, theta)
            error = y_hat - y
            theta = backprop error across G. 
            z = F(z, x[step+ N], theta)

## Teacher Forcing




## RNN's and statistical signal processing

Some observations:
- In the absence of input $x = 0$, the RNN obeys the 'Markov Propery' (next state only depends on the current state) and hence generates a stationary process $y_1,y_2,\ldots y_n$.
- Hence, RNN are causal (but there are some ways around this)!
- In the presence of input the process $y_1,y_2,\ldots$ is no longer stationary (courtesy of the graph $G$) and hence may have long-range dependence encoded in the interal state $z$
