# RNN(Recurrent Neural Network)

## Sequential data
Data with temporal property are called sequential data. Order in which features appear is important in sequential data. Also, sequential data has different sample size so recurrent nerual network accepts variable length by putting recurrent edge in hidden layers. Sequential data is context dependent. There is no fixed relationship between the features, but instead there is a context dependency, for example, between the two elements $\mathbf{x}^{(i)}$ and $\mathbf{x}^{(j)}$ in a sample.  
Sample $\mathbf{x}$ has $T$ elements, and an element is $d$-dimensional. The size $T$ is usually different from sample to sample. It is represented by  
$$\mathbf{x} = (\mathbf{x}^{(1)}, \mathbf{x}^{(2)}, \cdots, \mathbf{x}^{(T)})^T$$
And the train dataset is $\mathbb{X} = \{\mathbf{x}_1, \mathbf{x}_2, \cdots, \mathbf{x}_n\}, \mathbb{Y} = \{\mathbf{y}_1, \mathbf{y}_2, \cdots, \mathbf{y}_n\}$  


The RNN is used for applications such as machine translation and image annotation. The dictionary is needed when you show text data. The dictionary can be produced by collecting words that are commonly used in daily life or can be automatically extracted by analyzing a given corpus. There are several method:BoW(Bag of Words), one-hot code, and word embedding.  


## RNN
A neural network processing sequential data should have three things:  
- temporality : Features must be entered one at a time in order.
- variable length : Hidden layers should be $T$ when sample size is $T$ and $T$ is variable.
- context dependency : Remember the previous feature and use it at the right moment.


<img src="./img/5_RNN.png" width="50%" height="50%">  
Unlike MLP, RNN has only one hidden layer and there are edges among hidden nodes. The edges are called the recurrent edge. RNN has three functions-temporality, variable length, context dependency thanks to the recurrent edge. In brief, the recurrent edge passes information generated at $t-1$ to the moment $t$.  
$$
\mathbf{h}^{(t)} = f(\mathbf{h}^{(t-1)}, \mathbf{x}^{(t)};\Theta)
$$  
RNN computes by values of parameter $\Theta$, and it is a set of weights of RNN-$\mathbf{U}, \mathbf{V}, \mathbf{W}$. Value of hidden layer at the moment $t$, $\mathbf{h}^{(t)}$ is decided by input value at the moment $t$, $\mathbf{x}^{(t)}$ and value of hidden layer just before value, $\mathbf{h}^{(t-1)}$. The hidden layer of RNN do remember the previous information.  



$$
\begin{align}
\mathbf{h}^{(T)} & = f(\mathbf{h}^{(T-1)}, \mathbf{x}^{(T)};\Theta) \\
& = f(f(\mathbf{h}^{(T-2)}, \mathbf{x}^{(T-1)};\Theta), \mathbf{x}^{(T)};\Theta) \\
& \vdots \\
& = f(f(\cdots f(\mathbf{h}^{(1)}, \mathbf{x}^{(2)};\Theta), \cdots \mathbf{x}^{(T-1)};\Theta), \mathbf{x}^{(T)};\Theta) \\
& = f(f(\cdots f(f(\mathbf{h}^{(0)}, \mathbf{x}^{(1)};\Theta),\mathbf{x}^{(2)};\Theta), \cdots \mathbf{x}^{(T-1)};\Theta), \mathbf{x}^{(T)};\Theta)
\end{align}
$$  


The elements of $\mathbf{x}$($\mathbf{x}^{(1)}, \mathbf{x}^{(2)}, \cdots, \mathbf{x}^{(T)}$) are input sequentially into an input layer of RNN and an outputs are $\mathbf{y}'^{(1)}, \mathbf{y}'^{(2)}, \cdots, \mathbf{y}'^{(T)}$. Thus there are $\mathbf{x}^{(i)}$-dim nodes in the input layer, and $\mathbf{y}'^{(i)}$-dim nodes in the output layer.  
RNN has several type of weights:weights of edge which is from input layer to hidden layer, $\mathbf{U}$, weights of recurrent edge between hidden layers, $\mathbf{W}$, and weight of edge from hidden layer to output layer, $\mathbf{V}$. Additionally, there are weights, $\mathbf{b}$ and $\mathbf{c}$ which are connected with bias node that has always value 1. The set of parameter of RNN is, therefore, $\Theta = \{\mathbf{U}, \mathbf{W}, \mathbf{V}, \mathbf{b}, \mathbf{c}\}$. The weights are shared in RNN. Over time, a new vector is received and calculated using the same weights $\mathbf{U}, \mathbf{W}, \mathbf{V}, \mathbf{b}, \mathbf{c}$ each time.  
Weight sharing has several adventages.  
- Keep the optimization problem at a reasonable size by reducing dramatically the number of parameters estimated in training process.  
- The number of paramters is constant regardless a length of feature vector, $T$.  
- Can make same or simillar output although a moment that feature appears changes.  


Input count and output count both are $T$ because RNN outputs each time. However, in many applications, they can be different.  
<img src="./img/5_RNN_applications.png" width="80%" height="80%">  


The input at the moment $t$, $\mathbf{x}^{(t)}$ affects the hidden layer state $\mathbf{h}^{(t)}$. And $\mathbf{h}^{(t)}$ affects the state after it, $\mathbf{h}^{(t+1)}, \mathbf{h}^{(t+2)}, \cdots$ by recurrent edges. The hidden layers, in other words, remember $\mathbf{x}^{(t)}$. However the number of nodes in hidden layers is limited then it also should remember the information after $t$. RNN cannot be selectively remembered. So the memory for $\mathbf{x}^{(t)}$ fades out.  



$u_{ji}$ is an edge weight that connets $i^{th}$ input node and $j^{th}$ hidden node.  
$$
\mathbf{U} = \begin{pmatrix}
u_{11} & u_{12} & \cdots & u_{1d} \\
u_{21} & u_{22} & \cdots & u_{2d} \\
\vdots & \vdots & \ddots & \vdots \\
u_{p1} & u_{p2} & \cdots & u_{pd}
\end{pmatrix},
\mathbf{V} = \begin{pmatrix}
v_{11} & v_{12} & \cdots & v_{1p} \\
v_{21} & v_{22} & \cdots & v_{2p} \\
\vdots & \vdots & \ddots & \vdots \\
v_{q1} & v_{q2} & \cdots & v_{qp}
\end{pmatrix},
\mathbf{W} = \begin{pmatrix}
w_{11} & w_{12} & \cdots & w_{1p} \\
w_{21} & w_{22} & \cdots & w_{2p} \\
\vdots & \vdots & \ddots & \vdots \\
w_{p1} & w_{p2} & \cdots & w_{pp}
\end{pmatrix}
$$  


The $d$ weights from input layter to $h_j$ is $\mathbf{u}_j = (u_{j1}, u_{j2}, \cdots, u_{jd})$. Similarly, the recurrent edges with $h_j$ is $\mathbf{w}_j = (w_{j1}, w_{j2}, \cdots, w_{jp})$.  
RNN has a similar structure to MLP, and its behavior is also.  


$$
\begin{array}{lcl}
\mathbf{h}^{(t)} = \tau(\mathbf{a}^{(t)}) \\
\mathbf{a}^{(t)} = \mathbf{W}\mathbf{h}^{(t-1)} + \mathbf{U}\mathbf{x}^{(t)} + \mathbf{b}
\end{array}
$$  

After hidden layer computation, output layer begin to compute.  
$$
\begin{array}{lcl}
\mathbf{o}^{(t)} = \mathbf{V}\mathbf{h}^{(t)} + \mathbf{c} \\
\mathbf{y}'^{(t)} = \text{softmax}(\mathbf{o}^{(t)})
\end{array}
$$  

What is different between RNN and DMLP?  
1. RNN has input and output every moment whereas DMLP has only one input layer and one output layer.  
2. RNN shares weight. DMLP has different weights in each layers.  


### BPTT learning
RNN trains by SGD(stochastic gradient descent) with trainset $\mathbb{X} = \{\mathbf{x}_1, \mathbf{x}_2, \cdots, \mathbf{x}_n\}, \mathbb{Y} = \{\mathbf{y}_1, \mathbf{y}_2, \cdots, \mathbf{y}_n\}$. RNN should compoute gradient using BPTT(back-propagation through time) because RNN has temporal information.  


<img src="./img/5_expanded_RNN.png" width="50%" height="50%">  


RNN outputs $\mathbf{y}' = (\mathbf{y}'^{(1)}, \mathbf{y}'^{(2)}, ..., \mathbf{y}'^{(T)})^T$ as time passes $1, 2, \cdots, t, t+1, \cdots, T$ and the output is target of RNN. Therfore the object function $J(\mathbf{\Theta})$ is,  


$$
J(\mathbf{\Theta}) = \sum_{t=1}^T J^{(t)}(\mathbf{\Theta})
$$  


$J^{(t)}(\mathbf{\Theta})$ is a function estimating error at $t$. You can choose one among MSE, cross-entropy, log likelihood.  
- MSE : $J^{(t)}(\mathbf{\Theta}) = \sum_{j=1}^q (y_j^{(t)} - y_j'^{(t)})^2$  
- cross-entropy : $J^{(t)}(\mathbf{\Theta}) = -\mathbf{y}^{(t)}\log \mathbf{y}'^{(t)} = - \sum_{j=1}^q y_j^{(t)}\log y_j'^{(t)}$  
- log likelihood : $J^{(t)}(\mathbf{\Theta}) = - \log y_J'^{(t)}$  

We should find the best parameter to minimize the objective function. Thus,  
$$
\hat{\mathbf{\Theta}} = \underset{\mathbf{\Theta}}{\text{argmin}} J({\mathbf{\Theta}}) = \underset{\mathbf{\Theta}}{\text{argmin}} \sum_{t=1}^T J^{(t)}(\mathbf{\Theta})
$$  


We need $\frac {\partial J}{\partial \mathbf{\Theta}}$, so just compute $\frac {\partial J}{\partial \mathbf{U}}, \frac {\partial J}{\partial \mathbf{W}}, \frac {\partial J}{\partial \mathbf{V}}, \frac {\partial J}{\partial \mathbf{b}}, \frac {\partial J}{\partial \mathbf{c}}$ because $\mathbf{\Theta} = \{\mathbf{U}, \mathbf{W}, \mathbf{V}, \mathbf{b}, \mathbf{c}\}$.  
Applying the chain rule to the element $\frac {\partial J}{\partial v_{ji}}$ in the $i^{th}$ column of the $j^{th}$ row of $\frac {\partial J}{\partial \mathbf{V}}$ gives:  


$$
\frac {\partial J^{(t)}}{\partial v_{ji}} = \frac {\partial J^{(t)}}{\partial y'^{(t)}}\frac {\partial y'^{(t)}}{\partial o_j^{(t)}}\frac {\partial o_j^{(t)}}{\partial v_{ji}}
$$
$\frac {\partial o_j^{(t)}}{\partial v_{ji}} = h_i^{(t)}$ because $o_j^{(t)} = \mathbf{v}_j\mathbf{h}^{(t)}$.  


Usally people use log likelihood function as object function. So we should consider two cases:  
if target value $\mathbf{y}^{(t)} = (1,0)^T$, $J^{(t)} = -\log y_j'^{(t)}$ and if $\mathbf{y}^{(t)} = (0,1)^T$, $J^{(t)} = -\log y_i'^{(t)}$  
Then,  


$$
\frac {\partial J^{(t)}}{\partial v_{ji}} = (y_j'^{(t)} - 1)h_i{(t)},\quad \text{if the element in }j^{th}\text{ of }\mathbf{y}^{(t)}\text{ is 1} \\
\frac {\partial J^{(t)}}{\partial v_{ji}} = y_j'^{(t)}h_i{(t)},\quad \text{if the element in }j^{th}\text{ of }\mathbf{y}^{(t)}\text{ is 0}
$$  


Modern machine learning prefers matrix not only for its simplicity but also for its parallelism.  
You can write by vector shape, $\frac {\partial J^{(t)}}{\partial \mathbf{o}^{(t)}} = \mathbf{y}'^{(t)} - \mathbf{y}^{(t)}$  


Differentiation at the hidden layer is complicated because the hidden layer value of the instant $t$ affects both the hidden layer after $t$ and the output layer. Thus it must be expressed in a circular fashion.  
First, at the last moment $T$,  
$$
\frac {\partial J^{(T)}}{\partial \mathbf{h}^{(T)}} = \frac {\partial J^{(T)}}{\partial \mathbf{o}^{(T)}} \frac {\partial \mathbf{o}^{(T)}}{\partial \mathbf{h}^{(T)}} = \mathbf{V}^T\frac {\partial J^{(T)}}{\partial \mathbf{o}^{(T)}}
$$  

Immediately before, the hidden layer of $T-1$ affects the output layer of $T-1$ and the hidden layer of $T$, so the gradient of $T-1$ is as follows: $\mathbf{D}(1 - (\mathbf{h}^{(T)})^2)$ which is a diagonal matrix with $1 - (h_i^{(T)})^2$ in $i^{th}$ column.  


$$
\begin{alignat}{2}
\frac {\partial J^{(T-1)} + J^{(T)}}{\partial \mathbf{h}^{(T-1)}}
& = \frac {\partial J^{(T-1)}}{\partial \mathbf{o}^{(T-1)}} \frac {\partial \mathbf{o}^{(T-1)}}{\partial \mathbf{h}^{(T-1)}} + \frac {\partial \mathbf{h}^{(T)}}{\partial \mathbf{h}^{(T-1)}} \frac {\partial J^{(T)}}{\partial \mathbf{h}^{(T)}} \\
& = \mathbf{V}^T \frac {\partial J^{(T-1)}}{\partial \mathbf{o}^{(T-1)}} + \mathbf{W}^T \frac {\partial J^{(T)}}{\partial \mathbf{h}^{(T)}} \mathbf{D} \left(1 - (\mathbf{H}^{(T)})^2 \right) \\
\end{alignat}
$$  


If we generalize the above expression to the instant $t$, we get  


$$
\frac {\partial J^{(\tilde{t})}}{\partial \mathbf{h}^{(t)}} = \mathbf{V}^T \frac {\partial J^{(t)}}{\partial \mathbf{o}^{(t)}} + \mathbf{W}^T \frac {\partial J^{(\tilde{t+1})}}{\partial \mathbf{h}^{(t+1)}} \mathbf{D} \left(1 - (\mathbf{H}^{(t+1)})^2 \right)
$$  
<img src="./img/5_BPTT.png" width="70%" height="70%">  


$J^{(\tilde{t})}$ is the value of all subsequent objective functions including $t$ and $\frac {\partial J^{(\tilde{t+1})}}{\partial \mathbf{h}^{(t+1)}}$ is a gradient that reflects all of the gradient of the variable in the gray area. Gradient multiplied by $\mathbf{W}^T$ and $\mathbf{D} (1 - (\mathbf{H}^{(t+1)})^2)$ backpropagates from $\mathbf{h}^{(t+1)}$ to $\mathbf{h}^{(t)}$. $\mathbf{h}^{(t)}$ adds the gradient from $\mathbf{h}^{(t+1)}$ and the gradient $\mathbf{V}^T \frac {\partial J^{(t)}}{\partial \mathbf{o}^{(t)}}$ from $\mathbf{y}'$ and decides it its gradient. Then, $\mathbf{h}^{(t)}$ backpropagates to $\mathbf{h}^{(t-1)}$ by same process.  


* BPTT Algorithms  
$$
\begin{cases}
\frac {\partial J}{\partial \mathbf{V}} = \sum_{t=1}^T \frac {\partial J^{(t)}}{\partial \mathbf{o}^{(t)}} \mathbf{h}^{(t)^T} \\
\frac {\partial J}{\partial \mathbf{W}} = \sum_{t=1}^T \mathbf{D} \left(1 - (\mathbf{h}^{(t)})^2 \right) \frac {\partial J^{(\tilde{t})}}{\partial \mathbf{h}^{(t)}} \mathbf{h}^{(t-1)^T} \\
\frac {\partial J}{\partial \mathbf{U}} = \sum_{t=1}^T \mathbf{D} \left(1 - (\mathbf{h}^{(t)})^2 \right) \frac {\partial J^{(\tilde{t})}}{\partial \mathbf{h}^{(t)}} \mathbf{x}^{(t)^T} \\
\frac {\partial J}{\partial \mathbf{c}} = \sum_{t=1}^T \frac {\partial J^{(t)}}{\partial \mathbf{o}^{(t)}} \\
\frac {\partial J}{\partial \mathbf{b}} = \sum_{t=1}^T \mathbf{D} \left(1 - (\mathbf{h}^{(t)})^2 \right)
\end{cases}
$$  


## LSTM
**Long-term dependency**  
When the elements involved are far apart, they are called long-term dependencies. When you apply a formula $\mathbf{a}^{(t)} = \mathbf{W}\mathbf{h}^{(t-1)} + \mathbf{U}\mathbf{x}^{(t)} + \mathbf{b}$, influence of $\mathbf{x}^{(1)}$ decreases over time and then diminishes if the value of the element of $\mathbf{W}$ is less than 1. Conversely, if the value of the element of $\mathbf{W}$ is greater than 1, the value increases gradually, making RNN very unstable. The problems-gradient vanishing, and gradient explosion also occur when training. They do not happen only in RNN, but they are critical for RNN with two reasons.  
1. In RNN, the calculation of the hidden layer is repeated according to the length of the input sample, and very long samples such as speech recognition frequently occur.  
2. Since RNN shares weights, it keeps getting smaller and bigger and much worse.  

For these issues, Hochreiter proposed the LSTM model.  
The core idea of LSTM is the gate. At each moment there is an input gate and an output gate, which can be opened or closed.

<img src="./img/5_LSTM.png" width="35%" height="35%">  
In LSTM, hidden layers are connected with each other like RNN.(thin black lines) The thin blue lines mean connection from input layer to hidden layer, and the thin red one from hidden layer to output layer. The difference between LSTM and RNN is the connection to the gate. The output of the memory block enters into input of input gate and output gate.(thick black lines) The input vector, $\mathbf{x}$, also enters into input of input gate and output gate.(thick blue lines)  


<img src="./img/5_LSTM_memory_block.png" width="30%" height="30%">  

The thick connecting line corresponds to the weight vector, and the number above the line represents the dimension of the vector. Thin lines with number means that the same value is connected to multiple nodes.  
In LSTM, the hidden layer state $\mathbf{h}^{(t-1)}$ of $t-1$ at the previous moment is input at the input, the input gate and the output gate. So the recurrent edge is three types:$\mathbf{w}^g$ connected with input, $\mathbf{w}^i$ with input gate, $\mathbf{w}^o$ with output gate. The same is true for input vector $\mathbf{x}^{(t)}$:$\mathbf{u}^g$ connected with input, $\mathbf{u}^i$ with input gate, $\mathbf{u}^o$ with output gate.  
- input : $g = \tau_g(\mathbf{u}_j^g\mathbf{x}^{(t)} + \mathbf{w}_j^g\mathbf{h}^{(t-1)} + b_j^g)$  
- input gate : $i = \tau_f(\mathbf{u}_j^i\mathbf{x}^{(t)} + \mathbf{w}_j^i\mathbf{h}^{(t-1)} + b_j^i)$
- output gate : $o = \tau_f(\mathbf{u}_j^o\mathbf{x}^{(t)} + \mathbf{w}_j^o\mathbf{h}^{(t-1)} + b_j^o)$  


$\tau_g$ is usually tanh function(\[-1.0,1.0\]) and $\tau_f$ is logistic sigmoid function(\[0.0, 1.0\]). If $\tau_f$ is 0.0, the gate is compeletly closed, otherwise, compeletly opened. The simbol * means production. The product at the bottom multiplies the output $g$ of the input by the output $i$ of the input gate and outputs $g * i$. If value of gate is close to 0.0, it blocks input, otherwise, it passes $g$ upwards as is.  
\ circle is a state of memory block and \ means linear activation function. It has self edge 1.0 so $s^{(t)} = s^{(t-1)} + g * i$. If input gate has 0.0, the memory contents are the same as before. Thus, it remembers the previous content and extends the range of influence of the previous input further.  
$h_J^{(t)} = \tau_h(s^{(t)}) * o$ and it is passed to $q$ output nodes and used for the calculation of the output and next $t+1$.  


The formula for the LSTM memory block is expressed as a matrix as follows:  


$$
\begin{cases}
\text{input : }\mathbf{g} = \tau_g(\mathbf{U}^g\mathbf{x}^{(t)} + \mathbf{W}^g\mathbf{h}^{(t-1)} + \mathbf{b}^g) \\
\text{input gate : }\mathbf{i} = \tau_f(\mathbf{U}^i\mathbf{x}^{(t)} + \mathbf{W}^i\mathbf{h}^{(t-1)} + \mathbf{b}^i)\\
\text{output gate : }\mathbf{o} = \tau_f(\mathbf{U}^o\mathbf{x}^{(t)} + \mathbf{W}^o\mathbf{h}^{(t-1)} + \mathbf{b}^o) \\
\mathbf{s}^{(t)} = \mathbf{s}^{(t-1)} + \mathbf{g} \odot \mathbf{i} \\
\mathbf{h}^{(t)} = \tau_h(\mathbf{s}^{(t)}) \odot \mathbf{o} \\
\mathbf{y}'^{(t)} = \text{softmax}(\mathbf{V}\mathbf{h}^{(t)} + \mathbf{c}) \\
\end{cases}
$$  