In [1]:
%run Latex_macros.ipynb

<IPython.core.display.Latex object>

# Inside the RNN: update equations

An RNN layer, at time step $\tt$
- Takes input element $\x_\tp$
- Updates latent state $\h_\tp$
- Optionally outputs $\y_\tp$

according to the equations

$$
\begin{array}[lll]\\
\h_\tp & = & \phi(\W_{xh}\x_\tp  + \W_{hh}\h_{(t-1)}  + \b_h) \\
\y_\tp & = &  \W_{hy} \h_\tp  + \b_y \\
\end{array}
$$

where 
- $\phi$ is an activation function (usually $\tanh$)
- $\W$ are the weights of the RNN layer
    - partitioned into $\W_{xh}, \W_{hh}, \W_{hy}$
    - $\W_{xh}$: weights that update $\h_\tp$ based on $\x_\tp$
    - $\W_{hh}$: weights that update $\h_\tp$ based on $\h_{(\tt-1)}$
    - $\W_{hy}$: weights that update $\y_\tp$ based on $\h_\tp$ 

<table>
    <tr>
        <th><center>RNN</center></th>
    </tr>
    <tr>
        <td><img src="images/RNN_layer.png"</td>
    </tr>
</table>

**Notes**
- The RNN literature uses $\phi$ rather than $a_\llp$ to denote an activation function
- This is the update equation for a single example $\x^\ip$
- In practice, we can simultaneously update for *multiple examples*
    - The $m' \lt m$ examples in a minibatch, as examples are independent

Let's try to understand these equations
$$
\h_\tp  =  \phi(\W_{xh}\x_\tp  + \W_{hh}\h_{(t-1)}  + \b_h) 
$$

$\h_\tp$ is the latent state after time step $\tt$
- It is a *vector* of length $|| \h ||$
- We drop the time subscript as the dimension on each step is the same

$\W_{xh}\x_\tp$ must therefore *also be a vector* of length $|| \h ||$
- $|| \W_{xh} ||$ is a matrix of shape $( || \h || \times || \x ||)$
- $\h_j$, the $j^{th}$ element of latent state $\h$ is the dot product of row $j$ of  $\W_{xh}$ and $\x$
- So $\W_{xh}^{(j)}$ describes how input $\x_\tp$ influences new state $h_{\tp,j}$

That is: there are separate weights for each $j$ that describe the interaction of $\h$ and $\x$

Similarly, $\W_{hh}\h_{(t-1)}$ must be a *vector* of length $|| \h ||$
- $|| \W_{hh} || $ is a matrix of shape $( || \h || \times || \h || )$
- So $\W_{hh}^{(j)}$ describes how prior state $\h_{(\tt-1)}$ influences new state $\h_{\tp,j}$

$\b_h$, the bias/threshold must also be a vector of length $|| \h ||$
- It adjusts the threshold of activation function $\phi$
- As per our practice: we will usually fold $\b$ into the weight matrices $\W_{xh}, \W_{hh}$

Finally, activation $\phi$ maps a vector of length $|| \h ||$ to another vector of length $|| \h ||$ 
- The updated state

So updated latent state $\h_\tp$ is influenced
- By the input $\x_\tp$
- The prior latent state $\h_{(\tt-1)}$

The second equation
$$\y_\tp  =   \W_{hy} \h_\tp  + \b_y$$

is just a "translation" of the latent state $\h_\tp$ 
- To $\y_\tp$, the $\tt^{th}$ element of the output sequence
- $|| \W_{hy} || $ is a matrix of shape $( || \y || \times || \h || )$
    - $|| \y ||$ is the length of each output element and is problem dependent
    - For example: a OHE
    

It is common to equate $\y_\tp = \h_\tp$
- No separate "output"
- Just the latent state
- Particularly when using stacked RNN layers
    - $\y_\tp$ becomes the input to the next layer

## Equation in pseudo-matrix form

You will often see a short-hand form of the equation.

Look at $\h_\tp$ as a function of two inputs $\x_, \h_{(t-1)}$.

We can stack the two inputs into a single matrix.

Stack the two matrices $\W_{xh}, \W_{hh}$ into a single weight matrix

$
\begin{array}[lll]\\
\h_\tp  = \W \mathbf{I} + \b \\
\text{ with } \\
\W = \left[
 \begin{matrix}
    \W_{xh} & \W_{hh}
 \end{matrix} 
 \right] \\
\mathbf{I} = \left[
 \begin{matrix}
    \x_\tp  \\
    \h_{(t-1)}
 \end{matrix} 
 \right] \\
\end{array}
$

## Stacked RNN layers revisited

With the benefit of the RNN update equations, we can clarify how stack RNN layers works.

Let superscript $[\ll]$ denote a stacked layer of RNN.

So the RNN update equation for the bottom layer $1$ becomes
$$
\begin{array}[lll]\\
\h^{[1]}_\tp & = & \phi(\W_{xh}\x_\tp  + \W_{hh}\h^{[1]}_{(t-1)}  + \b_h) \\
\end{array}
$$

The RNN update equation for layer $[\ll]$ becomes

$$
\begin{array}[lll]\\
\h^{[\ll]}_\tp & = & \phi(\W_{xh}\h^{[\ll-1]}_\tp  + \W_{hh}\h^{[\ll]}_{(t-1)}  + \b_h) \\
\end{array}
$$

That is: the input to layer $[\ll]$ is $\h^{[\ll-1]}_\tp$ rather than $\x_\tp$

# Loss function

As usual, the objective of training is to find the weights $\W$ that minimize a loss function 

$$\loss = L(\hat{\y},\y; \W)$$
which is the average of per example losses $\loss^\ip$
$$\loss = \frac{1}{m} \sum_{i=1}^m { \loss^\ip }$$

    

When the output is a sequence
- It's important to recognize that the *target* is a sequence too !
- So the per example loss has an added temporal dimension
- Loss per example *per time step*
- Comparing the *predicted* $\tt^{th}$ output $\hat{\y}^\ip_\tp$ to the $\tt^{th}$ target $\y^\ip_\tp$

In the case that the API outputs sequences
- $\loss^\ip = \sum_{\tt=1}^T \loss^\ip_\tp$

In the case that the API outputs a single value
- $\loss^\ip = \loss_{(T)}$ 

<table>
    <tr>
        <th><center>RNN Loss: Forward pass</center></th>
    </tr>
    <tr>
        <td><img src="images/RNN_layer_loss.png"</td>
    </tr>
</table>

In [2]:
print("Done")

Done
