In [1]:
%run Latex_macros.ipynb

<IPython.core.display.Latex object>

# Warning: Higher dimensions ahead !

A Fully Connected/Dense layer is insensitive to the order of features.

This is just a property of the dot product
$$
\Theta^T \cdot \x =  \Theta[ \text{perm} ]^T \cdot \x[ \text{perm} ] 
$$

where  $\Theta[ \text{perm} ]^T$ and $\x[ \text{perm} ]$ are permuations of $\Theta, \x$.



But there are many problems in which order is important.

Consider the following examples


<table>
    <tr>
        <th><center>Same prices</center></th>
    </tr>
    <tr>
        <td><img src="images/RNN_sequence_1.jpg" width=800></td>
    </tr>
</table>

<center>Same words</center>
$$
\begin{matrix}
\text{Machine} & \text{Learning} & \text{is} & \text{easy} & \text{not} & \text{difficult} \\
\text{Machine} & \text{Learning} & \text{is} & \text{difficult} & \text{not} & \text{easy} \\
\end{matrix}
$$


<table>
    <tr>
        <th><center>Same pixels</center></th>
    </tr>
    <tr>
        <td><img src="images/RNN_sequence_2.jpg" width=800></td>
    </tr>
</table>

In this lecture, we will be dealing with examples that are *sequences*.

That is, we will add a new dimension to each example which we will call the *temporal dimension*.



To make this concrete, consider the difference between a snapshot and a movie
- A movie is a sequence of snapshots


We have already encountered (when introducing CNN's) data with a *spatial dimension*
- location of a feature within a 1D or 2D space.

The main difference between the spatial and temporal dimensions:
- We have some degree of freedom to alter the spatial dimension without affecting the problem
    - e.g., rotating an image
- There is *no* ability to rearrange data in the temporal dimension
    - Time flows forward and we can't peek ahead.


A single example $\x^\ip$ will now be written as
$$
[ \x^\ip_\tp \; | \; 1 \le \tt \le T]
$$

Using the movie analogy
- $\x^\ip$ is a movie: a sequence of frames
- $\x^\ip_\tp$ is the $t^{th}$ frame in the movies
- $\x^\ip_{\tp, j, j'}$ is a particular pixel within the frame $\x^\ip_\tp$
    - The temporal dimension is indexed by $\tp$ and the spatial dimensions by $j,j'$

# Functions on sequence

In the absence of a temporal dimension, our multi-layer networks
- Computed functions from vectors to vectors

With a temporal dimension, there are several variants of the function
- Many to one
    - Sequence as input, vector as output
    - Examples:
        - Predict next value in a time series (sequence of values)
        - Summarize the sentiment of a sentence (sequence of words)

- Many to many
    - Sequence as input, sequence of vectors as output
    - Examples
        - Translation of sentence in one language to sentence in second language
        - Caption a movie: sequence of frames to sequence of words

- One to many
    - Single input vector, sequence of vectors as output
    - Examples
        - Generating sentences from seed

# Recurrent Neural Network (RNN) layer

With a sequence $\x^\ip$ as input, and a sequence $\y$ as a potential output,  the questions arises:
- How does an RNN produce $\y_\tp$, the $t^{th}$ output ?

Some choices
- Predict $\y_\tp$ as a direct function of the prefix of $\x$ of length $\tt$: 
$$\pr{\y_\tp | \x_{(1)} \dots \x_\tp} $$

- Uses a "latent state" that is updated with each element of the sequence, then predict the output

$$
\begin{array}[lll] \\
\pr{\h_\tp | \x_\tp, \h_{(\tt-1)} } & \text{latent variable } \h_\tp \text{encodes } [ \x_{(1)} \dots \x_\tp ]\\
\pr{\y_\tp | \h_\tp }              & \text{prediction contingent on latent variable} \\
\end{array}
$$



The Recurrent Neural Network (RNN) adopts the latter approach.

Here is some pseudo-code:

In [2]:
def RNN( input_sequence, state_size ):
    state = np.random.uniform(size=state_size)
    
    for input in input_sequence:
        # Consume one input, update the state
        out, state = f(input, state)
        
    return out
        

and a picture/movie

<table>
    <tr>
        <th><center>RNN many to many API</center></th>
    </tr>
    <tr>
        <td><img src="images/RNN_layer_API_many_to_many.jpg"></td>
    </tr>
</table>

At each time step $\tt$
- Input $\x_\tp$ is processed
- Causes latent state $\h$ to update from $\h_{(\tt-1)}$ to $\h_\tp$
    - We use the same sequence notation to record the sequence of latent states $[ \h_{(1)}, \ldots, ]$
- Optionally outputs $\y_\tp$ (for outputs that are of type sequence)

When processing $\x_\tp$
- The function computed takes $\h_{(t-1)}$ as input
- Latent state $\h_{(t-1)}$ has been derived by having processed $[\x_{(1)} \dots \x_{(\tt-1)}]$
- And is thus a *summary* of the prefix of the input encountered thus far


One can look at this unrolled graph as being a dynamically-created computation graph.


A short-hand picture for the movie that you will often see is

<table>
    <tr>
        <th><center>RNN</center></th>
    </tr>
    <tr>
        <td><img src="images/RNN_loop.jpg" width=1000></td>
    </tr>
</table>

The movie version is a little more direct and is often referred to as "unrolling the loop"
in the short-hand version.

The unrolled version will be crucial in understanding how Gradient Descent works when RNN layers are present.
- The unrolled graph looks just like an ordinary graph
- Because it resembles a non-loop computation, our logic and intuition for computing gradients transfers directly


Note that $\x, \y, \h$ are all vectors. 

In particular, the state $\h$ *may have many* elements
-  to record information about the entire prefix of the input.

One extremely important aspect that might not be apparent from the movie version:
- Each unrolled "frame" in the movie shares the *same weights* and computes the *same* function $F$
- In contrast to a true multi-layer network where each layer has its *own* weights

That is the unrolled RNN computes
$$
\begin{array}[lll]\\
\y_\tp & = & F( \y_{(\tt-1)}; \W ) \\
& = &  F( \; F( \y_{(\tt-2)}; \,\W ); \,\W \;) \\
& = &  F( \; F( \; F( \y_{(\tt-3)}; \,\W ); \,\W \;  ); \W \;) \\
& = & \vdots \\
\end{array}
$$
rather than
$$
\begin{array}[lll]\\
\y_\llp & = & F_\llp( \y_{(\ll-1)}; \W_\llp ) \\
& = &  F_\llp( \; F_{(\ll-1)}( \y_{(\ll-2)}; \,\W_{(\ll-1)} ); \,\W_\llp \;) \\
& = &  F_\llp( \; F_{(\ll-1)}( \; F_{(\ll-2)}( \y_{(\ll-3)}; \,\W_{(\ll-2)} ); \,\W_{(\ll-1)} \;  ); \W_\llp \;) \\
& = & \vdots \\
\end{array}
$$

Note, in particular
- The repeated occurrence of the term $\W$ will complicate computing the derivative
- As we will see in a subsequent lecture

RNN's are sometimes drawn without separate outputs $\y_\tp$
- in that case, $\h_\tp$ may be considered the output. 

The computation of $\y_\tp$ will just be a transformation of $\h_\tp$ so there is no loss in omitting
it from the RNN and creating a separate node in the computation graph.

Geron does not distinguish between $\y_\tp$ and $\h_\tp$ and he uses the single $\y_\tp$ to denote the state.

I will use $\h$ rather than $\y$ to denote the "hidden state".


## $\h_\tp$ latent state

$\h_\tp$ is the latent state (sometimes called the *hidden state* as it is not visible outside the layer).

It is essentially a *fixed length* encoding of the variable length sequence $[\x_{(1)} \dots \x_\tp]$
- All essential information about the prefix of $\x$ ending at step $\tt$ is recorded in $\h_\tp$
- Hence, the size of $\h_\tp$ may need to be large

Having a fixed length encoding for a variable length input is crucial
- We can feed the (fixed length representation of a) sequence to a Classical ML Classifier/Regressor
- Which have fixed length inputs

<table>
    <tr>
        <th><center><strong>RNN Many to one; followed by classifier</strong></center></th>
    </tr>
    <tr>
        <td><img src="images/RNN_many_to_one_to_classifier.jpg" width=800></td>
    </tr>
</table>

# Conclusion

We have introduced the key concepts of Recurrent Neural Networks.
- An unrolled RNN is just a multi-layer network
- In which *all the layers are identical*
- The latent state is a fixed length encoding of the prefix of the input

A more detailed view of sequences and RNN's will be our next topic.

In [3]:
print("Done")

Done
