In [1]:
%run Latex_macros.ipynb

<IPython.core.display.Latex object>

# Warning: Higher dimensions ahead !

A Fully Connected/Dense layer is insensitive to the order of features.

This is just a property of the dot product
$$
\Theta^T \cdot \x =  \Theta[ \text{perm} ]^T \cdot \x[ \text{perm} ] 
$$

where  $\Theta[ \text{perm} ]^T$ and $\x[ \text{perm} ]$ are permutations of $\Theta, \x$.



$$
\begin{matrix}
\sum{
\begin{cases}
\text{Machine} & \text{Learning} & \text{is} & \text{easy} & \text{not} & \text{hard} \\
\cdot & \cdot & \cdot & \cdot & \cdot & \cdot \\
\Theta_\text{Machine} & \Theta_\text{Learning} & \Theta_\text{is} & \Theta_\text{easy} & \Theta_\text{not} & \Theta_\text{hard} \\
\end{cases}
}
\\
= \\
\sum{
\begin{cases}
\text{Machine} & \text{Learning} & \text{is} & \text{hard}& \text{not} & \text{easy}  \\
\cdot & \cdot & \cdot & \cdot & \cdot & \cdot \\
\Theta_\text{Machine} & \Theta_\text{Learning} & \Theta_\text{is} & \Theta_\text{hard}& \Theta_\text{not} & \Theta_\text{easy} \\
\end{cases}
}
\end{matrix}
$$

But there are many problems in which order is important.

Consider the following examples


<table>
    <tr>
        <th><center>Same prices</center></th>
    </tr>
    <tr>
        <td><img src="images/RNN_sequence_1.jpg" width=800></td>
    </tr>
</table>

<center>Same words</center>
$$
\begin{matrix}
\text{Machine} & \text{Learning} & \text{is} & \text{easy} & \text{not} & \text{difficult} \\
\text{Machine} & \text{Learning} & \text{is} & \text{difficult} & \text{not} & \text{easy} \\
\end{matrix}
$$


<table>
    <tr>
        <th><center>Same pixels</center></th>
    </tr>
    <tr>
        <td><img src="images/RNN_sequence_2.jpg" width=800></td>
    </tr>
</table>

In this lecture, we will be dealing with examples that are *sequences*.

That is, we will add a new dimension to each example which we will call the
- *positional* dimension

and we will denote $\x_\tp$ as position $\tt$ in sequence $\x$.

Often, the position is equated with time.  In such cases we can also refer to the positional dimension as
the *temporal* dimension*.

To make this concrete, consider a movie
- A movie is a sequence of snapshots
- Frame $\tt$ of the movie corresponds to position $\tt$ of the sequence.

Note the the snapshot has it's usual dimensions
- spatial dimensions
- feature dimension

Let $\x^ip$ be a example that happens to be a movie.

It is a sequence of items, at each of $T$ positions
$$
[ \x^\ip_\tp \; | \; 1 \le \tt \le T]
$$

where
- $\x^\ip$ is a movie: a sequence of frames
- $\x^\ip_\tp$ is the $t^{th}$ frame in the movie
- $\x^\ip_{\tp, j, j'}$ is a particular pixel within the frame $\x^\ip_\tp$
    - The positional dimension is indexed by $\tp$ and the spatial dimensions by $j,j'$


There is an important difference between the positional and spatial dimensions
- spatial dimensions can often be permuted without changing meaning
    - shifting or flipping a frame
- positional dimensions often *can not* be permuted
    - causal relationships are encoded by order
    - frame $\tt$ makes sense only if it occurs immediately after frame $(\tt-1)$ in the sequence


# Functions on sequence

In the absence of a positional dimension, our multi-layer networks
- Computed functions from vectors to vectors

With a positional dimension, there are several variants of the function
- Many to one
    - Sequence as input, vector as output
    - Examples:
        - Predict next value in a time series (sequence of values)
        - Summarize the sentiment of a sentence (sequence of words)

- Many to many
    - Sequence as input, sequence of vectors as output
    - Examples
        - Translation of sentence in one language to sentence in second language
        - Caption a movie: sequence of frames to sequence of words

- One to many
    - Single input vector, sequence of vectors as output
    - Examples
        - Generating sentences from seed

# Recurrent Neural Network (RNN) layer

With a sequence $\x^\ip$ as input, and a sequence $\y$ as a potential output,  the questions arises:
- How does an RNN produce $\y_\tp$, the $t^{th}$ output ?

Some choices
- Predict $\y_\tp$ as a direct function of the prefix of $\x$ of length $\tt$: 
$$\pr{\y_\tp | \x_{(1)} \dots \x_\tp} $$

<br>
<div>
    <center><strong>Direct function</strong></center>
    <img src="images/RNN_arch_parallel.png" width=50%>
</div>

- Loop
    - Uses a "latent state" that is updated with each element of the sequence, then predict the output

$$
\begin{array}[lll] \\
\pr{\h_\tp | \x_\tp, \h_{(\tt-1)} } & \text{latent variable } \h_\tp \text{encodes } [ \x_{(1)} \dots \x_\tp ]\\
\pr{\y_\tp | \h_\tp }              & \text{prediction contingent on latent variable} \\
\end{array}
$$

    
<br>
<div>
    <center><strong>Loop with latent state</strong></center>
    <img src="images/RNN_arch_loop.png" width=70%>
</div>


Since elements of the sequence are presented **one element at a time**
- the latent state $\h_\tp$ must act as a **summary** of all prior elements $\x_{(1)} \dots \x_\tp$

$$
\h_\tp = \text{summary}(\x_{([1:\tt])})
$$

Note that $\h_\tp$ is a *vector* of fixed length.

Thus, it is a *fixed length* representation of the key aspects
of a sequence $\x$ of potentially *unbounded* length.

**Example**

Let's use an RNN to compute the sum of a sequence numbers
- the latent state $\h_\tp$ can be maintained as 
$$
\h_\tp = \text{summary}(\x_{([1:\tt])}) = \sum_{\tt' =1}^\tt { \x_{(\tt')} }
$$
- by updating $\h_\tp$ in the loop
$$
\h_\tp = \h_{(\tt-1)} + \x_\tp
$$


The Recurrent Neural Network (RNN) adopts the "latent state" approach.

A prime advantage of the latent state approach
- it can handle sequences of *unbounded* length

Here is some pseudo-code:

In [2]:
def RNN( input_sequence, state_size ):
    state = np.random.uniform(size=state_size)
    
    for input in input_sequence:
        # Consume one input, update the state
        out, state = f(input, state)
        
    return out
        

And the corresponding diagram, showing the output `out` ($\y_\tp$)

<table>
    <tr>
        <th><center>RNN</center></th>
    </tr>
    <tr>
        <td><img src="images/RNN_loop.png"></td>
    </tr>
</table>

## Output $\hat\y_\tp$ of an RNN

According to our pseudo-code and diagram
$$
\hat\y_\tp = \h_\tp
$$

That is: the output is the same as the latent state.

It is easy to add another NN to transform $\h_\tp$ into a $\hat\y_\tp$ that is different
- we will omit this additional layer for clarity


## Unrolled RNN diagram

We can "unroll" the loop into a kind of movie
- a sequence of steps
- step $\tt-1$ arranged to the left of step $\tt$

<table>
    <tr>
        <th><center>RNN many to many API</center></th>
    </tr>
    <tr>
        <td><img src="images/RNN_layer_API_many_to_many.jpg"></td>
    </tr>
</table>

At each time step $\tt$
- Input $\x_\tp$ is processed
- Causes latent state $\h$ to update from $\h_{(\tt-1)}$ to $\h_\tp$
    - We use the same sequence notation to record the sequence of latent states $[ \h_{(1)}, \ldots, ]$
- Optionally outputs $\y_\tp$ (for outputs that are of type sequence)

When processing $\x_\tp$
- The function computed takes $\h_{(t-1)}$ as input
- Latent state $\h_{(t-1)}$ has been derived by having processed $[\x_{(1)} \dots \x_{(\tt-1)}]$
- And is thus a *summary* of the prefix of the input encountered thus far


One can look at this unrolled graph as being a dynamically-created computation graph.
- A sequence of layers
- One layer per time step
- But with an **identical** computation for all layers

The unrolled version will be crucial in understanding how Gradient Descent works when RNN layers are present.
- Just conceptualize the unrolled loop as a sequence of layers
- All our logic and intuition carries over


Note that $\x, \y, \h$ are all vectors. 

In particular, the state $\h$ *may have many* elements
- it is a vector of "synthesized" features
-  to record information about the entire prefix of the input.

$\h_\tp$ is the latent state (sometimes called the *hidden state* as it is not visible outside the layer).

It is essentially a *fixed length* encoding of the variable length sequence $[\x_{(1)} \dots \x_\tp]$
- All essential information about the prefix of $\x$ ending at step $\tt$ is recorded in $\h_\tp$
- Hence, the size of $\h_\tp$ may need to be large

We will shortly attempt to gain some intuition as to what these synthesized features may be.

## All "layers" in the unrolled graph share weights

One extremely important aspect that might not be apparent from the movie version:
- Each unrolled "frame" in the movie shares the *same weights* and computes the *same* function $F$
- In contrast to a true multi-layer network where each layer has its *own* weights
<br>
<br>
<table>
    <tr>
        <th><center>RNN shared weights</center></th>
    </tr>
    <tr>
        <td><img src="images/RNN_layer_API_shared_weights.png" width="80%"></td>
    </tr>
</table>

That is the unrolled RNN computes
$$
\begin{array}[lll]\\
\y_\tp & = & F( \y_{(\tt-1)}; \W ) \\
& = &  F( \; F( \y_{(\tt-2)}; \,\W ); \,\W \;) \\
& = &  F( \; F( \; F( \y_{(\tt-3)}; \,\W ); \,\W \;  ); \W \;) \\
& = & \vdots \\
\end{array}
$$
rather than
$$
\begin{array}[lll]\\
\y_\llp & = & F_\llp( \y_{(\ll-1)}; \W_\llp ) \\
& = &  F_\llp( \; F_{(\ll-1)}( \y_{(\ll-2)}; \,\W_{(\ll-1)} ); \,\W_\llp \;) \\
& = &  F_\llp( \; F_{(\ll-1)}( \; F_{(\ll-2)}( \y_{(\ll-3)}; \,\W_{(\ll-2)} ); \,\W_{(\ll-1)} \;  ); \W_\llp \;) \\
& = & \vdots \\
\end{array}
$$

Note, in particular
- The repeated occurrence of the term $\W$ will complicate computing the derivative
- As we will see in a subsequent lecture

RNN's are sometimes drawn without separate outputs $\y_\tp$
- in that case, $\h_\tp$ may be considered the output. 

The computation of $\y_\tp$ will just be a transformation of $\h_\tp$ so there is no loss in omitting
it from the RNN and creating a separate node in the computation graph.

Geron does not distinguish between $\y_\tp$ and $\h_\tp$ and he uses the single $\y_\tp$ to denote the state.

I will use $\h$ rather than $\y$ to denote the "hidden state".


# Typical uses of RNN

## Many to one: Creating a fixed length summary of a variable length sequence

A typical Many to One task is predicting the next element in a sequence

For example
- Predict the next word in a sentence
- Predict the next price in a timeseries of prices

These are implemented by a NN (with RNN layers as components) followed by a Head Layer (Classifier or Regressor)


But the Head Layers take **fixed length** inputs and our sequences are of potentially unbounded length !

We first need to convert the variable length sequence into a fixed length representation.

Let's make this concrete with an example: a sequence of words

<table>
    <tr>
        <th><center>RNN</center></th>
    </tr>
    <tr>
        <td><img src="images/RNN_loop_NLP.png" width=1000></td>
    </tr>
</table>

$\h_\tp$ is a **fixed length** vector that "summarizes" the prefix of sequence $\x$ up to element $t$.

The sequence is processed element by element, so order matters.

\begin{array} \\
\h_{(0)} & = & \text{summary}( [ \text{Machine} ]) \\
\h_{(1)} & = & \text{summary}( [ \text{Machine, Learning} ]) \\
\vdots \\
\h_\tp & = & \text{summary}( [ \x_{(0)}, \ldots \x_\tp ] ) \\
\vdots \\
\h_{(5)} & = & \text{summary}( [ \text{Machine, Learning, is, easy, not, hard} ]) \\
\end{array}

Turning an unbounded length sequence into a fixed length vector is very useful !
- All our other layer types take fixed length input

So we can feed $\h_{(5)}$ into a Classifier to decide on the sentiment of the sentence.

<table>
    <tr>
        <th><center><strong>RNN Many to one; followed by classifier</strong></center></th>
    </tr>
    <tr>
        <td><img src="images/RNN_many_to_one_to_classifier.jpg" width=870%></td>
    </tr>
</table>

## Many to many: Encoder-Decoder

Another common paradigm using RNN's that we will encounter is the *Encoder-Decoder* which is useful for tasks mapping sequences to sequences
- language translation
- Output and input sequence elements do not have a one to one correspondence
- The Encoder-Decoder decouples the sequences
    - Encoder summarizes the input sequence $\x_{([1:\bar T])}$ with $\bar \h_{(\bar T)}$
    - Decoder generates output sequence $\hat \y_{([1:T])}$ from the summary
    
<table>
    <tr>
        <th><center><strong>Encoder-Decoder for language translation</strong></center></th>
    </tr>
    <tr>
        <td><img src="images/RNN_layer_API_Encoder_Decoder_Language_Translation.png" width=80%></td>
    </tr>
</table>

- The final latent state $\bar\h_{(T)}$ of the Encoder "summarizes" the source sentence (English)
- It initializes the latent state of the Decoder which produces the target sentence (French)
- The Decoder implements a one-to-many API
    - source language "summary" as seed
    
Decoupling means that the length of $\x$ (length $\bar T)$ need not be equal to the length of $\hat\y$ (length $T$).

## One to many: Generative ML (generating sequences from a seed)

The two main Machine Learning Tasks we have studied thus far (Regression, Classification) are called *discriminative* tasks
- they learn the relationship between features and targets of an example

We can also use Machine Learning for the *generative* task of creating new examples
- learns the distribution of features
- can sample from the learned distribution to construct new examples



RNN's are often used for generative tasks. 

We generate a long sequence that is highly probable (from the learned distribution) given a short sequence as seed.

- The model is initially input a short "seed" sequence.
- The output is a prediction of the  **next** element of the sequence
- The input sequence is extended by the prediction
- Repeat !


[Here](https://app.inferkit.com/demo) is a demo of creating an entire story from an initial idea comprised of a few words.

# Conclusion

We have introduced the key concepts of Recurrent Neural Networks.
- An unrolled RNN is just a multi-layer network
- In which *all the layers are identical*
- The latent state is a fixed length encoding of the prefix of the input

A more detailed view of sequences and RNN's will be our next topic.

In [3]:
print("Done")

Done
