In [1]:
%run Latex_macros.ipynb

<IPython.core.display.Latex object>

**CHANGES TODO**
- All the back prop stuff has to wait until the lecture on Training
- Can't do LSTM until backprop
- Residual connections depend on back prop
- Visualization OK

# The RNN API

During one time step of computation, the RNN computes 2 values
- new  state $\h_\tp$
- output $\y_\tp$ (sometimes simply taken to be same as shor term state

The state computation is a function of the previous state $\h_{(t-1)}$,
and the current input $x_\tp$.

$$
\h_\tp = f(\x_\tp;  \h_{(t-1)})
$$

Note the recursive aspect of the computation of  $\h_\tp$: 
- it implicitly depends
on the values of the states at all previous time steps $t' < t$.


# RNN as a layer

## Many to one

Although the unrolled RNN looks confusing, as an "API" the RNN just acts as any other layer
- takes some input $\x$ (which happends to be a sequence)
- produces a single output

If we draw a box around the unrolled RNN, we can see the "API":

<table>
    <tr>
        <th><center>RNN many to one</center></th>
    </tr>
    <tr>
        <td><img src="images/RNN_layer_many_one.jpg" width=800></td>
    </tr>
</table>


## RNN layer as an encoder
The many to one RNN essentially creates a compact encoding of an arbitrarily long sequence.

This can be very useful as we can feed this "summary" (representation) of the entire sequence
into layers that can't handle sequence inputs.

Note that there is nothing special about a layer creating a compact encoding (representation) of it's input.

A CNN layer, with outputs flattened to one dimension, creates a compact encoding of an image.

The real power of the RNN is the ability to encode all sequences, regardless of length, into a fixed size
representation.

### Sequences: variable length input summarized

$\h_\tp$ summarizes the length $\tt$ sequence $\x_{1,\ldots, \tt}$ in a *fixed size* vector $\h_\tp$.
- makes sequences amenable to models that can only deal with fixed size input


To be clear
- the RNN is a layer, just like any other
    - Internally it implements a loop but that is ordinarily hidden
    - The intuition about the "unrolled loop" is to help us to better understand the inner workings, not as a coding matter

- Like any other layer, it produces an output (although after multiple time steps for an RNN versus
a single time step for a Dense layer).
                                          
- If the length of sequence $\x$ is $T$, there is ordinarily a **single** output $\y_{(T)}$
    - $\y_{(T)}$ is only available after the entire input sequence has been consumed
    - the intermediate results 
    $$\h_\tp, \y_\tp, \; t = 1, \ldots, (T-1)$$ 
    are not visible through the API
    

## Many to many

The above behavior defines a many to one mapping from input sequence (many) to single output (one).

With a minor change, we can define a many to many mapping:
- each element of the input sequence
results in one element of an output sequence.

Many Deep Learning software API's will see recurrent layers with an optional
argument 
- `return_sequences`
- `return_states` 
- both default to `False` in Keras.

This controls the output behavior of the RNN layer, whether it returns one output per time step
$$
       \h_{(1)}, \ldots, \h_{(T)} \\
       \y_{(1)}, \ldots, \y_{(T)}
$$
or just
$$
\h_{(T)} \\
\y_{(T)}
$$

This is how any RNN behaves when the function it's implementing is many to many:
- one output per time step.

When the RNN needs to implement a many to one function, the layer looks like

<table>
    <tr>
        <th><center>RNN many to many</center></th>
    </tr>
    <tr>
        <td><img src="images/RNN_layer_many_many.jpg" width=800></td>
    </tr>
</table>


The art-work needs to be clarified
- the RNN layer produces sequences
    - as outputs $\y$
    - as states $\h$

These sequences are available when the RNN layer *completes* its consumption of input $\x$.

The following diagram may clarify


<table>
    <tr>
        <th><center>RNN many to many, clarified</center></th>
    </tr>
    <tr>
        <td><img src="images/RNN_loop_many_many.jpg" width=1000></td>
    </tr>
</table>

- the `return_sequences` argument instructs the layer to produce a sequence $\y$
    - rather than a scalar, as in the many to one case
- the `return_states` argument instructs the layer to return the state $\h$ as well
    - useful if we stack RNN layers

### Stacked RNN layers

One can connect RNN layers into "stacks" 
- by feeding
the output state of one RNN layer as the input to the successor layer:


<table>
    <tr>
        <th><center>RNN Stacked layers</center></th>
    </tr>
    <tr>
        <td><img src="images/RNN_layers_stacked.jpg" width=800></td>
    </tr>
</table>
​

### Encoder/Decoder architecture

An *Encoder/Decoder* is a two part Neural Network that is applied to many NLP tasks
- *Encoder* converts sequence (sentence) into intermediate representation (sequence)
- *Decoder* converts intermediate sequence to final sequence

**IS THE DIAGRAM CORRECT ??**

Or is it
- Encoder: many to one
- Decoder: one to many

Attention
- Encoder: many to many
- Decoder: many to many


<table>
    <tr>
        <th><center>RNN Encoder/Decoder</center></th>
    </tr>
    <tr>
        <td><img src="images/RNN_Encoder_Decoder.jpg" width=800></td>
    </tr>
</table>
​


### Sequence to Sequence

- Many to one encoder
    - variable length input to fixed length final state $\h$
- One to one decoder
    - *initial* state of decoder set to *final* state of encoder
    - teacher forcing
        - input $(\tt +1)$ of decoder is out $\tt$ of decoder

Example: language translation

This is useful when there is not an exact one to one correspondence between tokens in the source and target languages.

<table>
    <tr>
        <th><center>Sequence to Sequence: inference</center></th>
    </tr>
    <tr>
        <td><img src="images/RNN_seq2seq_inference.jpg" width=1000></td>
    </tr>
</table>

<table>
    <tr>
        <th><center>Sequence to Sequence: training (teacher forcing)</center></th>
    </tr>
    <tr>
        <td><img src="images/RNN_seq2seq_training.jpg" width=1000></td>
    </tr>
</table>

# RNN details: update equations

$$
\begin{array}[lll]\\
\h_\tp & = & \phi(\W_{xh}\x_\tp  + \W_{hh}\h_{(t-1)}  + \b_h) \\
\y_\tp & = &  \W_{hy} \h_\tp  + \b_y \\
\end{array}
$$

where $\phi$ is an activation function (usually $\tanh$)

**Note** Geron prefers right multiplying weights $\x_\tp \W_{xh}$ versus $\W_{xh}\x_\tp$
- left multiplying seems more common in literature

**Note**
The equation is for a single example.  

In practice, we do an entire minibatch so have $m$ $\x's$
given as a $(m \times n)$ matrix $\mathbf{X}$.

**page 471: mention dimensions of each**

## Equation in pseudo-matrix form

You will often see a short-hand form of the equation.

Look at $\h_\tp$ as a function of two inputs $\x_, \h_{(t-1)}$.

We can stack the two inputs into a single matrix.

Stack the two matrices $\W_{xh}, \W_{hh}$ into a single weight matrix

$
\begin{array}[lll]\\
\h_\tp  = \W \mathbf{I} + \b \\
\text{ with } \\
\W = \left[
 \begin{matrix}
    \W_{xh} & \W_{hh}
 \end{matrix} 
 \right] \\
\mathbf{I} = \left[
 \begin{matrix}
    \x_\tp  \\
    \h_{(t-1)}
 \end{matrix} 
 \right] \\
\end{array}
$

## Stacked RNN layers revisited

With the benefit of the RNN update equations, we can clarify how stack RNN layers works>

Let superscript $[\ll]$ denote a stacked layer of RNN.

So the RNN update equation for the bottom layer $1$ becomes
$$
\begin{array}[lll]\\
\h^{[1]}_\tp & = & \phi(\W_{xh}\x_\tp  + \W_{hh}\h^{[1]}_{(t-1)}  + \b_h) \\
\end{array}
$$

The RNN update equation for leyer $[\ll]$ becomes

$$
\begin{array}[lll]\\
\h^{[\ll]}_\tp & = & \phi(\W_{xh}\h^{[\ll-1]}_\tp  + \W_{hh}\h^{[\ll]}_{(t-1)}  + \b_h) \\
\end{array}
$$

That is: the input to layer $[\ll]$ is $\h^{[\ll-1]}_\tp$ rather than $\x_\tp$

# Loss function

Examples
- an example $\x^\ip$ is now a *sequence* $\x^\ip_{(1)}, \x^\ip_{(2)}, \ldots, \x^\ip_{(T)} $
    - variable length
    - $\x^\ip_\tp$ *may* be a vector (doesn't have to be scalar), 
        - e.g., word embedding
        

Per example loss $\loss^\ip$ *per time step*
- In many to many: there is a loss per time-step.  
- Total loss (over which we optimize) is sum, orver time , of the loss per time step
    - $\loss^\ip = \sum_{\tt=1}^n \loss^\ip_\tp$
- In many to one: loss is single value (per example): depends on final state 
    - $\loss^\ip = \loss_{(T)}$ 

<table>
    <tr>
        <th><center>RNN Loss: Forward pass</center></th>
    </tr>
    <tr>
        <td><img src="images/RNN_layer_loss.jpg" width=800></td>
    </tr>
</table>

# Sequences: Variable length

There are lots of small potholes one encounters with sequences.

What is the examples of my training set have widely varying lengths ?

- Within a batch, short examples may behave differently than long examples:
    - Maybe learn less in short examples, noisier gradient updates
    
- Padding sequences to make them equal length
    - Pad at the start ? Or at the end ?

The general advice is to arrange your data so that an epoch contains examples of similar lengths.
- You may require multiple fittings, one per length

# Visualization of RNN hidden state

Here is a [visualization](http://karpathy.github.io/2015/05/21/rnn-effectiveness/#visualizing-the-predictions-and-the-neuron-firings-in-the-rnn) of single elements within the hidden state, as they consume the input sequence
of *single characters*.

The color reflects the intensity (value) of the paricular cell (blue=low, red=high)


<table>
    <tr>
        <th><center>State activations after seeing prefix of input</center></th>
    </tr>
    <tr>
        <td><img src="images/Unreasonable_effectiveness_1.png" width=800></td>
    </tr>
</table>

# Issues with RNN's

Preview:
- back prop, BPTT
- exploding gradient
- vanishing gradient

- Length of a sequence potentially very long
    - Gradient computation
    - Can we/should we really unroll the whole loop
    - Vanishing/exploding gradients
    - Dealing with long sequences in Keras

These are special case of the issue of Back Prop.

Will defer many issues until after we learn Back Prop

# RNN as a generative model


Up to now, an RNN's inputs were a prespecificed vector $\x$.

For each example during training, one element of $\x$ was fed into the RNN per time-step.

Similarly for inference time.

This behavior is characteristic of a discriminative network.

Consider: Suppose there were **no** inputs (or more precisely: a very short sequence $\x$ of length $t'$, used to "prime" the RNN).

Instead, let's set the input at time step $t$ to be the output of step $(t-1)$
$$
\x_\tp = \y_{(t-1)}
$$
for $t > t'.

Then the RNN would be self-perpetuating, never exhausting its inputs, and generating new outputs
*conditional* on previous outputs !

This would be a generative form of the RNN.

## Training by teacher forcing

One way to train this RNN is via a supervised task
- given sequence $\x$ up to time $t$: $\x_{(1, \ldots t)}$
- target is $\x_{(t+1)}$

This is just ordinary supervised training with a specially constructed input derived from a single sequence $\x$.  

Of course, we would do this for a training set with many sequences, as usual.

This is similar to a classifier where the class we are trying to predict is the class of the next input
element.

The only "trick" is that, at step $t$, the RNN may output the "wrong" value $\hat{\x} \ne \x_{(t+1)}$.
If we fed the wrong $\hat{\x}$ as the next input to the RNN during training, the RNN would remain
permanently off-track and never learn.

Instead, during *training*, the next input to the RNN 
- is **forced** to be the correct input (**is the input at step $t$ 

$$
\x_{(t+1)} \text{ or is it } \x_{(t+1)}
$$

This type of supervised learning is called *teacher forcing*.

During *inference* time, we feed back as input whatever the generated output is.

### Sampling
We have described a deterministic generation process.

This would be pretty boring, lacking variety
- as well as being problematic for generalization
- we would be encouraging
the RNN to memorize inputs.

In producing the single output, what is really happening is
- our classifier has one logit per
class
- we arbitarily decide to pick the largest.

But we can properly view the (post-softmax) output
- as a probability vector (elements sum to $1$).

Instead of choosing the class with maximum probability
- we can sample from the probability space
defined by this vector
- e.g., if the probability for class $c_1$ is twice as great as that for class
$c_2$, the probability of sampling $c_1$ would be twice as great.  

There is still some chance
that $c_2$is sampled, unlike the deterministic case.

If we do this, the generator can create output sequences different from any training example.

<table>
    <tr>
        <th><center>Training, with Teacher Forcing</center></th>
    </tr>
    <tr>
        <td><img src="images/RNN_teacher_forcing_training.jpg" width=800></td>
    </tr>
</table>

<table>
    <tr>
        <th><center>Test time: no forcing</center></th>
    </tr>
    <tr>
        <td><img src="images/RNN_teacher_forcing_inference.jpg" width=800"></td>
    </tr>
</table>

## Generating strange things

We haven't specified what each element in sequence $\x$ is.

For text, $x_\tp$ could be either a character or a word, for example.

You'll be surprised how successful an RNN can be when it's task is to consume sequences of characters
and predict the next character.

Although it hasn't been expicitly programmed to generate valid words, punctuation, etc.,
it tends to produce realistic text !

Another interesting fact: these "character RNN's" also learn semantically meaningful constructs
- the need for nested things to match
    - multi-level paranthetical phrases, e.g., "(this is (very important) I think)"
    - opening/closing markup
    - indentation/un-indentation of code blocks

This suggests that the hidden state may be learning to "count" certain concepts.

As we will see in a visualization of the hidden state, and in how LSTM's work, this may in fact be true.

RNN's of this type were quite popular and have been used to generate
- Fake [Shakespeare](http://karpathy.github.io/2015/05/21/rnn-effectiveness/#shakespeare), or fake politician-speak
- Fake code 
- Fake [math textbooks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/#algebraic-geometry-latex)
- [Click bait headline generator](http://clickotron.com/about)


In [3]:
print("Done")

Done
