In [1]:
%run Latex_macros.ipynb

<IPython.core.display.Latex object>

# Dealing with Sequences: Recurrent Neural Network (RNN) layer

For a function that takes 
sequence $\x^\ip$ as input
and creates sequence $\y$ as  output we had two choices for implementing the function.

The RNN implements the function as a "loop"
- A function that taking **a single** $\x_\tp$ as input a time
- Outputting $\y_\tp$ 
- Using a "latent state" $\h_\tp$  to summarize the prefix $\x_{(1\ldots \tt)}$
- Repeat in a loop over $\tt$

$$
\begin{array}[lll] \\
\pr{\h_\tp | \x_\tp, \h_{(\tt-1)} } & \text{latent variable } \h_\tp \text{encodes } [ \x_{(1)} \dots \x_\tp ]\\
\pr{\y_\tp | \h_\tp }              & \text{prediction contingent on latent variable} \\
\end{array}
$$
    
<br>
<div>
    <center><strong>Loop with latent state</strong></center>
    <img src="images/RNN_arch_loop.png" width=70%>
</div>

"Unrolling" the loop makes it equivalent to a multi-layer network

<br>
<table>
    <tr>
        <th><center>RNN unrolled</center></th>
    </tr>
    <tr>
        <td><img src="images/RNN_layer_API_many_to_many.jpg"></td>
    </tr>
</table>

# Transformer layer

The alternative to the loop was to create a "direct function"
- Taking a **sequence** $\x_{(1 \dots \tt)}$ as input
- Outputting $\y_\tp$

<br>
<div>
    <center><strong>Direct function</strong></center>
    <img src="images/RNN_arch_parallel.png" width=50%>
</div>

In order to output the sequence $\y_{(1)} \ldots \y_{(T)}$ we
create $T$ copies of the function (one for each $\y_\tp$)
- computes each $\y_\tp$ in **parallel**, not sequentially as in the loop

<br>
<div>
    <center><strong>Direct function, in parallel (masked input)</strong></center>
<img src="images/Transformer_parallel_masked.png" width=50%>
</div>

The parallel units constitute a *Transformer layer*

<br>
<table>
    <tr>
        <th><center>Transformer layer (masked)</center></th>
    </tr>
    <tr>
        <td><img src="images/Transformer_1.png"></td>
    </tr>
</table>

Compared to the unrolled RNN, the Transformer layer
- Has **no** data (e.g., $\h_\tp)$ passing from the computation between time steps (e.g., from $\tt$ to $(\tt +1)$)
- Takes a **sequence** $\x_{(1..t)}$ as input
    - Because $\y_\tp$ is computed as a *direct* function of the prefix $\x_{(1..t)}$ rather than recursively
- Outputs generated in parallel, not sequentially
- No gradients flowing backward over time

With this architecture, we can compute more general functions
- where each $\y_\tp$ depends on the entire $\x_{(1 \ldots T)}$ rather than a prefix $\x_{(1 \ldots \tt)}$

<br>
<div>
    <center><strong>Direct function, in parallel (un-masked input)</strong></center>
<img src="images/Transformer_parallel.png" width=50%>
</div>


<table>
    <tr>
        <th><center>Transformer layer</center></th>
    </tr>
    <tr>
        <td><img src="images/Transformer_2.png"></td>
    </tr>
</table>

To illustrate the different types of functions:
- In a "predict the next" element function, we restrict the input to a causal prefix
- In a "summarize" the sequence function, we allow access to the full sequence
    - Context Sensitive Encoding of a word within a sentence
    - Same effect as a bi-directional RNN


For generality
- The Transformer input is usually the entire sequence $\x_{(1 \ldots T)}$ 
- When we need to "hide" part of the sequence, we can use an **input mask**
- The mask can be arbitrary
- The particular input mask restricting to a prefix implements **causal** masking
    - Can't "look into the future inputs"

## Self Attention

If we look inside the box computing the direct function, we will find several layers
- An Attention layer
    - To influence which elements of the input sequence $\x$ to attend/focus when outputting $\y_\tp$
- A Feed Forward Network (FF) layer to compute the function

<br>
<table>
    <tr>
        <th><center>Transformer Layer (Encoder)</center></th>
    </tr>
    <tr>
        <td><img src="images/Transformer_Encoder.png" width=60%></td>
    </tr>
</table>

An Attention layer that attends/focus on its inputs implements what is called *Self-attention*
- We will soon see the possibility of attending to other values

If the function for $\y_\tp$ is restricted to prefix $\x_{(1 \ldots \tt)}$ the Attention layer can use causal masking.

This is referred to as *Masked Self-Attention*.

## Cross Attention

It is common to use two Transformers in an Encoder-Decoder configuration.

Recall the Encoder-Decoder architecture (using RNN's rather than Transformers in the diagram)

<table>
    <tr>
        <th><center><strong>Encoder-Decoder for language translation</strong></center></th>
    </tr>
    <tr>
        <td><img src="images/RNN_layer_API_Encoder_Decoder_Language_Translation.png" width=80%></td>
    </tr>
</table>

The Decoder in the Encoder-Decoder architecture is *generative*
- Outputs $\hat \y_\tp$ for a single $\tt$ at a time
- Appending output $\hat \y_\tp$ to the input available to output the next $\hat \y_{(\tt+1)}$


The Encoder in the Encoder-Decoder architecture creates a latent state $\bar \h_\tp$ which summarizes the input prefix $\x_{(1 \ldots \tt)}$.

In the above diagram the Decoder only has access to $\bar\h_{(\bar T)}$,
the final latent state
- summarizing the entire input sequence

This is very restrictive, forcing $\bar \h_{(\bar T)}$ to encode a lot of information.

But we motivated Attention by suggesting that the Decoder have access to *each* $\bar \h_\tp$ for $1 \le \tt \le \bar T$.
- and use the Attention mechanism to decide which $\bar \h_\tp$ to focus on when generating $\hat \y_\tp$

<table>
    <tr>
        <th><center>Decoder: Attention</center></th>
    </tr>
    <tr>
        <td><img src="images/Decoder_attention.png" width=80%></td>
    </tr>
</table>

Thus the Decoder Transformer can also attend to the output of the Encoder.

This is called *Cross Attention* (Encoder-Decoder Attention).


<table>
    <tr>
        <th><center>Transformer Layer (Decoder)</center></th>
    </tr>
    <tr>
        <td><img src="images/Transformer_Decoder.png" width=70%></td>
    </tr>
</table>

<table>
    <tr>
        <th><center>Transformer Layer (Encoder/Decoder)</center></th>
    </tr>
    <tr>
        <td><img src="images/Transformer_Encoder_Decoder_2.png" width=70%></td>
    </tr>
</table>

**Explanation of diagram**
- The Encoder uses Self-attention (<span style="color:green">wide Green arrow</span>) to attend to input sequence $\x$
- The Decoder uses Masked Self-attention (<span style="color:red">wide Red arrow</span>) to attend to its input
    - It's input is the prefix of the output sequence $\y$
    - Limited to prefix of length $\tt$ by **masking**
- The Decoder uses Cross Attention (between Encoder and Decoder) <span style="color:blue">(wide Blue arrow)</span>
    - To enable Decoder to focus on which Encoder latent state $\bar \h_\tp$ to atttend to
- The dotted <span style="color:blue">(thin Blue arrow)</span> indicates that the output $\hat \y_\tp$ is appended to the input that is available when generating $\hat \y_{(\tt+1)}$
  

## Stacked Transformer

Just as with many other layer types (e.g., RNN), we may stack Transformer layers.
- Each layer creating alternate representations of the input of increasing complexity

In fact, stacking $N > 1$  Transformer layers is typical.

$N = 6$ was the choice of the original paper.

<table>
    <tr>
        <th><center>Stacked Transformer Layers (Encoder/Decoder)</center></th>
    </tr>
    <tr>
        <td><img src="images/Transformer_Encoder_Decoder_multi.png" width=70%></td>
    </tr>
</table>

## Full Encoder-Decoder Transformer architecture

There are other components of the Encoder and Decoder that we have yet to describe.

We will do so briefly.

(The Transformer was introduced in the paper [Attention is all you Need](https://arxiv.org/pdf/1706.03762.pdf)
   
<table>
    <tr>
        <th><center>Transformer (Encoder/Decoder)</center></th>
    </tr>
    <tr>
        <td><img src="images/Attention_is_all_u_need_Transformer.png" width=60%></td>
    </tr>
</table>

**Embedding layers**

We will motivate and describe Embeddings in the NLP module.

For now:
- an embedding is an encoding of a categorical value that is shorter than OHE

It is used in the Transformer to
- encode the input sequence of words
- encode the output sequence of words

**Positional Encoding**

The inputs are ordered (i.e., sequences) and thus describe a relative ordering relationship between elements.

But inputs to most layer types (e.g., Fully Connected) are unordered.

The Positional Encoding is a way of encoding the the relative ordering of elements.

**Add and Norm**

We have seen each of these layer types before
- Norm: Batch (or other) Normalization layers
- Add: the part of the residual network that joins outputs of multiple previous layers

The diagram shows an Encoder-Decoder pair.

You will notice that each element of the pair is different.

- It is possible to use each element independently as well.

- But first we need to understand
the source of the differences and their implications.



## Encoder-style Transformer

The Transformer for the Encoder and Decoder of an Encoder-Decoder Transformer are slightly different.

They can also be used individually as well as in pairs.

It's important to understand the differences in order to know when to use each individually.

The Encoder side of the pair **does not** restrict the order in which it's inputs are accessed.
- Self-attention **without** causal masking

So the Encoder is appropriate for tasks that require a context-sensitive representation of
each input element.

For example: the meaning of the word "**it**" changes with a small change to a subsequent word in the following sentences:
- "The animal didn't cross the road because **it** was too tired"

- "The animal didn't cross the road because **it** was too wide"

Some tasks with this characteristic are
- Sentiment
- Masked Language Modeling: fill-in the masked word
- Semantic Search
    - compare a summary of the sequence that is the context-sensitive representation of
        - query sentence
        - document sentences
    - Each summary is a kind of **sentence embedding**
    - Summary
        - pooling over each word
        - final token
        

## Decoder-style Transformer

One notable aspect of the Decoder is its recurrent (generative) architecture
- Output $\y_{(\tt-1)}$ is appended to the Decoder inputs available at step $\tt$.
    - The Decoder inputs are $\y_{(1..T)}$, where $T$ is the full length of the Decoder output
    - **But** Causal Masking ensures that only $\y_{(1..\tt)}$ is *available* at step $\tt$.
    
Thus, the Decoder is appropriate for *generative* tasks
- Text generation
- Predict the next word in a sentence

# Advantages of a Transformer compared to an RNN

Among the most important advantages of the Transformer over an RNN
- are its ability to capture long-term dependencies 
- because all elements of the sequence are processed in parallel
    - no vanishing gradient or truncated back propagation
    
This has made the Transformer the architecture of choice for NLP.


The computational advantages (detailed in next section) are many:
- Time: All steps computed in parallel
    - $O(1)$ sequential steps versus $O(T)$
- Fewer operations: faster training
    - $O( T^2 * d )$ versus $O(T * d^2)$, where $d$ is size of latent state and length of a single input element
        - e.g., $\x_\tp$ replaced by an embedding of dimension $d$
    - Transformer has fewer operations when $T \lt d$
- Similar number of parameters 
    - When $T < \sqrt{d}$: Self attention has about the same number of parameters

Note that, because of TBTT, T is the length of a *chunk* rather than the full input length
- Typical $T = 64, d \ge 256$

So under the special case (that applies to sequences) that chunk length is short relative to representation size,
it is not "crazy" to perform all elements of $\x$ with separate FC's.

The faster training enables
- larger datasets
- deeper models

## Detailed computational comparison of architectures

| Type | Parameters  | Operations &nbsp; &nbsp; | Path length |
|------|------       |------      |------       |
|  CNN | $k * d^2$   | $T * k * d^2$ | $T$ |
| RNN  | $d^2$       | $T * d^2$     | $T$ |
| Self-attention | $T^2 *d $ | $T^2 *d$ | 1 |


Here's the details of the math        

Attention involves a dot product (of vectors of length $d$)
- Each input matched against all others: $T * T$
- So $T^2 *d$ operations

RNN
- $T$ sequential steps: path length $T$ 
- Each step evaluates
    $$
\h_\tp  =  \phi(\W_{xh}\x_\tp  + \W_{hh}\h_{(t-1)}  + \b_h) 
$$
- $\h_\tp$ has multiple elements, assume $|| \h || = O(d)$
    - Computing updated hidden state element $j$ (i.e., $\h_{\tp, j}$) involves dot product of vectors of length $d$ (size of $\x_\tp)$
    - $d$ multiplications per element of $\h$, times $O(d)$ elements of $\h$ is $O(d^2)$ per step
    - So $T * d^2$ operations
    
- $\W_{hh}$ matrix: $d^2$ parameters
  - $ | \h | = d$

CNN
- path length $T$ 
    - each kernel multiplication connects only $k$ elements of $\x$
    - since kernels overlap inputs, can't parallelize, hence $O(T/k)$ path length
        - can reduce to $\log(T)$ with tree structure
- Parameters
    - kernel size $k$
    - number of input channels = number of output channels = $d$
    - $k *d$ parameters for kernel of one channel
    - $k * d^2$ parameters for kernel for all $d$ output channels
    
- Operations
    - for a single output channel: $k$ per input channel
        - There are $d$ input channels, so $k *d$ for each dot product of *one* output channel
        - There are $d$ output channels, so $k * d^2$ per time step
    - $T$ time steps so $T * k * d^2$ number of operations


# A free lunch ? Almost !

Transformers offer the possibility of great improvements in training speed
- Parallelism
- Fewer operations
    
Sounds too good to be true.  Is there such a thing as a free lunch ?

Almost
- RNN can handle sequences of arbitrary length ($T$ unbounded)
- Transformer has a fixed number of parallel units, which limits the length of sequences

But, in practice: RNN uses *Truncated* Back Propagation Through Time
- So the maximum distance between input sequence elements is bounded by $k$, the truncation length

## Some other advantages

- Can learn long-range dependencies
    - Gradients within a layer don't flow backwards: always a single step
        - Can't vanish or explode
    - The output $\y^{[\ll]}_\tp$ of layer $\ll$ (for stacked Transformer layers) is a function of **all** inputs
    $$
    \y^{[\ll-1]}_{(\tt')} \text{ for } 1 \le \tt' \le T
    $$
        - so can directly access a distant input
        - not diminished by passing through multiple intermediate time steps


## Some drawbacks

- The output $\y^{[\ll]}_\tp$ of layer $\ll$ (for stacked Transformer layers) is a function of **all** inputs, **always**
    - Perhaps less efficient
- Unless you add positional encoding, you lose ordering relationships between inputs

In [None]:
print("Done")