In [1]:
%run Latex_macros.ipynb

<IPython.core.display.Latex object>

# Dealing with Sequences: Recurrent Neural Network (RNN) layer

For a function that takes 
sequence $\x^\ip$ as input
and creates sequence $\y$ as  output we had two choices for implementing the function.

The RNN implements the function as a "loop"
- A function that taking **a single** $\x_\tp$ as input a time
- Outputting $\y_\tp$ 
- Using a "latent state" $\h_\tp$  to summarize the prefix $\x_{(1\ldots \tt)}$
- Repeat in a loop over $\tt$

$$
\begin{array}[lll] \\
\pr{\h_\tp | \x_\tp, \h_{(\tt-1)} } & \text{latent variable } \h_\tp \text{encodes } [ \x_{(1)} \dots \x_\tp ]\\
\pr{\y_\tp | \h_\tp }              & \text{prediction contingent on latent variable} \\
\end{array}
$$
    
<br>
<div>
    <center><strong>Loop with latent state</strong></center>
    <img src="images/RNN_arch_loop.png" width=70%>
</div>

"Unrolling" the loop makes it equivalent to a multi-layer network

<br>
<table>
    <tr>
        <th><center>RNN unrolled</center></th>
    </tr>
    <tr>
        <td><img src="images/RNN_layer_API_many_to_many.jpg"></td>
    </tr>
</table>

# Transformer: Encoder-style

The alternative to the loop was to create a "direct function"
- Taking a **sequence** $\x_{(1 \dots \tt)}$ as input
- Outputting $\y_\tp$

<br>
<div>
    <center><strong>Direct function</strong></center>
    <img src="images/RNN_arch_parallel.png" width=50%>
</div>

In order to output the sequence $\y_{(1)} \ldots \y_{(T)}$ we
create $T$ copies of the function (one for each $\y_\tp$)
- computes each $\y_\tp$ in **parallel**, not sequentially as in the loop

<br>
<div>
    <center><strong>Direct function, in parallel (masked input)</strong></center>
<img src="images/Transformer_parallel_masked.png" width=50%>
</div>

The parallel units constitute a *Transformer Encoder*

<br>
<table>
    <tr>
        <th><center>Transformer Encoder (causal masked input)</center></th>
    </tr>
    <tr>
        <td><img src="images/Transformer_1.png"></td>
    </tr>
</table>

Compared to the unrolled RNN, the Transformer Decoder
- Takes a **sequence** $\x_{(1..t)}$ as input
    - Because $\y_\tp$ is computed as a *direct* function of the prefix $\x_{(1..t)}$ rather than recursively
- Has **no** latent state: output is a direct function of the input sequence
- Has **no** data (e.g., $\h_\tp)$ passing from the computation between time steps (e.g., from $\tt$ to $(\tt +1)$)
- Outputs generated in parallel, not sequentially
- No gradients flowing backward over time

With this architecture, we can compute more general functions than the RNN
- where each $\y_\tp$ depends on the entire $\x_{(1 \ldots T)}$ rather than a prefix $\x_{(1 \ldots \tt)}$

<br>
<div>
    <center><strong>Direct function, in parallel (un-masked input)</strong></center>
<img src="images/Transformer_parallel.png" width=50%>
</div>


<table>
    <tr>
        <th><center>Transformer Decoder (unmasked input)</center></th>
    </tr>
    <tr>
        <td><img src="images/Transformer_2.png"></td>
    </tr>
</table>

An example of such a general function is the "meaning" of a word, in context.

| Sentence | Meaning of "it" |
|:----------|:-----------------|
The animal didn't cross the street because **it** was too tired | the animal
The animal didn't cross the street because **it** was too wide  | the road

The meaning of the word "it" is determined by a word that follows it ("tired" or "wide")

So even though the Transformer output at each position is a function of the entire sequence $\x_{(1 \ldots T)}$
- the output is different for each position $\tt$

We can control whether the input to Transformer element at position $\tt$ is
prefix $\x_{(1 \ldots \tt)}$
or
the entire sequence $\x_{(1 \ldots T)}$
by **masking** the input to element $\tt$
- no masking: the entire sequence is visible
- *casual masking*: only the prefix up to $\tt$ is visible: $\x_{(1 \ldots \tt)}$

# Technical clarifications

When we introduced the RNN, at each step $\tt$ of the loop, we defined two outputs
$\h_\tp$
and
$\y_\tp$.

In general, we only need to output $\h_\tp$
- $\y_\tp$ can be defined as a further processing of $\h_\tp$

Henceforth, we will assume the style of a single output $\h_\tp$

The reason for doing this:
- We can "stack" $N$ Transformer layers (just as we can stack RNN layers)
- The output of the non-top layer $j$ is $\h^{[j]}_\tp$, not the final $\y_\tp$
- We identify $\y_\tp$ as the output of the top layer $\h^{[N]}_\tp$
    - perhaps after a further processing

Furthermore: 
    
Since the Encoder part is no longer a "loop"
- It is inaccurate to refer to the Encoder output $\bar \h_\tp$ as a "latent" state
- However, $\bar \h_\tp$ *is still* a summary of the input sequence
    - a summary of $\x_{(1 \ldots \tt)}$ when casual attention is used
    - a summary of $\x_{(1 \ldots \bar T)}$ otherwise
- Out of **bad habit** we may continue to erroneously refer to $\bar \h$ and $\h$ as "latent" states

# Inside the Transformer Encoder: Self Attention

If we look inside the box computing the direct function, we will find several layers
- An Attention layer
    - To influence which elements of the input sequence $\x$ to attend/focus when outputting $\y_\tp$
- A Feed Forward Network (FF) layer to compute the function

<br>
<table>
    <tr>
        <th><center>Transformer Layer (Encoder)</center></th>
    </tr>
    <tr>
        <td><img src="images/Transformer_Encoder.png" width=60%></td>
    </tr>
</table>

An Attention layer that attends/focus on its inputs implements what is called *Self-attention*
- We will soon see the possibility of attending to other values

If the function for $\h_\tp$ is restricted to prefix $\x_{(1 \ldots \tt)}$ the Attention layer can use causal masking of the input sequence.

This is referred to as *Masked Self-Attention*.

The Feed Forward Network computes the function, given the elements of the input sequence
that are being attend to.

# Advantages of a Transformer compared to an RNN

As we will demonstrate in detail below
- The Transformer's operations can be performed in parallel versus sequentially for the RNN
- Gradients less likely to vanish or explode



We can **leverage** these advantages in complexity by
- By making a Transformer model bigger (e.g., more stacked Transformer layers)
- Making the sequence lengths longer
- Increasing the number of examples of training data

So, for the same time "cost" as an RNN, we can use a bigger Transformer on more data
- **Hence: we can learn more complex functions for similar time cost**

Moreover: the path length from the output to the input is constant in an Transformer, compared to $T$ in the RNN.

Recall that gradients tend to vanish or explode as the path length during Back Propagation increases.

So **Transformers are better able to capture long-range dependencies than an RNN**

This gives them an advantage in learning as well.

The price we pay for this is that the number of parameters of a Transformer is greater than a similar RNN
- By a factor of $T$
- The parameters for each time step of a Transformer are *independent*
- The parameters for each time step of an RNN are *shared*

We give the detailed math below.

## Number of sequential steps

The most obvious advantage of the "direct function" as opposed to the "loop" is
that outputs are computed in parallel versus sequentially.

For an input sequence of length $T$:
- The loop requires $T$ steps
- The direct function requires $1$ step

## Path length
The *Path length* is the distance that the Loss Gradient needs to travel backwards during Back Propagation.

At each step, the gradient is subject to being diminished or increased (Vanishing/Exploding gradients).

Since the Transformer operates in parallel across positions, this is $\OrderOf{1}$.

It is $\OrderOf{T}$ for the RNN due to the sequential computation.


**The constant path length is critical to the success of the Transformer**
- The query used for the input at position $\tt$ can access **all** prior positions $\tt' \le \tt$ at the same cost
    - Gradient not diminished
    - RNN
        - Gradient signal diminished for position $\tt' << \tt$
        - Truncated Back Propagation may kill the gradient flow from position $\tt$ back to $\tt'$ beyond truncation window

A key strength of the Transformer is that it enables learning long-range dependencies.

## Number of operations

What about the number of operations ? 

Let $d$ denote the length of the output of a Transformer
- i.e., $d = || \h_\tp ||$

When we examine the internals of the Transformer in precise detail
- We will discover additional layers
- The size of the output of each layer is also $d$

The Self Attention layer attend to (transformed) inputs
- each element assumed size of $d$

The keys and values of the CSM implementing Attention are the size $d$ input elements.
- Each attention lookup (dot product of query with a key) requires $d$ multiplications.
- There are $T$ key/value pairs in the CSM
- There are $T$ attention units (one for each position, outputting $\h_\tp$)

Thus: $\OrderOf{T^2 *d}$ multiplications.

What about the number of operations for an RNN computing the same function ?

The RNN outputs $\h_\tp$  of size $d$ (same as Transformer).
- In the RNN $\h_\tp$ is also the latent state

The RNN "loops" for $T$ steps.

Each step updates latent state $\h_\tp$ via the equation
    $$
\h_\tp  =  \phi(\W_{xh}\x_\tp  + \W_{hh}\h_{(t-1)}  + \b_h) 
$$
- $\x_\tp$ is also size $d$ (same assumption as for Transformer)
- The  weight matrices 
$$\W_{xh} \text{ and } \W_{hh}$$
are of size $$\OrderOf{d \times d}$$

So each step involves $d^2$ multiplications.

For $T$ steps: 
 $\OrderOf{T * d^2}$ multiplications.

Transformer number of operations: $\OrderOf{T^2 * d}$

RNN number of operations $\OrderOf{T * d^2}$

When $T \lt d$, the Transformer uses fewer operations compared to the RNN.

Typical values
- $d \ge 768$
- $T \lt d$ in typical RNN
    - remember: TBPTT divides the input sequence into shorter segments
- **but** $T > d$ in the most recent Transformer modeles
    - Path length is constant, so able to increase $T$ without fear of vanishing/exploding gradients
    - Can capture very long-term dependencies

## Number of parameters

Multi-head Attention transforms the keys and values of the CSM
from length $d$ to a fraction (determined by number of heads) of $d$.

So these matrices are of size $\OrderOf{d^2}$.

The matrices for each position $\tt$ are not shared.

With $T$ positions, the total number of parameters is $\OrderOf{T * d^2}$.

**Note**

With multi-head attention, $d_\text{head} = \frac{d}{n_\text{heads}}$

The matrices are thus of size $d_\text{head}^2$; there are $n_\text{heads}$ of these matrices for
total size $n_\text{heads} * d_\text{head}^2 = \frac{d^2}{n_\text{heads}}$

This is $\OrderOf{d^2}$ but with a smaller multiplicative constant
- Can explain the difference when trying to perform an exact calculation of number of parameters

The FFN layer (assuming a single Fully Connected layer) would also have $d^2$ parameters.
- These **are shared** across positions.
- So only $\OrderOf{d^2}$ parameters for the shared FFN.

Total number of parameters in the combined Attention + FFN layers: $\OrderOf{T * d^2}$.

We previously derived that the size of the weight matrices in the RNN are $\OrderOf{d^2}$.

The number of parameters in the Transformer are $\OrderOf{d^2}$.

So the number of parameters in the Transformer is larger by a factor of $T$.


## Complexity: summary

We also throw in a CNN for comparison

The detailed CNN math is given in a following section.

| Type | Parameters  | Operations  &nbsp; &nbsp; &nbsp; | Sequential steps | Path length
|:------|:---|:---|:---|:---|
|  CNN | $\OrderOf{k * d^2}$   | $\OrderOf{T * k * d^2}$ | $\OrderOf{T}$   | $\OrderOf{T}$ |
| RNN  | $\OrderOf{d^2}$       | $\OrderOf{T * d^2}$     | $\OrderOf{T}$    | $\OrderOf{T}$ |
| Self-attention | $\OrderOf{T *d^2} $ | $\OrderOf{T^2 *d}$ | $\OrderOf{1}$ | $\OrderOf{1}$ |

Reference:
- [Table 1 of Attention paper](https://arxiv.org/pdf/1706.03762.pdf#page=6)
- See [Stack overflow](https://stackoverflow.com/questions/65703260/computational-complexity-of-self-attention-in-the-transformer-model) for correction of the number Operations calculated in paper

Here's the details of the math for the CNN

- path length $T$ 
    - each kernel multiplication connects only $k$ elements of $\x$
    - since kernels overlap inputs, can't parallelize, hence $\OrderOf{T/k}$ path length
        - can reduce to $\log(T)$ with tree structure
- Parameters
    - kernel size $k$
    - number of input channels = number of output channels = $d$
    - $k *d$ parameters for kernel of one channel
    - $\OrderOf{k * d^2}$ parameters for kernel for all $d$ output channels
    
- Operations
    - for a single output channel: $k$ per input channel
        - There are $d$ input channels, so $k *d$ for each dot product of *one* output channel
        - There are $d$ output channels, so $k * d^2$ per time step
    - $T$ time steps so $\OrderOf{T * k * d^2}$ number of operations


# A free lunch ? Almost !

Transformers sound almost too good to be true
- Faster compute (through reduced number of Sequential steps)
- Constant Path Length
    - Better able to capture long range dependencies
    
Is there really such a thing as a free lunch ?

Almost.

The Transformer is potentially more expensive in some aspects
- more weights
- more operations (but the operations for each of the $T$ positions can occur in parallel)

Moreover: the RNN can handle sequences of arbitrary length ($T$ unbounded)
- Transformer has a fixed number of parallel units, which limits the length of sequences

But, in practice: RNN uses *Truncated* Back Propagation Through Time
- So the maximum distance between input sequence elements is bounded by $k$, the truncation length

## Some drawbacks

- The output $\y^{[\ll]}_\tp$ of layer $\ll$ (for stacked Transformer layers) is a function of **all** inputs, **always**
    - Perhaps less efficient
- Unless you add positional encoding, you lose ordering relationships between inputs

# Transformer: Decoder style

It is common to use two Transformers in an Encoder-Decoder configuration.

Recall the Encoder-Decoder architecture (using RNN's rather than Transformers in the diagram)

<table>
    <tr>
        <th><center><strong>Encoder-Decoder for language translation</strong></center></th>
    </tr>
    <tr>
        <td><img src="images/RNN_layer_API_Encoder_Decoder_Language_Translation.png" width=80%></td>
    </tr>
</table>

The Decoder in the Encoder-Decoder architecture is *generative*
- Outputs $\hat \y_\tp$ for a single $\tt$ at a time
- Appending output $\hat \y_\tp$ to the input available to output the next $\hat \y_{(\tt+1)}$


The Encoder in the Encoder-Decoder architecture creates a latent state $\bar \h_\tp$ which summarizes the input prefix $\x_{(1 \ldots \tt)}$.

In the above diagram the Decoder only has access to $\bar\h_{(\bar T)}$,
the final latent state
- summarizing the entire input sequence

This is very restrictive, forcing $\bar \h_{(\bar T)}$ to encode a lot of information.

But we motivated Attention by suggesting that the Decoder have access to *each* $\bar \h_\tp$ for $1 \le \tt \le \bar T$.
- and use the Attention mechanism to decide which $\bar \h_\tp$ to focus on when generating $\hat \y_\tp$

<table>
    <tr>
        <th><center>Decoder: Attention</center></th>
    </tr>
    <tr>
        <td><img src="images/Decoder_attention.png" width=80%></td>
    </tr>
</table>

The *Decoder Transformer*
- is similar to the Encoder Transformer in that both use Self-Attention to their own inputs
- differs in that it can also attend to the output of the Encoder.

Attending to the output of another model (e.g., Decoder attending to Encoder output) is called *Cross Attention* (Encoder-Decoder Attention).


<table>
    <tr>
        <th><center>Transformer Layer (Decoder)</center></th>
    </tr>
    <tr>
        <td><img src="images/Transformer_Decoder.png" width=70%></td>
    </tr>
</table>

The combinded Encoder-Decoder Transformer diagram looks like this

<table>
    <tr>
        <th><center>Transformer Layer (Encoder/Decoder)</center></th>
    </tr>
    <tr>
        <td><img src="images/Transformer_Encoder_Decoder_2.png" width=70%></td>
    </tr>
</table>

**Explanation of diagram**
- The Encoder uses Self-attention (<span style="color:green">wide Green arrow</span>) to attend to input sequence $\x$
- The Decoder uses Masked Self-attention (<span style="color:red">wide Red arrow</span>) to attend to its input
    - It's input is the prefix of the output sequence $\y$
    - Limited to prefix of length $\tt$ by **masking**
- The Decoder uses Cross Attention (between Encoder and Decoder) <span style="color:blue">(wide Blue arrow)</span>
    - To enable Decoder to focus on which Encoder latent state $\bar \h_\tp$ to atttend to
- The dotted <span style="color:blue">(thin Blue arrow)</span> indicates that the output $\hat \y_\tp$ is appended to the input that is available when generating $\hat \y_{(\tt+1)}$
  

Note that the Decoder is recurrent (generative)
- it generates a single output at a time
- unlike the Encoder, which generates all outputs (i.e., "encodings") in parallel

## Functional versus Sequential architecture

The architecture diagram is more complex than we have seen thus far.

In particular: data no longer strictly flows forward in a layer-wise arrangement !
- There are two independent sub-networks (Encoder and Decoder)
- Connection from the Encoder output to the middle of the Decoder (Cross-Attention)

Each of the Encoder and Decoder is an independent Functional model.
- not our familiar Sequential modles

The Encoder-Decoder pair combination is also constructed as a Functional model.

Since we have not yet addressed Functional Models, you may not be prepared to completely grasp the totality.

But hopefully you can absorb the concepts even without fully understanding the details.

# Detailed Encoder-Decoder Transformer architecture

There are other components of the Encoder and Decoder that we have yet to describe.

We will do so briefly.

(The Transformer was introduced in the paper [Attention is all you Need](https://arxiv.org/pdf/1706.03762.pdf)
   
<table>
    <tr>
        <th><center>Transformer (Encoder/Decoder)</center></th>
    </tr>
    <tr>
        <td><img src="images/Attention_is_all_u_need_Transformer.png" width=60%></td>
    </tr>
</table>

**Embedding layers**

We will motivate and describe Embeddings in the NLP module.

For now:
- an embedding is an encoding of a categorical value that is shorter than OHE

It is used in the Transformer to
- encode the input sequence of words
- encode the output sequence of words

**Positional Encoding**

The inputs are ordered (i.e., sequences) and thus describe a relative ordering relationship between elements.

But inputs to most layer types (e.g., Fully Connected) are unordered.

The Positional Encoding is a way of encoding the the relative ordering of elements.

To represent the relative position of each element in the sequence,
- we can pair the input element  with an encoding of its position in the sequence.
$$
\langle \x_\tp, \text{encode}(\tt) \rangle
$$

The box labeled "Positional Encoding" creates $\text{encode}(\tt)$.

The "+" concatenates the Input Embedding and Positional Encoding to create $
\langle \x_\tp, \text{encode}(\tt) \rangle
$.

If relative position is important, the NN can learn the values of  $\text{encode}(\tt)$.

**Self Attention layers (Encoder and Decoder)**

The 3 arrows flowing into the Multi-Head Attention box
- are identical
- are the inputs (after being Embedded and having Positional Encoding added)

The Self-Attention layers for the Encoder and Decoder
- **differ in that the Decoder uses Causal Masking versus no-masking for the Encoder**
- Decoder can't "look ahead" at output $\y_{\tt'}$ for $\tt' \ge \tt$
    - it hasn't been generated yet at test time step $\tt$
    - it **is** available at training time (via Teacher Forcing)
        - but shouldn't look at it during training time, in order for training to be similar to test time
  

**Cross Attention layer (Decoder)**

The two arrows flowing from the Encoder output are the keys and values of the CSM

The arrow flowing from the Self Attention layer is the query
- The output of the Self Attention layer is the **query** used in Cross Attention

**Add and Norm**

We have seen each of these layer types before
- Norm: Batch (or other) Normalization layers
- Add: the part of the residual network that joins outputs of multiple previous layers

The diagram shows an Encoder-Decoder pair.

You will notice that each element of the pair is different.

- It is possible to use each element independently as well.

- But first we need to understand
the source of the differences and their implications.



## How is the direct function computed ?

The Encoder uses self-attention
- So the keys and values of the CSM are derived directly from input sequence $x_{(1 \ldots T)}$

During training, the Encoder
- learns a query, derived from input sequence $\x_{(1 \ldots T)}$
- learns weights for the Feed Forward Network

The Attention output 
- is equal to a weighted combination of CSM values
    - i.e., weighted sum of input elements

The Feed Forward Network transforms the Attention output into Encoder output $\bar \h_\tp$.

Similarly for the Decoder.

The Self-Attention layer CSM  has keys and values that are incrementally constructed
from the outputs $\y_{(1 \ldots, \tt)}$ that have been created from the first $\tt$ steps.

The Cross-Attention layer CSM has keys and values that are outputs $\bar \h_\tp$ of the Encoder.

During training, the Self-Attention layer outputs **the query** that is used for Cross Attention.

The query is created by self-attention to the inputs.

The Decoder **learns (from training)** 
- the Self-Attention query 
- the Cross Attention query
- the weights of the Feed Forward Network



## Stacked Transformer

Just as with many other layer types (e.g., RNN), we may stack Transformer layers.
- Each layer creating alternate representations of the input of increasing complexity

In fact, stacking $N > 1$  Transformer layers is typical.

$N = 6$ was the choice of the original paper.

<table>
    <tr>
        <th><center>Stacked Transformer Layers (Encoder/Decoder)</center></th>
    </tr>
    <tr>
        <td><img src="images/Transformer_Encoder_Decoder_multi.png" width=70%></td>
    </tr>
</table>

## Uses of  an Encoder-style Transformer

The Transformer for the Encoder and Decoder of an Encoder-Decoder Transformer are slightly different.

They can also be used individually as well as in pairs.

It's important to understand the differences in order to know when to use each individually.

The Encoder side of the pair **does not** restrict the order in which it's inputs are accessed.
- Self-attention **without** causal masking

So the Encoder is appropriate for tasks that require a context-sensitive representation of
each input element.

For example: the meaning of the word "**it**" changes with a small change to a subsequent word in the following sentences:
- "The animal didn't cross the road because **it** was too tired"

- "The animal didn't cross the road because **it** was too wide"

Some tasks with this characteristic are
- Sentiment
- Masked Language Modeling: fill-in the masked word
- Semantic Search
    - compare a summary of the sequence that is the context-sensitive representation of
        - query sentence
        - document sentences
    - Each summary is a kind of **sentence embedding**
    - Summary
        - pooling over each word
        - final token
        

## Uses of a Decoder-style Transformer

One notable aspect of the Decoder is its recurrent (generative) architecture
- Output $\y_{(\tt-1)}$ is appended to the Decoder inputs available at step $\tt$.
    - The Decoder inputs are $\y_{(1..T)}$, where $T$ is the full length of the Decoder output
    - **But** Causal Masking ensures that only $\y_{(1..\tt)}$ is *available* at step $\tt$.
    
Thus, the Decoder is appropriate for *generative* tasks
- Text generation
- Predict the next word in a sentence

# Conclusion

The Transformer architecture has come to dominate tasks with long sequences (e.g., NLP).

The operations of a Transformer occur in parallel for each position.

This allows us to leverage the compute time
- Use many stacked Transformer layers
- At time cost still less than a sequential RNN layer

Moreover, the constant path length means the gradients are less likely to vanish/explode for long sequences
- No need to truncate Back Propagation as in an RNN
- Long term dependencies between positions become feasible.

We pay for these advantages in terms of increasing
- number of operations
    - but they occur in parallel, so no increase in elapsed time
- number of weights

Thus, Transformer training is both compute and memory intensive.
- This limits the number of individuals/organizations able to train very large models.

In [2]:
print("Done")

Done
