In [1]:
%run Latex_macros.ipynb

<IPython.core.display.Latex object>

# Dealing with Sequences: Recurrent Neural Network (RNN) layer

For a function that takes 
sequence $\x^\ip$ as input
and creates sequence $\y$ as  output we had two choices for implementing the function.

The RNN implements the function as a "loop"
- A function that taking **a single** $\x_\tp$ as input a time
- Outputting $\y_\tp$ 
- Using a "latent state" $\h_\tp$  to summarize the prefix $\x_{(1\ldots \tt)}$
- Repeat in a loop over $\tt$

$$
\begin{array}[lll] \\
\pr{\h_\tp | \x_\tp, \h_{(\tt-1)} } & \text{latent variable } \h_\tp \text{encodes } [ \x_{(1)} \dots \x_\tp ]\\
\pr{\y_\tp | \h_\tp }              & \text{prediction contingent on latent variable} \\
\end{array}
$$
    
<br>
<div>
    <center><strong>Loop with latent state</strong></center>
    <img src="images/RNN_arch_loop.png" width=70%>
</div>

"Unrolling" the loop makes it equivalent to a multi-layer network

<br>
<table>
    <tr>
        <th><center>RNN unrolled</center></th>
    </tr>
    <tr>
        <td><img src="images/RNN_layer_API_many_to_many.jpg"></td>
    </tr>
</table>

# Transformer (Encoder style)

The alternative to the loop was to create a "direct function"
- Taking a **sequence** $\x_{(1 \dots \tt)}$ as input
- Outputting $\y_\tp$

<br>
<div>
    <center><strong>Direct function</strong></center>
    <img src="images/RNN_arch_parallel.png" width=50%>
</div>

In order to output the sequence $\y_{(1)} \ldots \y_{(T)}$ we
create $T$ copies of the function (one for each $\y_\tp$)
- computes each $\y_\tp$ in **parallel**, not sequentially as in the loop

<br>
<div>
    <center><strong>Direct function, in parallel (masked input)</strong></center>
<img src="images/Transformer_parallel_masked.png" width=50%>
</div>

The parallel units constitute a *Transformer Encoder*

<br>
<table>
    <tr>
        <th><center>Transformer Encoder (causal masked input)</center></th>
    </tr>
    <tr>
        <td><img src="images/Transformer_1.png"></td>
    </tr>
</table>

Compared to the unrolled RNN, the Transformer Encoder
- Takes a **sequence** $\x_{(1..t)}$ as input
    - Because $\y_\tp$ is computed as a *direct* function of the prefix $\x_{(1..t)}$ rather than recursively
- Has **no** latent state: output is a direct function of the input sequence
- Has **no** data (e.g., $\h_\tp)$ passing from the computation between time steps (e.g., from $\tt$ to $(\tt +1)$)
- Outputs generated in parallel, not sequentially
- No gradients flowing backward over time

With this architecture, we can compute more general functions than the RNN
- where each $\y_\tp$ depends on the entire $\x_{(1 \ldots T)}$ rather than a prefix $\x_{(1 \ldots \tt)}$

<br>
<div>
    <center><strong>Direct function, in parallel (un-masked input)</strong></center>
<img src="images/Transformer_parallel.png" width=50%>
</div>


<table>
    <tr>
        <th><center>Transformer Encoder (unmasked input)</center></th>
    </tr>
    <tr>
        <td><img src="images/Transformer_2.png"></td>
    </tr>
</table>

Note that we use *unmasked* Self Attention for the Encoder
- it is OK for the encoder to access all input positions $\x_{(1:\bar T)}$

This is useful, for example, when the meaning of a word depends on its *entire* context.


An example of such a general function is the "meaning" of a word, in context.

| Sentence | Meaning of "it" |
|:----------|:-----------------|
The animal didn't cross the street because **it** was too tired | the animal
The animal didn't cross the street because **it** was too wide  | the road

The meaning of the word "it" is determined by a word that follows it ("tired" or "wide")

So even though the Transformer output at each position is a function of the entire sequence $\x_{(1 \ldots T)}$
- the output is different for each position $\tt$

Here is a diagram of an Encoder style Transformer block.

<br>
    <table>
    <tr>
        <center><strong>Transformer layer: Encoder style</center>
     </tr> 
     <tr>   
        <img src='images/Transformer_Encoder.png' width=30%>
    </tr> 
</table>

# Attention is all you need

The key to the Transformer is multiple uses of the Attention mechanism
- Self-attention
    - Both the Encoder and Decoder attend to *their own inputs*
    - Implementing the *direct function*, rather than "loop" approach
- Cross-attention
    - The Decoder attends to the intermediate representations of the Encoder

The queries used in attention lookup are $\h_\tp, \bar \h_\tp$
- as are the keys and values

We will use $d$ (or $d_\text{model}$ to denote the length of these values.

Moreover, all pathways in the Transformer will maintain the size $d$.

# Technical clarifications

## Shared Transformer blocks across positions

The transformer blocks ("circles" in the diagram)
- are **shared** across all positions
- that is: the same computation (with shared parameters) is performed in parallel
- Thus, the number of parameters is **not** a function of sequence length $T$

## Identifying $\hat\y_\tp$ with $\h_\tp$

The simplest RNN (corresponding to our diagrams) use the latent state $\h_\tp$ as the output $\hat\y_\tp$
$$
\hat\y_\tp = \h_\tp
$$

It is easy to add another NN to transform $\h_\tp$ into a $\hat\y_\tp$ that is different.

- We can add a NN to the Decoder RNN that implements a function $D$ that transforms the latent state into an output.

$$\hat\y_\tp = D(\h_\tp)$$

Here is what the additional NN looks like:
<br>
<br>



<table>
    <tr>
        <th><center>Decoder output transformation: No attention</center></th>
    </tr>
    <tr>
        <td><img src="images/Decoder_no_attention.png" width=70%></td>
    </tr>
</table>

In the context of the Transformer:we will assume the style of a single output $\h_\tp$

The reason for doing this:
- We can "stack" $N$ Transformer layers (just as we can stack RNN layers)
- The output of the non-top layer $j$ is $\h^{[j]}_\tp$, not the final $\y_\tp$
- We identify $\y_\tp$ as the output of the top layer $\h^{[N]}_\tp$
    - perhaps after a further processing

Furthermore: 
    
Since the Encoder part is no longer a "loop"
- It is inaccurate to refer to the Encoder output $\bar \h_\tp$ as a "latent" state
- However, $\bar \h_\tp$ *is still* a summary of the input sequence
    - a summary of $\x_{(1 \ldots \tt)}$ when casual attention is used
    - a summary of $\x_{(1 \ldots \bar T)}$ otherwise
- Out of **bad habit** we may continue to erroneously refer to $\bar \h$ and $\h$ as "latent" states

# Inside the Transformer Encoder: Self Attention

There are two "styles" of Transformer
- Encoder
- Decoder

They are often paired into an Encoder-Decoder architecture but may also be used stand-alone.

We begin our discussion of the Transformer by looking at the Encoder style Transformer.

If we look inside the box computing the direct function, we will find several layers
- An Attention layer
    - To influence which elements of the input sequence $\x$ to attend/focus when outputting $\y_\tp$
- A Feed Forward Network (FF) layer to compute the function

<br>
<table>
    <tr>
        <th><center>Transformer Layer (Encoder)</center></th>
    </tr>
    <tr>
        <td><img src="images/Transformer_Encoder.png" width=60%></td>
    </tr>
</table>

The above diagram shows an Encoder that takes an input sequence $\x_{(1 \ldots T)}$
- consistent with our diagram of an Encoder computing a direct function of the entire input sequence

When the Attention is directed at the Encoder's own inputs: we call this *Self-Attention*.

Moreover, the Attention allows access to all elements of the input sequence
- allowing the computation of a direct function of all elements


# Advantages of a Transformer compared to an RNN

As we will demonstrate in detail below
- The Transformer's operations can be performed in parallel versus sequentially for the RNN
    - parallel processing of each element of the output sequence
    - number of steps to produce an output sequence of length $T$
        - is constant, rather than $T$
    - faster !
- Gradients less likely to vanish or explode



We can **leverage** these advantages in complexity by
- By making a Transformer model bigger (e.g., more stacked Transformer layers)
- Making the sequence lengths longer
- Increasing the number of examples of training data

So, for the same time "cost" as an RNN, we can use a bigger Transformer on more data
- **Hence: we can learn more complex functions for similar time cost**

The **path length** from the output to the input is constant in an Transformer, compared to $T$ in the RNN.
- parallel computation: reduced wall time
- less likely for gradients to vanish/explode over shorter path
    - Transformer better ble to capture long-range dependencies than an RNN


But there are costs to pay for this (relative to an RNN), many due to Attention Lookup.

- higher memory 
    - the $Q$ matrix is shape $(T \times d)$; the $K, V$ matrices are $(\bar T \times d)$
        - internal dimensions are size $d$
        - One query for each of the $T$ positions of the output sequence
            - or, as we will see in the Encoder-Decoder combination
                - $\bar T$ outputs ("latent states") of the Encoder to attend to
    - intermediate matrices, e.g. 
    $$Q * K^T$$
    are of shape $(T \times \bar T)$
- the number of operations is greater

- the number of parameters is greater

We give the detailed math below.

## Number of sequential steps

The most obvious advantage of the "direct function" as opposed to the "loop" is
that outputs are computed in parallel versus sequentially.

For an input sequence of length $T$:
- The loop requires $T$ steps
- The direct function requires $1$ step

## Path length
The *Path length* is the distance that the Loss Gradient needs to travel backwards during Back Propagation.

At each step, the gradient is subject to being diminished or increased (Vanishing/Exploding gradients).

Since the Transformer operates in parallel across positions, this is $\OrderOf{1}$.

It is $\OrderOf{T}$ for the RNN due to the sequential computation.


**The constant path length is critical to the success of the Transformer**
- The query used for the input at position $\tt$ can access **all** prior positions $\tt' \le \tt$ at the same cost
    - Gradient not diminished
    - RNN
        - Gradient signal diminished for position $\tt' << \tt$
        - Truncated Back Propagation may kill the gradient flow from position $\tt$ back to $\tt'$ beyond truncation window

A key strength of the Transformer is that it enables learning long-range dependencies.

## Number of parameters

In the Transformer,
the $Q, K, V$ matrices are first projected through $(d \times d)$ matrices, $\W_k, \W_Q, \W_V$
   
out  &nbsp;  &nbsp;  &nbsp;  &nbsp; | &nbsp; | left &nbsp; &nbsp;  &nbsp;  &nbsp; | &nbsp;  | right &nbsp;  &nbsp;  &nbsp;|
:--:|:-:|:-:|:-:|:-:
$Q$ | = | $Q$| * |$\W_Q$ |
$(T \times d)$ | | $(T \times d)$ | | $(d \times d)$
&nbsp;
$K$ | = | $K$| * |$\W_K$ |
$V$ | = | $V$| * |$\W_V$ |
$(\bar T \times d)$ | | $(\bar T \times d)$ | | $(d \times d)$

Each of the matrices, $\W_k, \W_Q, \W_V$, is $\OrderOf{d^2}$ parameters.

The Feed Forward Network is usually implemented by 2 `Dense` layers.

the first takes the length $d$ attention output
- and creates $d_\text{ffn}$ new features
- by custom: $d_\text{ffn} = 4 * d$

The second takes the length $d_\text{ffn}$ output of the first and creates the final length $d_\text{model}$ output.

Thus, each `Dense` layer has $\OrderOf{ d^2 }$ weights.


## Number of operations

What about the number of operations ? 

The Attention Lookup is computed via matrix multiplication 
    $$
    Q * K^T * V
    $$
    
$Q * K^T$ has $(T \times \bar T)$ elements, each the result of $d$ multiplications.

Thus: $\OrderOf{T^2 *d}$ multiplications.

The Self Attention layer attend to (transformed) inputs
- each element assumed size of $d$

The keys and values of the CSM implementing Attention are the size $d$ input elements.
- Each attention lookup (dot product of query with a key) requires $d$ multiplications.
- There are $T$ key/value pairs in the CSM
- There are $T$ attention units (one for each position, outputting $\h_\tp$)

Thus: $\OrderOf{T^2 *d}$ multiplications.

## RNN calculations

Let's examine the RNN's number of operations and weights.

The RNN inputs $\x_\tp$ and outputs $\h_\tp$  of size $d$ (same as Transformer).
- In the RNN $\h_\tp$ is also the latent state

Each step of the RNN updates the latent state $\h_\tp$ via the equation
    $$
\h_\tp  =  \phi(\W_{xh}\x_\tp  + \W_{hh}\h_{(t-1)}  + \b_h) 
$$

The  weight matrices 
$$\W_{xh} \text{ and } \W_{hh}$$
are of size $$\OrderOf{d \times d}$$
- transforming length $d$ vectors ($\x_\tp, \h_{(\tt-1)}$) into a length $d$ vector $\h_\tp$

The multiplication of $(d \times d)$ weights matrices times a vector of length $d$
- requires $d$ multiplications per element
- there are $d$ elements in $\h_\tp$

Thus $\OrderOf{d^2}$ operations per time step.

There are $T$ *sequential* time-steps
- $\OrderOf{T * d^2}$ total operations
- involving  $T$ sequential steps
    - steps are computed sequentially in the RNN, versus in parallel in the Transformer
- path length $T$ as gradient flows backward through each of the $T$ time steps

## Complexity: summary

We also throw in a CNN for comparison

The detailed CNN math is given in a following section.

| Type | Parameters  | Operations  &nbsp; &nbsp; &nbsp; | Sequential steps | Path length
|:------|:---|:---|:---|:---|
|  CNN | $\OrderOf{k * d^2}$   | $\OrderOf{T * k * d^2}$ | $\OrderOf{T}$   | $\OrderOf{T}$ |
| RNN  | $\OrderOf{d^2}$       | $\OrderOf{T * d^2}$     | $\OrderOf{T}$    | $\OrderOf{T}$ |
| Self-attention | $\OrderOf{d^2} $ | $\OrderOf{T^2 *d}$ | $\OrderOf{1}$ | $\OrderOf{1}$ |

Reference:
- [Transformer Scaling paper](https://arxiv.org/pdf/2001.08361.pdf#page=6)
- [Table 1 of Attention paper](https://arxiv.org/pdf/1706.03762.pdf#page=6)
- See [Stack overflow](https://stackoverflow.com/questions/65703260/computational-complexity-of-self-attention-in-the-transformer-model) for correction of the number Operations calculated in paper

Transformer main point of comparison to the RNN
- fewer Sequential Steps: $\OrderOf{1}$ versus $\OrderOf{T}$
- operations: $\OrderOf{T^2 * d}$ versus $\OrderOf{T * d^2}$
    - more when sequences are long, i.e., $T \gt d$
    

**But:** because of the reduced number of sequential steps, Transformers
- can stack *many* (i.e.,  $n_\text{layers}$) blocks, each taking $\OrderOf{1}$ time
    - $\OrderOf{n_\text{layers}}$ Sequential Steps total
- and still be less than the $\OrderOf{T}$ Sequential Steps of an RNN
- at the cost of increasing number of operations and parameters by $\OrderOf{n_\text{layers}}$

Transformers consume larger number of parameters and operations through this factor of $n_\text{layers}$ blocks.

### CNN calculations

Here's the details of the math for the CNN

- path length $T$ 
    - each kernel multiplication connects only $k$ elements of $\x$
    - since kernels overlap inputs, can't parallelize, hence $\OrderOf{T/k}$ path length
        - can reduce to $\log(T)$ with tree structure
- Parameters
    - kernel size $k$
    - number of input channels = number of output channels = $d$
    - $k *d$ parameters for kernel of one channel
    - $\OrderOf{k * d^2}$ parameters for kernel for all $d$ output channels
    
- Operations
    - for a single output channel: $k$ per input channel
        - There are $d$ input channels, so $k *d$ for each dot product of *one* output channel
        - There are $d$ output channels, so $k * d^2$ per time step
    - $T$ time steps so $\OrderOf{T * k * d^2}$ number of operations


# A free lunch ? Almost !

Transformers sound almost too good to be true
- Faster compute (through reduced number of Sequential steps)
- Constant Path Length
    - Better able to capture long range dependencies
    
Is there really such a thing as a free lunch ?

Almost.

In order to achieve the full benefit of reduced path length
- the operations across all $T$ positions must be computed in parallel
- this involves a tremendous amount of simultaneous compute power
    - very expensive in hardware and power costs


In addition, *positional encoding* needs to be preserved at each layer
- to maintain relative ordering (e.g., for causal attention)
- more complicated than an RNN


# Transformer: Decoder

The second type of Transformer is called the Decoder style.

<br>
    <table>
    <tr>
        <center><strong>Transformer layer: Decoder style</center>
     </tr> 
     <tr>   
        <img src='images/Transformer_Decoder_style.png' width=30%>
    </tr> 
</table>

A Decoder usually operates in an *auto regressive* manner
- it has **no initial** input
    - Technically: it has a special "start" token
    $$\y_{(0)} = \langle \text{START} \rangle$$
- the output $\hat \y_\tp$ of time step $\tt$ is appended to the input
    - so the input at time step $\tt$ is $$\hat\y_{(1 \ldots \tt-1)}$$
    
<br>    
<br>
    <table>
    <tr>
        <center><strong>Transformer layer: Decoder style</center>
     </tr> 
     <tr>   
        <img src='images/Transformer_Decoder_style_autoregressive.png' width=35%>
    </tr> 
</table>

When this occurs, the Encoder at time step $\tt$ can only attend to a *prefix* of $\hat\y_{(1..T)}$
$$
\hat\y_{(1 \ldots \tt-1)}
$$
- it can't refer to an input that will only be available (at *inference time*) in the future !
- Recall
    - During training, at step $\tt$
    - the entire Target $\y_{(1..T)}$ is available as input
    - masking prevents peeking into the future


This is implemented by a form of Attention called **Masked** Self-Attention
-  $\hat\y_{(1..T)}$ is masked to make visible only the prefix $\hat\y_{(1 \ldots \tt-1)}$



# Transformer: (Encoder/Decoder style)

It is common to use two Transformers in an Encoder/Decoder configuration.

Refer back to our [Attention module](Intro_to_Attention.ipynb#Attention)
- used to motivate Attention
- through several steps
    - we modified a pair of RNN's (Encoder and Decoder)
    - into a pair of direct function modules
    - which form the basis for the Transformer
    

<table>
    <tr>
        <th><center>Encoder/Decoder with Cross Attention and Self Attention (Encoder/Decoder)</center></th>
    </tr>
    <tr>
        <td><img src="images/RNN_layer_API_Encoder_Decoder_Attention_All_Self_Attention.png"
             width=80%</td>
    </tr>
   
</table>

Example on Question Answering
- an Input consisting of  Context and a Question is processed by the Encoder
- the Output Answer is generated auto-regressively by the Decoder
    - attending to the Encoder output

The Encoder part of the pair
- has full visibility to the input sequence $\x_{(1 \ldots \bar T)}$
    - via un-masked Self-Attention
- it transformers the input into sequence $\bar \h_{(1 \ldots \bar T)}$
    - which is made available to the Decoder when generated each element of the Output $\hat \y_\tp$
    


The Decoder part of the pair
- can attend to the Encoder Input $\bar \h_{(1 \ldots \bar T)}$ when generating any Output $\hat \y_\tp$
- the Output is produced auto-regressively
    - so Output at position $\tt$ is a function of
        - $\hat\y_{(1..\tt-1)}$: the Output generated thus far
        - the encoded input $\bar \h_{(1 \ldots \bar T)}$ 
        

The Decoder uses **two types** of Attention
- Attention to the Encoder output $\bar \h_{(1 \ldots \bar T)}$: Encoder/Decoder **Cross Attention**
- Self-Attention to the Output generated thus far $\hat\y_{(1..\tt-1)}$
    - **Masked Self-Attention**
        - Self-Attention: attends to its own input
        - Masked: access only to prefix $\hat\y_{(1..\tt-1)}$ rather than full $\hat\y_{(1..T)}$

<table>
    <tr>
        <th><center>Transformer Layer (Decoder)</center></th>
    </tr>
    <tr>
        <td><img src="images/Transformer_Decoder.png" width=70%></td>
    </tr>
</table>

The combinded Encoder-Decoder Transformer diagram looks like this

<table>
    <tr>
        <th><center>Transformer Layer (Encoder/Decoder)</center></th>
    </tr>
    <tr>
        <td><img src="images/Transformer_Encoder_Decoder_2.png" width=70%></td>
    </tr>
</table>

**Explanation of diagram**
- The Encoder uses Self-attention (<span style="color:green">wide Green arrow</span>) to attend to input sequence $\x$
- The Decoder uses Masked Self-attention (<span style="color:red">wide Red arrow</span>) to attend to its input
    - It's input is the prefix of the output sequence $\y$
    - Limited to prefix of length $\tt$ by **masking**
- The Decoder uses Cross Attention (between Encoder and Decoder) <span style="color:blue">(wide Blue arrow)</span>
    - To enable Decoder to focus on which Encoder latent state $\bar \h_\tp$ to atttend to
- The dotted <span style="color:blue">(thin Blue arrow)</span> indicates that the output $\hat \y_\tp$ is appended to the input that is available when generating $\hat \y_{(\tt+1)}$
  

Note that the Decoder is recurrent (auto-regressive; generative)
- it generates a single output at a time
- unlike the Encoder, which generates all outputs (i.e., "encodings") in parallel

During training, at step $\tt$
- the entire Target $\y_{(1..T)}$ is available as input
- **but** Causal masking ensures that only $\y_{(1..\tt-1)}$ is visible
- so the *available* input at step $\tt$ is $\y_{(1..\tt-1)}$
- note that Training time input is $\y_{(1..\tt-1)}$ **not** $\hat \y_{(1..\tt-1)}$
    - Teacher forcing to prevent cascading errors
    - stops errors at step $\tt-1$ from affecting predictions at subsequent steps

## Functional versus Sequential architecture

The architecture diagram is more complex than we have seen thus far.

In particular: data no longer strictly flows forward in a layer-wise arrangement !
- There are two independent sub-networks (Encoder and Decoder)
- Connection from the Encoder output to the middle of the Decoder (Cross-Attention)

Each of the Encoder and Decoder is an independent Functional model.
- not our familiar Sequential modles

The Encoder/Decoder pair combination is also constructed as a Functional model.

Since we have not yet addressed Functional Models, you may not be prepared to completely grasp the totality.

But hopefully you can absorb the concepts even without fully understanding the details.

# Detailed Encoder/Decoder Transformer architecture

There are other components of the Encoder and Decoder that we have yet to describe.

We will do so briefly.

The Transformer was introduced in the paper [Attention is all you Need](https://arxiv.org/pdf/1706.03762.pdf)
   
<table>
    <tr>
        <th><center>Transformer (Encoder/Decoder)</center></th>
    </tr>
    <tr>
        <td><img src="images/Attention_is_all_u_need_Transformer.png" width=60%></td>
    </tr>
</table>

**Embedding layers**

We will motivate and describe Embeddings in the NLP module.

For now:
- an embedding is an encoding of a categorical value that is shorter than OHE

It is used in the Transformer to
- encode the input sequence of words
- encode the output sequence of words

**Positional Encoding**

The inputs are ordered (i.e., sequences) and thus describe a relative ordering relationship between elements.

But inputs to most layer types (e.g., Fully Connected) are unordered.

The Positional Encoding is a way of encoding the the relative ordering of elements.

To represent the relative position of each element in the sequence,
- we can pair the input element  with an encoding of its position in the sequence.
$$
\langle \x_\tp, \text{encode}(\tt) \rangle
$$

The box labeled "Positional Encoding" creates $\text{encode}(\tt)$.

The "+" concatenates the Input Embedding and Positional Encoding to create $
\langle \x_\tp, \text{encode}(\tt) \rangle
$.

If relative position is important, the NN can learn the values of  $\text{encode}(\tt)$.

The encoding is subtle.

A fuller explanation is given in this [module](Transformer_PositionalEmbedding.ipynb).

**Self Attention layers (Encoder and Decoder)**

The 3 arrows flowing into the Multi-Head Attention box
- are identical
- are the inputs (after being Embedded and having Positional Encoding added)

The Self-Attention layers for the Encoder and Decoder
- **differ in that the Decoder uses Causal Masking versus no-masking for the Encoder**
- Decoder can't "look ahead" at output $\y_{\tt'}$ for $\tt' \ge \tt$
    - it hasn't been generated yet at test time step $\tt$
    - it **is** available at training time (via Teacher Forcing)
        - but shouldn't look at it during training time, in order for training to be similar to test time
  

**Cross Attention layer (Decoder)**

The two arrows flowing from the Encoder output are the keys and values of the CSM

The arrow flowing from the Self Attention layer is the query
- The output of the Self Attention layer is the **query** used in Cross Attention

**Add and Norm**

We have seen each of these layer types before
- Norm: Batch (or other) Normalization layers
- Add: the part of the residual network that joins outputs of multiple previous layers

The diagram shows an Encoder/Decoder pair.

You will notice that each element of the pair is different.

- It is possible to use each element independently as well.

- But first we need to understand
the source of the differences and their implications.



## What happens during training ?

**Encoder**

The Encoder uses self-attention
- So the keys and values of the CSM are derived directly from input sequence $\x_{(1 \ldots T)}$

During training, the Encoder
- learns a query, derived from input sequence $\x_{(1 \ldots T)}$
- learns weights for the Feed Forward Network

The Attention output 
- is equal to a weighted combination of CSM values
    - i.e., weighted sum of input elements

The Feed Forward Network transforms the Attention output into Encoder output $\bar \h_\tp$.

**Decoder**

Similarly for the Decoder.

The Self-Attention layer CSM  has keys and values that are incrementally constructed
from the outputs $\hat\y_{(1 \ldots, \tt)}$ that have been created from the first $\tt$ steps.

The Cross-Attention layer CSM has keys and values that are outputs $\bar \h_\tp$ of the Encoder.

During training, the Self-Attention layer learns to construct
- The **the query** that is used for Self Attention.
    -  attention to the inputs.
- The **output** of Self-Attention
    - the weighted sum of input positions
    - becomes **the query** that is used for Cross Attention.
- The **query** used for Cross Attention
    - attention to the Encoder outputs
- The **output** of Cross Attention
    - the weighted sum of Encoder outputs
    - becomes the input to the Feed Forward Network
- The **weights** of the Feed Foward Network
    - this is where "world knowledge" from training data is stored
    


## Stacked Transformer

Just as with many other layer types (e.g., RNN), we may stack Transformer layers.
- Each layer creating alternate representations of the input of increasing complexity

In fact, stacking $N > 1$  Transformer layers is typical.

$N = 6$ was the choice of the original paper.

Note that this is still an Encoder/Decoder
- so the *final* output of the Encoder is attended to by *each* layer of the Decoder.

<table>
    <tr>
        <th><center>Stacked Transformer Layers (Encoder/Decoder)</center></th>
    </tr>
    <tr>
        <td><img src="images/Transformer_Encoder_Decoder_multi.png" width=70%></td>
    </tr>
</table>

# Use cases for each style of Transformer

The Transformer for the Encoder and Decoder of an Encoder/Decoder Transformer are slightly different.

They can also be used individually as well as in pairs.

It's important to understand the differences in order to know when to use each individually.

## Encoder/Decoder uses

The Encoder/Decoder acts as a function
- from Input, processed by the Encoder
- to Target, processed by the Decoder
 
This is a natural architecture for Sequence to Sequence tasks.

## Encoder only uses

The Encoder side of the pair **does not** restrict the order in which it's inputs are accessed.
- Self-attention **without** causal masking

Thus the Encoder output at *each* position is a function of the input at *all* positions.

This is valuable for tasks that require a context-sensitive representation of
each input element.

For example: the meaning of the word "**it**" changes with a small change to a subsequent word in the following sentences:
- "The animal didn't cross the road because **it** was too tired"

- "The animal didn't cross the road because **it** was too wide"

Some tasks with this characteristic are
- Sentiment
- Masked Language Modeling: fill-in the masked word
- Semantic Search
    - compare a summary of the sequence that is the context-sensitive representation of
        - query sentence
        - document sentences
    - Each summary is a kind of **sentence embedding**
    - Summary
        - pooling over each word
        - final token
        

It is often the case that special tokens are added to the input of an Encoder style transformer.
- Designating a special role for this token, compared to the other tokens in the sequence
- For example `<CLS>` ("Classification") is the single token used as input to a subsequent Classifier layer

Thus Encoder style Transformers are usually used as the first "layer" of a multi-layer network
- later layers being, e.g., task-specific "heads"

## Decoder only uses

A Decoder style Transformer
- looks like the Decoder side of the Encoder-Decoder
- *without* Cross-Attention, since there is no Encoder

One notable aspect of the Decoder is its auto-regressive behavior
- Initial input is empty
- Output $\hat\y_{(\tt-1)}$ is appended to the Decoder inputs available at step $\tt$.
- step $\tt$ input: $\hat\y_{([0:\tt-1])}$

Thus, a Decoder only Transformer is useful for completely *generative* task
- create sequence output
- from no input

One can modify a Decoder only Transformer to implement a function from Input to Target
- just like an Encoder/Decoder
- by initializing the Decoder input to the function input sequence $\x_{(0..\bar T)}$

Thus, a Decoder only Transformer become similar in function to an Encoder/Decoder Transformer.
- but with half the parameters (since no Encoder)

# Conclusion

The Transformer architecture has come to dominate tasks with long sequences (e.g., NLP).

The operations of a Transformer occur in parallel for each position.

This allows us to leverage the compute time
- Use many stacked Transformer layers
- At time cost still less than a sequential RNN layer

Moreover, the constant path length means the gradients are less likely to vanish/explode for long sequences
- No need to truncate Back Propagation as in an RNN
- Long term dependencies between positions become feasible.

We pay for these advantages in terms of increasing
- number of operations
    - but they occur in parallel, so no increase in elapsed time
- number of weights

Thus, Transformer training is both compute and memory intensive.
- This limits the number of individuals/organizations able to train very large models.

In [2]:
print("Done")

Done
