# Natural Language Processing

### Transformers

<br><br>
Prof. Iacopo Masi and Prof. Stefano Faralli

In [1]:
import matplotlib.pyplot as plt
import scipy
import random
import numpy as np
import pandas as pd
pd.set_option('display.colheader_justify', 'center')

In [2]:
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
#plt.style.use('seaborn-whitegrid')

font = {'family' : 'Times',
        'weight' : 'bold',
        'size'   : 12}

matplotlib.rc('font', **font)


# Aux functions

def plot_grid(Xs, Ys, axs=None):
    ''' Aux function to plot a grid'''
    t = np.arange(Xs.size) # define progression of int for indexing colormap
    if axs:
        axs.plot(0, 0, marker='*', color='r', linestyle='none') #plot origin
        axs.scatter(Xs,Ys, c=t, cmap='jet', marker='.') # scatter x vs y
        axs.axis('scaled') # axis scaled
    else:
        plt.plot(0, 0, marker='*', color='r', linestyle='none') #plot origin
        plt.scatter(Xs,Ys, c=t, cmap='jet', marker='.') # scatter x vs y
        plt.axis('scaled') # axis scaled
        
def linear_map(A, Xs, Ys):
    '''Map src points with A'''
    # [NxN,NxN] -> NxNx2 # add 3-rd axis, like adding another layer
    src = np.stack((Xs,Ys), axis=Xs.ndim)
    # flatten first two dimension
    # (NN)x2
    src_r = src.reshape(-1,src.shape[-1]) #ask reshape to keep last dimension and adjust the rest
    # 2x2 @ 2x(NN)
    dst = A @ src_r.T # 2xNN
    #(NN)x2 and then reshape as NxNx2
    dst = (dst.T).reshape(src.shape)
    # Access X and Y
    return dst[...,0], dst[...,1]


def plot_points(ax, Xs, Ys, col='red', unit=None, linestyle='solid'):
    '''Plots points'''
    ax.set_aspect('equal')
    ax.grid(True, which='both')
    ax.axhline(y=0, color='gray', linestyle="--")
    ax.axvline(x=0, color='gray',  linestyle="--")
    ax.plot(Xs, Ys, color=col)
    if unit is None:
        plotVectors(ax, [[0,1],[1,0]], ['gray']*2, alpha=1, linestyle=linestyle)
    else:
        plotVectors(ax, unit, [col]*2, alpha=1, linestyle=linestyle)

def plotVectors(ax, vecs, cols, alpha=1, linestyle='solid'):
    '''Plot set of vectors.'''
    for i in range(len(vecs)):
        x = np.concatenate([[0,0], vecs[i]])
        ax.quiver([x[0]],
                   [x[1]],
                   [x[2]],
                   [x[3]],
                   angles='xy', scale_units='xy', scale=1, color=cols[i],
                   alpha=alpha, linestyle=linestyle, linewidth=2)

<div align='center'><img src='https://www.dottorgadget.it/news/wp-content/uploads/2022/07/transformers-1984-optimus-prime.gif' width='15%' ></div>

## My own latex definitions

$$\def\mbf#1{\mathbf{#1}}$$
$$\def\bmf#1{\boldsymbol{#1}}$$
$$\def\bx{\mbf{x}}$$
$$\def\bxt#1{\mbf{x}_{\text{#1}}}$$
$$\def\bv{\mbf{v}}$$
$$\def\bz{\mbf{z}}$$
$$\def\bmu{\bmf{\mu}}$$
$$\def\bsigma{\bmf{\Sigma}}$$
$$\def\Rd#1{\in \mathbb{R}^{#1}}$$
$$\def\chain#1#2{\frac{\partial #1}{\partial #2}}$$
$$\def\loss{\mathcal{L}}$$
$$\def\params{\bmf{\theta}}$$


# Today's lecture
## - Limitations of RNN
## - Self and Cross-Attention
## - The Transformers Architecture

# This lecture material is taken from
📘 **Chapter 9, 10, 11 Jurafsky Book**

📘 **Chapter 6.3 Eisenstein Book**
- [Stanford Slide Transformers](http://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture08-transformers.pdf)
- [Stanford Lecture Transformers](https://www.youtube.com/watch?v=ptuGllU5SQQ&list=PLoROMvodv4rOSH4v6133s9LFPRHjEmbmJ&index=9&themeRefresh=1)
- [Stanford Notes on Transformers](http://web.stanford.edu/class/cs224n/readings/cs224n-self-attention-transformers-2023_draft.pdf)

Another resource with code is [[d2l.ai] Attention and Transformers](https://d2l.ai/chapter_attention-mechanisms-and-transformers/index.html)

Illustrated Transformer [jalammar.github.io/illustrated-transformer](https://jalammar.github.io/illustrated-transformer/)

# Our last lecture: NLP  ❤️ RNN

- For what we have seen so far the state-of-the-arti: **bidirectional LSTM**: (for example, the source sentence in a translation) **This was circa 2016**
- Define your input (sequence)and use an **LSTM** to reduce it (classification) or generate (sequence generation).
- Notably we have seen that we can use **attention** 🧐 to avoid encode/decoder bottleneck and go back to see previous part in the text.


<div align='center'><img src="figs/attention_13.png" width='40%' ></div>

# 2️⃣0️⃣1️⃣7️⃣   ➡️➡️ 2️⃣0️⃣2️⃣3️⃣

# Limitations of RNN

# Limitations of RNN

<div align='center'><img src="figs/intro_transf.png" width='90%' ></div>


# Limitations of RNN

### Linear interaction distance
### Intrinsic Lack of Parallelizability


# Limitations of RNN: Linear interaction distance

- RNNs are unrolled “left-to-right”.
- This encodes **linear locality:** a useful heuristic!
- Nearby words often affect each other’s meanings


- **Problem:** RNNs take $\mathcal{O}(T)$ steps for
distant word pairs to interact where $T$ is the sequence length.

<div align='center'><img src="figs/short_seq.png?1" width='10%' ></div>

# Limitations of RNN: Linear interaction distance

- RNNs are unrolled “left-to-right”.
- This encodes **linear locality:** a useful heuristic!
- Nearby words often affect each other’s meanings
- **Problem:** RNNs take $\mathcal{O}(T)$ steps for
distant word pairs to interact where $T$ is the sequence length.

<div align='center'><img src="figs/long_seq.png?1" width='30%' ></div>

# Limitations of RNN: Linear interaction distance

$\mathcal{O}(T)$ steps for distant word pairs to interact means:
- Hard to learn **long-distance dependencies (because of vanishing gradient problems!)**
- Linear order of words is **“baked in”**; we already know linear order is not the right way to think about sentences...

<div align='center'><img src="figs/long_seq.png?1" width='30%' ></div>

# Limitations of RNN: Lack of Parallelizability

Forward and backward passes have $\mathcal{O}(T)$ **unparallelizable operations**
- GPUs can perform a bunch of independent computations at once!
- But **future RNN hidden states can not be computed in full before past RNN hidden states have been computed**
- Inhibits training on very large datasets!
<div align='center'><img src="figs/rnn_time_dependency.png" width='30%' ></div>

# If not recurrence, then what?

# What about window-based classifer?

Word window models **aggregate local contexts**
- Also known as 1D convolution! On images, this works very well, aka 2D convolution!
- **Number of unparallelizable operations does not increase with sequence length!**


<div align='center'><img src="figs/local_context.png" width='30%' ></div>

# What about window-based classifer?

The hidden state at layer 2, at time position $t=2$ will "see" hidden states at layer 1 at positions 1,2,3.

<br>
<div align='center'><img src="figs/local_context_02.png" width='30%' ></div>

# OK, good but how we make long-distance dependencies?


<br>
<div align='center'><img src="figs/local_context_02.png" width='30%' ></div>

# Ideas?

# Long-distance dependencies with local context [Receptive Field]

Stacking word window layers in depth allows interaction between farther words
- Maximum Interaction distance = sequence length / window size
- (But if your sequences are too long, you’ll just ignore long-distance context)

<br/><br/>
<div align='center'><img src="figs/local_context_03.png?2" width='40%' ></div>

# Overall take message so far: 
# - parallelizability in time is a must (RNN do not have that)
# - we can "pass on" parallelizability over depth

# What about Attention? 🧐

Attention treats each word’s representation as a **query** to access and incorporate information from a **set of values**.

We saw attention from **the decoder to the encoder**; today: attention with a single sentence.


<div align='center'><img src="figs/attention_13.png" width='50%' ></div>

# What about Attention? 🧐

- Number of unparallelizable operations does not increase with sequence length.
- Maximum interaction distance: $\mathcal{O}(1)$, since all words interact at every layer!
- All words attend to all words in previous layer (most arrows are omitted)

<br><br>
<div align='center'><img src="figs/self_attention_01.png" width='40%' ></div>

<br><div align='center'><img src="../2_04_from_rnn_to_nmt/figs/asteroidi.jpg" width='60%' ></div>|

<div align='center'><img src="figs/all_you_need.png" width='60%' ></div>


# A few words on query, key, values

So far all the networks we reviewed crucially relied on the **input being of a well-defined size.**

- **[VISION]** The images in ImageNet are of size $224 \times 224$ pixels and CNNs are specifically tuned to this size. 

- **[NLP]** **the input size for RNNs is well defined and fixed**. Variable size is addressed by sequentially processing one token at a time, or by specially designed convolution kernels.

In particular, for **long sequences it becomes quite difficult to keep track** of everything that has already been generated or even viewed by the network. 

# Query, key, values terminology is from databases

In their simplest form they are collections of keys ($k$) and values ($v$). For instance, our database $\mathcal{D}$ might consist of tuples.

 `{("key", "value")`

 `{("Zhang", "Aston"), ("Lipton", "Zachary"), ("Li", "Mu"), ("Smola", "Alex"), ("Hu", "Rachel"), ("Werness", "Brent")}`

- query ($q$) for "Li"  $\longrightarrow$ "Mu". Note query matches the key and returns the value.
- In case `("Li", "Mu")` was not a record in $\mathcal{D}$, there would be no valid answer!


- If we also allowed for **approximate matches**, we would retrieve `("Lipton", "Zachary")` instead.

# Query, key, values terminology is from databases

* We can design queries $q$ that operate on ($k$,$v$) pairs in such a manner as to be valid regardless of the  database size. 
* The same query can receive different answers, according to the contents of the database. 
* The "code" being executed to operate on a large state space (the database) can be quite simple (e.g., exact match, approximate match, top-$k$). 
* There is no need to compress or simplify the database to make the operations effective. 

# Lookup table vs Soft-average Lookup table

<div align='center'><img src="figs/self_attention_02.png" width='70%' ></div>

# Self-Attention: soft, averaging lookup table

<div align='center'><img src="https://d2l.ai/_images/attention-output.svg" width='70%' ></div>

# Self-Attention

<div align='center'><img src="figs/self_attention_03.png" width='60%' ></div>

# Self-Attention: soft, averaging lookup table

Denote by $\mathcal{D} \stackrel{\mathrm{def}}{=} \{(\mathbf{k}_1, \mathbf{v}_1), \ldots (\mathbf{k}_m, \mathbf{v}_m)\}$ a database of $m$ tuples of *keys* and *values*. Moreover, denote by $\mathbf{q}$ a *query*. Then we can define the *attention* over $\mathcal{D}$ as

$$\mathrm{Attention}(\mathbf{q}, \mathcal{D}) \stackrel{\mathrm{def}}{=} \sum_{i=1}^m \alpha(\mathbf{q}, \mathbf{k}_i) \mathbf{v}_i,$$

where $\alpha(\mathbf{q}, \mathbf{k}_i) \in \mathbb{R}$ ($i = 1, \ldots, m$) are **scalar attention weights**. The operation itself is typically referred to as *attention pooling*. 


The name *attention* derives from the fact that the operation pays particular attention to the terms for which the weight $\alpha$ is significant (i.e., large). As such, the attention over $\mathcal{D}$ generates a linear combination of values contained in the database. In fact, this contains the above example as a special case where all but one weight is zero.

# Self-Attention Properties
* The weights $\alpha(\mathbf{q}, \mathbf{k}_i)$ are nonnegative. In this case the output of the attention mechanism is contained in the convex cone spanned by the values $\mathbf{v}_i$. 
* The weights $\alpha(\mathbf{q}, \mathbf{k}_i)$ form a convex combination, i.e., $\sum_i \alpha(\mathbf{q}, \mathbf{k}_i) = 1$ and $\alpha(\mathbf{q}, \mathbf{k}_i) \geq 0$ for all $i$. This is the most common setting in deep learning. 
* Exactly one of the weights $\alpha(\mathbf{q}, \mathbf{k}_i)$ is $1$, while all others are $0$. This is akin to a traditional database query. 
* All weights are equal, i.e., $\alpha(\mathbf{q}, \mathbf{k}_i) = \frac{1}{m}$ for all $i$. This amounts to averaging across the entire database, also called average pooling in deep learning. 

# Softmax normalization
A common strategy to ensure that the weights sum up to $1$ is to normalize them via 

$$\alpha(\mathbf{q}, \mathbf{k}_i) = \frac{\alpha(\mathbf{q}, \mathbf{k}_i)}{{\sum_j} \alpha(\mathbf{q}, \mathbf{k}_j)}.$$

In particular, to ensure that the weights are also nonnegative, one can resort to exponentiation. This means that we can now pick *any* function  $a(\mathbf{q}, \mathbf{k})$ and then apply the softmax operation used for multinomial models to it via

$$\alpha(\mathbf{q}, \mathbf{k}_i) = \frac{\exp\big(a(\mathbf{q}, \mathbf{k}_i)\big)}{\sum_j \exp\big(a(\mathbf{q}, \mathbf{k}_j)\big)}. $$

# Self-Attention

<div align='center'><img src="https://d2l.ai/_images/qkv.svg" width='50%' ></div>

# So far, <u>nothing is learnable</u>, just linear combination



# Before going into the learning part, let us make a connection with classic machine learning

# Attention Pooling via Nadaraya-Watson Regression [1964]


$$f(\mathbf{x}) = \sum_i \mathbf{y}_i \underbrace{\frac{\alpha(\mathbf{x}, \mathbf{x}_i)}{\sum_j \alpha(\mathbf{x}, \mathbf{x}_j)}}_{\text{attention}}.$$

<br>
<div align='center'><img src="https://d2l.ai/_images/output_attention-pooling_d5e6b2_63_0.svg" width='80%' ></div>

<small>[Taken from d2l.ai](https://d2l.ai/chapter_attention-mechanisms-and-transformers/attention-pooling.html)</small>


# Attention Pooling via Nadaraya-Watson Regression


$$f(\mathbf{q}) = \sum_i \mathbf{v}_i \underbrace{\frac{\alpha(\mathbf{q}, \mathbf{k}_i)}{\sum_j \alpha(\mathbf{q}, \mathbf{k}_j)}}_{\text{attention}}.$$

<br>
<div align='center'><img src="https://d2l.ai/_images/output_attention-pooling_d5e6b2_78_0.svg" width='80%' ></div>

<small>[Taken from d2l.ai](https://d2l.ai/chapter_attention-mechanisms-and-transformers/attention-pooling.html)</small>


# Attention Pooling via Nadaraya-Watson Regression

The way you compare $\alpha(\mathbf{q}, \mathbf{k}_i)$ gives rise at different shape in the regression (control the range and smoothness).

$$\begin{aligned}
\alpha(\mathbf{q}, \mathbf{k}) & = \exp\left(-\frac{1}{2} \|\mathbf{q} - \mathbf{k}\|^2 \right) && \mathrm{Gaussian} \\
\alpha(\mathbf{q}, \mathbf{k}) & = 1 \text{ if } \|\mathbf{q} - \mathbf{k}\| \leq 1 && \mathrm{Boxcar} \\
\alpha(\mathbf{q}, \mathbf{k}) & = 1 && \mathrm{constant} \\
\alpha(\mathbf{q}, \mathbf{k}) & = \mathop{\mathrm{max}}\left(0, 1 - \|\mathbf{q} - \mathbf{k}\|\right) && \mathrm{Epanechikov}
\end{aligned}
$$

# Attention Pooling via Nadaraya-Watson Regression

```python
def nadaraya_watson(x_train, y_train, x_val, kernel):
    dists = x_train.reshape((-1, 1)) - x_val.reshape((1, -1))
    # Each column/row corresponds to each query/key
    k = kernel(dists).type(torch.float32)
    # Normalization over keys for each query
    attention_w = k / k.sum(0)
    y_hat = y_train@attention_w
    return y_hat, attention_w
```

# Attention Weights
<br>
<div align='center'><img src="https://d2l.ai/_images/output_attention-pooling_d5e6b2_63_0.svg" width='80%' ></div>

<div align='center'>Note that, besides the constant kernel, they are all similar. 
<br>Why not stick to Gaussian kernel and tune its bandwidth?<img src="https://d2l.ai/_images/output_attention-pooling_d5e6b2_78_0.svg" width='80%' ></div>


# Let's tune the bandwidth parameter in the [Gaussian] kernel
$$ \alpha(\mathbf{q}, \mathbf{k}) = \exp\left(-\frac{1}{2 \sigma^2} \|\mathbf{q} - \mathbf{k}\|^2 \right) $$
<div align='center'><img src="https://d2l.ai/_images/output_attention-pooling_d5e6b2_93_0.svg" width='80%' ></div>
<div align='center'><img src="https://d2l.ai/_images/output_attention-pooling_d5e6b2_108_0.svg" width='80%' ></div>

# Instead of tuning the bandwidth why not learning?

# Learnable self-attention

Let $\{w_1,\ldots,w_n \}$ be a sequence of words in vocabulary $V$, like 

```Iacopo made his daughter food```.

For each word token $w_i$ , let $\mbf{x}_i = \mbf{E}{w_i}$, where $\mbf{E} \in \mathbb{R}^{d\times |𝑉|}$ is an embedding matrix.

# Learnable self-attention

1. We transform each word embedding $\mbf{x}_i$ with <u>**learnable</u> weight matrices** $\mbf{Q},\mbf{K},\mbf{V} \in \mathbb{R}^{d\times d}$

 $$ \mbf{q}_i=\mbf{Q}\mbf{x}_i \qquad  \mbf{k}_i=\mbf{K}\mbf{x}_i \qquad \mbf{v}_i=\mbf{V}\mbf{x}_i \qquad$$
2. Compute pairwise similarities between keys and queries; normalize with softmax (across keys):

$$ e_{ij} = \mbf{q}_i^{\top}\mbf{k}_j \qquad \bmf{\alpha}_{ij}=\frac{\exp(e_{ij})}{\sum_j \exp(e_{ij}) } $$

3. Compute output for each word as weighted sum of values:
$$ \mbf{o}_i = \sum_{j} \bmf{\alpha}_{ij}\mbf{v}_i $$

# Computation with two word tokens

<div align='center'><img src="http://jalammar.github.io/images/t/transformer_self_attention_vectors.png" width='60%' ></div>

<small>[Taken from illustrated-transformer](http://jalammar.github.io/illustrated-transformer/)</small>


# Computation with two word tokens

<div align='center'><img src="http://jalammar.github.io/images/t/transformer_self_attention_score.png" width='60%' ></div>

<small>[Taken from illustrated-transformer](http://jalammar.github.io/illustrated-transformer/)</small>


# Computation with two word tokens

<div align='center'><img src="http://jalammar.github.io/images/t/self-attention_softmax.png" width='60%' ></div>

<small>[Taken from illustrated-transformer](http://jalammar.github.io/illustrated-transformer/)</small>


# Computation with two word tokens

<div align='center'><img src="http://jalammar.github.io/images/t/self-attention-output.png" width='50%' ></div>

<small>[Taken from illustrated-transformer](http://jalammar.github.io/illustrated-transformer/)</small>


# Computation with more than 2 word tokens

<div align='center'><img src="figs/self_attention_04.png" width='60%' ></div>

<div align='center'><img src="figs/self_attention_05.png" width='60%' ></div>

<div align='center'><img src="figs/self_attention_06.png" width='60%' ></div>

# Attention can be easily made into tensor

<div align='center'><img src="http://jalammar.github.io/images/t/self-attention-matrix-calculation.png" width='50%' ></div>

<div align='center'><img src="http://jalammar.github.io/images/t/self-attention-matrix-calculation-2.png" width='60%' ></div>

# Attention with tensor

Let us assume that $\mbf{X} = [\mbf{x}_1,\ldots,\mbf{x}_n] \in \mathbb{R}^{n\times d}$ is a matrix of concatenated input vectors. We are going to write all in function of the input $\mbf{X}$. Note also that:
- $\underbrace{\mbf{X}}_{n\times d}\underbrace{\mbf{K}}_{d\times d} \in \mathbb{R}^{n \times d}$, same hold for $\mbf{X}\mbf{Q} \in \mathbb{R}^{n \times d}$ and $\mbf{X}\mbf{V} \in \mathbb{R}^{n \times d}$ 
- The output is defined as $\operatorname{softmax}(\mbf{X}\mbf{Q}(\mbf{X}\mbf{K})^{\top})\cdot\mbf{X}\mbf{V}$

<div align='center'><img src="figs/self_attention_07.png" width='60%' ></div>

# Self-Attention with tensor

Let us assume that $\mbf{X} = [\mbf{x}_1,\ldots,\mbf{x}_n] \in \mathbb{R}^{n\times d}$ is a matrix of concatenated input vectors. We are going to write all in function of the input $\mbf{X}$. Note also that:
- $\underbrace{\mbf{X}}_{n\times d}\underbrace{\mbf{K}}_{d\times d} \in \mathbb{R}^{n \times d}$, same hold for $\mbf{X}\mbf{Q} \in \mathbb{R}^{n \times d}$ and $\mbf{X}\mbf{V} \in \mathbb{R}^{n \times d}$ 
- The output is defined as $\operatorname{softmax}(\mbf{X}\mbf{Q}(\mbf{X}\mbf{K})^{\top})\cdot\mbf{X}\mbf{V}  $

<div align='center'><img src="figs/self_attention_08.png" width='60%' ></div>

# Are we done with Self-Attention?
<br>
<div align='center'><img src="figs/self_attention_09.png" width='60%' ></div>

# No! Still many things to fix 😪

# Self-attention as a NLP block: Fix 1) No notion of order!

Self-attention, as of now, <u>works on **sets**</u>, does not have a **notion of order**; thus is permutation equivariant (i.e. permutations of the inputs give same permutation of the output).

`Iacopo made his daughter food` $=$ ` his daughter made Iacopo food`

RNN had order encoded **implicitly** in the representation Iacopo $\rightarrow$ made $\rightarrow$ his $\rightarrow$ daughter 

Why not putting the **word index explicit in the representation**?

| Iacopo | made | his | daughter | food |
|:------:|:----:|:---:|:--------:|:----:|
|    0   |   1  |  2  |     3    |   4  |

# Self-attention as a NLP block: Fix 1) No notion of order!

| **x** | Iacopo | made | his | daughter | food |
|-------|:------:|:----:|:---:|:--------:|:----:|
| **i** |    0   |   1  |  2  |     3    |   4  |

With $\mbf{x}$ we encode $i$ as well. The **sequence index** is mapped to a vector with a function. Let's call  $\mbf{p}$ the output of this function given $i$.

$$ \mbf{\tilde{x}}_i = \mbf{x}_i + \mbf{p}_i $$

So we **add signal of the index in the feature itself** (we could also concat).

# Fix 1) No notion of order!
The dominant approach for preserving  information about the order of tokens is to represent this to the model 
as an additional input associated  with each token.  These inputs are called **positional encodings** and they can either be:
1. learned or 
2. fixed a priori.

We now describe a simple scheme for fixed positional encodings based on sine and cosine functions.

<div align='center'><img src="https://miro.medium.com/v2/resize:fit:1100/format:webp/1*pS2_ywtYRO7hIRoj0XpHBQ.png" width='55%' ></div>

[Taken from towardsdatascience.com](https://towardsdatascience.com/master-positional-encoding-part-i-63c05d90a0c3)

<div align='center'><img src="https://miro.medium.com/v2/resize:fit:720/format:webp/1*R3U8xOrxuYRYNLe961n7bg.png" width='55%' ></div>

[Taken from towardsdatascience.com](https://towardsdatascience.com/master-positional-encoding-part-i-63c05d90a0c3)

# Positional Encoding

### Absolute Positional Information

To see how the monotonically decreased frequency
along the encoding dimension relates to absolute positional information,
let's print out **the binary representations** of $0, 1, \ldots, 7$.
As we can see, the lowest bit, the second-lowest bit, 
and the third-lowest bit alternate on every number, 
every two numbers, and every four numbers, respectively.

```python
0 in binary is 000
1 in binary is 001
2 in binary is 010
3 in binary is 011
4 in binary is 100
5 in binary is 101
6 in binary is 110
7 in binary is 111
```


# Positional Encoding

Suppose that the input representation 
$\mathbf{X} \in \mathbb{R}^{n \times d}$ 
contains the $d$-dimensional embeddings 
for $n$ tokens of a sequence.
The positional encoding outputs
$\mathbf{X} + \mathbf{P}$
using a positional embedding matrix 
$\mathbf{P} \in \mathbb{R}^{n \times d}$ of the same shape,
whose element on the $i^\mathrm{th}$ row 
and the $(2j)^\mathrm{th}$
or the $(2j + 1)^\mathrm{th}$ column is

$$\begin{aligned} p_{i, 2j} &= \sin\left(\frac{i}{10000^{2j/d}}\right),\\p_{i, 2j+1} &= \cos\left(\frac{i}{10000^{2j/d}}\right).\end{aligned}$$

At first glance,
this trigonometric-function
design looks weird.

# Positional Encoding

In the positional embedding matrix $\mathbf{P}$,
**rows correspond to positions within a sequence
and columns represent different positional encoding dimensions**.
In the example below,
we can see that
the $6^{\mathrm{th}}$ and the $7^{\mathrm{th}}$
columns of the positional embedding matrix 
have a higher frequency than 
the $8^{\mathrm{th}}$ and the $9^{\mathrm{th}}$
columns.
The offset between 
the $6^{\mathrm{th}}$ and the $7^{\mathrm{th}}$ (same for the $8^{\mathrm{th}}$ and the $9^{\mathrm{th}}$) columns
is due to the alternation of sine and cosine functions.

<br>
<div align='center'><img src="https://d2l.ai/_images/output_self-attention-and-positional-encoding_ce9eb6_48_0.svg" width='60%' ></div>


# Positional Encoding - Absolute Positional Info

In binary representations, a higher bit has a lower frequency than a lower bit. Similarly, as demonstrated in the heat map below, the positional encoding decreases frequencies along the encoding dimension by using trigonometric functions. Since the outputs are float numbers, such continuous representations are more space-efficient than binary representations.
<div align='center'><img src="https://d2l.ai/_images/output_self-attention-and-positional-encoding_ce9eb6_78_0.svg" width='25%' ></div>



# Positional Encoding - Relative Positional Info

Besides capturing absolute positional information,
the above positional encoding
also allows
a model to easily learn to attend by relative positions.
This is because
for any fixed position offset $\delta$,
the positional encoding at position $i + \delta$
can be represented by a linear projection
of that at position $i$.

This projection can be explained
mathematically.
Denoting
$\omega_j = 1/10000^{2j/d}$,
any pair of $(p_{i, 2j}, p_{i, 2j+1})$ 
in equation below
can 
be linearly projected to $(p_{i+\delta, 2j}, p_{i+\delta, 2j+1})$
for any fixed offset $\delta$:

$$\begin{aligned}
&\begin{bmatrix} \cos(\delta \omega_j) & \sin(\delta \omega_j) \\  -\sin(\delta \omega_j) & \cos(\delta \omega_j) \\ \end{bmatrix}
\begin{bmatrix} p_{i, 2j} \\  p_{i, 2j+1} \\ \end{bmatrix}\\
=&\begin{bmatrix} \cos(\delta \omega_j) \sin(i \omega_j) + \sin(\delta \omega_j) \cos(i \omega_j) \\  -\sin(\delta \omega_j) \sin(i \omega_j) + \cos(\delta \omega_j) \cos(i \omega_j) \\ \end{bmatrix}\\
=&\begin{bmatrix} \sin\left((i+\delta) \omega_j\right) \\  \cos\left((i+\delta) \omega_j\right) \\ \end{bmatrix}\\
=& 
\begin{bmatrix} p_{i+\delta, 2j} \\  p_{i+\delta, 2j+1} \\ \end{bmatrix},
\end{aligned}$$

where the $2\times 2$ projection matrix does not depend on any position index $i$.

# Self-Attention so far
<br>
<div align='center'><img src="figs/self_attention_10.png" width='40%' ></div>

# Self-attention as a NLP block: Fix 2) No nonlinearities! Just weighted average.




# Fix 2) Adding non-linearity

- Note that there are no elementwise
nonlinearities in self-attention;
stacking more self-attention layers
just re-averages value vectors

- Easy fix: **add a feed-forward network (MLP)**
to "post-process" each output vector.

\begin{aligned}
\mbf{m}_i = &\text{MLP}(\mbf{output}_i)=\\
= & \mbf{W}_2\operatorname{ReLu}\big(\mbf{W}_1\mbf{output}_i+ \mbf{b}_1 \big)+ \mbf{b}_2 
\end{aligned}

#   

<br>
<div align='center'><img src="figs/self_attention_11.png" width='100%' ></div>

# Positionwise Feed-Forward Networks (MLP)

<br>
<div align='center'><img src="figs/self_attention_11.png" width='60%' ></div>

# Self-attention as a NLP block: Fix 3) We can look in the future! 
## <u> Only used in the Decoder </u>

With RNN, this was not possible but with Self-attention we need to ensure we **do not “look at the future”** when predicting a sequence for:
- Machine Translation
- Language Modeling

# Masking Self-attention

- To use self-attention in
decoders, we need to ensure
we can’t peek at the future

- At every timestep, we could
change the set of keys and
queries to include only past
words. (Inefficient!)

- To enable parallelization, we
mask out attention to future
words by setting attention
scores to $-\infty$

\begin{equation}
   \mbf{e}_{ij} = \begin{cases}
      \mbf{q}_i^{\top}\mbf{k}_j & j \leq i\\
      -\infty & j > i\\
    \end{cases}\,.
\end{equation}

#   

<br>
<div align='center'><img src="figs/self_attention_12.png" width='100%' ></div>

# The Transformer Encoder-Decoder

<div align='center'><img src='https://www.dottorgadget.it/news/wp-content/uploads/2022/07/transformers-1984-optimus-prime.gif' width='15%' ></div>

# The Transformer
<div align='center'><img src="https://d2l.ai/_images/transformer.svg" width='35%' ></div>


# Missing pieces

## 1) Multi-Head Attention
### 2a) Residual Connection
### 2b) Layer Normalization
### 2c) Scaled Dot-Product





<div align='center'><img src="https://d2l.ai/_images/transformer.svg" width='80%' ></div>

# The world is "multi-modal"

# Self-Attention

<div align='center'><img src="figs/self_attention_03.png" width='60%' ></div>

# Multi-head self-attention
<br>
<div align='center'><img src="https://dmdave.com/wp-content/uploads/2019/03/demonic-hydra.jpg" width='60%' ></div>


# Multi-head self-attention
<br>
<div align='center'><img src="figs/self_attention_13.png" width='90%' ></div>

# Multi-head self-attention
<br>
<div align='center'><img src="figs/multi-head.png" width='60%' ></div>

# Self-Attention with tensor

Let us assume that $\mbf{X} = [\mbf{x}_1,\ldots,\mbf{x}_n] \in \mathbb{R}^{n\times d}$ is a matrix of concatenated input vectors. We are going to write all in function of the input $\mbf{X}$. Note also that:
- $\underbrace{\mbf{X}}_{n\times d}\underbrace{\mbf{K}}_{d\times d} \in \mathbb{R}^{n \times d}$, same hold for $\mbf{X}\mbf{Q} \in \mathbb{R}^{n \times d}$ and $\mbf{X}\mbf{V} \in \mathbb{R}^{n \times d}$ 
- The output is defined as $\operatorname{softmax}(\mbf{X}\mbf{Q}(\mbf{X}\mbf{K})^{\top})\cdot\mbf{X}\mbf{V}  $
<br>
<div align='center'><img src="figs/self_attention_08.png" width='60%' ></div>

# Multi Head Self-Attention with tensor

Instead of a single set of weights, we have now **$h$** weights $\{\mbf{Q}_l, \mbf{K}_l, \mbf{V}_l\}_{l=1}^h$ but to keep the same computational cost we cut the output dimensionality.

- $\mbf{Q}_l,\mbf{K}_l, \mbf{V}_l \in \mathbb{R}^{d\times \frac{d}{h}}$ where $l \in  \{1,\ldots,h \}$.
- The output$_l$ is defined as $\operatorname{softmax}(\mbf{X}\mbf{Q}_l(\mbf{X}\mbf{K}_l)^{\top})\cdot\mbf{X}\mbf{V}_l   \in \mathbb{R}^{d\times \frac{d}{h}}$
- The final output is concatenation of all output processed by a linear projection
 $$ \text{output} = \mbf{Y} [\text{output}_1;\ldots; \text{output}_h]$$

<div align='center'><img src="figs/multi-head_00.png" width='60%' ></div>

# Multi Head Self-Attention with tensor

Instead of a single set of weights, we have now **$h$** weights $\{\mbf{Q}_l, \mbf{K}_l, \mbf{V}_l\}_{l=1}^h$ but to keep the same computational cost we cut the output dimensionality.

- $\mbf{Q}_l,\mbf{K}_l, \mbf{V}_l \in \mathbb{R}^{d\times \frac{d}{h}}$ where $l$ ranges from $1,\ldots,h$.
- The output$_l$ is defined as $\operatorname{softmax}(\mbf{X}\mbf{Q}_l(\mbf{X}\mbf{K}_l)^{\top})\cdot\mbf{X}\mbf{V}_l   \in \mathbb{R}^{d\times \frac{d}{h}}$
- The final output is concatenation of all output processed by a linear projection
 $$ \text{output} = \mbf{Y} [\text{output}_1;\ldots; \text{output}_h]$$

### <u>Computationally is the same thing as before with the same #parameters except for $\mbf{Y}$.</u>

# 1) Multi Head Self-Attention 
<br>
<div align='center'><img src="figs/multi-head_01.png" width='60%' ></div>

# 2a) Residual Connection [He et al., 2016]

Residual connections are a powerful mechanism to allow gradients to flow better in your model. Sometimes they are also called skip connections. They have been popular in computer vision [He et al., 2016]

 $$\mbf{x}^i = \operatorname{Layer}\big(\mbf{x}^{i-1}\big)+\mbf{x}^{i-1}  $$
<br>
<div align='center'><img src="figs/residual_connection.png" width='30%' ></div>


# 2a) Residual Connection

$$\mbf{x}^i = \operatorname{Layer}\big(\mbf{x}^{i-1}\big)+\mbf{x}^{i-1}  $$
<br>
<div align='center'><img src="figs/residual_connection.png" width='70%' ></div>


<br>
<div align='center'><img src="figs/residual_connection_01.png" width='100%' ></div>

# 2b) Layer Normalization [Ba et al., 2016]

$\mbf{x} \in \mathbb{R}^{d}$ be a word embedding then:
 $$ \mbf{x}^{\prime} = \frac{\mbf{x}-\mu}{\sigma^2+\epsilon}\cdot \mbf{\gamma}+\mbf{\beta}$$

Note that $\mu,\sigma$ are scalar while $\mbf{\gamma}$,$\mbf{\beta}$ vectors. $~~~~~\downarrow$

<div align='center'><img src="figs/layer_norm.png" width='100%' ></div>


# 2c) Scaled Dot Product
Last, we need to keep the order of magnitude of the arguments in the exponential function under control. Assume that all the elements of the query $\mathbf{q} \in \mathbb{R}^d$ and the key $\mathbf{k}_i \in \mathbb{R}^d$ are independent and identically drawn random variables **with zero mean and unit variance**. The dot product between both vectors has zero mean and a variance of $d$. To ensure that the variance of the dot product still remains one regardless of vector length, we use the **scaled dot-product attention** scoring function. That is, we rescale the dot-product by $1/\sqrt{d}$. 

$$ a(\mathbf{q}, \mathbf{k}_i) = \frac{\mathbf{q}^\top \mathbf{k}_i}{\sqrt{d}}.$$

[More info here](https://github.com/BAI-Yeqi/Statistical-Properties-of-Dot-Product/blob/master/proof.pdf)

# The Transformer

<div align='center'><img src="figs/paper_00.png?1" width='70%' ></div>

# The Transformer

<div align='center'><img src="figs/paper_01.png" width='38%' ></div>

# Cross-Attention

- Self-attention, query, key and values come from the same **source**
- In the decoder, we have attention that looks more like what we saw in NMT.
- $\{ \mbf{h}_1,\ldots,\mbf{h}_T \}$ last output from the encoder
- $\{ \mbf{z}_1,\ldots,\mbf{z}_T \}$ is the input of the decoder [this can be input representation of input to next layer]

**Keys and Values are drawn from the encoder (like a memory)**

- $\underbrace{\mbf{k}_i = \mbf{K}\mbf{h}_i \quad \mbf{v}_i = \mbf{V}\mbf{h}_i}_{\text{encoder}} \quad \underbrace{\mbf{q}_i = \mbf{Q}\mbf{z}_i}_{\text{decoder}}$

# Cross-Attention with tensor

$\mbf{H} = [\mbf{h}_1,\ldots,\mbf{h}_n] \in \mathbb{R}^{n\times d}$ is a matrix of concatenated last encodings.

$\mbf{D} = [\mbf{d}_1,\ldots,\mbf{d}_n] \in \mathbb{R}^{n\times d}$ is a matrix of concatenated input in the decoder.

The output is defined as $\operatorname{softmax}(\mbf{D}\mbf{Q}(\mbf{H}\mbf{K})^{\top})\cdot\mbf{H}\mbf{V}  $

<div align='center'><img src="figs/corss-attention.png" width='50%' ></div>

# Transformers Decoding

<div align='center'>
    <img src="http://jalammar.github.io/images/t/transformer_decoding_1.gif" width='80%' >
</div>

# Transformers Decoding

<div align='center'><img src="http://jalammar.github.io/images/t/transformer_decoding_2.gif" width='80%' ></div>


# Final Linear Layer

<br>
<div align='center'><img src="http://jalammar.github.io/images/t/transformer_decoder_output_softmax.png" width='50%' ></div>

# Machine Translation Performance

<br>
<div align='center'><img src="figs/performance_01.png" width='60%' ></div>

# Text Summarization

<br>
<div align='center'><img src="figs/performance_00.png" width='70%' ></div>

# Natural Language Processing

### Contextual Embedding, Subword model, 
### BERT, Transfer Learning (Pre-training)

<br><br>
Prof. Iacopo Masi and Prof. Stefano Faralli

# Today's lecture
### - Few words on subwords modeling (Byte pairing)
## - GPT, BERT
## - Transfer Learning (Pre-training)

# This lecture material is taken from
📘 **Chapter 9, 10, 11 Jurafsky Book**

📘 **Chapter 6.3 Eisenstein Book**
- [Stanford Slide Transformers](http://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture08-transformers.pdf)
- [Stanford Lecture Transformers](https://www.youtube.com/watch?v=ptuGllU5SQQ&list=PLoROMvodv4rOSH4v6133s9LFPRHjEmbmJ&index=9&themeRefresh=1)
- [Stanford Notes on Transformers](http://web.stanford.edu/class/cs224n/readings/cs224n-self-attention-transformers-2023_draft.pdf)

Another resource with code is [[d2l.ai] Attention and Transformers](https://d2l.ai/chapter_attention-mechanisms-and-transformers/index.html)

Illustrated Transformer [jalammar.github.io/illustrated-transformer](https://jalammar.github.io/illustrated-transformer/)

# Subword Modeling

# Word structure and subword models

Let us take a look at the assumptions we have made about a language's vocabulary.

We assume a fixed vocab of tens of thousands of words, built from the training set.
All novel words seen at test time are mapped to a single **`UNK`**

| type       | word         | V mapping     | embedding         |
|------------|--------------|---------------|-------------------|
| common     | hat          | hat (index)   | $\mbf{e}_{hat}$   |
| common     | learn        | learn (index) | $\mbf{e}_{learn}$ |
| variations | taaaasty     | UNK (index)   | $\mbf{e}_{UNK}$   |
| typo       | laern        | UNK (index)   | $\mbf{e}_{UNK}$   |
| new word   | Transformify | UNK (index)   | $\mbf{e}_{UNK}$   |

# Word structure and subword models: Words Morphology

Finite vocabulary assumptions make **even less sense in many COMPLEX languages**.
- Many languages exhibit **complex morphology,** or word structure.
- The effect is more word types, each occurring fewer times

<br>
<div align='center'><img src="figs/verbs.png" width='50%' ></div>

# Subword modeling: we give up the assumption of single word token

# The Byte Pair Encoding [2015]

To allow for **variable-length subwords** in a **fixed-size vocabulary**, we can apply a compression algorithm called byte pair encoding (BPE) to extract subwords. BPE is in the middle between:

1. Break down all words into characters (the vocabulary is the set of chars). Model has to make a big effort to learn words.
2. Usual way of splitting the text in word token: too rigid, as soon as we get an non frequent unknown word we map it to `UNK`.

BPE starts from assumption 1. and slowly builds towards point 2. We learn the Vocabulary using a greedy approach starting from 1.

# The Byte Pair Encoding

Simple, effective strategy for defining a subword vocabulary.

1. Start with a **vocabulary containing only characters and an “end-of-word” symbol**.
2. Using a corpus of text, find the most common adjacent characters “a,b”; add “ab” as a subword.
3. Replace instances of the character pair with the new subword; repeat until desired vocab size.

Originally used in NLP for machine translation; now a similar method (**WordPiece**) is used in pretrained
models.

**BPE used in GPT-2 and RoBERTA.**

# The Byte Pair Encoding
| type       | word         | V mapping     |      embedding     |
|------------|--------------|---------------|:------------------:|
| common     | hat          | hat (index)   |   $\mbf{e}_{hat}$  |
| common     | learn        | learn (index) |  $\mbf{e}_{learn}$ |
| variations | taaaasty     | taa aaa sty   | $\mbf{e}_i ~i=1..3$ |
| typo       | laern        | la ern        | $\mbf{e}_i ~i=1..2$ |
| new word   | Transformify | Transform ify | $\mbf{e}_i ~i=1..2$ |

[Taken from d2l.ai](https://d2l.ai/chapter_natural-language-processing-pretraining/subword-embedding.html#byte-pair-encoding)



# The Byte Pair Encoding

We start from: 

```python
symbols = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm',
           'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z',
           '_', '[UNK]']
```

# The Byte Pair Encoding

From a corpus, we get raw frequency tokens (here only 4 for simplicity).

```python
raw_token_freqs = {'fast_': 4, 'faster_': 3, 'tall_': 5, 'taller_': 4}
```

From this we break each word in chars `_` is needed to remember where words where ending.

```python
{'f a s t _': 4, 'f a s t e r _': 3, 't a l l _': 5, 't a l l e r _': 4}
```


# Merging process by most frequent adjacent subwords


```python
merge #1: ('t', 'a')
merge #2: ('ta', 'l')
merge #3: ('tal', 'l')
merge #4: ('f', 'a')
merge #5: ('fa', 's')
merge #6: ('fas', 't')
merge #7: ('e', 'r')
merge #8: ('er', '_')
merge #9: ('tall', '_')
merge #10: ('fast', '_')
```


```python
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '_', '[UNK]', 'ta', 'tal', 'tall', 'fa', 'fas', 'fast', 'er', 'er_', 'tall_', 'fast_']
```

# English syntax and rules emerge

Faster and taller are broken as:

```python
['fast_', 'fast er_', 'tall_', 'tall er_']
```

# Out of Vocabulary

We have to model:
```python
tokens = ['tallest_', 'fatter_']
```

what happens with **BPE**?


```python
['tall e s t _', 'fa t t er_']
```

able to recover `tall` and also `fa` and `er` but fragmenting `e` `s` `t` etc. **Using a big corpus can fix this.**

[Taken from d2l.ai](https://d2l.ai/chapter_natural-language-processing-pretraining/subword-embedding.html#byte-pair-encoding)


# Contextual Embedding

# Motivating word meaning and context

Stated by J. R. Firth (1957) as: 

> "You shall know a word by the company it keeps"

Distributional statistics have a striking ability to capture lexical semantic relationships such as **analogies**.

J. R. Firth (1935) also stated:

> "... the complete meaning of a word is always contextual,
and no study of meaning apart from a complete context
can be taken seriously."

Consider `I record the record:` the two instances of `record` mean different things.

# Where we were: pretrained word embeddings

Circa 2017:
- Start with pretrained word embeddings (no
context!)
- Learn how to incorporate context in an LSTM or Transformer while training on the task

Issues:
- The training data we have for **our downstream task** (like question answering) **must be sufficient** to teach all contextual aspects of language
- Most of the parameters in our network are **randomly initialized!**

# Pretrained word embeddings



<br>
<div align='center'><img src="figs/pre_training.png" width='100%' ></div>

# Where we are: pretrained whole model

In modern NLP:
- All (or almost all) parameters in NLP
networks are initialized via **pretraining**.
- Pretraining methods hide parts of the input
from the model, and train the model to
reconstruct those parts

This has been exceptionally effective at
building strong:
- **representations of language**
- **parameter initializations** for strong NLP
models.
- **Probability distributions** over language that we can sample from

# Pretrained model



<br>
<div align='center'><img src="figs/pre_training_01.png" width='90%' ></div>

# Pretraining ❤️ Self-Supervised Learning

# What can we learn from reconstructing the input?

<br><br><br>
$$ \text{Sapienza University is located in __________, Italy}$$
<br><br>

# What can we learn from reconstructing the input?

<br><br><br>
$$ \text{I put ___ fork down on the table.}$$
<br><br>

# What can we learn from reconstructing the input?

<br><br><br>
$$ \text{The woman walked across the street,
checking for traffic over ___ shoulder.}$$
<br><br>

# What can we learn from reconstructing the input?

<br><br><br>
$$ \text{I went to the ocean to see the fish, turtles, seals, and _____.}$$
<br><br>

# What can we learn from reconstructing the input?

<br><br><br>
$$\text{Overall, the value I got from the two hours watching
it was the sum total of the popcorn and the drink.}$$

$$\text{The movie was ______.}$$
<br><br>

# What can we learn from reconstructing the input?

<br><br><br>
$$\text{Iroh went into the kitchen to make some tea.}$$
$$\text{Standing next to Iroh, Zuko pondered his destiny.}$$
$$\text{Zuko left the ______.}$$
<br><br>

# The Transformer (Encoder and Decoder)

<div align='center'><img src="figs/paper_01.png" width='35%' ></div>

# Pretraining through language modeling (LM)

Recall the language modeling task:
- Model $p_{\theta}(w_t | w_1,\ldots,w_{t-1})$, the probability
distribution over words given their past
contexts.
- There’s lots of data for this! (In English)

Pretraining through language modeling:
- Train a neural network to perform language
modeling on a large amount of text.
- Save the network parameters

<div align='center'><img src="figs/pre_training_00.png" width='98%' ></div>

 # The Pretraining / Finetuning Paradigm
 
 See Pretraining as serving for a **smart parameter initialization.**

# The Pretraining
**Step 1: Pretrain (on language modeling)**
Lots of text; learn general things!<br>
<div align='center'><img src="figs/pre_training_00.png" width='70%' ></div>

# Finetuning Paradigm

**Step 2: Finetune (on your task)**
Not many labels; adapt to the task!<br>
<div align='center'><img src="figs/pre_training_02.png" width='65%' ></div>

# Three ways to pre-train a model

# Pretraining for three types of architectures


## 1) Decoder Only

Language models! What we have seen so far. Nice to generate from yet can **not** condition on future words<br>
<div align='center'><img src="figs/decoder_only.png" width='30%' ></div>

## 2) Encoder Only

Gets **bidirectional context – can condition on future!** How do we train them to build strong representations?<br>

<div align='center'><img src="figs/encoder_only.png" width='30%' ></div>

## 3) Encoder-Decoder

The best of both worlds: Good parts of decoders and encoders?  What’s the best way to pretrain them?

<div align='center'><img src="figs/e2d.png" width=60%' ></div>

# 1) Decoder only - Option 1

When using language model pretrained decoders, we can ignore that they were trained to model $p_{\theta}(w_t | w_1,\ldots,w_{t-1})$.

We can fine-tune them by training a classifier
on the last word’s hidden state.

$$ \mbf{h}_1,\ldots,\mbf{h}_t=\operatorname{decoder}(w_1,\ldots,w_t)\\
\mbf{y} = \mbf{W}\mbf{h}_t+\mbf{b}$$

$\mbf{W},\mbf{b}$ are randomly initialized.


<div align='center'><img src="figs/decoder_finetune.png" width=60%' ></div>

# 1) Decoder only - Option 2

Re-use them as LM $p_{\theta}(w_t | w_1,\ldots,w_{t-1})$.

This is helpful in tasks where the output is a sequence with a vocabulary like that at
pretraining time!

- **Dialogue** (context=dialogue history)
- **Summarization** (context=document)

$$ \mbf{h}_1,\ldots,\mbf{h}_t=\operatorname{decoder}(w_1,\ldots,w_t)\\
w_{t+1} = \mbf{W}\mbf{h}_t+\mbf{b}$$

$\mbf{W},\mbf{b}$ were pretrained in the language model!

<div align='center'><img src="figs/decoder_finetune2.png" width=60%' ></div>

# Generative Pretrained Transformer (GPT) [Radford et al., 2018]
<br>
<div align='center'><img src="https://d2l.ai/_images/gpt-decoder-only.svg" width='30%'></div>

    2018’s GPT was a big success in pretraining a decoder!

- Transformer decoder with **12 layers, 117M parameters.**
- **768**-dimensional hidden states, **3072**-dimensional feed-forward hidden layers.
- Byte-pair encoding with **40,000 merges** (it is not the size of the vocab.)
- Trained on **BooksCorpus: over 7000 unique books.**
- Contains long spans of contiguous text, for learning **long-distance dependencies.**
- The acronym "GPT" never showed up in the original paper; it could stand for "Generative PreTraining" or "Generative Pretrained Transformer

# Generative Pretrained Transformer (GPT) [Radford et al., 2018]
<br>
<div align='center'><img src="figs/GPT_paper.png" width='70%'></div>

# Generative Pretrained Transformer (GPT) [Radford et al., 2018]

```
Premise: The main is in the doorway
Hypothesis: The man is near the door

labels: entailment/contradictory/neutral
```
Input: `[START] The main is in the doorway [DELIM] The man is near the door [EXTRACT]]`
<div align='center'><img src="figs/GPT_experiments.png" width='65%'></div>

# GPT2: increasingly convincing generations [Radford et al., 2018]
<br>
<div align='center'><img src="figs/GPT2_generation.png" width='80%'></div>

# GPT3


<div align='center'><img src="figs/GPT3_01.png" width='80%' ></div>

# GPT3


<div align='center'><img src="figs/GPT3_00.png?2" width='80%' ></div>

# GPT3

GPT3 has **175 billion parameters.** trained on **300B tokens of text**. Context window is **2048 tokens** (a few pages?).

Not much different than before

> We use the same model and architecture as GPT-2 [RWC+19], including the modified initialization, pre-normalization, and reversible tokenization described therein, with the exception that we use alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer

# GPT3 -  Huge Model, Huge Training set


<div align='center'><img src="figs/GPT3_02.png" width='80%' ></div>

# GPT3 -  In-context learning

We can "interact" with pretrained models in two ways:
- Sample from the distributions they define (maybe providing a prompt)
- Fine-tune them on a task we care about, and take their predictions.

**Very large language models seem to perform some kind of learning without gradient steps simply from examples you provide within their contexts.**

# GPT3 -  In-context learning

<div align='center'><img src="figs/GPT3_03.png" width='80%' ></div>

# GPT3 -  In-context learning

<br><div align='center'><img src="figs/GPT3_04.png" width='80%' ></div>

# GPT3 -  In-context learning

<br><div align='center'><img src="figs/GPT3_04.png" width='80%' ></div>

# GPT3 -  In-context learning

<br><div align='center'><img src="figs/prompt_zeroshot.png" width='70%' ></div>

## 2) Encoder Only

Gets **bidirectional context – can condition on future!** How do we train them to build strong representations?<br>

<div align='center'><img src="figs/encoder_only.png" width='30%' ></div>

# Masked Language Modeling + BERT (Bidirectional Encoder Representations from Transformers)

# Masked Language Modeling

We have looked at language model pre-training. But **encoders** get **bidirectional context**, so we **can not do language modeling!**

**Idea:** replace some fraction of words in the
input with a special `[MASK]` token; predict
these words.


$$ \mbf{h}_1,\ldots,\mbf{h}_t=\operatorname{encoder}(w_1,\ldots,w_t)\\
w_{t+1} = \mbf{W}\mbf{h}_t+\mbf{b}$$

**Masked LM:** Only add loss terms from words that are
"masked out." If $\tilde{\mbf{x}}$ is the masked version of $\mbf{x}$, we are learning $p_{\bmf{\theta}}(\mbf{x}|\tilde{\mbf{x}}$).

<div align='center'><img src="figs/masked_LM.png" width='70%' ></div>

# BERT: Masked LM

[Devlin et al., 2018] proposed the **Masked LM** objective and released the weights of a
pretrained Transformer, a model they labeled **BERT**.


Some more details about Masked LM for BERT:
- Predict a random **15% of (sub)word tokens.**
- Replace input word with **[MASK] 80% of the time**
- Replace input word with a **random token 10% of the time**
- Leave input word **unchanged 10% of the time** (but still predict it!)
- Why? Does not let the model get complacent and not build strong representations of non-masked words. (No masks are seen at fine-tuning time!)


<div align='center'><img src="figs/BERT_01.png" width='70%' ></div>

# BERT: Masked LM


[Devlin et al., 2018] proposed the **Masked LM** objective and released the weights of a
pretrained Transformer, a model they labeled **BERT**.


Some more details about Masked LM for BERT:
- Predict a random **15% of (sub)word tokens.**
- Replace input word with **[MASK] 80% of the time**
- Replace input word with a **random token 10% of the time**
- Leave input word **unchanged 10% of the time** (but still predict it!)
- Why? Does not let the model get complacent and not build strong representations of non-masked words. (No masks are seen at fine-tuning time!)


<div align='center'><img src="figs/BERT_02.png" width='70%' ></div>

# BERT: Masked LM


[Devlin et al., 2018] proposed the **Masked LM** objective and released the weights of a
pretrained Transformer, a model they labeled **BERT**.


Some more details about Masked LM for BERT:
- Predict a random **15% of (sub)word tokens.**
- Replace input word with **[MASK] 80% of the time**
- Replace input word with a **random token 10% of the time**
- Leave input word **unchanged 10% of the time** (but still predict it!)
- Why? Does not let the model get complacent and not build strong representations of non-masked words. (No masks are seen at fine-tuning time!)


<div align='center'><img src="figs/BERT_03.png" width='70%' ></div>

# BERT: Next Sentence Prediction

<div align='center'><img src="https://d2l.ai/_images/bert-input.svg" width='70%' ></div>


BERT was trained to predict whether one chunk follows the other or is randomly sampled.

# BERT Details

Two models were released:
- BERT-base: 12 layers, 768-dim hidden states, 12 attention heads, **110 million params.**
- BERT-large: 24 layers, 1024-dim hidden states, 16 attention heads, **340 million params**.

Trained on:
- BooksCorpus (800 million words)
- English Wikipedia (2,500 million words)

Pretraining is expensive and impractical on a single GPU.
- BERT was pretrained with 64 TPU chips for a total of 4 days.
- (TPUs are special tensor operation acceleration hardware)

Finetuning is practical and common on a single GPU **("Pretrain once, finetune many times.")**

# BERT: One model many tasks

<div align='center'><img src="figs/BERT_04.png" width='70%' ></div>

**QQP:** Quora Question Pairs (detect paraphrase questions)
**QNLI:** natural language inference over question answering data
**SST-2:** sentiment analysis
**CoLA:** corpus of linguistic acceptability (detect whether sentences are grammatical.)
**STS-B:** semantic textual similarity
**MRPC:** Microsoft paraphrase corpus
**RTE:** a small natural language inference corpus

# GLUE Benchmarks
<br>
<div align='center'><img src="figs/GLUE_01.png" width='70%' ></div>

https://gluebenchmark.com/leaderboard/

# BERT Extensions

There are a of BERT variants like **RoBERTa**, **SpanBERT** etc.

Some generally accepted improvements to the BERT pretraining formula:
- **RoBERTa:** mainly just train BERT for longer and remove next sentence prediction!
- **SpanBERT:** masking contiguous spans of words makes a harder, more useful pretraining task

# RoBERTa: A Robustly Optimized BERT Pretraining Approach


<br>
<div align='center'><img src="figs/roberta.png" width='70%' ></div>


## 3) Encoder-Decoder

The best of both worlds: Good parts of decoders and encoders?  What’s the best way to pretrain them?

<div align='center'><img src="figs/e2d.png" width=60%' ></div>

## 3) Encoder-Decoder Pretraining [T5 Model]

For encoder-decoders, we could do something like language modeling, but where a prefix of every input is provided to the encoder and is not predicted.

$$
\begin{gathered}
h_1, \ldots, h_T=\operatorname{Encoder}\left(w_1, \ldots, w_T\right) \\
h_{T+1}, \ldots, h_2=\operatorname{Decoder}\left(w_1, \ldots, w_T, h_1, \ldots, h_T\right) \\
y_i \sim A h_i+b, i>T
\end{gathered}
$$


The encoder portion benefits from
bidirectional context; the decoder portion is
used to train the whole model through
language modeling

# 
<div align='center'><img src="figs/T5.png" width=80%' ></div>

# Fine-tuning approaches

# Full Fine-tuning vs. Parameter-Efficient Fine-tuning


- Finetuning every parameter in a pretrained model works well, but is **memory-intensive.**
- Lightweight finetuning methods **adapt pretrained models in a constrained way**: may lead to **less overfitting** and/or **more efficient finetuning** and inference.


<div align='center'><img src="figs/fine-tuning_00.png" width=70%' ></div>

# Parameter-Efficient Fine-tuning: Prefix-Tuning, Prompt tuning

Prefix-Tuning **adds a prefix of parameters**, and freezes all pre-trained parameters.

The prefix is processed by the model just like real words would be.

**Advantage: each element of a batch at inference could run a different tuned model.**

<div align='center'><img src="figs/fine-tuning_01.png" width=80%'></div>

# Parameter-Efficient Fine-tuning: Low-Rank Tuning

Low-Rank Adaptation Learns a low-rank "diff" between the pre-trained and fine-tuned weight matrices.

_Hypothesis:_ **the fine-tuning updates (diffs) produce changes that live in a low-dimensional subspace.**

<div align='center'><img src="figs/fine-tuning_02.png" width=70%'></div>

# How LLMs scale?

# How LLMs scale?

<br><div align='center'><img src="figs/GPT3_05.png" width='80%' ></div>

# How LLMs scale?

<br><div align='center'><img src="figs/GPT3_06.png" width='80%' ></div>

# How LLMs scale?

<br><div align='center'><img src="figs/GPT3_07.png" width='80%' ></div>

# How LLMs scale?

<br><div align='center'><img src="figs/GPT3_08.png" width='80%' ></div>

# Evolutional Tree of LLM

<br><div align='center'><img src="figs/tree_of_LLM.jpg" width='80%' ></div>