# Recurrent Neural Networks
## Transformer

Author: Binghen Wang

Last Updated: 6 Feb, 2023

<nav>
    <b>Deep learning navigation:</b> <a href="./Deep Learning Basics.ipynb">Deep Learning Basics</a> |
    <a href="./Deep Learning Optimization.ipynb">Optimization</a> |
    <a href="./Convolutional Neural Networks.ipynb">Convolutional Neural Networks</a>
    <br>
    <b>RNN navigation:</b> <a href="./Recurrent Neural Networks.ipynb">Basics</a> |
    <a href="./Natural Language Processing.ipynb">Natural Language Processing</a>
</nav>

---
<nav>
    <a href="../Machine%20Learning.ipynb">Machine Learning</a> |
    <a href="../Supervised Learning/Supervised%20Learning.ipynb">Supervised Learning</a>
</nav>

---

<div class="alert alert-block alert-warning">
Original paper:
<ul>
    <li><a href = "https://arxiv.org/pdf/1706.03762.pdf">Attention Is All You Need. (Vaswani et al., 2017)</a>
</ul>
    
Nice explanations:
<ul>
    <li><a href = "https://gb.coursera.org/learn/nlp-sequence-models">Sequential Models, Course 5. (DeepLearning.AI)</a>
    <li><a href= "https://people.tamu.edu/~sji/classes/attn.pdf">A Mathematical View of Attention Models in Deep Learning. (Ji, Gao & Xie, 2019)</a>
</ul>
This note consolidates the information based on the above sources.
</div>

## Contents
- Attention Mechanism
- Multi-Headed Attention
- The Transformer

## Attention Mechanism
<blockquote>
    Given a set of $n$ query vectors $q_1,q_2,\dots,q_n \in \mathbb{R}^d$, $m$ key vectors $k_1,k_2,\dots,k_m \in \mathbb{R}^d$, and $m$ value vectors $v_1,v_2,\dots,v_m \in \mathbb{R}^p$,the <b>attention mechanism</b> computes a set of output vectors $o_1,o_2,\dots,o_n \in \mathbb{R}^q$ by linearly combining the $g$-transformed value vectors $g(v_i) \in \mathbb{R}^q$ using the relations between the corresponding query vector and each key vector as coefficients. Formally,
    $$
    o_j = \frac{1}{C}\sum_{i=1}^{m} f(q_j,k_i) g(v_i),
    $$
where $f(q_j,k_i)$ charaterizes the relation (e.g., similarity) between $q_j$ and $k_i$, $g(\cdot)$ is commonly a linear transformation as $g(v_i) = W_v v_i \in \mathbb{R}^q$, where $W_v \in \mathbb{R}^{q\times p}$, and $C = \sum_{i=1}^{m} f(q_j,k_i) $ is a normalization factor. A commonly used similarity function is the embedded Gaussian, defined as $ f(q_j,k_i) = \exp\left(\theta(q_j)^{\mathsf{T}}\phi(k_i)\right)$, where $\theta(\cdot)$ and $\phi(\cdot)$ are commonly linear transformations as $\theta(q_j)= W_q q_j$ and $\phi (k_i) = W_k k_i$.
    <div style ="text-align:right">— Ji, Gao & Xie, 2019</div>
</blockquote>

Here for **notational consistency** with Vaswani et al. (2017), we **redefine** the linear transformations with their transformed versions such that $W_v \in \mathbb{R}^{p\times q}$, $W_q \in \mathbb{R}^{d\times d}$ and $W_k \in \mathbb{R}^{d\times d}$.

<div style = "text-align: center;">
    <img src="./images/attention mechanism.png" style="width:80%;" >
</div>

The dimension of the output matrix $O$ is governed by the number of queries $n$ and the dimension of linearly transformed value vectors $q$.

### Self-Attention
In **self-attention**, $Q=K=V=X$, i.e. inputs are all from the same matrix embeddings.
$$
O = \text{softmax}\left((QW_q){(KW_k)}^T\right)VW_v
$$
### Scaled Dot-Product Attention
While dot product attention mechanism is **much faster and more space-efficient** than additive attention (via a feed-forward neural network), it does not work as well for larger values of $d$, potentially due to large magnitudes of the dot products (Vaswani et al., 2017). Recall the vanishing gradient problem. In response, Vaswani et al. (2017) propose **scaled dot-product attention**.

$$
\begin{align}
O =& \text{softmax}\left(\frac{\tilde{Q}\tilde{K}^{T}}{\sqrt{d}}\right)\tilde{V} \\
=& \text{softmax}\left(\frac{(QW_q){(KW_k)}^{T}}{\sqrt{d}}\right)VW_v
\end{align}
$$

### Attention with Learnable Query
As a common variant of the self-attention, **attention with learnable query** does not require an input for $Q$ and learns it entirely as trainable variables:
$$
O = \text{softmax}\left(\tilde{Q}_{\text{trainable}}{(KW_k)}^T\right)VW_v
$$

## Multi-Head Attention
**Multi-head attention** consists of multiple attention operators with different queries, keys and values, allowing for specialization in each head (attention computation). Given a M-head attention, the output of the $i$-th attention is computed (using scaled dot-product attention) as:
$$
H_i = \text{softmax}\left(\frac{(QW_q^{(i)}){(KW_k^{(i)})}^{T}}{\sqrt{d}}\right)VW_v^{(i)} \in \mathbb{R}^{n \times q_i}
$$
Technically speaking, heads could differ in the lengths of output vectors $q_i$.

The final output is computed by concatenating the heads and then matrix-multiply it by a weight matrix $W_o$ (linear transformation):
$$
O=\underbrace{\left[\begin{array}{ccccc} H_1 & H_2 & \cdots & H_{M-1} & H_M\end{array}\right]}_{n\times\sum_i q_i} W_o
$$
where $W_o \in \mathbb{R}^{\sum_i q_i \times q}$ projects the concatenated heads into the desired dimensionality $q$.

<div style = "text-align: center;">
    <img src="./images/multi-head attention.png" style="width:80%;" >
</div>

## The Transformer

The Transformer relies on **self-attention** and **point-wise feed-forward networks** to compute representations of its input and output **without using sequence-aligned RNNs or convolution**.

### Model Architecture

<div style = "text-align: center;">
    <img src="./images/Transformer.png" style="width:50%;" ><br>
    Source: <a href = "https://arxiv.org/pdf/1706.03762.pdf">Attention Is All You Need. (Vaswani et al., 2017)</a>
</div>

### Positional Encoding
Since the Transformer does not use recurrence nor convolution, additional **positional encoding** is incorporated into the model architecture so that the model could take into account information about the relative and absolute positions of the tokens in the sequence. 

Two main types of positional encodings are: 1) learned 2) fixed. Vaswani et al. (2017) use sinusoidal functions of different frequences to encode positional information:
$$
\begin{align}
PE_{(pos, 2i)} = & \sin\left(\frac{pos}{10000^{\frac{2i}{d}}}\right) \\
PE_{(pos, 2i+1)} = & \cos\left(\frac{pos}{10000^{\frac{2i}{d}}}\right)
\end{align}
$$

<div style = "text-align: center;">
    <img src="./images/positional encoding.png" style="width:70%;" >
</div>

**Property 1: Each positional encoding vector has a constant norm.**

For an even number of dimensions $d$, the resulting position encoding vectors have a **constant norm** regardless of the position:
$$
\Vert PE_{(pos)} \Vert = \sqrt{\frac{d}{2}}
$$
due to the fact that 
$$\sin^2\theta + \cos^2 \theta = 1.$$

**Property 2: The distance between two positional encoding vectors depends on the relative positions of words irrespective of the absolute positions of each word.**

To see this, consider
$$
\begin{align}
&{\left[\sin{(\theta+ k)} - \sin{\theta}\right]}^2 + {\left[\cos{(\theta+ k)} - \cos{\theta}\right]}^2 \\
= & \;2 - 2\left[\sin{(\theta+ k)}\sin{\theta}+\cos{(\theta+ k)}\cos{\theta}\right] \\
= & \;2- 2\cos{(\theta + k - \theta)} \\
= & \;2- 2\cos{k} \left( = 4\sin\left(\frac{k}{2}\right) \right)
\end{align}
$$
which does not depend on $\theta$.

Link: <a href = "https://en.wikipedia.org/wiki/List_of_trigonometric_identities">List of trigonometric identities </a>

<div style = "text-align: center;">
    <img src="./images/relative positions_positional encoding.png" style="width:100%;" >
</div>