# Summary of *Attention Is All You Need* by Vaswani et al.

Implementation of Transformer model.

This module contains classes and functions which implement the main parts of
the Transformer model, as presented in paper Attention Is All You Need
by Vaswani et al.

In [10]:
import math

import torch
import torch.nn as nn

<h2 id="Model_Architecture">Model Architecture</h2>

The Transformer model has an encoder-decoder architecture, where $N$ encoder and decoder layers are stacked using the outputs of the previous layer as inputs.

The layers contained in each encoder or decoder layer are addressed as "sublayers".

The inputs of the first layer are vectors of real numbers with dimension set by the hyperparameter $d_\text{model}$. Hereafter, vectors in formulae are considered row vectors.
The output of the last decoder layer is transformed by a linear layer to obtain vectors of dimension appropriate to the classification task.

<img src="figures/ModalNet-21.png" alt="The Transformer" width="300" />

<h3 id="Embeddings_and_Softmax">Embeddings and Softmax</h3>

Input and outputs sentences are sequences of tokens. Tokens are not necessarily words or characters and they are identified through a specific algorithm, called "tokenizer". In the case of the original paper, the byte pair encoding algorithm is used (cf. section [Training Data and Batching](#Training_Data_and_Batching) for more details).

The resulting set of tokens is called "vocabulary", its cardinality is $d_\text{vocabulary}$ and depends on the dataset considered for the task.
Special tokens are included to represent the beginning the end of the sentence, the end of the sentence and for padding (i.e. to identify a position in the sentence not occupied by a token with useful meaning).

The vocabulary is embedded in a vector space of real numbers $\mathbb{R}^{d_\text{model}}$. The embedding is equivalent to a linear layer where the weights are learned during training.
This embedding allows the model to learn hidden relations among tokens of the training set, lowering the dimensionality of the vocabulary since $d_\text{model} < d_\text{vocabulary}$.

Learned weights are shared among the input and output embedding layers. Moreover, the values obtained by the embedding algorithm are scaled by a factor $\sqrt{d_\text{model}}$.

In [11]:
class InputEmbedding(nn.Module):
    def __init__(self, d_vocabulary: int, d_model: int) -> None:
        super().__init__()
        self.d_vocabulary = d_vocabulary
        self.d_model = d_model
        self.embedding = nn.Embedding(self.d_vocabulary, self.d_model)
    
    def forward(self, input):
        return self.embedding(input) * math.sqrt(self.d_model)

Weights from the embedding layer are shared also with the linear layer positioned before the softmax layer which determines the probabilities of the next token.

In [12]:
class Linear(nn.Module):
    def __init__(self, d_model: int, d_vocabulary: int) -> None:
        super().__init__()
        self.linear = nn.Linear(d_model, d_vocabulary)
    
    def forward(self, input):
        return self.linear(input)

In [13]:
class Softmax(nn.Module):
    def __init__(self) -> None:
        super().__init__()
        self.softmax = nn.Softmax()
    
    def forward(self, input):
        return self.softmax(input)

<h3 id="Positional_Encoding">Positional Encoding</h3>

The meaning of a sentence is determined by the words that are contained and their relative position. Since operations applied to a given sequence are invariant under token permutation, the information on the position of tokens is inserted explicitly in the model.
This is achieved by encoding the position of each word in numeric values which are evaluated by analytic formulae. These functions are fixed, i.e. no learning is performed for them, because differences in performance between the two versions are negligible.

For each token in a sequence, a vector with same size $d_\text{model}$ of the embedding is generated by equation
\begin{equation*}
    \mathrm{PE}(\mathrm{pos}, i) =
    \begin{cases}
        \sin \bigg( \frac{\mathrm{pos}}{10000^{\frac{i}{d_\text{model}}}} \bigg) & \text{$i$ even} \\
        \cos \bigg( \frac{\mathrm{pos}}{10000^{\frac{i}{d_\text{model}}}} \bigg) & \text{$i$ odd}
    \end{cases}
    \quad ,
\end{equation*}
depending on the parity of the element of the vector, with $\mathrm{pos}$ position of the token in the sequence and $i = 0, \dots, d_\text{model} - 1$ index of elements in the vector resulting from the embedding.

Trigonometric functions are chosen because they can evaluate positional encodings for sequences longer than the ones encountered during training without additional computations, due to their periodicity.

Dropout is applied during training and the dropout probability is stored in the hyperparameter $\mathrm{dropout}$.

In [14]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model: int, d_sequence: int, dropout: float) -> None:
        super().__init__()
        self.d_model = d_model
        self.d_sequence = d_sequence
        self.dropout = nn.Dropout(dropout)
        
        pe = torch.zero(d_sequence, d_model)
        pos = torch.arange(0, d_sequence, dtype=torch.float).unsqueeze(1)
        # Use exp and log to increase performance.
        denominator = torch.exp(torch.arange(0, d_model, 2).float() / d_model * math.log(10000))
        pe[:, 0::2] = torch.sin(pos / denominator)
        pe[:, 1::2] = torch.cos(pos / denominator)
        
        # Add batch dimension for parallel processing of sequences.
        pe = pe.unsqueeze(0)
        
        # Store positional encoding parameters for future analysis.
        self.register_buffer("pe", pe)
    
    def forward(self, input):
        # No need to learn positional encoding parameters.
        input = input + (self.pe[:, :input.shape[1], :]).requires_grad_(False)
        return self.dropout(input)

<h3 id="Layer_Normalization_and_Residual_Connection">Layer Normalization and Residual Connection</h3>

Layer normalization is the last transformation applied to the output values of each sublayer. Normalization layers help to increase the convergence rate during training and execution of the model.

The transformation consists in normalizing the input values using their sample mean and standard deviation, the latter being evaluated with the biased estimator.
Learned weights are present to adapt the sample statistics to the dataset. To avoid divergence issues, a constant scalar value $\mathrm{eps}$ is summed to the standard deviation.

In [15]:
class LayerNormalization(nn.Module):
    def __init__(self, eps: float = 1e-6) -> None:
        super().__init__()
        self.eps = eps
        self.gain = nn.Parameter(torch.ones(1))
        self.bias = nn.Parameter(torch.zeros(1))
    
    def forward(self, input):
        # Dimension is kept to allow broadcasting.
        mean = input.mean(dim=-1, keepdim=True)
        std = input.std(dim=-1, correction=0, keepdim=True)
        return self.gain / (std + self.eps) * (input - mean) + self.bias

Before applying the layer normalization, output values of the sublayer are added to the input values. This sum is called residual connection and helps to reduce the propagation of noise between the connected layers.

In [16]:
class ResidualConnection(nn.Module):
    def __init__(self, eps: float, dropout: float) -> None:
        super().__init__()
        self.norm = LayerNormalization(eps)
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, input, sublayer):
        return input + self.dropout(self.norm(sublayer(input)))

<h3 id="Position-wise_Feed-Forward_Networks">Position-wise Feed-Forward Networks</h3>

A feedforward neural network is applied to data corresponding to each position (i.e. token in the input sequence). The network has 2 layers, with ReLU activation function for each neuron. Dimensions are $d_\text{model}$ for the output layer and $d_\text{ff}$ for the hidden layer.

The model is
\begin{equation*}
    \mathrm{FFN}(input) = \max(0, input W_1 + b_1) W_2 + b_2
    \quad ,
\end{equation*}
where ReLU function is applied element-wise. Weights are shared among all the positions, but are different between the layers of the model.

In [17]:
class Feedforward(nn.Module):
    def __init__(self, d_model: int, d_ff: int, dropout: float) -> None:
        super().__init__()
        self.linear_1 = nn.Linear(d_model, d_ff)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(dropout)
        self.linear_2 = nn.Linear(d_ff, d_model)
    
    def forward(self, input):
        return self.linear_2(self.dropout(self.relu(self.linear_1(input))))

<h3 id="Attention">Attention</h3>

The "attention" is a mapping between vectors corresponding to tokens in a given sequence. Its purpose is to identify the tokens with highest correlation without losing information on the tokens less correlated. These information are then aggregated as output of the function. The attention sublayer implements this mapping.

More in detail, the attention function applied to each token is
\begin{equation*}
    \mathrm{Attention}(Q, K, V) = \mathrm{softmax} \Bigg( \frac{Q K^\intercal}{\sqrt{d_k}} \Bigg) V
\end{equation*}
where $Q \in \mathbb{R}^{d_k}$, $K \in \mathbb{R}^{d_k}$ and $V \in \mathbb{R}^{d_v}$ are the query, key and value vectors, respectively. Query and key have same dimension to perform their product without introducing additional weights. Values are weighted by the output of a softmax sublayer.

The authors call this attention function "Scaled Dot-Product Attention", in contrast to the "additive attention" which is implemented through a feedforward network with a single hidden layer and the "dot-product attention", equivalent to function $\mathrm{Attention}$ without dividing the argument of the softmax by $\sqrt{d_k}$. The calculations for dot-product attention are performed more efficiently than for additive attention, but for large values of $d_k$ the former suffers from the vanishing gradients problem due to the saturation of the softmax function. Factor $\sqrt{d_k}$ in the Scaled Dot-Product Attention addresses this issue.

In the Transformer, attention is applied in parallel $h$ times, each application being called "head" in a "multi-head attention" sublayer. The vectors resulting from the heads are concatenated to obtain a vector which is then multiplied to a weight matrix. The formula presented in the original paper is
\begin{equation*}
    \mathrm{MultiHead}(Q, K, V) = \mathrm{Concat}(\mathrm{head}_1, \dots, \mathrm{head}_h) W^O
\end{equation*}
where $\mathrm{head}_i = \mathrm{Attention}(Q W_i^Q, K W_i^K, V W_i^V)$, index $i = 1, \dots, h$ identifies the head and the matrices $W_i^Q \in \mathbb{R}^{d_\text{model} \times d_k}$, $W_i^K \in \mathbb{R}^{d_\text{model} \times d_k}$, $W_i^V \in \mathbb{R}^{d_\text{model} \times d_v}$ and $W^O \in \mathbb{R}^{h d_v \times d_\text{model}}$ are linear projections for query, value, key and output vector, respectively.

By chosing $d_k = d_v = \frac{d_\text{model}}{h}$ the authors noted that the computational cost is similar to attention with a single head.

Moreover, vectors $Q$, $K$ and $V$ are extended to matrices, to parallelize the evaluation on the entire token sequence.
This requires that the input vectors to the multi-head attention sublayer are different depending on the position of the sublayer inside the model:

- In the encoder layer, query, key and value vectors are the same output vector from the previous sublayer.
- In the multi-head attention sublayer which connects a pair of encoder and decoder layers, key and value vectors are the outputs of the encoder layer, while the query is the output of the previous sublayer in the decoder layer. This is similar to sequence-to-sequence models.
- In the decoder layer, query, key and value vectors are the same output vector from the previous sublayer, but tokens are correlated only to their or previous positions in the sequence. This condition is achieved by masking the rightmost elements by setting them to $-\infty$ in the argument of the softmax layer.

In [18]:
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model: int, h: int, dropout: float) -> None:
        super().__init__()
        self.d_model = d_model
        self.h = h
        # Dimension of embedding is supposed to be divisible by number of heads.
        self.d_k = d_model // h
        self.d_v = self.d_k
        self.W_Q = nn.Linear(d_model, d_model, bias=False)
        self.W_K = nn.Linear(d_model, d_model, bias=False)
        self.W_V = nn.Linear(d_model, d_model, bias=False)
        self.softmax = Softmax()
        self.W_O = nn.Linear(d_model, d_model, bias=False)
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, Q: torch.Tensor, K: torch.Tensor, V: torch.Tensor, mask: torch.BoolTensor):
        query = self.W_Q(Q)
        key = self.W_K(K)
        value = self.W_V(V)
        
        # (batch, seq_len, d_model) --> (batch, seq_len, h, d_k) --> (batch, h, seq_len, d_k)
        query = query.reshape(query.shape[0], query.shape[1], self.h, self.d_k)
        key = key.reshape(key.shape[0], key.shape[1], self.h, self.d_k)
        value = value.reshape(value.shape[0], value.shape[1], self.h, self.d_v)
        attention = torch.matmul(query.transpose(1, 2), key.transpose(1, 2).transpose(2, 3))
        # Mask to saturate to zero the softmax function.
        if mask is not None:
            attention.masked_fill_(mask == 0, -1e15)
        attention = self.softmax(attention / math.sqrt(self.d_k), dim=3)
        attention = self.dropout(attention)
        attention = torch.matmul(attention, value)
        
        # Concatenate heads.
        output = attention.transpose(1, 2)
        output.reshape(output.shape[0], output.shape[1], self.h * self.d_k)
        return self.W_O(output)