# Summary of *Attention Is All You Need* by Vaswani et al.

Implementation of Transformer model.

This module contains classes and functions which implement the main parts of
the Transformer model, as presented in paper Attention Is All You Need
by Vaswani et al.

In [8]:
import math

import torch
import torch.nn as nn

<h2 id="Embeddings">Embeddings</h2>

Input and outputs sentences are sequences of tokens. Tokens are not necessarily words or characters and they are identified through a specific algorithm, called *tokenizer*. In the case of the original paper, the byte pair encoding algorithm is used (cf. section [Training Data and Batching](#Training_Data_and_Batching) for more details).

The resulting set of tokens is called *vocabulary*, its cardinality is $d_\text{vocabulary}$ and depends on the dataset considered for the task.
Special tokens are included to represent the beginning the end of the sentence, the end of the sentence and for padding (i.e. to identify a position in the sentence not occupied by a token with useful meaning).

The vocabulary is embedded in a vector space of real numbers $\mathbb{R}^{d_\text{model}}$, where $d_\text{model}$ is a hyperparameter. The embedding is equivalent to a linear layer where the weights are learned during training.
This embedding allows the model to learn hidden relations among tokens of the training set, lowering the dimensionality of the vocabulary since $d_\text{model} < d_\text{vocabulary}$.

The values obtained by the embedding algorithm are scaled by a factor $\sqrt{d_\text{model}}$.

In [9]:
class InputEmbedding(nn.Module):
    def __init__(self, d_vocabulary: int, d_model: int) -> None:
        super().__init__()
        self.d_vocabulary = d_vocabulary
        self.d_model = d_model
        self.embedding = nn.Embedding(self.d_vocabulary, self.d_model)
    
    def forward(self, x):
        return self.embedding(x) * math.sqrt(self.d_model)

<h2 id="Positional_Encoding">Positional Encoding</h2>

The meaning of a sentence is determined by the words that are contained and their relative position. Since operations applied to a given sequence are invariant under token permutation, the information on the position of tokens is inserted explicitly in the model.
This is achieved by encoding the position of each word in numeric values which are evaluated by analytic formulae. These functions are fixed, i.e. no learning is performed for them, because differences in performance between the two versions are negligible.

For each token in a sequence, a vector with same size $d_\text{model}$ of the embedding is generated by equations
\begin{align*}
    \mathrm{PE}(\mathrm{pos}, 2t) & = \sin \bigg( \frac{\mathrm{pos}}{10000^{\frac{2t}{d_\text{model}}}} \bigg)
    \quad , \\
    \mathrm{PE}(\mathrm{pos}, 2t + 1) & = \cos \bigg( \frac{\mathrm{pos}}{10000^{\frac{2t}{d_\text{model}}}} \bigg)
    \quad ,
\end{align*}
depending on the parity of the element of the vector, with $\mathrm{pos}$ position of the token in the sequence and $t \in \mathbb{N}$ parameter used to identify every element of the vector.

Trigonometric functions are chosen because they can evaluate positional encodings for sequences longer than the ones encountered during training without additional computations, due to their periodicity.

Dropout is applied during training and the dropout probability is stored in the hyperparameter $\mathrm{dropout}$.

In [10]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model: int, d_sequence: int, dropout: float) -> None:
        super().__init__()
        self.d_model = d_model
        self.d_sequence = d_sequence
        self.dropout = nn.Dropout(dropout)
        
        pe = torch.zero(d_sequence, d_model)
        pos = torch.arange(0, d_sequence, dtype=torch.float).unsqueeze(1)
        denominator = torch.exp(torch.arange(0, d_model, 2).float() / d_model * math.log(10000))
        pe[:, 0::2] = torch.sin(pos / denominator)
        pe[:, 1::2] = torch.cos(pos / denominator)
        
        # Add batch dimension for parallel training.
        pe = pe.unsqueeze(0)
        
        # Store positional encoding values.
        self.register_buffer("pe", pe)
    
    def forward(self, x):
        # No learning needed for positional encoding parameters.
        x = x + (self.pe[:, :x.shape[1], :]).requires_grad_(False)
        return self.dropout(x)