# Summary of *Attention Is All You Need* by Vaswani et al.

Implementation of Transformer model.

This module contains classes and functions which implement the main parts of
the Transformer model, as presented in paper Attention Is All You Need
by Vaswani et al.

In [1]:
import math

import torch
import torch.nn as nn

<h2 id="Embeddings">Embeddings</h2>

Input and outputs sentences are sequences of tokens. Tokens are not necessarily words or characters and they are identified through a specific algorithm, called *tokenizer*. In the case of the original paper, the byte pair encoding algorithm is used (cf. section [Training Data and Batching](#Training_Data_and_Batching) for more details).

The resulting set of tokens is called *vocabulary*, its cardinality is $d_\text{vocabulary}$ and depends on the dataset considered for the task.
Special tokens are included to represent the beginning the end of the sentence, the end of the sentence and for padding (i.e. to identify a position in the sentence not occupied by a token with useful meaning).

The vocabulary is embedded in a vector space of real numbers $\mathbb{R}^{d_\text{model}}$, where $d_\text{model}$ is a hyperparameter. The embedding is equivalent to a linear layer where the weights are learned during training.
This embedding allows the model to learn hidden relations among tokens of the training set, lowering the dimensionality of the vocabulary since $d_\text{model} < d_\text{vocabulary}$.

The values obtained by the embedding algorithm are scaled by a factor $\sqrt{d_\text{model}}$.

In [None]:
class InputEmbedding(nn.Module):
    def __init__(self, d_vocabulary: int, d_model: int) -> None:
        super().__init__()
        self.d_vocabulary = d_vocabulary
        self.d_model = d_model
        self.embedding = nn.Embedding(self.d_vocabulary, self.d_model)
    
    def forward(self, x):
        return self.embedding(x) * math.sqrt(self.d_model)