In [5]:
import numpy
import torch
import typing
import yaml

from torch import Tensor

# Implementation steps
1. Scaled Dot-product attention
2. Attention Head
3. Multi-Head attention mechanism
4. Positional encoding
5. Feed-forward block
6. Encoder
7. Decoder
8. Transformer

**Note** that we'll try to produce clean commented code and use the `typing` module

## 1. Scaled Dot-product attention

It consists of a simple attention mechanism that is a dot-product between the **Query** (data being processed)
and the **Key** (hidden state of the encoder).


As the value of the dot product grows with the dimensionality of the Query and Key vectors
we need to rescale the dot product to prevent it from exploding into huge values.

\begin{equation*}
\mathrm{Attention}(Q, K, V) = \mathrm{softmax}(\frac{QK^T}{\sqrt{d_k}})V
\end{equation*}


In [6]:
def scaled_dot_product_attention(query: Tensor, key: Tensor, value: Tensor) -> Tensor:
    """
    :param query: a Tensor
    :param key: a Tensor
    :param value: a Tensor
    
    :return: a Tensor that is the softmax of the scaled dot-product
    """
    # TO DO

## 2. Attention Head

A class that inherits from torch.nn.module.
the constructor takes as parameters:
- the input dimension `dim_in` as an `int` 
- an output dimension `dim_k` as an `int` for the Query and the Key
- an output dimension `dim_v` as an `int` for the Value

The forward method takes as parameters:
- the `query` as a Tensor
- the `key` as a Tensor
- the `value` as a Tensor

And return the scaled dot-product as a Tensor

In [8]:
class AttentionHead(torch.nn.Module):
    """
    
    """
    # TO DO

## 3. Multi-Head attention mechanism

A class that inherits from `torch.nn.module` and that combines a number `num_heads` of Attention Heads 
before combining the results and apply a `Linear` layer.

The constructor takes as parameters:
- the number of heads as an `int` (num_heads)
- the input dimension `dim_in` as an `int` 
- an output dimension `dim_k` as an `int` for the Query and the Key
- an output dimension `dim_v` as an `int` for the Value

The `Linear` layer takes as input a concatenation of the attention vectors and output a vector of 
dimension `dim_in`.

The forward method takes as parameters:
- the `query` as a Tensor
- the `key` as a Tensor
- the `value` as a Tensor

And return the result of the multi-head attention mechanism

In [9]:
class MultiHeadAttention(torch.nn.Module):
    """
    
    """
    def __init__(self, num_heads: int, dim_in: int, dim_k: int, dim_v: int):
        super().__init__()
        # TO DO

    def forward(self, query: Tensor, key: Tensor, value: Tensor) -> Tensor:
        # TO DO

## 4. Positional encoding

As the attention mechanism is not aware of any sequence order, this information is 
provided to the network explicitly by encoding the position of each element in the sequence.

The positional encodings have the same dimension $d_{model}$ as the embeddings

There are different ways to encode this information. In the original paper, the authors propose to use the 
following equation:

\begin{equation*}
\mathrm{PE}(pos, 2i) = \sin\left(\frac{pos}{10000^\frac{2i}{d_{model}}}\right)
\end{equation*}

and 

\begin{equation*}
\mathrm{PE}(pos, 2i+1) = \cos\left(\frac{pos}{10000^\frac{2i}{d_{model}}}\right)
\end{equation*}

Where $pos$ is the position and $i$ is the dimension. That is, each dimension of the positional 
encoding corresponds to a sinusoid. The wavelengths form a geometric progression from $2\Pi$ to $10000 \cdot 2\Pi$.

We implement the position encoding with a method that takes as input:
- an integer `seq_len` that is the length of the sequence
- an integer `dim_model` that is the dimension of the 
- the device, `torch.device` used by PyTorch

And return a `Tensor`.

In [10]:
def position_encoding(seq_len: int, dim_model: int, device: torch.device = torch.device("cpu")) -> Tensor:
    """
    
    """
    pos = torch.arange(seq_len, dtype=torch.float, device=device).reshape(1, -1, 1)
    dim = torch.arange(dim_model, dtype=torch.float, device=device).reshape(1, 1, -1)
    phase = (pos / 1e4) ** (dim // dim_model)

    return torch.where(dim.long() % 2 == 0, torch.sin(phase), torch.cos(phase))

## 5. Residual block

`We employ a residual connection around each ofthe two sub-layers, followed by layer normalization.  That is, the output of each sub-layer is LayerNorm(x+ Sublayer(x)), where Sublayer(x)is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension model= 512`

The two **sub-layers** are implemted as a function that returns a `torch.nn.sequential` composed of:
- one linear layer ($dim\_input \times dim\_feedforward$)
- one `ReLU`activation
- one linear layer ($dim\_feedforward \times dim\_input$)

The dimensionality of input and output is 512, and the inner-layer has dimensionality 2048.

In [11]:
def feed_forward(dim_input: int = 512, dim_feedforward: int = 2048) -> torch.nn.Module:
    """
    
    """
    # TO DO

`The output of each sub-layer is LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself. … We apply dropout to the output of each sub-layer, before it is added to the sub-layer input and normalized.`

Now we implement the Residual block in a `class` that inherits from the `torch.nn.module`class.

The constructor of the class takes as input:
- `sublayer` a feed-forward module of type `torch.nn.module`
- `dimension`, an integer that is the size of the output of the multi-head attention
- a probability `float` of dropout (by default: 0.1)

The forward method of the Residual module takes as input a sequence/collection of `Tensors` of undefined length.
We assume that the **value** `Tensor` will be the last to be complient with the signature of the 
`MultiHeadAttention`.

In [12]:
class Residual(torch.nn.Module):
    """
    
    """
    def __init__(self, sublayer: torch.nn.Module, dimension: int, dropout: float = 0.1):
        """
        
        """
        super().__init__()
        # TO DO

    def forward(self, *tensors: Tensor) -> Tensor:
        """
        
        """
        # TO DO

## 6. Encoder

Now we can create the `TransformerEncoderLayer` that will be duplicated to create the Encoder (left side of the network).

The constructor of the class takes as input:
- an integer `dim_model` that gives the dimension of the embeddings (concatenated ouptut of the multi-head attention mechanism (default is 512)
- an integer `num_heads` that gives the number of attention heads (default is 8)
- an integer `dim_feedforward` that gives the dimension of the inner-layer of the feed-forward block
- the probability of dropout given as a float `dropout` which default value is 0.1

The `TransformerEncoderLayer` consists of one `Residual` block encapsulating a `MultiHeadAttention` followed
by one `Residual` block encapsulating a `feed_forward`.

The `forward` method takes a single `Tensor`, `src`, as input

In [14]:
class TransformerEncoderLayer(torch.nn.Module):
    """
    
    """
    def __init__(
        self, 
        dim_model: int = 512, 
        num_heads: int = 8, 
        dim_feedforward: int = 2048, 
        dropout: float = 0.1, 
    ):
        """
        
        """
        super().__init__()
        dim_k = dim_v = dim_model // num_heads
        
        # TO DO

    def forward(self, src: Tensor) -> Tensor:
        """
        
        """
        # TO DO


And now the complete Encoder.

In [15]:
class TransformerEncoder(torch.nn.Module):
    # TO DO

## 7. Decoder

Creates the class for the Decoder, the right part of the network.

First the `TransformerDecoderLayer` that will be duplicated.

The constructor takes as inputs:
- an integer `dim_model` that gives the dimension of the embeddings (concatenated ouptut of the multi-head attention mechanism (default is 512)
- an integer `num_heads` that gives the number of attention heads (default is 8)
- an integer `dim_feedforward` that gives the dimension of the inner-layer of the feed-forward block, default is 2048
- the probability of dropout given as a float `dropout` which default value is 0.1

The `TransformerDecoderLayer` consists of two consecutive blocks `Residual` block encapsulating a `MultiHeadAttention` followed by one `Residual` block encapsulating a `feed_forward`.

The `forward` method takes two `Tensor`, `tgt` and `memory` as input.

In [16]:
class TransformerDecoderLayer(torch.nn.Module):
    """
    
    """
    def __init__(
        self, 
        dim_model: int = 512, 
        num_heads: int = 8, 
        dim_feedforward: int = 2048, 
        dropout: float = 0.1, 
    ):
        # TO DO

    def forward(self, tgt: Tensor, memory: Tensor) -> Tensor:
        """
        
        """
        # TO DO

And now the `TransfomerDecoder` class that consists of 6 `TransformerDecoderLayer` followed by a linear layer.
(don't forget the position encoding) and the final softmax.

In [22]:
class TransformerDecoder(torch.nn.Module):
    """
    """
    def __init__(
        self, 
        num_layers: int = 6,
        dim_model: int = 512, 
        num_heads: int = 8, 
        dim_feedforward: int = 2048, 
        dropout: float = 0.1, 
        number_classes=2
    ):
        super().__init__()
        self.layers = torch.nn.ModuleList([
            TransformerDecoderLayer(dim_model, num_heads, dim_feedforward, dropout)
            for _ in range(num_layers)
        ])
        self.linear = torch.nn.Linear(dim_model, number_classes)

    def forward(self, tgt: Tensor, memory: Tensor) -> Tensor:
        seq_len, dimension = tgt.size(1), tgt.size(2)
        tgt += position_encoding(seq_len, dimension)
        for layer in self.layers:
            tgt = layer(tgt, memory)

        return torch.softmax(self.linear(tgt), dim=-1)

## 8. Transformer

Now just put the encoder and decoder together.

And put the data through...

In [23]:
class Transformer(torch.nn.Module):
    """
    
    """
    def __init__(
        self, 
        num_encoder_layers: int = 6,
        num_decoder_layers: int = 6,
        dim_model: int = 512, 
        num_heads: int = 8, 
        dim_feedforward: int = 2048, 
        dropout: float = 0.1, 
        activation: torch.nn.Module = torch.nn.ReLU(),
    ):
    # TO DO

    def forward(self, src: Tensor, tgt: Tensor) -> Tensor:
        # TO DO

# Test

Now Create a Transformers that will process sequences of length 200 vectors of dimension 30.
Create a dummy batch of  of input data and check that everything runs smoothly.

In [27]:
src = torch.rand(64, 100, 30)
tgt = torch.rand(64, 100, 30)
out = Transformer(dim_model=30)(src, tgt)
print(out.shape)

torch.Size([64, 100, 2])
