# Some links
- [Code reference](https://towardsdatascience.com/build-your-own-transformer-from-scratch-using-pytorch-84c850470dcb)
- [Original Paper -- Attention is all you need](https://arxiv.org/pdf/1706.03762.pdf)

In [16]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data
import torch.nn.functional as F
import math
import copy

# Positional encoding
The full sentence is passed to the encoder. From the encoder's view, these are just embeddings without any order. Therefore, positional encoding is used to tell the encoder the original position of each word in a sentence.

![image](https://github.com/guyuxuan9/Transformer-from-scratch/assets/58468284/c218d9f1-49b3-4ad2-ba0a-2b4fe1625503)


**Original paper**:

![image](https://github.com/guyuxuan9/Transformer-from-scratch/assets/58468284/53fd61b6-a49f-424c-a0cb-9f8b5e738f23)

**Code implementation** (replace $(10k)^{power}$ with $10k e^{power}$, prob easier computation):


$ div\_term = e^{-2i \frac{log(10k)}{d_{model} } } = \frac{1} { 10k \cdot e^{\frac{2i}{d_{model}}}}$,

$even\_pos = sin(position \times div\_term)$, $odd\_pos = cos(position \times div\_term)$

In [3]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_seq_length):
        super(PositionalEncoding, self).__init__()

        pe = torch.zeros(max_seq_length, d_model)
        position = torch.arange(0, max_seq_length, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model))

        pe[:, 0::2] = torch.sin(position * div_term) # 0 to end, step size = 2
        pe[:, 1::2] = torch.cos(position * div_term) # 1 to end, step size = 2

        self.register_buffer('pe', pe.unsqueeze(0)) # store model state, Device Synchronization, Persistence, Serializing and Loading

    def forward(self, x):
        return x + self.pe[:, :x.size(1)]


- [nn.Module](https://pytorch.org/docs/stable/generated/torch.nn.Module.html): base class for all neural network modules. Every model should be subclass of this:

```
class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        ...
    def forward(self, x):
        ...
        # return output
```
- model = Model(). No need to call forward like model.forward().
- model(x) gives the output

# MultiHeadAttention
**Intuition**: Each head captures different relationships between words.

_NB_: The following code and explanation implements the multihead attention a bit differently from that in the original paper. $Q,K,V \in \mathbb{R}^{seq, d_{model}}$

Original paper:
-  $W_i^Q \in \mathbb{R}^{d_{\text{model}} \times d_k}, \quad W_i^K \in \mathbb{R}^{d_{\text{model}} \times d_k}, \quad W_i^V \in \mathbb{R}^{d_{\text{model}} \times d_v}, \quad \text{and} \quad W^O \in \mathbb{R}^{h \cdot d_v \times d_{\text{model}}}$
- These are weight matrix for each head

Code below:
- $W_i^Q \in \mathbb{R}^{d_{\text{model}} \times d_{model} }, \quad W_i^K \in \mathbb{R}^{d_{\text{model}} \times d_{model} }, \quad W_i^V \in \mathbb{R}^{d_{\text{model}} \times d_{model} }, \quad W^O \in \mathbb{R}^{h \cdot d_v \times d_{\text{model}}}$
- Only one big weight matrix for $Q,K,V$.
- Split the heads after linear transformation.

![image](https://github.com/guyuxuan9/UROP_robotic_arm/assets/58468284/5e12311f-dee3-4d09-9dce-90e81c93458c)

## Linear Layer for $Q$, $K$ and $V$

$Q_{original} = \begin{bmatrix}
    q_{1,1} & q_{1,2} & \dots & q_{1,d_{model}} \\
    q_{2,1} & q_{2,2} & \dots & q_{2,d_{model}} \\
    \vdots & \vdots & \ddots & \vdots \\
    q_{\text{seq},1} & q_{\text{seq},2} & \dots & q_{\text{seq},d_{model}} \\
\end{bmatrix}$,    $K_{original} = \begin{bmatrix}
    k_{1,1} & k_{1,2} & \dots & k_{1,d_{model}} \\
    k_{2,1} & k_{2,2} & \dots & k_{2,d_{model}} \\
    \vdots & \vdots & \ddots & \vdots \\
    k_{\text{seq},1} & k_{\text{seq},2} & \dots & k_{\text{seq},d_{model}} \\
\end{bmatrix}$

The weight matrix $W_Q$ and $W_K$ have the learnable parameters. They are described by the _nn.Linear_ function. They all have dimension ($d_{model}, d_{model}$).

$W_Q = \begin{bmatrix}
\vdots & \vdots & \dots & \vdots \\
w_1^Q & w_2^Q & \dots & w_{d_{model}}^Q \\
\vdots & \vdots & \dots & \vdots
\end{bmatrix}$ $W_K = \begin{bmatrix}
\vdots & \vdots & \dots & \vdots \\
w_1^K & w_2^K & \dots & w_{d_{model}}^K \\
\vdots & \vdots & \dots & \vdots
\end{bmatrix}$

By multiplying the original embeddings with the learnable weights, the network can learn more patterns, increasing the expressive power than self-attention.

$Q = Q_{original}W_Q = \begin{bmatrix}
    q_{1,1}w_{1,1}^Q & q_{1,2}w_{1,2}^Q & \dots & q_{1,d_{model}}w_{1,d_{model}}^Q \\
    q_{2,1}w_{2,1}^Q & q_{2,2}w_{2,2}^Q & \dots & q_{2,d_{model}}w_{2,d_{model}}^Q \\
    \vdots & \vdots & \ddots & \vdots \\
    q_{\text{seq},1}w_{\text{seq},1}^Q & q_{\text{seq},2}w_{\text{seq},2}^Q & \dots & q_{\text{seq},d_{model}}w_{\text{seq},d_{model}}^Q \\
\end{bmatrix} = \begin{bmatrix}
    q_{1,1}' & q_{1,2}' & \dots & q_{1,d_{model}}' \\
    q_{2,1}' & q_{2,2}' & \dots & q_{2,d_{model}}' \\
    \vdots & \vdots & \ddots & \vdots \\
    q_{\text{seq},1}' & q_{\text{seq},2}' & \dots & q_{\text{seq},d_{model}}' \\
\end{bmatrix}$,

$K = K_{original}W_K = \begin{bmatrix}
    k_{1,1}' & k_{1,2}' & \dots & k_{1,d_{model}}' \\
    k_{2,1}' & k_{2,2}' & \dots & k_{2,d_{model}}' \\
    \vdots & \vdots & \ddots & \vdots \\
    k_{\text{seq},1}' & k_{\text{seq},2}' & \dots & k_{\text{seq},d_{model}}' \\
\end{bmatrix}$





## Split heads
(batch size, seq length, $d_{model}$) --> (batch size, # heads, seq length, $d_k$)

$\begin{bmatrix}
    q_{1,1}' & q_{1,2}' & \dots & q_{1,d_{model}}' \\
    q_{2,1}' & q_{2,2}' & \dots & q_{2,d_{model}}' \\
    \vdots & \vdots & \ddots & \vdots \\
    q_{\text{seq},1}' & q_{\text{seq},2}' & \dots & q_{\text{seq},d_{model}}' \\
\end{bmatrix}$ --> $\begin{bmatrix}
    q_{1,1}' & \dots & q_{1,k}'  \\
    q_{2,1}' & \dots &q_{2,k}'\\
    \vdots & \ddots & \vdots   \\
    q_{\text{seq},1}' & \dots & q_{\text{seq},k}' \\
\end{bmatrix}$ $\begin{bmatrix}
    q_{1,k+1}' & \dots & q_{1,2k}'  \\
    q_{2,k+1}' & \dots &q_{2,2k}'\\
    \vdots & \ddots & \vdots   \\
    q_{\text{seq},k+1}' & \dots & q_{\text{seq},2k}' \\
\end{bmatrix}$ ...



## Attention calculation

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_{k}}}\right) V$
- **Q**: Why scaled? **A**: dot product $q.k = \sum_{i=1}^{d_k} q_ik_i$. Assume $q$ and $k$ are independent, zero mean and unit variance. $E\{q.k\} = 0$, $Var(q.k) = d_k$. If dot products gets larger, it will enter the saturation region of softmax --> vanishing gradient
- Attention has shape: (batch size, # heads, seq length,  $d_k$ )

## Combine heads
(batch size, # heads, seq length,  dk) --> (batch size, seq length, $d_{model}$)

In [4]:
class MultiHeadAttention(nn.Module):
  def __init__(self, d_model, num_heads):
    super(MultiHeadAttention, self).__init__()
    assert d_model % num_heads == 0, "d_model must be divisible by num_heads" # d_k = d_v = d_model/num_heads
    self.d_model = d_model
    self.num_heads = num_heads
    self.d_k = d_model // num_heads

    self.W_q = nn.Linear(d_model, d_model)
    self.W_k = nn.Linear(d_model, d_model)
    self.W_v = nn.Linear(d_model, d_model)
    self.W_o = nn.Linear(d_model, d_model)

  def scaled_dot_product_attention(self, Q, K, V, mask=None):
        attn_scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        if mask is not None:
            attn_scores = attn_scores.masked_fill(mask == 0, -1e9)
        attn_probs = torch.softmax(attn_scores, dim=-1)
        output = torch.matmul(attn_probs, V)
        return output

  def split_heads(self, x):
      batch_size, seq_length, d_model = x.size()
      return x.view(batch_size, seq_length, self.num_heads, self.d_k).transpose(1, 2)

  def combine_heads(self, x):
      batch_size, _, seq_length, d_k = x.size()
      return x.transpose(1, 2).contiguous().view(batch_size, seq_length, self.d_model)

  def forward(self, Q, K, V, mask=None):
      Q = self.split_heads(self.W_q(Q))
      K = self.split_heads(self.W_k(K))
      V = self.split_heads(self.W_v(V))

      attn_output = self.scaled_dot_product_attention(Q, K, V, mask)
      output = self.W_o(self.combine_heads(attn_output))
      return output

# Position-wise Feed-Forward Networks
![image](https://github.com/guyuxuan9/UROP_robotic_arm/assets/58468284/4c2f54fb-27d0-4e19-a99f-0562123240d7)


In [5]:
class PositionWiseFeedForward(nn.Module):
    def __init__(self, d_model, d_ff):
        super(PositionWiseFeedForward, self).__init__()
        self.fc1 = nn.Linear(d_model, d_ff)
        self.fc2 = nn.Linear(d_ff, d_model)
        self.relu = nn.ReLU()

    def forward(self, x):
        return self.fc2(self.relu(self.fc1(x)))

# Encoder Layer

![image](https://github.com/guyuxuan9/Transformer-from-scratch/assets/58468284/28b7d53f-2d51-40db-9853-76a6ce400e7b)

**Batch Normalisation vs. Layer Normalisation** (Why layer norm?)

![image](https://github.com/guyuxuan9/Transformer-from-scratch/assets/58468284/5099cb58-16cf-48dd-be40-b79fc263c56e)

- Layer Norm: normalise across all features of each input word
- Batch Norm: normalise across batch of each feature

The problem of Batch Norm in NLP is that the input sentence might have various length, which is indicated by figure below.

![image](https://github.com/guyuxuan9/Transformer-from-scratch/assets/58468284/c6cd143c-bede-4a43-8c1a-0fb7781cf86b)

If the sequence length is less than the max length, zero paddings will be used to fill in the empty positions. However, the zeros added will change the mean and variance of the batch (batch statistics). Therefore, batch norm is not used in NLP tasks.

In [6]:
class EncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout):
        super(EncoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = PositionWiseFeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask):
        attn_output = self.self_attn(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_output))
        ff_output = self.feed_forward(x)
        x = self.norm2(x + self.dropout(ff_output))
        return x

# Decoder Layer

![image](https://github.com/guyuxuan9/Transformer-from-scratch/assets/58468284/89dd4b3e-cdf7-4eb6-be9e-69d7af4a826f)

In [7]:
class DecoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout):
        super(DecoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.cross_attn = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = PositionWiseFeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, enc_output, src_mask, tgt_mask):
        attn_output = self.self_attn(x, x, x, tgt_mask) # Q,K,V
        x = self.norm1(x + self.dropout(attn_output))
        attn_output = self.cross_attn(x, enc_output, enc_output, src_mask)
        x = self.norm2(x + self.dropout(attn_output))
        ff_output = self.feed_forward(x)
        x = self.norm3(x + self.dropout(ff_output))
        return x

# Transformer

![image](https://github.com/guyuxuan9/Transformer-from-scratch/assets/58468284/0ec7bb55-5685-45ab-b87a-dbd269e7f3bf)

- **Nx** in encoder and decoder means there are several layers. The output of the previous layer is fed to the input of the current layer.
- **Purpose of the mask**: in output prediction, we don't want to look ahead, i.e. predict only based on the previous input

![image](https://github.com/guyuxuan9/Transformer-from-scratch/assets/58468284/20259fc2-ccf9-4104-bda0-5be29a9a96e3)


In [17]:
class Transformer(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model, num_heads, num_layers, d_ff, max_seq_length, dropout):
        super(Transformer, self).__init__()
        self.encoder_embedding = nn.Embedding(src_vocab_size, d_model)
        self.decoder_embedding = nn.Embedding(tgt_vocab_size, d_model)
        self.positional_encoding = PositionalEncoding(d_model, max_seq_length)

        self.encoder_layers = nn.ModuleList([EncoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        self.decoder_layers = nn.ModuleList([DecoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])

        self.fc = nn.Linear(d_model, tgt_vocab_size)
        self.dropout = nn.Dropout(dropout)

    def generate_mask(self, src, tgt):
        src_mask = (src != 0).unsqueeze(1).unsqueeze(2) # for the non-zero elements
        tgt_mask = (tgt != 0).unsqueeze(1).unsqueeze(3)
        seq_length = tgt.size(1)
        nopeak_mask = (1 - torch.triu(torch.ones(1, seq_length, seq_length), diagonal=1)).bool() # 1 - upper triangular ones --> lower triangular ones
        tgt_mask = tgt_mask & nopeak_mask
        return src_mask, tgt_mask

    def forward(self, src, tgt):
        src_mask, tgt_mask = self.generate_mask(src, tgt)
        src_embedded = self.dropout(self.positional_encoding(self.encoder_embedding(src)))
        tgt_embedded = self.dropout(self.positional_encoding(self.decoder_embedding(tgt)))

        enc_output = src_embedded
        for enc_layer in self.encoder_layers:
            enc_output = enc_layer(enc_output, src_mask)

        dec_output = tgt_embedded
        for dec_layer in self.decoder_layers:
            dec_output = dec_layer(dec_output, enc_output, src_mask, tgt_mask)

        output = self.fc(dec_output)
        output = F.softmax(output, dim=-1)
        return output

# Training & Testing

In [9]:
src_vocab_size = 5000
tgt_vocab_size = 5000
d_model = 512
num_heads = 8
num_layers = 6
d_ff = 2048
max_seq_length = 100
dropout = 0.1

transformer = Transformer(src_vocab_size, tgt_vocab_size, d_model, num_heads, num_layers, d_ff, max_seq_length, dropout)

# Generate random sample data
src_data = torch.randint(1, src_vocab_size, (64, max_seq_length))  # (batch_size=64, seq_length)
tgt_data = torch.randint(1, tgt_vocab_size, (64, max_seq_length))  # (batch_size=64, seq_length)

In [10]:
src_data

tensor([[1133, 3281, 1928,  ..., 3837, 3115,  781],
        [4246,  473, 2231,  ...,  596, 4612, 3928],
        [2661, 2564, 4253,  ..., 2675, 1316, 2385],
        ...,
        [2023, 4028, 4269,  ..., 2424,  640, 2002],
        [4030,  749, 2769,  ..., 3078, 1955,  867],
        [3401, 1888,  822,  ..., 3024,  379, 3776]])

In [18]:
criterion = nn.CrossEntropyLoss(ignore_index=0)
optimizer = optim.Adam(transformer.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)

transformer.train()

for epoch in range(2):
    optimizer.zero_grad()
    output = transformer(src_data, tgt_data[:, :-1])
    loss = criterion(output.contiguous().view(-1, tgt_vocab_size), tgt_data[:, 1:].contiguous().view(-1))
    loss.backward()
    optimizer.step()
    print(f"Epoch: {epoch+1}, Loss: {loss.item()}")

Epoch: 1, Loss: 8.28443717956543
Epoch: 2, Loss: 8.30500316619873


In [19]:
src = torch.tensor([[0, 2, 5, 6, 4, 3, 9, 5, 2, 9, 10, 1]])
trg = torch.tensor([[0]])
print(src.shape,trg.shape)
out = transformer(src, trg)
print(out, out.shape)

torch.Size([1, 12]) torch.Size([1, 1])
tensor([[[-0.1130, -0.2085, -0.2666,  ...,  0.2334,  0.9185, -0.3312]]],
       grad_fn=<ViewBackward0>) torch.Size([1, 1, 5000])
