In [1]:
import numpy as np

In [2]:
import torch
import torch.nn as nn
import math
from torch.autograd import Variable

# Embedding 

In [11]:
class Embedder(nn.Module):
    def __init__(self, vocab_size:int, emb_dim:int):
        super().__init__()
        self.emb_dim = emb_dim 
        self.embed = nn.Embedding(vocab_size, emb_dim)

    def forward(self, x: torch.tensor):
        return self.embed(x)

# Why use `register_buffer`

- Ans [link](https://discuss.pytorch.org/t/what-is-the-difference-between-register-buffer-and-register-parameter-of-nn-module/32723/11)

An example where I find this distinction difficult is in 
the context of fixed positional encodings in the Transformer 
model. Typically I see implementations where the fixed positional 
encodings are registered as buffers but I’d consider these tensors 
as non-learnable parameters (that should show up in the list of 
model parameters), especially when comparing between methods 
that don’t rely on such injection of fixed tensors.

So in general:
- buffers = `fixed tensors / non-learnable parameters / stuff that does not require gradient`
- parameters = `learnable parameters, requires gradient`

![image](https://discuss.pytorch.org/user_avatar/discuss.pytorch.org/ptrblck/90/1823_2.png)
Piotr Bialecki

If you have parameters in your model, which should be saved and restored in the state_dict, but not trained by the optimizer, you should register them as buffers. Buffers won’t be returned in model.parameters(), so that the optimizer won’t have a change to update them.

Both approaches work the same regarding training etc.
There are some differences in the function calls however. Using register_parameter you have to pass the name as a string, which can make the creation of a range of parameters convenient. Besides that I think it’s just coding style which one you prefer.

If your `self.some_params` are `nn.Parameter` objects, then you don’t have to worry about this. If they’re tensors, then they won’t be in the `state_dict` (unless registered as buffer).

> simple `torch.tensor` will not be available under `state_dict`

one reason to register the tensor as a buffer is to be able to serialize the model and restore all internal states.
Another one is that all buffers and parameters will be pushed to the device, if called on the parent model:

```python
class MyModel(nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.my_tensor = torch.randn(1)
        self.register_buffer('my_buffer', torch.randn(1))
        self.my_param = nn.Parameter(torch.randn(1))
        
    def forward(self, x):
            return x

model = MyModel()
print(model.my_tensor)
> tensor([0.9329])
print(model.state_dict())
> OrderedDict([('my_param', tensor([-0.2471])), ('my_buffer', tensor([1.2112]))])

model.cuda()
print(model.my_tensor)
> tensor([0.9329])
print(model.state_dict())
> OrderedDict([('my_param', tensor([-0.2471], device='cuda:0')), ('my_buffer', tensor([1.2112], device='cuda:0'))])
```

As you can see, model.my_tensor is still on the CPU, where is was created, while all parameters and buffers were pushed to the GPU after calling `model.cuda()`.

<center>
    <img src="https://miro.medium.com/max/566/1*B-VR6R5vJl3Y7jbMNf5Fpw.png" width="400">
</center>

# Make embedding relatively larger by scaling the values. WHY?

The reason we increase the embedding values before 
addition is to make the positional encoding relatively 
smaller. This means the original meaning in the embedding 
vector won’t be lost when we add them together

```python
x = x*math.sqrt(self.emb_dim)
```

In [162]:
class PositionalEmbedding(nn.Module):
    def __init__(self, emb_dim:int, max_seq_len:int = 200, dropout_pct:float = 0.1):
        super().__init__()

        self.emb_dim = emb_dim
        self.dropout = nn.Dropout(dropout_pct)

        # create constant 'pe' matrix with values dependent on 
        # word position 'pos' and embedding position 'i'
        pe = torch.zeros(max_seq_len, emb_dim)

        for pos in range(max_seq_len):
            for i in range(0, emb_dim, 2):
                pe[pos,i] = math.sin(pos/(1000**((2*i)/emb_dim)))
                pe[pos,i+1] = math.sin(pos/(1000**((2*(i+1))/emb_dim)))

        # print(pe.size())
        # pe = pe.unsqueeze(0)
        # print(pe.size())

        self.register_buffer('pe', pe)


    def forward(self, x):
        
        # scale values
        x = x*math.sqrt(self.emb_dim)  

        seq_len = x.size(0)
        
        # add constant positional embedding to the word embedding
        pe = Variable(self.pe[:seq_len,:], requires_grad=False)
        
        if x.is_cuda:
            pe.cuda()
        
        x = x + pe
        return self.dropout(x)

## Test module

In [151]:
VOCAB_SIZE = 20000
EMB_DIM = 512

In [152]:
e = Embedder(VOCAB_SIZE, EMB_DIM)

In [153]:
idx = torch.tensor([1,2,3,4,5]) # torch.randint(3, 5, (3,))

In [154]:
idx

tensor([1, 2, 3, 4, 5])

In [155]:
emb = e(idx)

In [156]:
emb.size()

torch.Size([5, 512])

In [157]:
p = PositionalEmbedding(emb_dim=EMB_DIM)

In [158]:
p.pe.size()

torch.Size([200, 512])

In [159]:
p.pe.squeeze().size()

torch.Size([200, 512])

In [160]:
p_emb = p(emb)

In [161]:
p_emb.size()

torch.Size([5, 512])

# Batch Normalization

Normalisation is highly important in deep neural networks. It prevents the range of values in the layers changing too much, meaning the model trains faster and has better ability to generalise.

<center>
    <img src="https://miro.medium.com/max/511/1*4w3sQ14caDRkrQsAeK5Flw.png" width="400">
</center>

We will be normalising our results between each layer in the encoder/decoder, so before building our model let’s define that function:

- [blog](https://kharshit.github.io/blog/2018/12/28/why-batch-normalization)

<center>
    <img src="https://kharshit.github.io/img/batch_normalization.png" width="400">
</center>

In [176]:
class Norm(nn.Module):
    def __init__(self, d_model, eps = 1e-6):
        super().__init__()
    
        self.size = d_model
        
        # create two learnable parameters to calibrate normalisation
        self.alpha = nn.Parameter(torch.ones(self.size))
        self.bias = nn.Parameter(torch.zeros(self.size))
        
        self.eps = eps
    
    def forward(self, x):
        
        x_mean = x.mean(dim=-1, keepdim=True)
        x_variance = x.std(dim=-1, keepdim=True) 
        
        normalized_x = (x - x_mean) / (x_variance + self.eps)
        
        # scale and shift
        y = self.alpha * normalized_x + self.bias
        return y

## Test module

In [178]:
d_model = 5
bs = 2
seq_len = 10
x = torch.rand(size=(bs,seq_len, d_model))
x

tensor([[[0.6995, 0.8797, 0.3724, 0.2287, 0.4503],
         [0.9911, 0.6278, 0.0657, 0.3674, 0.2664],
         [0.9776, 0.4216, 0.2456, 0.8553, 0.1205],
         [0.3701, 0.1670, 0.7497, 0.4190, 0.3809],
         [0.2739, 0.2522, 0.5973, 0.6230, 0.7718],
         [0.8050, 0.0362, 0.4502, 0.5093, 0.9923],
         [0.9708, 0.5319, 0.8902, 0.8937, 0.0268],
         [0.6310, 0.3133, 0.3671, 0.5056, 0.5416],
         [0.9670, 0.6497, 0.8942, 0.6173, 0.3297],
         [0.8384, 0.7972, 0.8147, 0.2408, 0.3032]],

        [[0.1791, 0.7929, 0.8694, 0.4868, 0.7919],
         [0.2966, 0.0549, 0.5732, 0.8265, 0.9248],
         [0.5742, 0.8801, 0.1025, 0.9999, 0.7294],
         [0.4161, 0.3759, 0.0544, 0.9366, 0.2160],
         [0.5113, 0.1617, 0.3394, 0.9271, 0.2979],
         [0.5145, 0.8691, 0.2940, 0.0353, 0.1214],
         [0.4028, 0.7736, 0.6155, 0.2101, 0.9731],
         [0.8073, 0.2950, 0.9592, 0.8006, 0.8004],
         [0.7588, 0.8528, 0.7546, 0.1169, 0.4026],
         [0.1103, 0.0015, 0.7

In [179]:
x.size()

torch.Size([2, 10, 5])

In [180]:
n = Norm(d_model=d_model)

In [181]:
n(x)

tensor([[[ 0.6639,  1.3536, -0.5886, -1.1386, -0.2903],
         [ 1.4746,  0.4588, -1.1127, -0.2691, -0.5517],
         [ 1.2053, -0.2724, -0.7403,  0.8801, -1.0726],
         [-0.2249, -1.1910,  1.5814,  0.0078, -0.1732],
         [-1.0005, -1.0948,  0.4079,  0.5196,  1.1678],
         [ 0.6731, -1.4272, -0.2962, -0.1347,  1.1849],
         [ 0.7816, -0.3317,  0.5771,  0.5862, -1.6132],
         [ 1.2266, -1.2201, -0.8055,  0.2608,  0.5382],
         [ 1.0904, -0.1659,  0.8023, -0.2940, -1.4329],
         [ 0.7996,  0.6621,  0.7206, -1.1953, -0.9870]],

        [[-1.5405,  0.5848,  0.8495, -0.4750,  0.5812],
         [-0.6582, -1.3249,  0.1048,  0.8035,  1.0747],
         [-0.2379,  0.6388, -1.5899,  0.9822,  0.2068],
         [ 0.0491, -0.0719, -1.0391,  1.6147, -0.5528],
         [ 0.2159, -0.9662, -0.3655,  1.6218, -0.5059],
         [ 0.4406,  1.4981, -0.2175, -0.9890, -0.7323],
         [-0.6405,  0.5949,  0.0683, -1.2824,  1.2596],
         [ 0.2946, -1.7238,  0.8934,  0.2685, 

# Attention

<center>
    <img src="https://miro.medium.com/max/445/1*evdACdTOBT5j1g1nXialBg.png" width="400">
</center>


<center>
    <img src="https://miro.medium.com/max/140/1*15E9qKg9bKnWdSRWCyY2iA.png" width="100">
</center>

- Initially we must multiply $Q$ by the transpose of $K$. This is then `scaled` by dividing the output by the square root of $d_k$.
- A step that’s not shown in the equation is the `masking operation`. Before we perform `Softmax`, we apply our mask and hence reduce values where the input is padding (or in the decoder, also where the input is ahead of the current word).

Another step not shown is `dropout`, which we will apply after `Softmax`.

Finally, the last step is doing a `dot` product between the result so far and $V$.

In [173]:
def attention(q, k, v, d_k, mask=None, dropout=None):
    
    scores = torch.matmul(q, k.transpose(-2, -1)) /  math.sqrt(d_k)
    
    if mask is not None:
        mask = mask.unsqueeze(1)
        scores = scores.masked_fill(mask == 0, -1e9)
    
    scores = F.softmax(scores, dim=-1)
    
    if dropout is not None:
        scores = dropout(scores)
        
    output = torch.matmul(scores, v)
    return output

# Multi-Headed Attention

Once we have our embedded values (with positional encodings) and our masks, we can start building the layers of our model.

Here is an overview of the multi-headed attention layer:


<center>
    <img src="https://miro.medium.com/max/523/1*1tsRtfaY9z6HxmERYhw8XQ.png" width="300">
</center>

- $V$, $K$ and $Q$ stand for `key`, `value` and `query`. These are terms used in attention functions

- In the case of the **Encoder**, $V, K$ and $G$ will simply be **identical copies** of the `emb_vector + pos_encoding`. 
- They will have the dimensions `Batch_size * seq_len * d_model`

<center>
    <img src="images/tensor_dimension.png" width="400">
</center>

- Drawing tool [Excalidraw](https://excalidraw.com/)

In [182]:
class MultiHeadAttention(nn.Module):
    def __init__(self, heads, d_model, dropout = 0.1):
        super().__init__()
        
        self.d_model = d_model
        self.d_k = d_model // heads
        self.h = heads
        
        self.q_linear = nn.Linear(d_model, d_model)
        self.v_linear = nn.Linear(d_model, d_model)
        self.k_linear = nn.Linear(d_model, d_model)
        
        self.dropout = nn.Dropout(dropout)
        self.out = nn.Linear(d_model, d_model)
    
    def forward(self, q, k, v, mask=None):
        
        bs = q.size(0)
        
        # perform linear operation and split into N heads
        k = self.k_linear(k).view(bs, -1, self.h, self.d_k)
        q = self.q_linear(q).view(bs, -1, self.h, self.d_k)
        v = self.v_linear(v).view(bs, -1, self.h, self.d_k)
        
        # transpose to get dimensions bs * N * sl * d_model
        k = k.transpose(1,2)
        q = q.transpose(1,2)
        v = v.transpose(1,2)
        

        # calculate attention using function we will define next
        scores = attention(q, k, v, self.d_k, mask, self.dropout)
        # concatenate heads and put through final linear layer
        concat = scores.transpose(1,2).contiguous()\
        .view(bs, -1, self.d_model)
        output = self.out(concat)
    
        return output