# <font color = 'blue'> Attention is All You Need - Small example forward loop only

<font size = 4, color ='green'>
    
The differences between the implementation in this notebook and the paper are:
- Learned positional encoding instead of a static one
- Standard Adam optimizer with a static learning rate instead of one with warm-up and cool-down steps
- Label Smoothing
- Weight sharing between embedding layer and final Linear layer before softmax
- BPE for tokenization
- Effcient Batching (make sure we have to do minimum padding)
- Multiple GPU training

# <font color = 'blue'> Import Libraries
As always, let's import all the required modules and set the random seeds for reproducability.

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
from torch.nn.utils.rnn import pad_sequence

import torchtext 
from torchtext.datasets import Multi30k
from torchtext.vocab import vocab
#from torchtext.legacy.data import Field, BucketIterator

import spacy
import numpy as np
from collections import Counter, OrderedDict

import random
import math
import time
import pandas as pd
from pathlib import Path

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

In [2]:
SEED = 1234

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

In [3]:
torchtext.__version__, torch.__version__, torch.cuda.is_available(), spacy.__version__

('0.11.0', '1.10.0', True, '3.2.4')

# <font color = 'blue'>  Preparing the Data



In [4]:
data_folder = Path('/home/harpreet/Insync/google_drive_shaannoor/Data/NLP')
project_folder = Path('/home/harpreet/Insync/google_drive_harpreet/Research/NLP/pytorch-seq2seq')

In [5]:
torchtext.__version__, torch.__version__, torch.cuda.is_available(), spacy.__version__

('0.11.0', '1.10.0', True, '3.2.4')

We'll then create our tokenizers as before.

## <font color = 'blue'> Load tokenized data

In [6]:
df_train = pd.read_pickle(project_folder/'df_train_en_de.pickel')

In [7]:
df_train

Unnamed: 0,source_tokens,target_tokens,source_tokens_reverse
0,"[zwei, junge, weiße, männer, sind, i, m, freie...","[two, young, ,, white, males, are, outside, ne...","[., büsche, vieler, nähe, der, in, freien, m, ..."
1,"[mehrere, männer, mit, schutzhelmen, bedienen,...","[several, men, in, hard, hats, are, operating,...","[., antriebsradsystem, ein, bedienen, schutzhe..."
2,"[ein, kleines, mädchen, klettert, in, ein, spi...","[a, little, girl, climbing, into, a, wooden, p...","[., holz, aus, spielhaus, ein, in, klettert, m..."
3,"[ein, mann, in, einem, blauen, hemd, steht, au...","[a, man, in, a, blue, shirt, is, standing, on,...","[., fenster, ein, putzt, und, leiter, einer, a..."
4,"[zwei, männer, stehen, am, herd, und, bereiten...","[two, men, are, at, the, stove, preparing, foo...","[., zu, essen, bereiten, und, herd, am, stehen..."
...,...,...,...
28995,"[., wand, verschnörkelten, einer, hinter, schr...","[a, woman, behind, a, scrolled, wall, is, writ...","[eine, frau, schreibt, hinter, einer, verschnö..."
28996,"[., kletterwand, einer, an, übt, bergsteiger, ...","[a, rock, climber, practices, on, a, rock, cli...","[ein, bergsteiger, übt, an, einer, kletterwand..."
28997,"[., hauses, einem, vor, straße, einer, auf, ar...","[two, male, construction, workers, are, workin...","[zwei, bauarbeiter, arbeiten, auf, einer, stra..."
28998,"[., fassade, einer, vor, wagen, einem, mit, ju...","[an, elderly, man, sits, outside, a, storefron...","[ein, älterer, mann, sitzt, mit, einem, jungen..."


## <font color = 'blue'> Small subset of data

In [8]:
df_train_small= df_train[0:4]

In [9]:
df_train_small

Unnamed: 0,source_tokens,target_tokens,source_tokens_reverse
0,"[zwei, junge, weiße, männer, sind, i, m, freie...","[two, young, ,, white, males, are, outside, ne...","[., büsche, vieler, nähe, der, in, freien, m, ..."
1,"[mehrere, männer, mit, schutzhelmen, bedienen,...","[several, men, in, hard, hats, are, operating,...","[., antriebsradsystem, ein, bedienen, schutzhe..."
2,"[ein, kleines, mädchen, klettert, in, ein, spi...","[a, little, girl, climbing, into, a, wooden, p...","[., holz, aus, spielhaus, ein, in, klettert, m..."
3,"[ein, mann, in, einem, blauen, hemd, steht, au...","[a, man, in, a, blue, shirt, is, standing, on,...","[., fenster, ein, putzt, und, leiter, einer, a..."


In [10]:
df_train_small= df_train_small.drop(columns=['source_tokens_reverse'])

In [11]:
df_train_small

Unnamed: 0,source_tokens,target_tokens
0,"[zwei, junge, weiße, männer, sind, i, m, freie...","[two, young, ,, white, males, are, outside, ne..."
1,"[mehrere, männer, mit, schutzhelmen, bedienen,...","[several, men, in, hard, hats, are, operating,..."
2,"[ein, kleines, mädchen, klettert, in, ein, spi...","[a, little, girl, climbing, into, a, wooden, p..."
3,"[ein, mann, in, einem, blauen, hemd, steht, au...","[a, man, in, a, blue, shirt, is, standing, on,..."


## <font color = 'blue'> Build Vocab

In [12]:
def create_vocab(text, min_freq, specials):
    my_counter = Counter()
    for line in text:
       my_counter.update(line)
    my_vocab = vocab(my_counter, min_freq=min_freq)
    for i, special in enumerate(specials):
        my_vocab.insert_token(special, i)
    my_vocab.set_default_index(0)
    return my_vocab

Create source vocab, We will add four special tokens - ```['<unk>', '<BOS>', '<EOS>', '<PAD>']```

### <font color = 'blue'> Source Vocab

In [13]:
source_vocab = create_vocab(df_train_small['source_tokens'], 1, ['<unk>', '<BOS>', '<EOS>', '<PAD>'])

In [14]:
len(source_vocab)

41

In [15]:
pd.DataFrame(source_vocab.get_stoi().items(), columns=['tokens', 'index']).sort_values(by = ['index'])[0:10]

Unnamed: 0,tokens,index
28,<unk>,0
34,<BOS>,1
23,<EOS>,2
19,<PAD>,3
17,zwei,4
27,junge,5
14,weiße,6
12,männer,7
16,sind,8
7,i,9


In [16]:
# check index of unknown word - it should be zero
source_vocab['abracdabra']

0

### <font color = 'blue'> Target Vocab

In [17]:
target_vocab = create_vocab(df_train_small['target_tokens'], 1, ['<unk>', '<BOS>', '<EOS>', '<PAD>'])

In [18]:
len(target_vocab)

40

## <font color = 'blue'> Create Dataset and Dataloader

In [19]:
class EngGerman(Dataset):
    def __init__(self, X1, X2):
        self.X1 = X1
        self.X2 = X2
        
    def __len__(self):
        return len(self.X1)
    
    def __getitem__(self, indices):
        return (self.X1.iloc[indices] , self.X2.iloc[indices]) 

In [20]:
trainset = EngGerman(df_train_small['source_tokens'], df_train['target_tokens'])

In [21]:
trainset[0]

(['zwei',
  'junge',
  'weiße',
  'männer',
  'sind',
  'i',
  'm',
  'freien',
  'in',
  'der',
  'nähe',
  'vieler',
  'büsche',
  '.'],
 ['two',
  'young',
  ',',
  'white',
  'males',
  'are',
  'outside',
  'near',
  'many',
  'bushes',
  '.'])

<font color = 'green'> **Function to replace words woth their index. Also add tokens BOS and EOS for beginning and end of sentences**

In [22]:
def text_transform (my_vocab, text):
     text_numerical = [my_vocab[token] for token in text]
     return torch.tensor([my_vocab['<BOS>']] + text_numerical + [my_vocab['<EOS>']])
     #return list(my_vocab['<BOS>']) + text_numerical + list(my_vocab['<EOS>'])

In [23]:
text = trainset[0][1]
print(text)
text_transform(target_vocab, text)

['two', 'young', ',', 'white', 'males', 'are', 'outside', 'near', 'many', 'bushes', '.']


tensor([ 1,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14,  2])

In [24]:
text = trainset[1][1]
print(text)
text_transform(target_vocab, text)

['several', 'men', 'in', 'hard', 'hats', 'are', 'operating', 'a', 'giant', 'pulley', 'system', '.']


tensor([ 1, 15, 16, 17, 18, 19,  9, 20, 21, 22, 23, 24, 14,  2])

<font color = 'green'> Create a function that will be use by dataloaders to group obsevations. We will first use transform function to add eos and bos tokens and replace words with indexes. Finally we will add pad tokens for smaller sentences in a batch.

In [25]:
def collate_batch(batch):
   source_list, target_list = [], []
   for (source_text, target_text) in batch:
        source_transform = text_transform(source_vocab, source_text)
        source_list.append(source_transform)
        target_transform =text_transform(target_vocab, target_text)
        target_list.append(target_transform)
        
   source_pad = pad_sequence(source_list, padding_value=3.0, batch_first = True)
   target_pad = pad_sequence(target_list, padding_value=3.0, batch_first = True)
   #print(source_list)
   return (source_pad, target_pad)

In [26]:
torch.manual_seed(0)
batch_size = 2

train_loader = DataLoader(trainset, batch_size=batch_size, shuffle=True, 
                              collate_fn=collate_batch)

In [27]:
torch.manual_seed(40)
for i, (source, target) in enumerate(train_loader):
   
  print('batch number:' ,i)
  print('source')  
  print(source)
  print('target')  
  print(target)

batch number: 0
source
tensor([[ 1, 22, 30, 12, 31, 32, 33, 34, 35, 36, 37, 38, 39, 22, 40, 17,  2],
        [ 1,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,  2,  3]])
target
tensor([[ 1, 21, 31, 17, 21, 32, 33, 34, 35, 36, 21, 37, 38, 21, 39, 14,  2],
        [ 1,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14,  2,  3,  3,  3,  3]])
batch number: 1
source
tensor([[ 1, 22, 24, 25, 26, 12, 22, 27, 28, 29, 17,  2],
        [ 1, 18,  7, 19, 20, 21, 22, 23, 17,  2,  3,  3]])
target
tensor([[ 1, 21, 25, 26, 27, 28, 21, 29, 30, 14,  2,  3,  3,  3],
        [ 1, 15, 16, 17, 18, 19,  9, 20, 21, 22, 23, 24, 14,  2]])


In [28]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
#device = torch.device('cpu')
device

device(type='cuda')

<font size = 3, color = 'green'> **We will be using the second batch for our example.**

In [29]:
src = source.clone().to(device)
print(src)
print(src.shape)

tensor([[ 1, 22, 24, 25, 26, 12, 22, 27, 28, 29, 17,  2],
        [ 1, 18,  7, 19, 20, 21, 22, 23, 17,  2,  3,  3]], device='cuda:0')
torch.Size([2, 12])


In [30]:
trg= target.clone().to(device)
print(trg)
print(trg.shape)

tensor([[ 1, 21, 25, 26, 27, 28, 21, 29, 30, 14,  2,  3,  3,  3],
        [ 1, 15, 16, 17, 18, 19,  9, 20, 21, 22, 23, 24, 14,  2]],
       device='cuda:0')
torch.Size([2, 14])


- <font size = 3, color = 'green'> First token will have index 1 - corresponding to < BOS > token
- <font size = 3, color = 'green'> In target last token in first senence have index 3 - corresponding to < PAD >
- <font size = 3, color = 'green'> In target last token in second senence have index 2 - corresponding to < EOS >
</font>

In [31]:
print(src[0][0]) # first token will have index 1 - corresponding to '<BOS>'
print(trg[0][-1]) # In target last token in first senence have index 3 - corresponding to '<PAD>'
print(trg[1][-1]) # In target last token in second senence have index 2 - corresponding to '<EOS>'
# batch_size, src_len

tensor(1, device='cuda:0')
tensor(3, device='cuda:0')
tensor(2, device='cuda:0')


In [32]:
print(src.device)
print(trg.device)

cuda:0
cuda:0


<font color = 'green'> **INPUT, LABEL, OUTPUT FOR DECODER** <br>
Original Sequence: $trg = [sos, x_1, x_2, x_3, eos]$ <br>
Input to Model: $trg[:-1][sos, x_1, x_2, x_3]$  <br>
Predicted Values: $[y_1, y_2, y_3, eos]$<br>
Lable or True y : $trg[1:] = [x_1, x_2, x_3, eos]$

<font size = 3, color = 'green'> **ISSUE**: We are trying to remove eos token using  $trg[:,-1]$. However, this will remove pad token and not eos token. 

<font color = 'green'> **SOLUTION**: Response from  https://github.com/bentrevett/pytorch-seq2seq/issues/182
    
<font color = 'green'> Our trg sequence will be something like [sos, x1, x2, x3, eos] When we do $trg[:,-1]$ the sequence will be [sos, x1, x2, x3], and our predicted sequence will be [y1, y2, y3, y4], where y1 should be x1, y2 should be x2, y3 should be x3 and y4 should be eos. The predicted sequence should be a shifted version of the target sequence -- this is because we calculate the loss of output against $trg[,1:]$ = [x1, x2, x3, eos]

<font color = 'green'> With padding, let's say the target sequence is [sos, x1, x2, x3, eos, pad, pad], thus $trg[:,-1]$ = [sos, x1, x2, x3, eos, pad] and thus our predicted sequence is [y1, y2, y3, y4, y5, y6]. Same as before, but y5 and y6 should be the model predicting pad tokens, so we are calculating the loss of our output against $trg[,1:]$ = [x1, x2, x3, eos, pad, pad]. However, because we use nn.CrossEntropyLoss(ignore_index = TRG_PAD_IDX) we ignore the loss values over the pad tokens, so we only calculate loss from the output compared against $trg[,1:]$ = [x1, x2, x3, eos] -- which is the exact same sequence without padding.

<font color = 'green'>**CONCLUSION** <br>
Original Sequence: $trg = [sos, x_1, x_2, x_3, eos, pad, pad]$ <br>
Input to Model: $trg[:-1][sos, x_1, x_2, x_3, eos, pad]$  <br>
Predicted Values: $[y_1, y_2, y_3, eos, garbage, garbage]$ <br>
Lable or True y : $trg[1:] = [x_1, x_2, x_3, eos, pad, pad]$ <br>
    
<font color = 'red'>Since, we ignore pad token in Label while calculationg loss, this does not effect our model. Ideally we do not want to give eos as input to model, but there is no way around with pad sequences. 
    

In [33]:
trg_in = trg[:, :-1]

In [34]:
trg_in

tensor([[ 1, 21, 25, 26, 27, 28, 21, 29, 30, 14,  2,  3,  3],
        [ 1, 15, 16, 17, 18, 19,  9, 20, 21, 22, 23, 24, 14]], device='cuda:0')

In [35]:
trg_in.shape

torch.Size([2, 13])

In [36]:
trg.shape

torch.Size([2, 14])

# <font color = 'blue'> Building the Model

## <font color = 'blue'> Input Embeddings

![](assets/transformer-encoder.png)

<font size = 3, color = 'green'>
    
1. Pass the token indices through embedding layer. 
2. Multiply token embedding by a scaling factor :$\sqrt{d_{model}}$, where $d_{model}$ is the hidden dimension size, `hid_dim`. <font color = 'red'> **NOT UNDERSTOOD**
3.Create a vector of token positions (assime max length. pass the positions through the  *positional embedding layer*. <font color = 'red'> **Check the fixed static embeddings used in the original paper** </font>

<font color = 'green'>
    
4. Add the two embeddings  
    
5. **Dropout is then applied to the combined embeddings.**
   

### <font color = 'blue'>  Step 1 Token embedding

In [37]:
hid_dim = 8
torch.manual_seed(0)
src_token_embedding_layer = nn.Embedding(len(source_vocab), hid_dim).to(device)
trg_token_embedding_layer = nn.Embedding(len(target_vocab), hid_dim).to(device)

In [38]:
print(f'{src_token_embedding_layer.weight[0:5]}')

tensor([[-1.1258, -1.1524, -0.2506, -0.4339,  0.8487,  0.6920, -0.3160, -2.1152],
        [ 0.3223, -1.2633,  0.3500,  0.3081,  0.1198,  1.2377,  1.1168, -0.2473],
        [-1.3527, -1.6959,  0.5667,  0.7935,  0.5988, -1.5551, -0.3414,  1.8530],
        [ 0.7502, -0.5855, -0.1734,  0.1835,  1.3894,  1.5863,  0.9463, -0.8437],
        [-0.6136,  0.0316, -0.4927,  0.2484,  0.4397,  0.1124,  0.6408,  0.4412]],
       device='cuda:0', grad_fn=<SliceBackward0>)


In [39]:
print(f'{trg_token_embedding_layer.weight[0:5]}')

tensor([[-0.5627, -0.8328, -1.3955, -0.3993, -0.3099, -0.0561,  0.5174, -1.5962],
        [ 0.3570, -2.2975, -0.8711, -1.6740,  0.5631, -1.4351,  0.7194, -1.3707],
        [ 0.3221, -0.1016,  0.2060,  1.2168,  1.2359, -0.1002,  2.1364,  0.0700],
        [ 0.4990,  0.0565,  0.4061, -1.7384,  1.1901,  2.6352,  0.2284,  0.3241],
        [-1.1154,  2.1914,  0.1158,  0.7773, -1.0921, -0.0611, -1.4928, -1.7644]],
       device='cuda:0', grad_fn=<SliceBackward0>)


In [40]:
src_embedding = src_token_embedding_layer(src)
trg_embedding = trg_token_embedding_layer(trg_in)

In [41]:
print(src_embedding.shape)
print(trg_embedding.shape)
# batch_size, seq_len, hid_dim

torch.Size([2, 12, 8])
torch.Size([2, 13, 8])


<font size = 3, color = 'green'> Each word is represented as  vector of size hidden_dim. Hidden_dim is same for source and target.

In [42]:
src_embedding[0][0]

tensor([ 0.3223, -1.2633,  0.3500,  0.3081,  0.1198,  1.2377,  1.1168, -0.2473],
       device='cuda:0', grad_fn=<SelectBackward0>)

In [43]:
trg_embedding[0][0]

tensor([ 0.3570, -2.2975, -0.8711, -1.6740,  0.5631, -1.4351,  0.7194, -1.3707],
       device='cuda:0', grad_fn=<SelectBackward0>)

### <font color = 'blue'>  Step 2 - scale output of embedding
<font color = 'red'> **NOT UNDERSTOOD**</font>

In [44]:
print(torch.var(src_embedding [0][4]))
torch.var(trg_embedding [0][4])

tensor(1.0456, device='cuda:0', grad_fn=<VarBackward0>)


tensor(0.5988, device='cuda:0', grad_fn=<VarBackward0>)

In [45]:
scale = torch.sqrt(torch.FloatTensor([hid_dim])).to(device)
scale

tensor([2.8284], device='cuda:0')

In [46]:
src_embedding_scaled = src_embedding*scale
trg_embedding_scaled = trg_embedding*scale

In [47]:
print(torch.var(src_embedding_scaled[0][4]))
torch.var(trg_embedding_scaled[0][4])

tensor(8.3652, device='cuda:0', grad_fn=<VarBackward0>)


tensor(4.7907, device='cuda:0', grad_fn=<VarBackward0>)

### <font color = 'blue'>  Step 3 Positional Embedding

In [48]:
torch.manual_seed(0)
max_length = 20
src_position_embedding_layer = nn.Embedding(max_length, hid_dim).to(device)
trg_position_embedding_layer = nn.Embedding(max_length, hid_dim).to(device)

In [49]:
print(src_position_embedding_layer.weight.shape)
print(trg_position_embedding_layer.weight.shape)
# max seq len, hid_dim

torch.Size([20, 8])
torch.Size([20, 8])


In [50]:
src_position_embedding_layer.weight[0:5]

tensor([[-1.1258, -1.1524, -0.2506, -0.4339,  0.8487,  0.6920, -0.3160, -2.1152],
        [ 0.3223, -1.2633,  0.3500,  0.3081,  0.1198,  1.2377,  1.1168, -0.2473],
        [-1.3527, -1.6959,  0.5667,  0.7935,  0.5988, -1.5551, -0.3414,  1.8530],
        [ 0.7502, -0.5855, -0.1734,  0.1835,  1.3894,  1.5863,  0.9463, -0.8437],
        [-0.6136,  0.0316, -0.4927,  0.2484,  0.4397,  0.1124,  0.6408,  0.4412]],
       device='cuda:0', grad_fn=<SliceBackward0>)

In [51]:
batch_size = src_embedding_scaled.shape[0]
src_len = src_embedding_scaled.shape[1]
trg_len = trg_embedding_scaled.shape[1]

In [52]:
print(batch_size)
print(src_len)
print(trg_len)

2
12
13


In [53]:
src_position = torch.arange(0, src_len)
print(src_position)
print(src_position.shape)

tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])
torch.Size([12])


In [54]:
src_position = src_position.unsqueeze(0)
print(src_position.shape)
src_position

torch.Size([1, 12])


tensor([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11]])

In [55]:
src_position = src_position.repeat(batch_size,1)
src_position = src_position.to(device)
# [batch_size, seq_len]

In [56]:
src_position.shape

torch.Size([2, 12])

In [57]:
src_position

tensor([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11],
        [ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11]], device='cuda:0')

In [58]:
trg_position = torch.arange(trg_len).view(1,-1).repeat(batch_size,1).to(device)

In [59]:
print(trg_position)
print(trg_position.shape)

tensor([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12],
        [ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12]], device='cuda:0')
torch.Size([2, 13])


In [60]:
src_position_embedding = src_position_embedding_layer(src_position)
trg_position_embedding = trg_position_embedding_layer(trg_position)

In [61]:
print(src_position_embedding.shape)
print(trg_position_embedding.shape)
# [ batch_size, seq_len, hid_dim]

torch.Size([2, 12, 8])
torch.Size([2, 13, 8])


### <font color = 'blue'> Step 4 Combine scaled token embedding and position embedding

In [62]:
encoder_input = src_position_embedding + src_embedding_scaled
decoder_input = trg_position_embedding + trg_embedding_scaled

In [63]:
encoder_input[0][0]

tensor([-0.2143, -4.7256,  0.7393,  0.4377,  1.1877,  4.1926,  2.8427, -2.8146],
       device='cuda:0', grad_fn=<SelectBackward0>)

### <font color = 'blue'> Step 5 Apply Dropout

In [64]:
torch.manual_seed(0)
encoder_input_dropout_layer = nn.Dropout(p=0.1)
decoder_input_dropout_layer = nn.Dropout(p=0.1)

In [65]:
encoder_input_after_dropout = encoder_input_dropout_layer(encoder_input)
decoder_input_after_dropout = encoder_input_dropout_layer(decoder_input)

In [66]:
encoder_input_after_dropout[0][0]
# [batch_size, seq_len, hid_dim]

tensor([-0.2381, -5.2507,  0.8215,  0.4863,  1.3196,  0.0000,  3.1586, -3.1274],
       device='cuda:0', grad_fn=<SelectBackward0>)

<font size = 3, color = 'green'> **Dropout layer set 10% (p =0.1) of neurons to zero and divides the other by 0.9 (1-p)**. We do not apply dropout during inference. In pytorch when we use model.train() - dropout layer is applied, whereas when we use model.eval() dropout layer is not applied.

In [67]:
encoder_input[0][0]/0.9

tensor([-0.2381, -5.2507,  0.8215,  0.4863,  1.3196,  4.6585,  3.1586, -3.1274],
       device='cuda:0', grad_fn=<DivBackward0>)

## <font color = 'blue'>  <font size =5> **Encoder Layers**


<font size =4, color = 'green'> Sublayer Self Attention </font>

<font color = 'green'>
    
- Step1: Pass the source sentence and its mask into the *multi-head attention layer*- self attention.
    
- Step2: Layer Noramalization(dropout(output of self attention) + input self attention). Adding input to the output layer is also referred as redidual connection or skip connection.
</font>

<font size =4, color = 'green'> Sublayer Position-Wise Feedforward </font>

<font color = 'green'>
    
- Step3: Pass the output of Self attention through following layers: Linear layer > RELU > Dropout > Linear
   
- Step4: Again, apply dropout, a residual connection and then layer normalization i.e. Layer Noramalization(dropout(output of position-wise feedforward) + input position-wise feedforward).

#### <font color = 'blue'> **Self Attention**

**Mutli Head Attention Layer**


![](assets/transformer-attention.png)

<font color = 'green'>
    
- Attention can be thought of as *queries*, *keys* and *values* - where the query is used with the key to get an attention vector (usually the output of a *softmax* operation and has all values between 0 and 1 which sum to 1) which is then used to get a weighted sum of the values.

$$ \text{Attention}(Q, K, V) = \text{Softmax} \big( \frac{QK^T}{\sqrt{d_k}} \big)V $$ 

- Scaling is done to stop the results of the dot products growing large, causing gradients to become too small - similar to some inititalization methods.

- However, the scaled dot-product attention isn't simply applied to the queries, keys and values. Instead of doing a single attention application the queries, keys and values have their `hid_dim` split into $h$ *heads* and the scaled dot-product attention is calculated over all heads in parallel. This means instead of paying attention to one concept per attention application, we pay attention to $h$. We then re-combine the heads into their `hid_dim` shape, thus each `hid_dim` is potentially paying attention to $h$ different concepts.

$$ \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1,...,\text{head}_h)W^O $$

$$\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V) $$

$W^O$ is the linear layer applied at the end of the multi-head attention layer, `fc`. $W^Q, W^K, W^V$ are the linear layers `fc_q`, `fc_k` and `fc_v`.

**Steps for Self Attention:** 
- Step1: first we calculate $QW^Q$, $KW^K$ and $VW^V$ with the linear layers, `fc_q`, `fc_k` and `fc_v`, to give us `Q`, `K` and `V`. 
- Step2: Next, we split the `hid_dim` of the query, key and value into `n_heads` using `.view` and correctly permute them so they can be multiplied together. 
- Step3: We then calculate the `energy` (the un-normalized attention) by multiplying `Q` and `K` together and scaling it by the square root of `head_dim`, which is calulated as `hid_dim // n_heads`. 
- Step 4: We then mask the energy so we do not pay attention over any elements of the sequeuence we shouldn't.
- Step 5: then apply the softmax 
- Step 6: apply dropout on output of softmax. 
- Step 7: We then apply the attention to the value heads, `V`, before combining the `n_heads` together. 
- Step 8: Finally, we multiply this $W^O$, represented by `fc_o`.
    
NOTE - All the above steps can also be applied using nn.Multiheadattention() layer. See the Appendix where we show that we get su=milar results.

##### <font color = 'blue'> **Step 1: Linear Transformation of embeddings to generate Queries, Keys and values**</font>

In [68]:
torch.manual_seed(0)
out_hid_dim = 12
hid_dim = 8
fc_q = nn.Linear(hid_dim, out_hid_dim).to(device)
fc_k = nn.Linear(hid_dim, out_hid_dim).to(device)
fc_v = nn.Linear(hid_dim, out_hid_dim).to(device)
fc_o = nn.Linear(out_hid_dim, hid_dim).to(device)

In [69]:
fc_q.weight.shape

torch.Size([12, 8])

In [70]:
fc_q.bias.shape

torch.Size([12])

In [71]:
Q = fc_q(encoder_input_after_dropout)
K = fc_k(encoder_input_after_dropout)
V = fc_v(encoder_input_after_dropout)

In [72]:
Q.shape
#[batch_size, query_len, hid_dim]

torch.Size([2, 12, 12])

##### <font color = 'blue'>**Step2: Split the `hid_dim` of the query, key and value into `n_heads`**

In [73]:
n_heads = 3
head_dim = out_hid_dim // n_heads
print(head_dim)

4


In [74]:
assert out_hid_dim % n_heads == 0

In [75]:
Q = Q.view(batch_size, -1, n_heads, head_dim)
K = K.view(batch_size, -1, n_heads, head_dim)
V = V.view(batch_size, -1, n_heads, head_dim)

In [76]:
Q.shape
#[batch_size, query_len, n_heads, head_dim]

torch.Size([2, 12, 3, 4])

In [77]:
Q = Q.permute(0, 2, 1, 3)
K = K.permute(0, 2, 1, 3)
V = V.permute(0, 2, 1, 3)

In [78]:
Q.shape
#[batch_size,num_heads, query_len, head_dim ]

torch.Size([2, 3, 12, 4])

In [79]:
K.shape
#[batch_size,num_heads, key_len, head_dim ]

torch.Size([2, 3, 12, 4])

##### <font color = 'blue'> **Step3: Scaled dot product of Queries and Keys**</font>

In [80]:
scale = torch.sqrt(torch.FloatTensor([head_dim])).to(device)

In [81]:
energy = torch.matmul(Q, K.permute(0, 1, 3, 2)) / scale

In [82]:
energy.shape
#[batch_size, num_heads, query_len, key_len]

torch.Size([2, 3, 12, 12])

##### <font color = 'blue'> **Step4: Apply mask to output of Q, K dot product**</font><br>
<font color = 'green'>**We do not want tokens to pay attention to pad tokens**</font>

In [83]:
SRC_PAD_IDX = source_vocab['<PAD>']
SRC_PAD_IDX

3

In [84]:
src_mask = (src!= SRC_PAD_IDX )
src_mask

tensor([[ True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True],
        [ True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         False, False]], device='cuda:0')

In [85]:
src_mask.shape

torch.Size([2, 12])

In [86]:
src_mask = src_mask.unsqueeze(1).unsqueeze(2)

In [87]:
src_mask.shape

torch.Size([2, 1, 1, 12])

In [88]:
src_mask = src_mask.to(device)

In [89]:
energy_masked = energy.masked_fill(src_mask == 0, -1e10)

<font color = 'red'>**Print energy values for second sentence**</font>

<font color ='green'> **We can see below that for the last two tokens (pad tokens) energy has a very high negative value.**

In [90]:
print(energy_masked[1,1,1,10].data, energy_masked[1,1,1,11].data)
      
#[batch_size, num_heads, query_len, key_len]

tensor(-1.0000e+10, device='cuda:0') tensor(-1.0000e+10, device='cuda:0')


In [91]:
print(energy_masked[1,0,5,10].data, energy_masked[1,0,5,11].data)
      
#[batch_size, num_heads, query_len, key_len]

tensor(-1.0000e+10, device='cuda:0') tensor(-1.0000e+10, device='cuda:0')


<font color = 'green'> **We are not ignoring pad tokens completely - we are ignoring pad tokens in keys but not in queries**

In [92]:
print(energy_masked[1,0,10,1].data, energy_masked[1,0,11,1].data)      
#[batch_size, num_heads, query_len, key_len]

tensor(-3.0029, device='cuda:0') tensor(-1.2690, device='cuda:0')


<font color = 'green'>**Print energy values for first sentence**</font>
<font color = 'green'> **There are no pad tokens for the first sentence**

In [93]:
print(energy_masked[0,0,5,10].data, energy_masked[0,1,5,11].data, 
      energy_masked[0,0,10,5].data, energy_masked[0,1,11,5].data)
#[batch_size, num_heads, query_len, key_len]

tensor(-1.5125, device='cuda:0') tensor(-0.4217, device='cuda:0') tensor(-0.7463, device='cuda:0') tensor(-1.1680, device='cuda:0')


##### <font color = 'blue'>**Step 5: Apply softmax to convert QV dot product to probabilities**</font><br>

In [94]:
attention_prob = torch.softmax(energy_masked, dim = -1)                 
#attention_prob = [batch size, n heads, query len, key len]

<font color ='green'> **Query will not pay attention to last two tokens in the Key as these has zero attention probabilities. The last two tokens were pad tokens. Teh softmax for very high negative values is zero**

In [95]:
print(attention_prob [1,1,1,10].data, attention_prob [1,1,1,11].data)
#[batch_size, num_heads, query_len, key_len]

tensor(0., device='cuda:0') tensor(0., device='cuda:0')


<font color ='green'> **We do not obseve zero probabilites for the first senence as it has no pad tokens**

In [96]:
print(attention_prob[0, 1,1, :])

tensor([7.9895e-03, 3.0690e-02, 1.1646e-05, 2.0567e-03, 2.4597e-03, 6.2799e-01,
        2.9352e-02, 1.9576e-03, 4.1289e-03, 2.9197e-01, 1.3934e-03, 1.5670e-06],
       device='cuda:0', grad_fn=<SliceBackward0>)


In [97]:
print(attention_prob[0, 0,1, :])

tensor([8.5804e-03, 1.7263e-02, 2.7257e-05, 1.6996e-03, 7.9512e-04, 1.5986e-02,
        6.9464e-03, 2.9713e-03, 3.5198e-02, 9.0983e-01, 6.8868e-04, 1.1910e-05],
       device='cuda:0', grad_fn=<SliceBackward0>)


In [98]:
attention_prob[0, 0,1, :].sum()

tensor(1., device='cuda:0', grad_fn=<SumBackward0>)

##### <font color = 'blue'>**Step 6: Apply dropout layer to attention probabilities**</font><br>
<font color = 'red'>**NOT UNDERSTOOD- why apply dropout here (probs will not sum to 1)**</font><br>
<font color = 'green'>**Quotes from paper --We apply dropout [33] to the output of each sub-layer, before it is added to the sub-layer input and normalized.**</font>


In [99]:
torch.manual_seed(0)
att_enc_dropout =  nn.Dropout(p=0.1)

In [100]:
attention_prob_after_dropout = att_enc_dropout(attention_prob)

In [101]:
print(attention_prob_after_dropout [0, 0,1, :])

tensor([0.0000e+00, 1.9181e-02, 3.0286e-05, 1.8884e-03, 0.0000e+00, 1.7762e-02,
        7.7182e-03, 3.3015e-03, 3.9109e-02, 1.0109e+00, 7.6521e-04, 1.3233e-05],
       device='cuda:0', grad_fn=<SliceBackward0>)


In [102]:
attention_prob_after_dropout[0, 0,1, :].sum()

tensor(1.1007, device='cuda:0', grad_fn=<SumBackward0>)

<font size = 3, color = 'red'>**Probs do not sum to 1, sometimes these are greater than one and sometimes these are less than one**</font>

##### <font color = 'blue'>**Step 7: Apply the attention to the value heads**</font><br>
<font size = 3, color = 'green'>**Final vectors are wighted sum of values. This gives us the final embeddings afer considering the context words. These represent the contextualized embeddings for the tokens**</font><br>

In [103]:
V.shape
# [batch_size, num_heads, value_len, head_dim]

torch.Size([2, 3, 12, 4])

In [104]:
attention_prob_after_dropout.shape
#[batch_size, num_heads, query_len, key_len]

torch.Size([2, 3, 12, 12])

<font size = 3, color = 'green'>**NOTE: key_len will be same as value_len**.<br>

<font size = 3, color = 'green'>
    
- Query comes from focal word (sentence), keys and values are from context.
- In self attention both focal words and context are based on same sentence and hence same length.
- In encoder-decoder attention of machine translation - focal word is target sentence and context word comes from source language. Hence queries are generated from target language. Whereas keys and values are generated from source language. We are trying to find which focal word in target language should pay attention to which words in source language.

<font size = 3, color = 'red'>**Not Understood** - Since keys and values both capture context, why can we not use same matrix for Values and Keys i.e fc_k = fc_v. This is exacyly what we did in seq2seq paper with attention (without self attention). The source vectors were used both as keys and values.

In [105]:
# We can do this batch multiplication of the matrices of shape 
# [query_len, key_len] and [value_len, head_dim] as key_len = value_len
encoder_self_att_output = torch.matmul(attention_prob_after_dropout, V)
#[batch_size, num_heads, query_len, head_dim]

In [106]:
encoder_self_att_output.shape
# [batch_size, number_of_heads, query_len, head_dim]

torch.Size([2, 3, 12, 4])

In [107]:
encoder_self_att_output = encoder_self_att_output.permute(0, 2, 1, 3)
# [batch_size, query_len, number_of_heads, head_dim]

In [108]:
encoder_self_att_output.shape

torch.Size([2, 12, 3, 4])

In [109]:
encoder_self_att_output = encoder_self_att_output.view(batch_size, -1, out_hid_dim)

RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.

We cannot use .view because the tensor is no londer contiguous. We can use reshape which will create a copy and make a contiguous tensor.

In [170]:
encoder_self_att_output = encoder_self_att_output.reshape(batch_size, -1, out_hid_dim)

In [171]:
encoder_self_att_output.shape
#[batch_size, query_len, out_hid_dim]

torch.Size([2, 12, 12])

<font size = 3, color = 'green'>We need to project the final values to have same shape as  input embedding. To accomplish this we will use fc_o linear layer we created earlier.

##### <font color = 'blue'>**Step 8: Apply the linear layer to get the output representaion as hid_dim**</font>

In [172]:
encoder_self_att_output = fc_o(encoder_self_att_output)

In [173]:
encoder_self_att_output.shape
#[batch_size, query_len, hid_dim]

torch.Size([2, 12, 8])

In [174]:
encoder_self_att_output[0,0,:]

tensor([-0.1231, -0.9036,  0.3362,  0.6744,  0.0063, -0.5722, -0.7662, -0.7788],
       device='cuda:0', grad_fn=<SliceBackward0>)

<font  size =3, color ='green'>**Implementing MultiHeadAttention using Pytorch Layer**
Limitation: hid_dim = output_hid_dim. Moved to separate notebook    
if hid_dim = output_hid_dim , we can use torch.nn.MultiheadAttention for attention sublayer </font>
**See the Appendix**
    

##### <font color = 'blue'>**Step 9: Apply dropout**</font>

In [175]:
torch.manual_seed(0)
encoder_self_attn_dropout = nn.Dropout(p=0.1)

In [176]:
encoder_self_att_output_after_dropout = encoder_self_attn_dropout(encoder_self_att_output)

##### <font color = 'blue'>**Step 10 Add input to output of the sublayer**</font><br>

In [177]:
encoder_self_att_output_plus_input = encoder_input_after_dropout + encoder_self_att_output

In [178]:
encoder_self_att_output_plus_input.shape

torch.Size([2, 12, 8])

In [179]:
# first sentence first worrd, all dimensions
encoder_self_att_output_plus_input[0, 1, :]

tensor([ 1.0798, -4.8025, -5.0172,  0.1274, -0.6700,  0.8240,  5.5056,  1.9648],
       device='cuda:0', grad_fn=<SliceBackward0>)

##### <font color = 'blue'>**Step 11: Normalize Encoder Output**</font><br>

In [180]:
torch.manual_seed(0)
norm_layer_encoder_attention = nn.LayerNorm(hid_dim).to(device)

In [181]:
encoder_self_att_output_plus_input_normalized = norm_layer_encoder_attention(encoder_self_att_output_plus_input)

In [182]:
encoder_self_att_output_plus_input_normalized[0, 1, :]

tensor([ 0.3700, -1.4389, -1.5049,  0.0772, -0.1681,  0.2914,  1.7310,  0.6422],
       device='cuda:0', grad_fn=<SliceBackward0>)

<font color = 'green'>**Norm Layer Manual Calculations for one word**

In [183]:
for name, param in norm_layer_encoder_attention.named_parameters():
    if param.requires_grad:
        print(name, param.data)

weight tensor([1., 1., 1., 1., 1., 1., 1., 1.], device='cuda:0')
bias tensor([0., 0., 0., 0., 0., 0., 0., 0.], device='cuda:0')


In [184]:
_mean = torch.mean(encoder_self_att_output_plus_input[0, 1, :])

In [185]:
_std = torch.std(encoder_self_att_output_plus_input[0, 1, :], unbiased= False)

In [186]:
nl_w = norm_layer_encoder_attention.weight
nl_b = norm_layer_encoder_attention.bias

In [187]:
encoder_attention_output_normalized_manual =  nl_b + nl_w * (encoder_self_att_output_plus_input[0, 1, :]-_mean)/_std

In [188]:
encoder_attention_output_normalized_manual

tensor([ 0.3700, -1.4389, -1.5049,  0.0772, -0.1681,  0.2914,  1.7310,  0.6422],
       device='cuda:0', grad_fn=<AddBackward0>)

#### <font color = 'blue'> **Positionwise FeedForward Layer**

In [189]:
torch.manual_seed(0)
pf_dim = 16
enc_positionwise_feed_forward_layer = nn.Sequential(nn.Linear(hid_dim, pf_dim),
                                nn.ReLU(),
                                nn.Dropout(p=0.1),
                                nn.Linear(pf_dim, hid_dim)).to(device)

In [190]:
enc_positionwise_output = enc_positionwise_feed_forward_layer(encoder_self_att_output_plus_input_normalized)

In [191]:
enc_positionwise_output.shape

torch.Size([2, 12, 8])

In [192]:
torch.manual_seed(0)
enc_dropout_positionwise =  nn.Dropout(p=0.1)

In [193]:
enc_positionwise_output_after_dropout = enc_dropout_positionwise(enc_positionwise_output)

In [194]:
norm_layer_encoder_positionwise = nn.LayerNorm(hid_dim).to(device)

In [195]:
encoder_layer_output = enc_positionwise_feed_forward_layer(
                        enc_positionwise_output_after_dropout + 
                        encoder_self_att_output_plus_input_normalized)

In [196]:
encoder_layer_output.shape

torch.Size([2, 12, 8])

In [197]:
encoder_layer_output.shape[1]

12

## <font color = 'blue'> Decoder Layers

### <font color = 'blue'> Decoder Self Attention Layer

As mentioned previously, the decoder layer is similar to the encoder layer except that it now has two multi-head attention layers, `self_attention` and `encoder_attention`. 

The first performs self-attention, as in the encoder, by using the decoder representation so far as the query, key and value. This is followed by dropout, residual connection and layer normalization. This `self_attention` layer uses the target sequence mask, `trg_mask`, in order to prevent the decoder from "cheating" by paying attention to tokens that are "ahead" of the one it is currently processing as it processes all tokens in the target sentence in parallel.

The second is how we actually feed the encoded source sentence, `enc_src`, into our decoder. In this multi-head attention layer the queries are the decoder representations and the keys and values are the encoder representations. Here, the source mask, `src_mask` is used to prevent the multi-head attention layer from attending to `<pad>` tokens within the source sentence. This is then followed by the dropout, residual connection and layer normalization layers. 

Finally, we pass this through the position-wise feedforward layer and yet another sequence of dropout, residual connection and layer normalization.

The decoder layer isn't introducing any new concepts, just using the same set of layers as the encoder in a slightly different way.

#### <font color = 'blue'> Multi Head Attention

<font  size =3, color ='green'>**Implementing MultiHeadAttention using Pytorch Layer**
Limitation: hid_dim == output_hid_dim.
    

In [198]:
hid_dim = 8
n_heads = 2

In [199]:
torch.manual_seed(0)
multihead_attnetion_layer = torch.nn.MultiheadAttention(embed_dim=hid_dim, num_heads=n_heads, 
                                                        dropout=0.0, 
                                                        bias=True, add_bias_kv=False, 
                                                        add_zero_attn=False, kdim=None, 
                                                        vdim=None, batch_first=True, 
                                                        device=device, dtype=None)

<font size = 3, color = 'green'>It combines Q, Kand V into one metrics. To compare the results, we need to make sure that initial merices are the same

In [200]:
for name, parameter in multihead_attnetion_layer.named_parameters():
    print(name, parameter.data.shape)

in_proj_weight torch.Size([24, 8])
in_proj_bias torch.Size([24])
out_proj.weight torch.Size([8, 8])
out_proj.bias torch.Size([8])


In [201]:
TRG_PAD_IDX = target_vocab['<PAD>']
TRG_PAD_IDX 

3

<font color = 'green'> Index where we will apply mask i.e. not pay attention

In [202]:
trg_pad_mask = (trg_in== TRG_PAD_IDX )
trg_pad_mask=trg_pad_mask.to(device)

In [203]:
trg_pad_mask

tensor([[False, False, False, False, False, False, False, False, False, False,
         False,  True,  True],
        [False, False, False, False, False, False, False, False, False, False,
         False, False, False]], device='cuda:0')

In [204]:
trg_pad_mask.shape
# batchsize, seq_len

torch.Size([2, 13])

<font color = 'green'> In deocoder a token can pay attention to only preceeding tokens. Hence we need to create a mask for all the tokens ahead for a goven token.

In [205]:
torch.tril(torch.ones(trg_in.shape[1], trg_in.shape[1], device = device))

tensor([[1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
        [1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0.],
        [1., 1., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 0., 0.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 0.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]], device='cuda:0')

In [206]:
trg_att_mask = ~torch.tril(torch.ones((trg_in.shape[1], trg_in.shape[1]), device = device)).bool()

In [207]:
trg_att_mask

tensor([[False,  True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True],
        [False, False,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True],
        [False, False, False,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True],
        [False, False, False, False,  True,  True,  True,  True,  True,  True,
          True,  True,  True],
        [False, False, False, False, False,  True,  True,  True,  True,  True,
          True,  True,  True],
        [False, False, False, False, False, False,  True,  True,  True,  True,
          True,  True,  True],
        [False, False, False, False, False, False, False,  True,  True,  True,
          True,  True,  True],
        [False, False, False, False, False, False, False, False,  True,  True,
          True,  True,  True],
        [False, False, False, False, False, False, False, False, False,  True,
          True,  True,  True],
        [F

In [208]:
decoder_self_att_output, decoder_self_att_probs = multihead_attnetion_layer (
                                              query=decoder_input_after_dropout, 
                                              key= decoder_input_after_dropout,
                                              value= decoder_input_after_dropout, 
                                              key_padding_mask=trg_pad_mask, 
                                              need_weights=True, 
                                              attn_mask=trg_att_mask)

In [209]:
decoder_self_att_output[1,12,:]

tensor([ 0.4591, -0.7687,  0.4112, -0.3029, -0.4572,  0.3844,  0.5456, -0.5902],
       device='cuda:0', grad_fn=<SliceBackward0>)

<font size = 3, color = 'green'> **First word in first sentence can pay attention to itself only**

In [210]:
decoder_self_att_probs[0,0,:]

tensor([1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], device='cuda:0',
       grad_fn=<SliceBackward0>)

<font size = 3, color = 'green'> **Second word in first sentence can pay attention to itself and previous word**

In [211]:
decoder_self_att_probs[0,1,:]

tensor([0.5225, 0.4775, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
        0.0000, 0.0000, 0.0000, 0.0000], device='cuda:0',
       grad_fn=<SliceBackward0>)

<font size = 3, color = 'green'> **Last word in first sentence can pay attention to all the words except last two words as last three words are pad tokens**

In [212]:
decoder_self_att_probs[0,12,:]

tensor([5.2618e-01, 2.7541e-07, 4.4419e-03, 1.1549e-10, 8.9900e-04, 7.2436e-05,
        3.2480e-08, 3.0348e-07, 1.2610e-03, 1.3610e-05, 4.6713e-01, 0.0000e+00,
        0.0000e+00], device='cuda:0', grad_fn=<SliceBackward0>)

<font size = 3, color = 'green'> **Last word in second sentence can pay attention to all the words as this sentence has no pad tokens**

In [213]:
decoder_self_att_probs[1,12,:]

tensor([0.0217, 0.0448, 0.1272, 0.0989, 0.0915, 0.1613, 0.2903, 0.0122, 0.0374,
        0.0046, 0.0096, 0.0764, 0.0242], device='cuda:0',
       grad_fn=<SliceBackward0>)

<font size = 3, color = 'green'> **The probbalitoes sum to 1 as I have kept dropout = 0**

In [214]:
decoder_self_att_probs[1,12,:].sum()

tensor(1., device='cuda:0', grad_fn=<SumBackward0>)

#### <font color = 'blue'>  Dropout

In [215]:
decoder_self_att_dropout = nn.Dropout(p=0.1)

#### <font color = 'blue'>  Residual Connection

In [216]:
decoder_self_att_output_after_dropout = decoder_self_att_dropout(decoder_self_att_output)

In [217]:
decoder_self_att_plus_input = decoder_input_after_dropout + \
                                  decoder_self_att_output_after_dropout

In [218]:
decoder_self_att_plus_input.shape
#Batch_size, query_len, hid_dim

torch.Size([2, 13, 8])

#### <font color = 'blue'>  Layer_norm

In [219]:
dec_self_att_layer_norm= nn.LayerNorm(hid_dim).to(device)
decoder_self_att_plus_input_normalized = dec_self_att_layer_norm(decoder_self_att_plus_input)

In [220]:
decoder_self_att_plus_input_normalized.shape
#batch_size, query_len, hid_dim]

torch.Size([2, 13, 8])

### <font color = 'blue'> Decoder Encoder Attention Layer

As mentioned previously, the decoder layer is similar to the encoder layer except that it now has two multi-head attention layers, `self_attention` and `encoder_attention`. 

The first performs self-attention, as in the encoder, by using the decoder representation so far as the query, key and value. This is followed by dropout, residual connection and layer normalization. This `self_attention` layer uses the target sequence mask, `trg_mask`, in order to prevent the decoder from "cheating" by paying attention to tokens that are "ahead" of the one it is currently processing as it processes all tokens in the target sentence in parallel.

The second is how we actually feed the encoded source sentence, `enc_src`, into our decoder. In this multi-head attention layer the queries are the decoder representations and the keys and values are the encoder representations. Here, the source mask, `src_mask` is used to prevent the multi-head attention layer from attending to `<pad>` tokens within the source sentence. This is then followed by the dropout, residual connection and layer normalization layers. 

Finally, we pass this through the position-wise feedforward layer and yet another sequence of dropout, residual connection and layer normalization.

The decoder layer isn't introducing any new concepts, just using the same set of layers as the encoder in a slightly different way.

#### <font color = 'blue'> Multi Head Attention

<font  size =3, color ='green'>**Implementing MultiHeadAttention using Pytorch Layer**
Limitation: hid_dim == output_hid_dim.
    

In [221]:
hid_dim = 8
n_heads = 2

In [222]:
torch.manual_seed(0)
multihead_attnetion_layer = torch.nn.MultiheadAttention(embed_dim=hid_dim, num_heads=n_heads, 
                                                        dropout=0.0, 
                                                        bias=True, add_bias_kv=False, 
                                                        add_zero_attn=False, kdim=None, 
                                                        vdim=None, batch_first=True, 
                                                        device=device, dtype=None)

<font size = 3, color = 'green'>It combines Q, Kand V into one metrics. To compare the results, we need to make sure that initial merices are the same

In [223]:
for name, parameter in multihead_attnetion_layer.named_parameters():
    print(name, parameter.data.shape)

in_proj_weight torch.Size([24, 8])
in_proj_bias torch.Size([24])
out_proj.weight torch.Size([8, 8])
out_proj.bias torch.Size([8])


In [224]:
SRC_PAD_IDX = target_vocab['<SRC>']
SRC_PAD_IDX 

0

<font color = 'green'> Index where we will apply mask i.e. not pay attention

In [225]:
src_pad_mask = (src== SRC_PAD_IDX )
src_pad_mask = src_pad_mask.to(device)

In [226]:
src_pad_mask

tensor([[False, False, False, False, False, False, False, False, False, False,
         False, False],
        [False, False, False, False, False, False, False, False, False, False,
         False, False]], device='cuda:0')

In [227]:
src_pad_mask.shape
# batchsize, seq_len

torch.Size([2, 12])

<font color = 'green'> Here we only need src_pad_mask. The query will not pay attention to pad tokens of source

In [228]:
decoder_enc_att_output, decoder_enc_att_probs = multihead_attnetion_layer (
                                              query=decoder_self_att_plus_input_normalized,
                                              key=encoder_layer_output,
                                              value=encoder_layer_output, 
                                              key_padding_mask=src_pad_mask, 
                                              need_weights=True, 
                                              attn_mask=None)

In [229]:
decoder_enc_att_output[1,12,:]

tensor([-0.1441,  0.1203, -0.0382,  0.0077,  0.0479, -0.1381, -0.1649,  0.0225],
       device='cuda:0', grad_fn=<SliceBackward0>)

<font size = 3, color = 'green'> **Last word in first sentence of target can pay attention to all the words of source except last three words as last three words are pad tokens**

In [230]:
decoder_enc_att_probs[0,12,:]

tensor([0.0883, 0.0860, 0.0854, 0.0795, 0.0856, 0.0767, 0.0878, 0.0778, 0.0777,
        0.0891, 0.0822, 0.0840], device='cuda:0', grad_fn=<SliceBackward0>)

<font size = 3, color = 'green'> **Last word in second sentence OF target can pay attention to all the words of teh source as this sentence has no pad tokens**

In [231]:
decoder_enc_att_probs[1,12,:]

tensor([0.0814, 0.0804, 0.0882, 0.0843, 0.0853, 0.0858, 0.0787, 0.0862, 0.0812,
        0.0831, 0.0808, 0.0846], device='cuda:0', grad_fn=<SliceBackward0>)

<font size = 3, color = 'green'> **The probbalitoes sum to 1 as I have kept dropout = 0**

In [232]:
decoder_enc_att_probs[1,12,:].sum()

tensor(1., device='cuda:0', grad_fn=<SumBackward0>)

#### <font color = 'blue'>  Dropout

In [233]:
decoder_enc_att_dropout = nn.Dropout(p=0.1)

#### <font color = 'blue'>  Residual Connection

In [234]:
decoder_enc_att_output_after_dropout = decoder_enc_att_dropout(decoder_enc_att_output)

In [235]:
decoder_enc_att_plus_input = decoder_self_att_plus_input_normalized + \
                             decoder_enc_att_output_after_dropout

In [236]:
decoder_enc_att_plus_input.shape
#Batch_size, query_len, hid_dim

torch.Size([2, 13, 8])

#### <font color = 'blue'>  Layer_norm

In [237]:
dec_enc_att_layer_norm= nn.LayerNorm(hid_dim).to(device)
decoder_enc_att_plus_input_normalized = dec_self_att_layer_norm(decoder_self_att_plus_input)

In [238]:
decoder_enc_att_plus_input_normalized.shape
#batch_size, query_len, hid_dim]

torch.Size([2, 13, 8])

### <font color = 'blue'> **Decoder Positionwise FeedForward Layer**

#### <font color = 'blue'>  Linera Layers and ReLU Activation 

In [239]:
torch.manual_seed(0)
pf_dim = 16
dec_positionwise_feed_forward_layer = nn.Sequential(nn.Linear(hid_dim, pf_dim),
                                nn.ReLU(),
                                nn.Dropout(p=0.1),
                                nn.Linear(pf_dim, hid_dim)).to(device)

In [240]:
dec_positionwise_output = dec_positionwise_feed_forward_layer(
                                              decoder_enc_att_plus_input_normalized)

In [241]:
dec_positionwise_output.shape
# batch_size, query_len(trg_len), hid_dim)

torch.Size([2, 13, 8])

#### <font color = 'blue'>  Dropout

In [242]:
torch.manual_seed(0)
decoder_pos_dropout =  nn.Dropout(p=0.1)

In [243]:
dec_positionwise_output_after_dropout = decoder_pos_dropout(dec_positionwise_output)

#### <font color = 'blue'>  Residual Connection + Layer Norm

In [244]:
norm_layer_decoder_pos = nn.LayerNorm(hid_dim).to(device)

In [245]:
dec_positionwise_plus_input_norm = norm_layer_decoder_pos(
                                     dec_positionwise_output_after_dropout + 
                                     decoder_enc_att_plus_input_normalized)

In [246]:
dec_positionwise_plus_input_norm.shape
# batch_size, query_len(trg_len), hid_dim)

torch.Size([2, 13, 8])

### <font color = 'blue'> **Decoder Final Linear Layer**

In [247]:
final_linear_layer = nn.Linear(hid_dim, len(target_vocab)).to(device)

In [248]:
decoder_output = final_linear_layer(dec_positionwise_plus_input_norm)

In [249]:
decoder_output.shape
# batch, seq_len, vocab_size

torch.Size([2, 13, 40])

### <font color = 'blue'> **Loss Calculation**

Original Sequence: $trg = [sos, x_1, x_2, x_3, eos]$ <br>
Input to Model: $trg[:-1][sos, x_1, x_2, x_3]$  <br>
Predicted Values: $[y_1, y_2, y_3, eos]$<br>
Lable or True y : $trg[1:] = [x_1, x_2, x_3, eos]$


In [250]:
output_dim = decoder_output.shape[-1]
output_dim

40

In [251]:
output = decoder_output.contiguous().view(-1, output_dim)

In [252]:
output.shape

torch.Size([26, 40])

In [253]:
trg.shape

torch.Size([2, 14])

In [254]:
trg

tensor([[ 1, 21, 25, 26, 27, 28, 21, 29, 30, 14,  2,  3,  3,  3],
        [ 1, 15, 16, 17, 18, 19,  9, 20, 21, 22, 23, 24, 14,  2]],
       device='cuda:0')

<font color = 'green'> Remove <BOS> token from trg for loss computation

In [255]:
trg_y = trg[:, 1:]

In [256]:
trg_y

tensor([[21, 25, 26, 27, 28, 21, 29, 30, 14,  2,  3,  3,  3],
        [15, 16, 17, 18, 19,  9, 20, 21, 22, 23, 24, 14,  2]], device='cuda:0')

In [257]:
trg_y.shape

torch.Size([2, 13])

In [258]:
trg_y = trg_y.contiguous().view(-1)

In [259]:
trg_y.shape

torch.Size([26])

#### <font color = 'blue'> nn.CrossEntropy
<font color = 'green'>Note that this case is equivalent to the combination of LogSoftmax and NLLLoss.

In [260]:
criterion = nn.CrossEntropyLoss(ignore_index = TRG_PAD_IDX)

In [262]:
loss = criterion(output, trg_y)

In [263]:
loss


tensor(3.7780, device='cuda:0', grad_fn=<NllLossBackward0>)

#### <font color = 'blue'> Manual Loss Calculations

In [305]:
def softmax(output):
    num = torch.exp(output) # 26, 40 (words, hiddn_dim)
    den = torch.sum(num, axis = 1, keepdim= True) #|(26, 1)
    return num/den

In [307]:
def crossentropy_m(prob, y):
    return - torch.log(prob[range(len(y)), y]).mean()

In [308]:
loss_m = crossentropy_m(softmax(output), trg_y)

In [309]:
loss_m

tensor(3.7797, device='cuda:0', grad_fn=<NegBackward0>)

#### <font color = 'blue'> Stable Manual Calculations

$${\displaystyle \mathrm {LSE} (x_{1},\dots ,x_{n})=x^{*}+\log \left(\exp(x_{1}-x^{*})+\cdots +\exp(x_{n}-x^{*})\right)}$$
where $$ {\displaystyle x^{*}=\max {\{x_{1},\dots ,x_{n}\}}}{\displaystyle x^{*}=\max {\{x_{1},\dots ,x_{n}\}}}$$

In [328]:
torch.log(torch.sum(torch.exp(output-max_output), dim = 1, keepdim=True)).shape

torch.Size([26, 1])

In [329]:
def crossentropy_softmax(output, y):
    max_output, _ = torch.max(output, dim = 1, keepdim=True)
    
    neglogp =  -(output[range(len(y)), y]- (max_output + 
                           torch.log(torch.sum(torch.exp(output-max_output), 
                                               dim = 1, keepdim=True))))
    return neglogp.mean()

In [330]:
crossentropy_softmax(output, trg_y)

tensor(3.7797, device='cuda:0', grad_fn=<MeanBackward0>)