# Transformers

In this notebook, we are going to define and implement a Transformer model.

In [3]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np

## Scaled Dot-Product Attention

<img src='assets/1/scaled-dot-product-3.PNG'/> 

In [17]:
# Scaled Dot-Product Attention calculation 
# from the paper (section 3.2.1 Scaled Dot-Product Attention):
# 'We compute the dot products of the query with all keys, divide each by √dk, 
# and apply a softmax function to obtain the weights on the values.'

class ScaledDotProductAttention(nn.Module):
    def __init__(self):
        super(ScaledDotProductAttention, self).__init__()
        
    def forward(self, Q, K, V, mask=False):
        
        # compute dot product of the all query with all keys
        dot_products = np.matmul(Q, np.transpose(K))   # matmul is equal to matrix-matrix dot-product 
        
        # divide each by √dk
        d_k = K.shape[1]    # get length of key vector
        scaled_dot_products = dot_products / np.sqrt(d_k)
        
        # apply a softmax function to obtain weights on values
        weights = F.softmax(scaled_dot_products, dim=1)
        
        # get weighted values by multiplying values with weights(softmax scores)
        weighted_values = np.matmul(weights, V)
        
        # apply mask to prevent positions from attending to subsequent positions in decoder
        if mask==True:
            row_ind, col_ind = np.indices(mask.shape)
            mask_ind = row_ind>=col_ind
            look_ahead_mask = np.where(mask_ind, 1, -np.inf)
            weighted_values = weighted_values * look_ahead_mask  
        
        return weighted_values
    

Scaled Dot-product attention is identical to the [Dot-product attention algorithm](https://arxiv.org/pdf/1508.04025.pdf), except for the scaling  factor of 1/√dk.
>For large values of dk, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients. To counteract this effect, we scale the dot products by 1/√dk. <sub>-from section 3.2.1 Scaled Dot-Product Attention<sub>

**Masked Multi-Head Attention**

<img src='assets/1/masked-attention.PNG' width=50% height=50% /> 

>We need to prevent leftward information flow in the decoder to preserve the auto-regressive property. We **implement this inside of scaled dot-product attention** by masking out (setting to −∞) all values in the input of the softmax which correspond to illegal connections. <sub>-from section 3.2.3 Applications of Attention in our Model<sub>

>We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i. <sub>-from section 3.1 Encoder and Decoder Stacks<sub>
    
Auto-regressive property can be defined as to predict future values based on past value. Therefore we need to mask the inputs that are subsequent to that position. So that, with having data given in parallel, the decoder will not learn a simple mapping provided all target outputs.

Let's do a unit test on Scaled Dot-Product Attention implementation.

<img src='assets/1/scaled-dot-product-matrix.PNG' /> 

In [23]:
# define mock data assuming n=3 embeddings
n = 3
d_k = 64

# define K,Q and V matrices
K = torch.Tensor(np.arange(n*d_k).reshape(n, d_k))
Q = torch.Tensor(np.arange(n*d_k).reshape(n, d_k))
V = torch.Tensor(np.arange(n*d_k).reshape(n, d_k))

scaled_dot_product = ScaledDotProductAttention()
output = scaled_dot_product(Q, K, V)    # apply scaled-dot product attention

output.shape

torch.Size([3, 64])

In [None]:
masked_output = scaled_dot_product(Q, K, V, mask=True)

## Attention

<img src='assets/1/attention-3.PNG'/> 

In [None]:
class Attention(nn.Module):
    def __init__(self, d_k=64):
        super(Attention, self).__init__()
        
        # define Key, Query and Value weight matrices
        K = nn.Parameter(torch.randn(n, d_k))
        Q = nn.Parameter(torch.randn(n, d_k))
        V = nn.Parameter(torch.randn(n, d_k))
        
        self.scaled_dot_product = ScaledDotProductAttention()
        
    def forward(self, inputs, mask=False):
        K, Q, V = self.K, self.Q, self.V
        
        # apply linear transformations
        
        # get packed Key, Query and Value matrices
        # by multiplying inputs with K, Q and V   
        Key = torch.matmul(inputs, K)
        Query = torch.matmul(inputs, Q)
        Value = torch.matmul(inputs, V)
        
        # apply scaled dot-product attention
        attention_scores = self.scaled_dot_product(Query, Key, Value, mask)
        
        return attention_scores
    
# 'We compute the attention function on a set of queries simultaneously, packed together
# into a matrix Q. The keys and values are also packed together into matrices K and V.'
# from section - 3.2.1 Scaled Dot-Product Attention

Let's do a unit test on Attention implementation.

In [None]:
# define mock data assuming 

## Multi-Head Attention

<img src='assets/1/multi-head-attention-3.PNG'/>

In [None]:
# Multi-Head Attention consists of several attention layers running in parallel.       
class MultiHeadAttention(nn.Module):
    
    def __init__(self, h=8,  d_model=512):
        # add multi-head attention layers to a sequential container
        multi_head_attention = []
        
        # define linear layer 
        self.W = nn.Parameter(torch.randn(size))
        
        # self.
    
    def forward(self):
        
        # apply attention
        
        # collect attentions
        
        # concat attentions
        attentions = torch.cat((x, x, x), 1)
        
        # apply linear transformaiton
        # np.matmul(z, W)
        # reduce dimensionality

Let's do a unit test on Multi-Head Attention implementation.

## Feed Forward Neural Network
<img src='assets/1/ffnn.PNG' width=75% height=75%/>


In [None]:
 # feedforward neural network here
class FeedForwardNetwork(nn.Module):
    def __init__(self, d_ff=2048):
        super(FeedForwardNetwork, self).__init__()
        self.fc1 = nn.Linear()
        self.fc2 = nn.Linear()

    def forward(self, x):
        
  

In [None]:
# Let's not skip the unit test of FeedForward network. 

## Sublayer

In [None]:
# We employ a residual connection around each of the two sub-layers, 
# followed by layer normalization. That is'The output of each sub-layer is
# LayerNorm(x + Sublayer(x)), where Sublayer(x) is
# the function implemented by the sub-layer itself.'
# from section - 3.1 Encoder and Decoder Stacks

# define residual learning block
class Residual_Connection(nn.Module):
    def __init__(self, layer):
        super(Residual_Connection, self).__init__()
        self.layer = layer     # function implemented by the sublayer

    def forward(self, x):
        f_x = self.layer(x)    # apply Sublayer(x)
        x = x + f_x            # x + Sublayer(x)
        return x
    
class Sublayer(nn.Module):
    def __init__(self, layer):
        super(SubLayer, self).__init__()
    
        self.sublayer = nn.Sequential(
                            Residual_Connection(layer), 
                            nn.LayerNorm()
        )
        
    def forward(self, x):
        x = self.sublayer(x)
        return x

In [None]:
# unit test on sublayer

## Encoder

<img src='assets/1/encoder.PNG' /> 

In [None]:
class Encoder(nn.Module):
    def __init__(self):
        super(Encoder, self).__init__()
        
        self.attention = MultiHeadAttention()
        self.feed_forward = 
        
        # two sub-layers
        # 1. Self-Attention
        self.sublayer1 = Sublayer(self.attention)
          
        # 2. Feedforward Neural Network
        self.sublayer2 = Sublayer(self.attention)
        
    def forward(self, x):
        pass
        
        
        
        
        # nn.Sequential(*[])

## Decoder

<img src='assets/1/decoder.PNG' /> 

In [None]:
class Decoder(nn.Module):
    def __init__(self):
        super(Decoder, self).__init__()
        
        # three sub-layers
        # 1. Feedforward Neural Network
        # 2. Self-Attention
        # 3. Self-Attention
        
        # employ a residual connection around each of
        # the two sub-layers, followed by layer normalization
        
        # That is, the output of each sub-layer is
        # LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function 
        # implemented by the sub-layer itself
        
    def forward(self, x):
        pass

## Transformer Model

<img src='assets/1/transformer-model.PNG' width=80% height=80%/> 

In [None]:
# The encoders are all identical in structure (yet they do not share weights)

class Transformer(nn.Module):
    
    # N is the number of stacked layers of encoders & decoders
    def __init__(self, N=6, h=8, d_k=64, d_v=64):
        super(Transformer, self).__init__()
        
        # initialize the encoder and decoder stacks with sequential container
        encoders, decoders = [], []
        
        # stack encoders
        for i in range(N):
            encoder = Encoder()
            encoders.append(encoder)
        
        # stack decoders
        for i in range(N):
            decoder = Decoder()
            decoders.append(decoder)
            
        encoders = nn.Sequential(*encoders)
        decoders = nn.Sequential(*decoders)
        
    def forward(self):
        # pass through encoders
        
        # pass through decoders but considering same input
        pass
    
        
        
        

###  Specifying the Hyperparameters

In [None]:
# Define the model parameters

# from section 3.1 Encoder and Decoder Stacks
N = 6  # Number of layers in the encoder stack and decoder stack
d_model = 512  # Dimensionality of model layers' outputs

# from section 3.2.2 Multi-Head Attention
# d_k = d_v = d_model / h = 64
h = 8  # Number of self-attention heads
d_k = 64  # Dimensionality of the linearly projected queries and keys
d_v = 64  # Dimensionality of the linearly projected values

# from section 3.3 Position-wise Feed-Forward Network
d_ff = 2048  # Dimensionality of the inner fully connected layer


In [None]:
# Define the training parameters

# from section 5.3 Optimizer 
beta_1 = 0.9
beta_2 = 0.98
epsilon = 1e-9
warmup_steps = 4000
# train steps is stated as 100K in the paper, 
# we will use warmup steps to train 

# from section 5.4 Regularization
dropout_rate = 0.1

### Resources

1. [Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/) - Blog Post
2. [Attention is All You Need](https://arxiv.org/pdf/1706.03762.pdf) - Paper