# Transformers

In this notebook, we are going to define and implement a Transformer model.

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np

## Scaled Dot-Product Attention

<img src='assets/1/scaled-dot-product-3.PNG'/> 

In [2]:
# Scaled Dot-Product Attention calculation 
# from the paper (section 3.2.1 Scaled Dot-Product Attention):
# 'We compute the dot products of the query with all keys, divide each by √dk, 
# and apply a softmax function to obtain the weights on the values.'

class ScaledDotProductAttention(nn.Module):
    def __init__(self, mask=False):
        super(ScaledDotProductAttention, self).__init__()
        self.mask=mask
        
    def forward(self, Q, K, V):
        
        # compute dot product of the all query with all keys
        dot_products = torch.matmul(Q, torch.transpose(K, 0, 1)) 
        
        # divide each by √dk
        d_k = K.shape[1]    # get length of key vector
        scaled_dot_products = dot_products / np.sqrt(d_k)
        
        # apply a softmax function to obtain weights on values
        weights = F.softmax(scaled_dot_products, dim=-1)
        
        # get weighted values by multiplying values with weights(softmax scores)
        weighted_values = torch.matmul(weights, V)
        
        # apply mask to prevent positions from attending to subsequent positions in decoder
        if self.mask==True:
            size = weighted_values.shape 
            look_ahead_mask = torch.triu(torch.full(size, float('-inf')), diagonal=1)
            look_ahead_mask[look_ahead_mask == 0] = 1  
            weighted_values = weighted_values * look_ahead_mask
        
        return weighted_values
    

Scaled Dot-product attention is identical to the [Dot-product attention algorithm](https://arxiv.org/pdf/1508.04025.pdf), except for the scaling  factor of 1/√dk.
>For large values of dk, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients. To counteract this effect, we scale the dot products by 1/√dk. <sub>-from section 3.2.1 Scaled Dot-Product Attention<sub>

**Masked Multi-Head Attention**

<img src='assets/1/masked-attention.PNG' width=50% height=50% /> 

>We need to prevent leftward information flow in the decoder to preserve the auto-regressive property. We **implement this inside of scaled dot-product attention** by masking out (setting to −∞) all values in the input of the softmax which correspond to illegal connections. <sub>-from section 3.2.3 Applications of Attention in our Model<sub>

>We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i. <sub>-from section 3.1 Encoder and Decoder Stacks<sub>
    
Auto-regressive property can be defined as to predict future values based on past value. Therefore we need to mask the inputs that are subsequent to that position. So that, with having data given in parallel, the decoder will not learn a simple mapping provided all target outputs.
    
 ---

Let's do a unit test on Scaled Dot-Product Attention implementation.

<img src='assets/1/scaled-dot-product-matrix.PNG' /> 

In [3]:
# define data, n=3 embeddings
n = 3

# from section 3.2.2 Multi-Head Attention
d_key = 64 # key dimension

# define K,Q and V matrices
K = torch.randn(n, d_key)
Q = torch.randn(n, d_key)
V = torch.randn(n, d_key)

scaled_dot_product = ScaledDotProductAttention()

# apply scaled-dot product attention
output = scaled_dot_product(Q, K, V)    

output.shape

torch.Size([3, 64])

In [4]:
scaled_dot_product = ScaledDotProductAttention(mask=True)

masked_output = scaled_dot_product(Q, K, V)
print(masked_output.shape)

print(masked_output[:,:5])

torch.Size([3, 64])
tensor([[-0.6561,    -inf,    -inf,    -inf,    -inf],
        [-0.6787,  0.1006,    -inf,    -inf,    -inf],
        [-0.6778, -0.2175, -0.0867,     inf,     inf]])


## Attention

<img src='assets/1/attention-3.PNG'/> 

In [5]:
class Attention(nn.Module):
    def __init__(self, d_model=512, d_key=64, mask=False):
        super(Attention, self).__init__()
        
        # define Key, Query and Value weight matrices
        self.WK = nn.Parameter(torch.randn(d_model, d_key))
        self.WQ = nn.Parameter(torch.randn(d_model, d_key))
        self.WV = nn.Parameter(torch.randn(d_model, d_key))
        
        # init scaled dot-product attention
        self.scaled_dot_product = ScaledDotProductAttention(mask)
        
    def forward(self, inputs):
        WK, WQ, WV = self.WK, self.WQ, self.WV
        
        # apply linear transformations
        
        # get packed Key, Query and Value matrices
        # by multiplying inputs with K, Q and V   
        K = torch.matmul(inputs, WK)
        Q = torch.matmul(inputs, WQ)
        V = torch.matmul(inputs, WV)
        
        # apply scaled dot-product attention
        attention_scores = self.scaled_dot_product(Q, K, V)
        
        return attention_scores

>An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key. <sub>-from section 3.2 Attention<sub>

>In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix Q. The keys and values are also packed together into matrices K and V. <sub>-from section 3.2.1 Scaled Dot-Product Attention<sub>

It can be inferred from the paper that K, Q and V weight matrices are distinct but in practice, in the implementation, there is only **one weight matrix** and the projected matrices are also one matrix. The reason for this is to implement the model with highly optimized matrix multiplication code.

<img src='assets/1/weight-impl.PNG' />

We will not implement this way in this notebook because we want the matrices to be distinguishable so that the code could be easier to follow as much as possible.

---

Let's do a unit test on Attention implementation.

<img src='assets/1/attention-matrix.PNG'/>

In [6]:
# from section 3.1 Encoder and Decoder Stacks
d_model = 512    # embedding dimension

# define input
X = torch.randn(n, d_model)

# init attention 
attention = Attention(d_model, d_key)

# compute attention scores
attention_scores = attention(X)

attention_scores.shape

torch.Size([3, 64])

## Multi-Head Attention

<img src='assets/1/multi-head-attention-3.PNG' width=75% height=75%/>

In [7]:
# Multi-Head Attention consists of several attention layers running in parallel.       
class MultiHeadAttention(nn.Module):
    def __init__(self, n_head=8, d_model=512, d_key=64, mask=False):
        super(MultiHeadAttention, self).__init__()
        self.mask = mask
        
        # number of heads
        self.h = n_head
        
        # add attention layers to a ModuleList container
        attention_list = [Attention(d_model, d_key, mask) for _ in range(n_head)]
        self.multi_head_attention = nn.ModuleList(attention_list)
        
        # define linear layer 
        self.W = nn.Parameter(torch.randn(n_head*d_key, d_model))
    
    def forward(self, x):
        # apply & concat attention layers
        Z = torch.cat([attention(x) for attention in self.multi_head_attention], -1) 
        
        # apply linear transformation
        output = torch.matmul(Z, self.W)    # reduce dimensionality
        
        return output
        

Let's do a unit test on Multi-Head Attention implementation.

<img src='assets/1/multi-head-matrix.PNG'/>

In [8]:
# from section 3.2.2 Multi-Head Attention
h = 8 # number of heads

# define input
X = torch.randn(n, d_model)

# init multi-head attention 
multi_head_attention = MultiHeadAttention(h, d_model, d_key)

# compute multi_head attention score
Z = multi_head_attention(X)

Z.shape

torch.Size([3, 512])

### Parallelization of Multi-Head Attention in Code

Multi-Head attention works with Attention layers running in parallel. However, currently there does  not exist an official implementation in PyTorch for parallel modules. Developers are using third-party libraries in their code to utilize parallelism. Executing the code on GPU avoids sequential execution since underlying execution is asynchronous. Additionally, **if we have multiple GPUs in our system (and don’t use data parallel), we could execute different modules on each device and concatenate the result back on a single device.**

## Feed Forward Neural Network
<img src='assets/1/ffnn.PNG' width=75% height=75%/>


In [9]:
# The dimensionality of input and output is dmodel = 512, 
# and the inner-layer has dimensionality d_ff = 2048
# - from section 3.3 Position-wise Feed-Forward Networks

class FeedForwardNetwork(nn.Module):
    def __init__(self, d_model=512, d_feedforward=2048):
        super(FeedForwardNetwork, self).__init__()
        
        # define the feedforward neural network 
        self.feedforwardnn = nn.Sequential(   
                nn.Linear(d_model, d_feedforward),
                nn.ReLU(),
                nn.Linear(d_feedforward, d_model),
        )
        
    def forward(self, x):
        x = self.feedforwardnn(x)
        return x

In [10]:
# unit test of FeedForward network

# from section 3.3 Position-wise Feed-Forward Network
d_feedforward = 2048

# define input
X = torch.rand(n, d_model)

# define feedforward neural network
ffnn = FeedForwardNetwork(d_model, d_feedforward)

# compute ffnn output
output = ffnn(X)

output.shape

torch.Size([3, 512])

## Sublayer


<img src='assets/1/sublayer.PNG' width=80% height=80%/> 

In [12]:
# We employ a residual connection around each of the two sub-layers, 
# followed by layer normalization. That is'The output of each sub-layer is
# LayerNorm(x + Sublayer(x)), where Sublayer(x) is
# the function implemented by the sub-layer itself.'
# from section - 3.1 Encoder and Decoder Stacks

# define residual learning block
class Residual_Connection(nn.Module):
    def __init__(self, layer):
        super(Residual_Connection, self).__init__()
        self.layer = layer     # function implemented by the layer

    def forward(self, x):
        f_x = self.layer(x)    # apply Sublayer(x)
        x = x + f_x            # x + Sublayer(x)
        return x
    
class Sublayer(nn.Module):
    def __init__(self, layer, d_model=512):
        super(Sublayer, self).__init__()
    
        self.sublayer = nn.Sequential(
                            Residual_Connection(layer), 
                            nn.LayerNorm(d_model)  # embedding vector length
        )
        
    def forward(self, x):
        x = self.sublayer(x)
        return x

In [14]:
# unit test on sublayer

# define input
X = torch.rand(n, d_model)

# init feedforward neural network (with default settings) sublayer
nn_layer = Sublayer(FeedForwardNetwork())

# compute ffnn sublayer output
ffnn_output = nn_layer(X)

# init attention sublayer (with default settings)
attention_layer = Sublayer(MultiHeadAttention())

# compute attention sublayer output
attention_output = attention_layer(X)

print(ffnn_output.shape)
print(attention_output.shape)

torch.Size([3, 512])
torch.Size([3, 512])


## Encoder

<img src='assets/1/encoder.PNG' /> 

In [None]:
class Encoder(nn.Module):
    def __init__(self, n_head=8, d_model=512, d_key=64):
        super(Encoder, self).__init__()
        
        self.attention = MultiHeadAttention(n_head, d_model, d_key)
        self.feed_forward = FeedForwardNetwork()
        
        # two sub-layers
        # 1. Self-Attention
        self.attention_layer = Sublayer(self.attention)
          
        # 2. Feedforward Neural Network
        self.feedforward_layer = Sublayer(self.feed_forward)
        
    def forward(self, x):
        pass
        
        
        
        
        # nn.Sequential(*[])

## Decoder

<img src='assets/1/decoder.PNG' /> 

In [None]:
class Decoder(nn.Module):
    def __init__(self):
        super(Decoder, self).__init__()
        
        # three sub-layers
        # 1. Feedforward Neural Network
        # 2. Self-Attention
        # 3. Self-Attention
        
    def forward(self, x):
        pass

## Transformer Model

<img src='assets/1/transformer-model.PNG' width=80% height=80%/> 

In [None]:
# The encoders are all identical in structure (yet they do not share weights)

class Transformer(nn.Module):
    
    # N is the number of stacked layers of encoders & decoders
    def __init__(self, N=6, n_head=8, d_model=512):
        super(Transformer, self).__init__()
        
        # initialize dimensionality of key 
        d_key = d_model / n_head
        
        # initialize the encoder and decoder stacks with sequential container
        encoders, decoders = [], []
        
        # stack encoders
        for i in range(N):
            encoder = Encoder()
            encoders.append(encoder)
        
        # stack decoders
        for i in range(N):
            decoder = Decoder()
            decoders.append(decoder)
            
        encoders = nn.Sequential(*encoders)
        decoders = nn.Sequential(*decoders)
        
    def forward(self):
        # pass through encoders
        
        # pass through decoders but considering same input
        pass
    
        
        
        

In [None]:
# from section 3.1 Encoder and Decoder Stacks
N = 6

##  Comparing Results

with nn.Trasnformers

**Transformers Success**

**Transformers Weakness**



### Quick Recap of Familiar Components

<img src='assets/1/components.PNG'/> 

### Final Notes

The implementation here is a very basic representation of how Transformers work. There are a lot of lacking components that are avoided here to preserve the simplicity of the code. Other than components, in practice, the Transformer model is used with batches with sequence lengths. This introduces new dimensions in the tensors discussed above, and also a lot of new calculations, complex matrix multiplications. There are also a lot of improvements that can be made in the implementation here, and calculations can be further optimized.   

### Resources

1. [Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/) - Blog Post
2. [Attention is All You Need](https://arxiv.org/pdf/1706.03762.pdf) - Paper
3. [Transformers for Beginners | What are they and how do they work](https://www.youtube.com/watch?v=_UVfwBqcnbM&t=4s) - Video
4. [Implementing a Transformer from Scratch](https://towardsdatascience.com/7-things-you-didnt-know-about-the-transformer-a70d93ced6b2) - Blog Post